Plan : staticdeps: enrich
This commit is contained in:
parent
47d4c95264
commit
7e5abd9669
1 changed files with 57 additions and 6 deletions
|
@ -115,22 +115,36 @@ On SKX,
|
||||||
|
|
||||||
#### With valgrind
|
#### With valgrind
|
||||||
|
|
||||||
|
* Write a tool, valgrind-depsim, to instrument a binary to extract its
|
||||||
|
dependencies at runtime
|
||||||
|
* Can extract memory, register and temp-based dependencies
|
||||||
|
* Here, only the memory dependencies are relevant -- disable the other deps.
|
||||||
* Instrument binary:
|
* Instrument binary:
|
||||||
* for each write, add `write_addr -> writer_pc` to a hashmap
|
* for each write, add `write_addr -> writer_pc` to a hashmap
|
||||||
* for each read, fetch `writer_pc` from hashmap
|
* for each read, fetch `writer_pc` from hashmap
|
||||||
* if found, add a dependency `reader_pc -> writer_pc`
|
* if found, add a dependency `reader_pc -> writer_pc`
|
||||||
|
* use the process' memory map to translate PC to addresses inside ELF files
|
||||||
* At the end, write deps file:
|
* At the end, write deps file:
|
||||||
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
|
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
|
||||||
* Run for each binary in genbenchs
|
* Run for each binary in genbenchs
|
||||||
|
* Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
|
||||||
|
|
||||||
|
Then, compare with staticdeps: `eval/vg_depsim.py` script.
|
||||||
|
|
||||||
* For each binary in genbenchs,
|
* For each binary in genbenchs,
|
||||||
|
* use genbench's bb split/occurrences to retrieve basic blocks
|
||||||
* for each BB with more than 10% of max BB hits,
|
* for each BB with more than 10% of max BB hits,
|
||||||
* predict deps with staticdeps
|
* predict deps with staticdeps
|
||||||
* check which dependencies are found/missed from the instrumented ones
|
* cache the result: fast, but we're dealing with 3500 files.
|
||||||
* limitation: will only find deps from/to the same BB! Dependencies leaving
|
* translate staticdeps' periodic deps to PC deps, discard the `iter`
|
||||||
a BB are discarded.
|
parameter
|
||||||
|
* for each dependency from the depsim results that occurs inside this BB,
|
||||||
|
* check if found or missed, append to a list
|
||||||
|
* score: `|found| / (|found| + |missed|)`. Discards occurrences.
|
||||||
|
* limitation: will only find deps from/to the same BB! Dependencies leaving
|
||||||
|
a BB are discarded.
|
||||||
|
|
||||||
* Result: about 38% of deps found.
|
* Result: about 38% of deps found; 44% if weighting by occurrences
|
||||||
|
|
||||||
* Cause: kernels executed in loops.
|
* Cause: kernels executed in loops.
|
||||||
* No dependency in the kernel
|
* No dependency in the kernel
|
||||||
|
@ -146,10 +160,41 @@ On SKX,
|
||||||
* Fix: introduce dependency lifetime
|
* Fix: introduce dependency lifetime
|
||||||
* timestamp = instructions executed (VG instrumentation, added up at the
|
* timestamp = instructions executed (VG instrumentation, added up at the
|
||||||
end of each BB)
|
end of each BB)
|
||||||
* lifetime fixed to 1024 instructions
|
* lifetime fixed to 1024 instructions, order of magnitude of a ROB
|
||||||
* dependencies are discarded if written to more than a lifetime ago
|
* dependencies are discarded if written to more than a lifetime ago
|
||||||
|
|
||||||
* Result: about (?? TODO) of deps found
|
* Result: about 58% of deps found; same if weighing.
|
||||||
|
* If lifetime lowered to 512, about 56% of deps found, or 63% if weighing.
|
||||||
|
* Results are quite similar, lowering the lifetime further makes no
|
||||||
|
particular sense.
|
||||||
|
|
||||||
|
Raw results:
|
||||||
|
```
|
||||||
|
In [123]: res_success(res_life512)
|
||||||
|
Out[123]: 0.5640902544407105
|
||||||
|
|
||||||
|
In [124]: res_success(res_life1024)
|
||||||
|
Out[124]: 0.5761437608875034
|
||||||
|
|
||||||
|
In [125]: res_success(res_nolife)
|
||||||
|
Out[125]: 0.38143868803578085
|
||||||
|
|
||||||
|
In [126]: res_success_weight(res_life512)
|
||||||
|
Out[126]: 0.6347271857382266
|
||||||
|
|
||||||
|
In [127]: res_success_weight(res_life1024)
|
||||||
|
Out[127]: 0.5817404277466787
|
||||||
|
|
||||||
|
In [128]: res_success_weight(res_nolife)
|
||||||
|
Out[128]: 0.4397921976192802
|
||||||
|
```
|
||||||
|
|
||||||
|
* The results are reasonable, but not all the deps are caught
|
||||||
|
* As argued above, will never see aliasing; important in plenty of cases.
|
||||||
|
* eg. if the compiler allocates `%rcx = A[i]` and `%rdx = A[i+2]` for some
|
||||||
|
reason, dependencies will be missed.
|
||||||
|
* As argued in previous chapter, a complete dependencies analysis would require
|
||||||
|
a broader range: take the full scope into account
|
||||||
|
|
||||||
#### With Gus
|
#### With Gus
|
||||||
|
|
||||||
|
@ -167,3 +212,9 @@ TODO ?
|
||||||
* results
|
* results
|
||||||
* Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
|
* Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
|
||||||
* results
|
* results
|
||||||
|
|
||||||
|
* Although not all dependencies are detected [paragraph above], the "important"
|
||||||
|
ones seem to be detected: this is the most critical property for throughput
|
||||||
|
analysis
|
||||||
|
* but might not be true for other applications that require dependencies
|
||||||
|
detection
|
||||||
|
|
Loading…
Reference in a new issue