diff --git a/plan/60_staticdeps.md b/plan/60_staticdeps.md index e87c3bd..7c03924 100644 --- a/plan/60_staticdeps.md +++ b/plan/60_staticdeps.md @@ -115,22 +115,36 @@ On SKX, #### With valgrind +* Write a tool, valgrind-depsim, to instrument a binary to extract its + dependencies at runtime +* Can extract memory, register and temp-based dependencies +* Here, only the memory dependencies are relevant -- disable the other deps. * Instrument binary: * for each write, add `write_addr -> writer_pc` to a hashmap * for each read, fetch `writer_pc` from hashmap * if found, add a dependency `reader_pc -> writer_pc` + * use the process' memory map to translate PC to addresses inside ELF files * At the end, write deps file: * `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path` * Run for each binary in genbenchs + * Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage + +Then, compare with staticdeps: `eval/vg_depsim.py` script. * For each binary in genbenchs, + * use genbench's bb split/occurrences to retrieve basic blocks * for each BB with more than 10% of max BB hits, * predict deps with staticdeps - * check which dependencies are found/missed from the instrumented ones - * limitation: will only find deps from/to the same BB! Dependencies leaving - a BB are discarded. + * cache the result: fast, but we're dealing with 3500 files. + * translate staticdeps' periodic deps to PC deps, discard the `iter` + parameter + * for each dependency from the depsim results that occurs inside this BB, + * check if found or missed, append to a list +* score: `|found| / (|found| + |missed|)`. Discards occurrences. +* limitation: will only find deps from/to the same BB! Dependencies leaving + a BB are discarded. -* Result: about 38% of deps found. +* Result: about 38% of deps found; 44% if weighting by occurrences * Cause: kernels executed in loops. * No dependency in the kernel @@ -146,10 +160,41 @@ On SKX, * Fix: introduce dependency lifetime * timestamp = instructions executed (VG instrumentation, added up at the end of each BB) - * lifetime fixed to 1024 instructions + * lifetime fixed to 1024 instructions, order of magnitude of a ROB * dependencies are discarded if written to more than a lifetime ago -* Result: about (?? TODO) of deps found +* Result: about 58% of deps found; same if weighing. +* If lifetime lowered to 512, about 56% of deps found, or 63% if weighing. + * Results are quite similar, lowering the lifetime further makes no + particular sense. + +Raw results: +``` +In [123]: res_success(res_life512) +Out[123]: 0.5640902544407105 + +In [124]: res_success(res_life1024) +Out[124]: 0.5761437608875034 + +In [125]: res_success(res_nolife) +Out[125]: 0.38143868803578085 + +In [126]: res_success_weight(res_life512) +Out[126]: 0.6347271857382266 + +In [127]: res_success_weight(res_life1024) +Out[127]: 0.5817404277466787 + +In [128]: res_success_weight(res_nolife) +Out[128]: 0.4397921976192802 +``` + +* The results are reasonable, but not all the deps are caught +* As argued above, will never see aliasing; important in plenty of cases. + * eg. if the compiler allocates `%rcx = A[i]` and `%rdx = A[i+2]` for some + reason, dependencies will be missed. +* As argued in previous chapter, a complete dependencies analysis would require + a broader range: take the full scope into account #### With Gus @@ -167,3 +212,9 @@ TODO ? * results * Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps * results + +* Although not all dependencies are detected [paragraph above], the "important" + ones seem to be detected: this is the most critical property for throughput + analysis + * but might not be true for other applications that require dependencies + detection