# Static extraction of memory-carried dependencies [[PREREQUISITES]] * CesASMe results * Gus * Static vs dynamic * PC * μarch: μop, renamer, L1-res, ROB * Osaca * UiCA [[END]] ## Intro * Previous chapt. : effect of mem-carried deps * Presented solution: Gus; in general dynamic analysis. * Effective * 2 O.M. slower => not acceptable in many cases * We need a static solution Dependencies are costly: assuming everything L1-resident, the latency of each μop on the dependency chain must be paid. On SKX, * `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C) => `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps * `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C) ## Types of dependencies 4 main types: 1. RaW: "real" dependency 2. WaW 3. WaR 4. RaR * 4: not an issue. * 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem either. Might be a problem for some archs. In all this chapter, we consider only RaW deps. Solution can be easily extended for WaW, WaR if necessary. Can occur: * through registers ``` A = 7 B = A + 2 ``` * through memory ``` store %rax, (%rbx) add (%rbx), %rcx ``` Can be: * in straight-line code * loop-carried: ``` for(i) B[i] = A[i-1] + 2 A[i] = 7 ``` ## Dynamic detection: Valgrind * Mention Gus ### Valgrind's VEX * Introduce Valgrind as an instrumentation tool * Introduce VEX * Should be portable to any architecture supported * but suffers limitations for recent extension sets; eg avx512 not supported (TODO check) ### Depsim * Write a tool, valgrind-depsim, to instrument a binary to extract its dependencies at runtime * Can extract memory, register and temp-based dependencies * Here, only the memory dependencies are relevant -- disable the other deps. * Instrument binary: * for each write, add `write_addr -> writer_pc` to a hashmap * for each read, fetch `writer_pc` from hashmap * if found, add a dependency `reader_pc -> writer_pc` * use the process' memory map to translate PC to addresses inside ELF files * At the end, write deps file: * `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path` * Run for each binary in genbenchs * Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage ## Static detection * Reg-carried, straight-line: relatively easy. Keep track of which PC last wrote each register. * Reg-carried, loop-carried: can be adapted from straight-line. Indeed, * need to track only so many iterations behind: after a certain point, instructions are out of the ROB anyway * 224 μops in Intel's Skylake, 2015 * 512 μops in Intel's Golden Cove, 2021 * Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13] * Can unroll until we have ~|ROB|+|K| instructions in the kernel: since instructions yield at least a μop, safe [TODO check] * Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo. * Harder for memory-carried: * addresses may alias, eg. (%rax) = 8(%rbx) * pointer arithmetics: must track values * Usually not done, or only for trivial cases. ## Staticdeps heuristic * Aims to simply solve the 2nd point. * Could be solved with symbolic calculus, but not that easy to implement, slower. * Use random values * Operates at the scale of a kernel, unrolled enough times to fill the ROB * Whenever reading an unknown value (from mem or register), generate a fresh random value (64b), save it to shadow memory/register file * Whenever encountering integer arithmetics, compute the operation * Whenever encountering other kind of operations or unsupported operations, define the result as invalid (\bot): not pointer arithmetics. * Whenever writing to a memory address, keep track of which PC wrote where. * Whenever reading from a memory address, generate a dependency to the writing PC. * Reconstruct recurrent dependencies: transcribe each dependency to `(src, dst, kernel delta)`. * Verify that the dependency exists for each unroll (where it can exist, eg. 1st kernel cannot depend on the previous kernel unroll); if it happens in the majority of cases, keep; else drop * We need semantics for our assembly ### Limitations * Does not track aliasing that originates from outside of the kernel. * As advocated in CesASMe, would require a broader analysis range * Randomness may (theoretically) lead to false positives * but re-running with different seed should eliminate the hazard close to entirely * Should not have false negatives outside of aliasing or unsupported ops ## Evaluation ### Dependencies detection #### With valgrind Use valgrind-depsim. Then, compare with staticdeps: `eval/vg_depsim.py` script. * For each binary in genbenchs, * use genbench's bb split/occurrences to retrieve basic blocks * for each BB with more than 10% of max BB hits, * predict deps with staticdeps * cache the result: staticdeps is fast, but we're dealing with 3500 files. * translate staticdeps' periodic deps to PC deps, discard the `iter` parameter * for each dependency from the depsim results that occurs inside this BB, * check if found or missed, append to a list * score: `|found| / (|found| + |missed|)`. Discards occurrences. * limitation: will only find deps from/to the same BB! Dependencies leaving a BB are discarded. * Result: about 38% of deps found; 44% if weighting by occurrences * Cause: kernels executed in loops. * No dependency in the kernel ``` while: read (%rax) %rax ++ write (%rax) ``` * But dependencies if executed in a loop! "Unwanted" deps. * and irrelevant in real life anyway: they are far away and will not cause latency * Fix: introduce dependency lifetime * timestamp = instructions executed (VG instrumentation, added up at the end of each BB) * lifetime fixed to 1024 instructions, order of magnitude of a ROB * dependencies are discarded if written to more than a lifetime ago * Result: about 58% of deps found; same if weighing. * If lifetime lowered to 512, about 56% of deps found, or 63% if weighing. * Results are quite similar, lowering the lifetime further makes no particular sense. Raw results: ``` In [123]: res_success(res_life512) Out[123]: 0.5640902544407105 In [124]: res_success(res_life1024) Out[124]: 0.5761437608875034 In [125]: res_success(res_nolife) Out[125]: 0.38143868803578085 In [126]: res_success_weight(res_life512) Out[126]: 0.6347271857382266 In [127]: res_success_weight(res_life1024) Out[127]: 0.5817404277466787 In [128]: res_success_weight(res_nolife) Out[128]: 0.4397921976192802 ``` * The results are reasonable, but not all the deps are caught * As argued above, will never see aliasing; important in plenty of cases. * eg. if the compiler allocates `%rcx = A[i]` and `%rdx = A[i+2]` for some reason, dependencies will be missed. * As argued in previous chapter, a complete dependencies analysis would require a broader range: take the full scope into account #### With Gus TODO ? ### UiCA enriching * Plug Staticdeps into UiCA * UiCA has a μop-level representation; staticdeps has an instr-level representation * Add dependencies between each couple of μop in (src,dest). * A finer model would be necessary to be accurate * Pessimistic model * Run CesASMe on the full suite with uiCA and uiCA+staticdeps * results * Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps * results * Although not all dependencies are detected [paragraph above], the "important" ones seem to be detected: this is the most critical property for throughput analysis * but might not be true for other applications that require dependencies detection ### Speed TODO: evaluate speed?