7.8 KiB
Static extraction of memory-carried dependencies
- CesASMe results
- Gus
- Static vs dynamic
- PC
- μarch: μop, renamer, L1-res, ROB
- Osaca
- UiCA END
Intro
- Previous chapt. : effect of mem-carried deps
- Presented solution: Gus; in general dynamic analysis.
- Effective
- 2 O.M. slower => not acceptable in many cases
- We need a static solution
Dependencies are costly: assuming everything L1-resident, the latency of each μop on the dependency chain must be paid.
On SKX,
add %rax, %rdx
-> lat = 1 cycle (throughput = 1/4C) =>add %rax, %rdx ; add %rdx, %rcx
: 1.25C, would be 0.5C without depsvfmadd*pd %ymm0, %ymm1, %ymm2
: lat = 4C (TP = 1/2C)
Types of dependencies
4 main types:
- RaW: "real" dependency
- WaW
- WaR
- RaR
- 4: not an issue.
- 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem either. Might be a problem for some archs.
In all this chapter, we consider only RaW deps. Solution can be easily extended for WaW, WaR if necessary.
Can occur:
- through registers
A = 7 B = A + 2
- through memory
store %rax, (%rbx) add (%rbx), %rcx
Can be:
- in straight-line code
- loop-carried:
for(i) B[i] = A[i-1] + 2 A[i] = 7
Dynamic detection: Valgrind
- Mention Gus
Valgrind's VEX
- Introduce Valgrind as an instrumentation tool
- Introduce VEX
- Should be portable to any architecture supported
- but suffers limitations for recent extension sets; eg avx512 not supported (TODO check)
Depsim
- Write a tool, valgrind-depsim, to instrument a binary to extract its dependencies at runtime
- Can extract memory, register and temp-based dependencies
- Here, only the memory dependencies are relevant -- disable the other deps.
- Instrument binary:
- for each write, add
write_addr -> writer_pc
to a hashmap - for each read, fetch
writer_pc
from hashmap- if found, add a dependency
reader_pc -> writer_pc
- if found, add a dependency
- use the process' memory map to translate PC to addresses inside ELF files
- At the end, write deps file:
#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path
- Run for each binary in genbenchs
- Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
- for each write, add
Static detection
-
Reg-carried, straight-line: relatively easy. Keep track of which PC last wrote each register.
-
Reg-carried, loop-carried: can be adapted from straight-line. Indeed,
- need to track only so many iterations behind: after a certain point,
instructions are out of the ROB anyway
- 224 μops in Intel's Skylake, 2015
- 512 μops in Intel's Golden Cove, 2021
- Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
- Can unroll until we have ~|ROB|+|K| instructions in the kernel: since instructions yield at least a μop, safe [TODO check]
- Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
- need to track only so many iterations behind: after a certain point,
instructions are out of the ROB anyway
-
Harder for memory-carried:
- addresses may alias, eg. (%rax) = 8(%rbx)
- pointer arithmetics: must track values
- Usually not done, or only for trivial cases.
Staticdeps heuristic
-
Aims to simply solve the 2nd point.
-
Could be solved with symbolic calculus, but not that easy to implement, slower.
-
Use random values
-
Operates at the scale of a kernel, unrolled enough times to fill the ROB
-
Whenever reading an unknown value (from mem or register), generate a fresh random value (64b), save it to shadow memory/register file
-
Whenever encountering integer arithmetics, compute the operation
-
Whenever encountering other kind of operations or unsupported operations, define the result as invalid (\bot): not pointer arithmetics.
-
Whenever writing to a memory address, keep track of which PC wrote where.
-
Whenever reading from a memory address, generate a dependency to the writing PC.
-
Reconstruct recurrent dependencies: transcribe each dependency to
(src, dst, kernel delta)
. -
Verify that the dependency exists for each unroll (where it can exist, eg. 1st kernel cannot depend on the previous kernel unroll); if it happens in the majority of cases, keep; else drop
-
We need semantics for our assembly
Limitations
- Does not track aliasing that originates from outside of the kernel.
- As advocated in CesASMe, would require a broader analysis range
- Randomness may (theoretically) lead to false positives
- but re-running with different seed should eliminate the hazard close to entirely
- Should not have false negatives outside of aliasing or unsupported ops
Evaluation
Dependencies detection
With valgrind
Use valgrind-depsim.
Then, compare with staticdeps: eval/vg_depsim.py
script.
-
For each binary in genbenchs,
- use genbench's bb split/occurrences to retrieve basic blocks
- for each BB with more than 10% of max BB hits,
- predict deps with staticdeps
- cache the result: staticdeps is fast, but we're dealing with 3500 files.
- translate staticdeps' periodic deps to PC deps, discard the
iter
parameter - for each dependency from the depsim results that occurs inside this BB,
- check if found or missed, append to a list
-
score:
|found| / (|found| + |missed|)
. Discards occurrences. -
limitation: will only find deps from/to the same BB! Dependencies leaving a BB are discarded.
-
Result: about 38% of deps found; 44% if weighting by occurrences
-
Cause: kernels executed in loops.
- No dependency in the kernel
while: read (%rax) %rax ++ write (%rax)
- But dependencies if executed in a loop! "Unwanted" deps.
- and irrelevant in real life anyway: they are far away and will not cause latency
-
Fix: introduce dependency lifetime
- timestamp = instructions executed (VG instrumentation, added up at the end of each BB)
- lifetime fixed to 1024 instructions, order of magnitude of a ROB
- dependencies are discarded if written to more than a lifetime ago
-
Result: about 58% of deps found; same if weighing.
-
If lifetime lowered to 512, about 56% of deps found, or 63% if weighing.
- Results are quite similar, lowering the lifetime further makes no particular sense.
Raw results:
In [123]: res_success(res_life512)
Out[123]: 0.5640902544407105
In [124]: res_success(res_life1024)
Out[124]: 0.5761437608875034
In [125]: res_success(res_nolife)
Out[125]: 0.38143868803578085
In [126]: res_success_weight(res_life512)
Out[126]: 0.6347271857382266
In [127]: res_success_weight(res_life1024)
Out[127]: 0.5817404277466787
In [128]: res_success_weight(res_nolife)
Out[128]: 0.4397921976192802
- The results are reasonable, but not all the deps are caught
- As argued above, will never see aliasing; important in plenty of cases.
- eg. if the compiler allocates
%rcx = A[i]
and%rdx = A[i+2]
for some reason, dependencies will be missed.
- eg. if the compiler allocates
- As argued in previous chapter, a complete dependencies analysis would require a broader range: take the full scope into account
With Gus
TODO ?
UiCA enriching
-
Plug Staticdeps into UiCA
-
UiCA has a μop-level representation; staticdeps has an instr-level representation
- Add dependencies between each couple of μop in (src,dest).
- A finer model would be necessary to be accurate
- Pessimistic model
-
Run CesASMe on the full suite with uiCA and uiCA+staticdeps
- results
-
Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
- results
-
Although not all dependencies are detected [paragraph above], the "important" ones seem to be detected: this is the most critical property for throughput analysis
- but might not be true for other applications that require dependencies detection
Speed
TODO: evaluate speed?