249 lines
7.8 KiB
Markdown
249 lines
7.8 KiB
Markdown
# Static extraction of memory-carried dependencies
|
|
|
|
[[PREREQUISITES]]
|
|
* CesASMe results
|
|
* Gus
|
|
* Static vs dynamic
|
|
* PC
|
|
* μarch: μop, renamer, L1-res, ROB
|
|
* Osaca
|
|
* UiCA
|
|
[[END]]
|
|
|
|
|
|
## Intro
|
|
|
|
* Previous chapt. : effect of mem-carried deps
|
|
* Presented solution: Gus; in general dynamic analysis.
|
|
* Effective
|
|
* 2 O.M. slower => not acceptable in many cases
|
|
* We need a static solution
|
|
|
|
Dependencies are costly: assuming everything L1-resident, the latency of each
|
|
μop on the dependency chain must be paid.
|
|
|
|
On SKX,
|
|
* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C)
|
|
=> `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
|
|
* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
|
|
|
|
|
|
## Types of dependencies
|
|
|
|
4 main types:
|
|
1. RaW: "real" dependency
|
|
2. WaW
|
|
3. WaR
|
|
4. RaR
|
|
|
|
* 4: not an issue.
|
|
* 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem
|
|
either. Might be a problem for some archs.
|
|
|
|
In all this chapter, we consider only RaW deps. Solution can be easily extended
|
|
for WaW, WaR if necessary.
|
|
|
|
Can occur:
|
|
* through registers
|
|
```
|
|
A = 7
|
|
B = A + 2
|
|
```
|
|
* through memory
|
|
```
|
|
store %rax, (%rbx)
|
|
add (%rbx), %rcx
|
|
```
|
|
|
|
Can be:
|
|
* in straight-line code
|
|
* loop-carried:
|
|
```
|
|
for(i)
|
|
B[i] = A[i-1] + 2
|
|
A[i] = 7
|
|
```
|
|
|
|
## Dynamic detection: Valgrind
|
|
|
|
* Mention Gus
|
|
|
|
### Valgrind's VEX
|
|
|
|
* Introduce Valgrind as an instrumentation tool
|
|
* Introduce VEX
|
|
* Should be portable to any architecture supported
|
|
* but suffers limitations for recent extension sets; eg avx512 not
|
|
supported (TODO check)
|
|
|
|
### Depsim
|
|
|
|
* Write a tool, valgrind-depsim, to instrument a binary to extract its
|
|
dependencies at runtime
|
|
* Can extract memory, register and temp-based dependencies
|
|
* Here, only the memory dependencies are relevant -- disable the other deps.
|
|
* Instrument binary:
|
|
* for each write, add `write_addr -> writer_pc` to a hashmap
|
|
* for each read, fetch `writer_pc` from hashmap
|
|
* if found, add a dependency `reader_pc -> writer_pc`
|
|
* use the process' memory map to translate PC to addresses inside ELF files
|
|
* At the end, write deps file:
|
|
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
|
|
* Run for each binary in genbenchs
|
|
* Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
|
|
|
|
## Static detection
|
|
|
|
* Reg-carried, straight-line: relatively easy. Keep track of which PC last
|
|
wrote each register.
|
|
* Reg-carried, loop-carried: can be adapted from straight-line. Indeed,
|
|
* need to track only so many iterations behind: after a certain point,
|
|
instructions are out of the ROB anyway
|
|
* 224 μops in Intel's Skylake, 2015
|
|
* 512 μops in Intel's Golden Cove, 2021
|
|
* Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
|
|
* Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
|
|
instructions yield at least a μop, safe [TODO check]
|
|
* Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
|
|
|
|
* Harder for memory-carried:
|
|
* addresses may alias, eg. (%rax) = 8(%rbx)
|
|
* pointer arithmetics: must track values
|
|
* Usually not done, or only for trivial cases.
|
|
|
|
## Staticdeps heuristic
|
|
|
|
* Aims to simply solve the 2nd point.
|
|
* Could be solved with symbolic calculus, but not that easy to implement,
|
|
slower.
|
|
* Use random values
|
|
* Operates at the scale of a kernel, unrolled enough times to fill the ROB
|
|
|
|
|
|
* Whenever reading an unknown value (from mem or register), generate a fresh
|
|
random value (64b), save it to shadow memory/register file
|
|
* Whenever encountering integer arithmetics, compute the operation
|
|
* Whenever encountering other kind of operations or unsupported operations,
|
|
define the result as invalid (\bot): not pointer arithmetics.
|
|
* Whenever writing to a memory address, keep track of which PC wrote where.
|
|
* Whenever reading from a memory address, generate a dependency to the writing
|
|
PC.
|
|
|
|
* Reconstruct recurrent dependencies: transcribe each dependency to
|
|
`(src, dst, kernel delta)`.
|
|
* Verify that the dependency exists for each unroll (where it can exist, eg.
|
|
1st kernel cannot depend on the previous kernel unroll); if it happens in the
|
|
majority of cases, keep; else drop
|
|
|
|
* We need semantics for our assembly
|
|
|
|
### Limitations
|
|
|
|
* Does not track aliasing that originates from outside of the kernel.
|
|
* As advocated in CesASMe, would require a broader analysis range
|
|
* Randomness may (theoretically) lead to false positives
|
|
* but re-running with different seed should eliminate the hazard close to
|
|
entirely
|
|
* Should not have false negatives outside of aliasing or unsupported ops
|
|
|
|
## Evaluation
|
|
|
|
### Dependencies detection
|
|
|
|
#### With valgrind
|
|
|
|
Use valgrind-depsim.
|
|
Then, compare with staticdeps: `eval/vg_depsim.py` script.
|
|
|
|
* For each binary in genbenchs,
|
|
* use genbench's bb split/occurrences to retrieve basic blocks
|
|
* for each BB with more than 10% of max BB hits,
|
|
* predict deps with staticdeps
|
|
* cache the result: staticdeps is fast, but we're dealing with 3500
|
|
files.
|
|
* translate staticdeps' periodic deps to PC deps, discard the `iter`
|
|
parameter
|
|
* for each dependency from the depsim results that occurs inside this BB,
|
|
* check if found or missed, append to a list
|
|
* score: `|found| / (|found| + |missed|)`. Discards occurrences.
|
|
* limitation: will only find deps from/to the same BB! Dependencies leaving a
|
|
BB are discarded.
|
|
|
|
* Result: about 38% of deps found; 44% if weighting by occurrences
|
|
|
|
* Cause: kernels executed in loops.
|
|
* No dependency in the kernel
|
|
```
|
|
while:
|
|
read (%rax)
|
|
%rax ++
|
|
write (%rax)
|
|
```
|
|
* But dependencies if executed in a loop! "Unwanted" deps.
|
|
* and irrelevant in real life anyway: they are far away and will not cause
|
|
latency
|
|
* Fix: introduce dependency lifetime
|
|
* timestamp = instructions executed (VG instrumentation, added up at the
|
|
end of each BB)
|
|
* lifetime fixed to 1024 instructions, order of magnitude of a ROB
|
|
* dependencies are discarded if written to more than a lifetime ago
|
|
|
|
* Result: about 58% of deps found; same if weighing.
|
|
* If lifetime lowered to 512, about 56% of deps found, or 63% if weighing.
|
|
* Results are quite similar, lowering the lifetime further makes no
|
|
particular sense.
|
|
|
|
Raw results:
|
|
```
|
|
In [123]: res_success(res_life512)
|
|
Out[123]: 0.5640902544407105
|
|
|
|
In [124]: res_success(res_life1024)
|
|
Out[124]: 0.5761437608875034
|
|
|
|
In [125]: res_success(res_nolife)
|
|
Out[125]: 0.38143868803578085
|
|
|
|
In [126]: res_success_weight(res_life512)
|
|
Out[126]: 0.6347271857382266
|
|
|
|
In [127]: res_success_weight(res_life1024)
|
|
Out[127]: 0.5817404277466787
|
|
|
|
In [128]: res_success_weight(res_nolife)
|
|
Out[128]: 0.4397921976192802
|
|
```
|
|
|
|
* The results are reasonable, but not all the deps are caught
|
|
* As argued above, will never see aliasing; important in plenty of cases.
|
|
* eg. if the compiler allocates `%rcx = A[i]` and `%rdx = A[i+2]` for some
|
|
reason, dependencies will be missed.
|
|
* As argued in previous chapter, a complete dependencies analysis would require
|
|
a broader range: take the full scope into account
|
|
|
|
#### With Gus
|
|
|
|
TODO ?
|
|
|
|
### UiCA enriching
|
|
|
|
* Plug Staticdeps into UiCA
|
|
* UiCA has a μop-level representation; staticdeps has an instr-level
|
|
representation
|
|
* Add dependencies between each couple of μop in (src,dest).
|
|
* A finer model would be necessary to be accurate
|
|
* Pessimistic model
|
|
* Run CesASMe on the full suite with uiCA and uiCA+staticdeps
|
|
* results
|
|
* Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
|
|
* results
|
|
|
|
* Although not all dependencies are detected [paragraph above], the "important"
|
|
ones seem to be detected: this is the most critical property for throughput
|
|
analysis
|
|
* but might not be true for other applications that require dependencies
|
|
detection
|
|
|
|
### Speed
|
|
|
|
TODO: evaluate speed?
|