phd-thesis/plan/60_staticdeps.md

170 lines
5.3 KiB
Markdown
Raw Normal View History

2023-09-08 16:58:38 +02:00
# Static extraction of memory-carried dependencies
## Intro
* Previous chapt. : effect of mem-carried deps
* Presented solution: Gus; in general dynamic analysis.
* Effective
* 2 O.M. slower => not acceptable in many cases
* We need a static solution
## Types of dependencies
4 main types:
1. RaW: "real" dependency
2. WaW
3. WaR
4. RaW
* 4: not an issue.
* 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem
either. Might be a problem for some archs.
In all this chapter, we consider only RaW deps. Solution can be easily extended
for WaW, WaR if necessary.
Can occur:
* through registers
```
A = 7
B = A + 2
```
* through memory
```
store %rax, (%rbx)
add (%rbx), %rcx
```
Can be:
* in straight-line code
* loop-carried:
```
for(i)
B[i] = A[i-1] + 2
A[i] = 7
```
## Cost of dependencies
Dependencies are costly: assuming everything L1-resident, the latency of each
μop on the dependency chain must be paid.
On SKX,
* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C)
=> `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
## Static detection
* Reg-carried, straight-line: relatively easy. Keep track of which PC last
wrote each register.
* Reg-carried, loop-carried: can be adapted from straight-line. Indeed,
* need to track only so many iterations behind: after a certain point,
instructions are out of the ROB anyway
* 224 μops in Intel's Skylake, 2015
* 512 μops in Intel's Golden Cove, 2021
* Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
instructions yield at least a μop, safe [TODO check]
* Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
* Harder for memory-carried:
* addresses may alias, eg. (%rax) = 8(%rbx)
* pointer arithmetics: must track values
* Usually not done, or only for trivial cases.
## Staticdeps heuristic
* Aims to simply solve the 2nd point.
* Could be solved with symbolic calculus, but not that easy to implement,
slower.
* Use random values
* Operates at the scale of a kernel, unrolled enough times to fill the ROB
* Whenever reading an unknown value (from mem or register), generate a fresh
random value (64b), save it to shadow memory/register file
* Whenever encountering integer arithmetics, compute the operation
* Whenever encountering other kind of operations or unsupported operations,
define the result as invalid (\bot): not pointer arithmetics.
* Whenever writing to a memory address, keep track of which PC wrote where.
* Whenever reading from a memory address, generate a dependency to the writing
PC.
* Reconstruct recurrent dependencies: transcribe each dependency to
`(src, dst, kernel delta)`.
* Verify that the dependency exists for each unroll (where it can exist, eg.
1st kernel cannot depend on the previous kernel unroll); if it happens in the
majority of cases, keep; else drop
* Semantics of asm coming from Valgrind's IR -- should be portable to any
architecture supported
* but suffers limitations for recent extension sets; eg avx512 not
supported (TODO check)
### Limitations
* Does not track aliasing that originates from outside of the kernel.
* As advocated in CesASMe, would require a broader analysis range
* Randomness may lead to false positives
* but re-running with different seed should eliminate the hazard close to
entirely
* Should not have false negatives outside of aliasing or unsupported ops
## Evaluation
### Dependencies detection
2023-09-13 11:57:51 +02:00
#### With valgrind
* Instrument binary:
* for each write, add `write_addr -> writer_pc` to a hashmap
* for each read, fetch `writer_pc` from hashmap
* if found, add a dependency `reader_pc -> writer_pc`
* At the end, write deps file:
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
* Run for each binary in genbenchs
* For each binary in genbenchs,
* for each BB with more than 10% of max BB hits,
* predict deps with staticdeps
* check which dependencies are found/missed from the instrumented ones
* limitation: will only find deps from/to the same BB! Dependencies leaving
a BB are discarded.
* Result: about 38% of deps found.
* Cause: kernels executed in loops.
* No dependency in the kernel
```
while:
read (%rax)
%rax ++
write (%rax)
```
* But dependencies if executed in a loop! "Unwanted" deps.
* and irrelevant in real life anyway: they are far away and will not cause
latency
* Fix: introduce dependency lifetime
* timestamp = instructions executed (VG instrumentation, added up at the
end of each BB)
* lifetime fixed to 1024 instructions
* dependencies are discarded if written to more than a lifetime ago
* Result: about (?? TODO) of deps found
#### With Gus
TODO ?
2023-09-08 16:58:38 +02:00
### UiCA enriching
* Plug Staticdeps into UiCA
* UiCA has a μop-level representation; staticdeps has an instr-level
representation
* Add dependencies between each couple of μop in (src,dest).
* A finer model would be necessary to be accurate
* Pessimistic model
* Run CesASMe on the full suite with uiCA and uiCA+staticdeps
* results
* Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
* results