Static extraction of memory-carried dependencies

Intro

Previous chapt. : effect of mem-carried deps
Presented solution: Gus; in general dynamic analysis.
- Effective
- 2 O.M. slower => not acceptable in many cases
We need a static solution

4 main types:

4: not an issue.
2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem either. Might be a problem for some archs.

In all this chapter, we consider only RaW deps. Solution can be easily extended for WaW, WaR if necessary.

Can occur:

Can be:

loop-carried:

for(i)
    B[i] = A[i-1] + 2
    A[i] = 7

Dependencies are costly: assuming everything L1-resident, the latency of each μop on the dependency chain must be paid.

On SKX,

add %rax, %rdx -> lat = 1 cycle (throughput = 1/4C) => add %rax, %rdx ; add %rdx, %rcx : 1.25C, would be 0.5C without deps
vfmadd*pd %ymm0, %ymm1, %ymm2: lat = 4C (TP = 1/2C)

Reg-carried, straight-line: relatively easy. Keep track of which PC last wrote each register.
Reg-carried, loop-carried: can be adapted from straight-line. Indeed,
- need to track only so many iterations behind: after a certain point, instructions are out of the ROB anyway
  - 224 μops in Intel's Skylake, 2015
  - 512 μops in Intel's Golden Cove, 2021
- Can unroll until we have ~|ROB|+|K| instructions in the kernel: since instructions yield at least a μop, safe [TODO check]
- Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
Harder for memory-carried:
- addresses may alias, eg. (%rax) = 8(%rbx)
- pointer arithmetics: must track values
- Usually not done, or only for trivial cases.

Aims to simply solve the 2nd point.
Could be solved with symbolic calculus, but not that easy to implement, slower.
Use random values
Operates at the scale of a kernel, unrolled enough times to fill the ROB
Whenever reading an unknown value (from mem or register), generate a fresh random value (64b), save it to shadow memory/register file
Whenever encountering integer arithmetics, compute the operation
Whenever encountering other kind of operations or unsupported operations, define the result as invalid (\bot): not pointer arithmetics.
Whenever writing to a memory address, keep track of which PC wrote where.
Whenever reading from a memory address, generate a dependency to the writing PC.
Reconstruct recurrent dependencies: transcribe each dependency to (src, dst, kernel delta).
Verify that the dependency exists for each unroll (where it can exist, eg. 1st kernel cannot depend on the previous kernel unroll); if it happens in the majority of cases, keep; else drop
Semantics of asm coming from Valgrind's IR -- should be portable to any architecture supported
- but suffers limitations for recent extension sets; eg avx512 not supported (TODO check)

Does not track aliasing that originates from outside of the kernel.
- As advocated in CesASMe, would require a broader analysis range
Randomness may lead to false positives
- but re-running with different seed should eliminate the hazard close to entirely
Should not have false negatives outside of aliasing or unsupported ops

Instrument binary:
- for each write, add write_addr -> writer_pc to a hashmap
- for each read, fetch writer_pc from hashmap
  - if found, add a dependency reader_pc -> writer_pc
- At the end, write deps file:
  - #occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path
- Run for each binary in genbenchs
For each binary in genbenchs,
- for each BB with more than 10% of max BB hits,
- predict deps with staticdeps
- check which dependencies are found/missed from the instrumented ones
- limitation: will only find deps from/to the same BB! Dependencies leaving a BB are discarded.
Result: about 38% of deps found.
Cause: kernels executed in loops.
- No dependency in the kernel
```
while:
    read (%rax)
    %rax ++
    write (%rax)
```
- But dependencies if executed in a loop! "Unwanted" deps.
- and irrelevant in real life anyway: they are far away and will not cause latency
Fix: introduce dependency lifetime
- timestamp = instructions executed (VG instrumentation, added up at the end of each BB)
- lifetime fixed to 1024 instructions
- dependencies are discarded if written to more than a lifetime ago
Result: about (?? TODO) of deps found

TODO ?

Plug Staticdeps into UiCA
UiCA has a μop-level representation; staticdeps has an instr-level representation
- Add dependencies between each couple of μop in (src,dest).
- A finer model would be necessary to be accurate
- Pessimistic model
Run CesASMe on the full suite with uiCA and uiCA+staticdeps
- results
Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
- results