phd-thesis/plan/60_staticdeps.md

# Static extraction of memory-carried dependencies

## Intro

* Previous chapt. : effect of mem-carried deps
* Presented solution: Gus; in general dynamic analysis.
    * Effective
    * 2 O.M. slower => not acceptable in many cases
* We need a static solution

## Types of dependencies

4 main types:
1. RaW: "real" dependency
2. WaW
3. WaR
4. RaW

* 4: not an issue.
* 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem
  either. Might be a problem for some archs.

In all this chapter, we consider only RaW deps. Solution can be easily extended
for WaW, WaR if necessary.

Can occur:
* through registers
    ```
    A = 7
    B = A + 2
    ```
* through memory
    ```
    store %rax, (%rbx)
    add (%rbx), %rcx
    ```

Can be:
* in straight-line code
* loop-carried:
    ```
    for(i)
        B[i] = A[i-1] + 2
        A[i] = 7
    ```
## Cost of dependencies

Dependencies are costly: assuming everything L1-resident, the latency of each
μop on the dependency chain must be paid.

On SKX,
* `add %rax, %rdx`  -> lat = 1 cycle (throughput = 1/4C)
    => `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)

## Static detection

* Reg-carried, straight-line: relatively easy. Keep track of which PC last
  wrote each register.
* Reg-carried, loop-carried: can be adapted from straight-line. Indeed,
    * need to track only so many iterations behind: after a certain point,
      instructions are out of the ROB anyway
        * 224 μops in Intel's Skylake, 2015
        * 512 μops in Intel's Golden Cove, 2021
    * Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
      instructions yield at least a μop, safe [TODO check]
    * Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.

* Harder for memory-carried:
    * addresses may alias, eg. (%rax) = 8(%rbx)
    * pointer arithmetics: must track values
    * Usually not done, or only for trivial cases.

## Staticdeps heuristic

* Aims to simply solve the 2nd point.
* Could be solved with symbolic calculus, but not that easy to implement,
  slower.
* Use random values
* Operates at the scale of a kernel, unrolled enough times to fill the ROB


* Whenever reading an unknown value (from mem or register), generate a fresh
  random value (64b), save it to shadow memory/register file
* Whenever encountering integer arithmetics, compute the operation
* Whenever encountering other kind of operations or unsupported operations,
  define the result as invalid (\bot): not pointer arithmetics.
* Whenever writing to a memory address, keep track of which PC wrote where.
* Whenever reading from a memory address, generate a dependency to the writing
  PC.

* Reconstruct recurrent dependencies: transcribe each dependency to
`(src, dst, kernel delta)`.
* Verify that the dependency exists for each unroll (where it can exist, eg.
  1st kernel cannot depend on the previous kernel unroll); if it happens in the
  majority of cases, keep; else drop

* Semantics of asm coming from Valgrind's IR -- should be portable to any
  architecture supported
    * but suffers limitations for recent extension sets; eg avx512 not
      supported (TODO check)

### Limitations

* Does not track aliasing that originates from outside of the kernel.
    * As advocated in CesASMe, would require a broader analysis range
* Randomness may lead to false positives
    * but re-running with different seed should eliminate the hazard close to
      entirely
* Should not have false negatives outside of aliasing or unsupported ops

## Evaluation

### Dependencies detection

#### With valgrind

* Instrument binary:
    * for each write, add `write_addr -> writer_pc` to a hashmap
    * for each read, fetch `writer_pc` from hashmap
        * if found, add a dependency `reader_pc -> writer_pc`
    * At the end, write deps file:
        * `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
    * Run for each binary in genbenchs

* For each binary in genbenchs,
    * for each BB with more than 10% of max BB hits,
    * predict deps with staticdeps
    * check which dependencies are found/missed from the instrumented ones
    * limitation: will only find deps from/to the same BB! Dependencies leaving
      a BB are discarded.

* Result: about 38% of deps found.

* Cause: kernels executed in loops.
    * No dependency in the kernel
    ```
    while:
        read (%rax)
        %rax ++
        write (%rax)
    ```
    * But dependencies if executed in a loop! "Unwanted" deps.
    * and irrelevant in real life anyway: they are far away and will not cause
      latency
* Fix: introduce dependency lifetime
    * timestamp = instructions executed (VG instrumentation, added up at the
      end of each BB)
    * lifetime fixed to 1024 instructions
    * dependencies are discarded if written to more than a lifetime ago

* Result: about (?? TODO) of deps found

#### With Gus

TODO ?

### UiCA enriching

* Plug Staticdeps into UiCA
* UiCA has a μop-level representation; staticdeps has an instr-level
  representation
    * Add dependencies between each couple of μop in (src,dest).
    * A finer model would be necessary to be accurate
    * Pessimistic model
* Run CesASMe on the full suite with uiCA and uiCA+staticdeps
    * results
* Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
    * results
Staticdeps: first tentative version 2023-09-08 16:58:38 +02:00			`# Static extraction of memory-carried dependencies`

			`## Intro`

			`* Previous chapt. : effect of mem-carried deps`
			`* Presented solution: Gus; in general dynamic analysis.`
			`* Effective`
			`* 2 O.M. slower => not acceptable in many cases`
			`* We need a static solution`

			`## Types of dependencies`

			`4 main types:`
			`1. RaW: "real" dependency`
			`2. WaW`
			`3. WaR`
			`4. RaW`

			`* 4: not an issue.`
			`* 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem`
			`either. Might be a problem for some archs.`

			`In all this chapter, we consider only RaW deps. Solution can be easily extended`
			`for WaW, WaR if necessary.`

			`Can occur:`
			`* through registers`
			```
			`A = 7`
			`B = A + 2`
			```
			`* through memory`
			```
			`store %rax, (%rbx)`
			`add (%rbx), %rcx`
			```

			`Can be:`
			`* in straight-line code`
			`* loop-carried:`
			```
			`for(i)`
			`B[i] = A[i-1] + 2`
			`A[i] = 7`
			```
			`## Cost of dependencies`

			`Dependencies are costly: assuming everything L1-resident, the latency of each`
			`μop on the dependency chain must be paid.`

			`On SKX,`
			* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C)
			=> `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
			* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)

			`## Static detection`

			`* Reg-carried, straight-line: relatively easy. Keep track of which PC last`
			`wrote each register.`
			`* Reg-carried, loop-carried: can be adapted from straight-line. Indeed,`
			`* need to track only so many iterations behind: after a certain point,`
			`instructions are out of the ROB anyway`
			`* 224 μops in Intel's Skylake, 2015`
			`* 512 μops in Intel's Golden Cove, 2021`
			`* Can unroll until we have ~\|ROB\|+\|K\| instructions in the kernel: since`
			`instructions yield at least a μop, safe [TODO check]`
			`* Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.`

			`* Harder for memory-carried:`
			`* addresses may alias, eg. (%rax) = 8(%rbx)`
			`* pointer arithmetics: must track values`
			`* Usually not done, or only for trivial cases.`

			`## Staticdeps heuristic`

			`* Aims to simply solve the 2nd point.`
			`* Could be solved with symbolic calculus, but not that easy to implement,`
			`slower.`
			`* Use random values`
			`* Operates at the scale of a kernel, unrolled enough times to fill the ROB`


			`* Whenever reading an unknown value (from mem or register), generate a fresh`
			`random value (64b), save it to shadow memory/register file`
			`* Whenever encountering integer arithmetics, compute the operation`
			`* Whenever encountering other kind of operations or unsupported operations,`
			`define the result as invalid (\bot): not pointer arithmetics.`
			`* Whenever writing to a memory address, keep track of which PC wrote where.`
			`* Whenever reading from a memory address, generate a dependency to the writing`
			`PC.`

			`* Reconstruct recurrent dependencies: transcribe each dependency to`
			`(src, dst, kernel delta)`.
			`* Verify that the dependency exists for each unroll (where it can exist, eg.`
			`1st kernel cannot depend on the previous kernel unroll); if it happens in the`
			`majority of cases, keep; else drop`

			`* Semantics of asm coming from Valgrind's IR -- should be portable to any`
			`architecture supported`
			`* but suffers limitations for recent extension sets; eg avx512 not`
			`supported (TODO check)`

			`### Limitations`

			`* Does not track aliasing that originates from outside of the kernel.`
			`* As advocated in CesASMe, would require a broader analysis range`
			`* Randomness may lead to false positives`
			`* but re-running with different seed should eliminate the hazard close to`
			`entirely`
			`* Should not have false negatives outside of aliasing or unsupported ops`

			`## Evaluation`

			`### Dependencies detection`

Plan: staticdeps eval 2023-09-13 11:57:51 +02:00			`#### With valgrind`

			`* Instrument binary:`
			* for each write, add `write_addr -> writer_pc` to a hashmap
			* for each read, fetch `writer_pc` from hashmap
			* if found, add a dependency `reader_pc -> writer_pc`
			`* At the end, write deps file:`
			* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
			`* Run for each binary in genbenchs`

			`* For each binary in genbenchs,`
			`* for each BB with more than 10% of max BB hits,`
			`* predict deps with staticdeps`
			`* check which dependencies are found/missed from the instrumented ones`
			`* limitation: will only find deps from/to the same BB! Dependencies leaving`
			`a BB are discarded.`

			`* Result: about 38% of deps found.`

			`* Cause: kernels executed in loops.`
			`* No dependency in the kernel`
			```
			`while:`
			`read (%rax)`
			`%rax ++`
			`write (%rax)`
			```
			`* But dependencies if executed in a loop! "Unwanted" deps.`
			`* and irrelevant in real life anyway: they are far away and will not cause`
			`latency`
			`* Fix: introduce dependency lifetime`
			`* timestamp = instructions executed (VG instrumentation, added up at the`
			`end of each BB)`
			`* lifetime fixed to 1024 instructions`
			`* dependencies are discarded if written to more than a lifetime ago`

			`* Result: about (?? TODO) of deps found`

			`#### With Gus`

			`TODO ?`
Staticdeps: first tentative version 2023-09-08 16:58:38 +02:00
			`### UiCA enriching`

			`* Plug Staticdeps into UiCA`
			`* UiCA has a μop-level representation; staticdeps has an instr-level`
			`representation`
			`* Add dependencies between each couple of μop in (src,dest).`
			`* A finer model would be necessary to be accurate`
			`* Pessimistic model`
			`* Run CesASMe on the full suite with uiCA and uiCA+staticdeps`
			`* results`
			`* Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps`
			`* results`