tobast/phd-thesis

Fork 0

Théophile Bastian c2acf78476 Init staticdeps, write §1

2023-09-27 17:02:30 +02:00

7.8 KiB

Raw Blame History

Static extraction of memory-carried dependencies

PREREQUISITES

CesASMe results
Gus
Static vs dynamic
PC
μarch: μop, renamer, L1-res, ROB
Osaca
UiCA END

Intro

Previous chapt. : effect of mem-carried deps
Presented solution: Gus; in general dynamic analysis.
- Effective
- 2 O.M. slower => not acceptable in many cases
We need a static solution

Dependencies are costly: assuming everything L1-resident, the latency of each μop on the dependency chain must be paid.

On SKX,

add %rax, %rdx -> lat = 1 cycle (throughput = 1/4C) => add %rax, %rdx ; add %rdx, %rcx : 1.25C, would be 0.5C without deps
vfmadd*pd %ymm0, %ymm1, %ymm2: lat = 4C (TP = 1/2C)

Types of dependencies

4 main types:

RaW: "real" dependency
WaW
WaR
RaR

4: not an issue.
2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem either. Might be a problem for some archs.

In all this chapter, we consider only RaW deps. Solution can be easily extended for WaW, WaR if necessary.

Can occur:

through registers
```
A = 7
B = A + 2
```
through memory
```
store %rax, (%rbx)
add (%rbx), %rcx
```

Can be:

in straight-line code

loop-carried:

for(i)
    B[i] = A[i-1] + 2
    A[i] = 7

Dynamic detection: Valgrind

Mention Gus

Valgrind's VEX

Introduce Valgrind as an instrumentation tool
Introduce VEX
Should be portable to any architecture supported
- but suffers limitations for recent extension sets; eg avx512 not supported (TODO check)

Depsim

Write a tool, valgrind-depsim, to instrument a binary to extract its dependencies at runtime
Can extract memory, register and temp-based dependencies
Here, only the memory dependencies are relevant -- disable the other deps.
Instrument binary:
- for each write, add write_addr -> writer_pc to a hashmap
- for each read, fetch writer_pc from hashmap
  - if found, add a dependency reader_pc -> writer_pc
- use the process' memory map to translate PC to addresses inside ELF files
- At the end, write deps file:
  - #occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path
- Run for each binary in genbenchs
- Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage

Static detection

Reg-carried, straight-line: relatively easy. Keep track of which PC last wrote each register.
Reg-carried, loop-carried: can be adapted from straight-line. Indeed,
- need to track only so many iterations behind: after a certain point, instructions are out of the ROB anyway
  - 224 μops in Intel's Skylake, 2015
  - 512 μops in Intel's Golden Cove, 2021
  - Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
- Can unroll until we have ~|ROB|+|K| instructions in the kernel: since instructions yield at least a μop, safe [TODO check]
- Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
Harder for memory-carried:
- addresses may alias, eg. (%rax) = 8(%rbx)
- pointer arithmetics: must track values
- Usually not done, or only for trivial cases.

Staticdeps heuristic

Aims to simply solve the 2nd point.
Could be solved with symbolic calculus, but not that easy to implement, slower.
Use random values
Operates at the scale of a kernel, unrolled enough times to fill the ROB
Whenever reading an unknown value (from mem or register), generate a fresh random value (64b), save it to shadow memory/register file
Whenever encountering integer arithmetics, compute the operation
Whenever encountering other kind of operations or unsupported operations, define the result as invalid (\bot): not pointer arithmetics.
Whenever writing to a memory address, keep track of which PC wrote where.
Whenever reading from a memory address, generate a dependency to the writing PC.
Reconstruct recurrent dependencies: transcribe each dependency to (src, dst, kernel delta).
Verify that the dependency exists for each unroll (where it can exist, eg. 1st kernel cannot depend on the previous kernel unroll); if it happens in the majority of cases, keep; else drop
We need semantics for our assembly

Limitations

Does not track aliasing that originates from outside of the kernel.
- As advocated in CesASMe, would require a broader analysis range
Randomness may (theoretically) lead to false positives
- but re-running with different seed should eliminate the hazard close to entirely
Should not have false negatives outside of aliasing or unsupported ops

Evaluation

Dependencies detection

With valgrind

Use valgrind-depsim. Then, compare with staticdeps: eval/vg_depsim.py script.

For each binary in genbenchs,
- use genbench's bb split/occurrences to retrieve basic blocks
- for each BB with more than 10% of max BB hits,
- predict deps with staticdeps
  - cache the result: staticdeps is fast, but we're dealing with 3500 files.
- translate staticdeps' periodic deps to PC deps, discard the iter parameter
- for each dependency from the depsim results that occurs inside this BB,
  - check if found or missed, append to a list
score: |found| / (|found| + |missed|). Discards occurrences.
limitation: will only find deps from/to the same BB! Dependencies leaving a BB are discarded.
Result: about 38% of deps found; 44% if weighting by occurrences
Cause: kernels executed in loops.
- No dependency in the kernel
```
while:
    read (%rax)
    %rax ++
    write (%rax)
```
- But dependencies if executed in a loop! "Unwanted" deps.
- and irrelevant in real life anyway: they are far away and will not cause latency
Fix: introduce dependency lifetime
- timestamp = instructions executed (VG instrumentation, added up at the end of each BB)
- lifetime fixed to 1024 instructions, order of magnitude of a ROB
- dependencies are discarded if written to more than a lifetime ago
Result: about 58% of deps found; same if weighing.
If lifetime lowered to 512, about 56% of deps found, or 63% if weighing.
- Results are quite similar, lowering the lifetime further makes no particular sense.

Raw results:

In [123]: res_success(res_life512)
Out[123]: 0.5640902544407105

In [124]: res_success(res_life1024)
Out[124]: 0.5761437608875034

In [125]: res_success(res_nolife)
Out[125]: 0.38143868803578085

In [126]: res_success_weight(res_life512)
Out[126]: 0.6347271857382266

In [127]: res_success_weight(res_life1024)
Out[127]: 0.5817404277466787

In [128]: res_success_weight(res_nolife)
Out[128]: 0.4397921976192802

The results are reasonable, but not all the deps are caught
As argued above, will never see aliasing; important in plenty of cases.
- eg. if the compiler allocates %rcx = A[i] and %rdx = A[i+2] for some reason, dependencies will be missed.
As argued in previous chapter, a complete dependencies analysis would require a broader range: take the full scope into account

With Gus

TODO ?

UiCA enriching

Plug Staticdeps into UiCA
UiCA has a μop-level representation; staticdeps has an instr-level representation
- Add dependencies between each couple of μop in (src,dest).
- A finer model would be necessary to be accurate
- Pessimistic model
Run CesASMe on the full suite with uiCA and uiCA+staticdeps
- results
Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps
- results
Although not all dependencies are detected [paragraph above], the "important" ones seem to be detected: this is the most critical property for throughput analysis
- but might not be true for other applications that require dependencies detection

Speed

TODO: evaluate speed?

7.8 KiB Raw Blame History