phd-thesis/plan/60_staticdeps.md

7.4 KiB

Static extraction of memory-carried dependencies

Intro

  • Previous chapt. : effect of mem-carried deps
  • Presented solution: Gus; in general dynamic analysis.
    • Effective
    • 2 O.M. slower => not acceptable in many cases
  • We need a static solution

Types of dependencies

4 main types:

  1. RaW: "real" dependency
  2. WaW
  3. WaR
  4. RaW
  • 4: not an issue.
  • 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem either. Might be a problem for some archs.

In all this chapter, we consider only RaW deps. Solution can be easily extended for WaW, WaR if necessary.

Can occur:

  • through registers
    A = 7
    B = A + 2
    
  • through memory
    store %rax, (%rbx)
    add (%rbx), %rcx
    

Can be:

  • in straight-line code
  • loop-carried:
    for(i)
        B[i] = A[i-1] + 2
        A[i] = 7
    

Cost of dependencies

Dependencies are costly: assuming everything L1-resident, the latency of each μop on the dependency chain must be paid.

On SKX,

  • add %rax, %rdx -> lat = 1 cycle (throughput = 1/4C) => add %rax, %rdx ; add %rdx, %rcx : 1.25C, would be 0.5C without deps
  • vfmadd*pd %ymm0, %ymm1, %ymm2: lat = 4C (TP = 1/2C)

Static detection

  • Reg-carried, straight-line: relatively easy. Keep track of which PC last wrote each register.

  • Reg-carried, loop-carried: can be adapted from straight-line. Indeed,

    • need to track only so many iterations behind: after a certain point, instructions are out of the ROB anyway
      • 224 μops in Intel's Skylake, 2015
      • 512 μops in Intel's Golden Cove, 2021
    • Can unroll until we have ~|ROB|+|K| instructions in the kernel: since instructions yield at least a μop, safe [TODO check]
    • Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
  • Harder for memory-carried:

    • addresses may alias, eg. (%rax) = 8(%rbx)
    • pointer arithmetics: must track values
    • Usually not done, or only for trivial cases.

Staticdeps heuristic

  • Aims to simply solve the 2nd point.

  • Could be solved with symbolic calculus, but not that easy to implement, slower.

  • Use random values

  • Operates at the scale of a kernel, unrolled enough times to fill the ROB

  • Whenever reading an unknown value (from mem or register), generate a fresh random value (64b), save it to shadow memory/register file

  • Whenever encountering integer arithmetics, compute the operation

  • Whenever encountering other kind of operations or unsupported operations, define the result as invalid (\bot): not pointer arithmetics.

  • Whenever writing to a memory address, keep track of which PC wrote where.

  • Whenever reading from a memory address, generate a dependency to the writing PC.

  • Reconstruct recurrent dependencies: transcribe each dependency to (src, dst, kernel delta).

  • Verify that the dependency exists for each unroll (where it can exist, eg. 1st kernel cannot depend on the previous kernel unroll); if it happens in the majority of cases, keep; else drop

  • Semantics of asm coming from Valgrind's IR -- should be portable to any architecture supported

    • but suffers limitations for recent extension sets; eg avx512 not supported (TODO check)

Limitations

  • Does not track aliasing that originates from outside of the kernel.
    • As advocated in CesASMe, would require a broader analysis range
  • Randomness may lead to false positives
    • but re-running with different seed should eliminate the hazard close to entirely
  • Should not have false negatives outside of aliasing or unsupported ops

Evaluation

Dependencies detection

With valgrind

  • Write a tool, valgrind-depsim, to instrument a binary to extract its dependencies at runtime
  • Can extract memory, register and temp-based dependencies
  • Here, only the memory dependencies are relevant -- disable the other deps.
  • Instrument binary:
    • for each write, add write_addr -> writer_pc to a hashmap
    • for each read, fetch writer_pc from hashmap
      • if found, add a dependency reader_pc -> writer_pc
    • use the process' memory map to translate PC to addresses inside ELF files
    • At the end, write deps file:
      • #occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path
    • Run for each binary in genbenchs
    • Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage

Then, compare with staticdeps: eval/vg_depsim.py script.

  • For each binary in genbenchs,

    • use genbench's bb split/occurrences to retrieve basic blocks
    • for each BB with more than 10% of max BB hits,
    • predict deps with staticdeps
      • cache the result: fast, but we're dealing with 3500 files.
    • translate staticdeps' periodic deps to PC deps, discard the iter parameter
    • for each dependency from the depsim results that occurs inside this BB,
      • check if found or missed, append to a list
  • score: |found| / (|found| + |missed|). Discards occurrences.

  • limitation: will only find deps from/to the same BB! Dependencies leaving a BB are discarded.

  • Result: about 38% of deps found; 44% if weighting by occurrences

  • Cause: kernels executed in loops.

    • No dependency in the kernel
    while:
        read (%rax)
        %rax ++
        write (%rax)
    
    • But dependencies if executed in a loop! "Unwanted" deps.
    • and irrelevant in real life anyway: they are far away and will not cause latency
  • Fix: introduce dependency lifetime

    • timestamp = instructions executed (VG instrumentation, added up at the end of each BB)
    • lifetime fixed to 1024 instructions, order of magnitude of a ROB
    • dependencies are discarded if written to more than a lifetime ago
  • Result: about 58% of deps found; same if weighing.

  • If lifetime lowered to 512, about 56% of deps found, or 63% if weighing.

    • Results are quite similar, lowering the lifetime further makes no particular sense.

Raw results:

In [123]: res_success(res_life512)
Out[123]: 0.5640902544407105

In [124]: res_success(res_life1024)
Out[124]: 0.5761437608875034

In [125]: res_success(res_nolife)
Out[125]: 0.38143868803578085

In [126]: res_success_weight(res_life512)
Out[126]: 0.6347271857382266

In [127]: res_success_weight(res_life1024)
Out[127]: 0.5817404277466787

In [128]: res_success_weight(res_nolife)
Out[128]: 0.4397921976192802
  • The results are reasonable, but not all the deps are caught
  • As argued above, will never see aliasing; important in plenty of cases.
    • eg. if the compiler allocates %rcx = A[i] and %rdx = A[i+2] for some reason, dependencies will be missed.
  • As argued in previous chapter, a complete dependencies analysis would require a broader range: take the full scope into account

With Gus

TODO ?

UiCA enriching

  • Plug Staticdeps into UiCA

  • UiCA has a μop-level representation; staticdeps has an instr-level representation

    • Add dependencies between each couple of μop in (src,dest).
    • A finer model would be necessary to be accurate
    • Pessimistic model
  • Run CesASMe on the full suite with uiCA and uiCA+staticdeps

    • results
  • Run CesASMe on the no-memdeps suite with uiCA and uiCA+staticdeps

    • results
  • Although not all dependencies are detected [paragraph above], the "important" ones seem to be detected: this is the most critical property for throughput analysis

    • but might not be true for other applications that require dependencies detection