phd-thesis/plan/30_palmed.md

4.4 KiB

Palmed: automatically modelling the backend

  • SotA: we saw efforts to build backend models

  • they take considerable expert knowledge/time

  • based on reverse-engineering, HW counters

  • What if these counters are not as precise? (TODO: investigate ZEN, ARM)

  • Too many new CPU/archs anyway for the experts to catch up

  • Goal: make a benchmarks-based tool

    • fully-automated
    • yet as accurate as possible
  • Mostly the work of Nicolas Derumigny

  • I worked on Palmed as engineer about a year, gain expertise in CPU architecture

Resource models

  • As seen before, CPU backend = ports
  • Instruction --> decode --> μop(s)
  • Each μop --> port able to process it
  • Ports: in most cases, fully pipelined. 1μop/cycle (even though time to completion is longer).
  • Classical port mapping: insn -> μop -> possible ports (disjunctive)
    • example where everything works well
    • example with port overlap: 2xADDSS + BSR [cf palmed paper]
    • nontrivial example? => in the general case, requires solving an optimisation problem
  • Resource model
    • presentation
    • formal definition
    • can be solved with a max
    • same examples => trivial to find throughput in any case
    • drawback: combinatorial explosion
      • but this is very reasonable on real-life CPUs because ports are not random.

Palmed

  • Find a resource model automatically
  • Concerned only with backend throughput
    • No dependencies
    • Ignore completely in-order effects
  • kernel = multiset of instructions

General idea: given enough, well-chosen kernels, and a measure of their execution time, we can build a model.

Indeed, (K, Cyc(K)) => max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K) => many equations describing the \rho.

  • multi-stage model, builds intermediary results

[insert high-level view of Palmed]

  • quickly describe intermediary results
  • classes of instructions: will be useful later

Actually measuring a kernel's throughput

Pipedream

  • Original work by F. Gruber, cont. by N. Derumigny and C. Guillon

  • Goal: Measure #cycles of a multiset of instructions

    • Full throughput
    • No dependencies
    • L1-resident
  • Use HW counters to measure cycles (Papi)

  • Generate an asm kernel of the form

for NUM_MEASURES:
    HW_cycles_measure:
        for NUM_ITER:
            kernel
            kernel
            ...
            kernel

so that unrolled body of the loop has >= UNROLL_SIZE insn, and UNROLL_SIZE * NUM_ITER >= TOTAL_INSN.

  • Must instantiate insn:
    • reg alloc
    • mem addresses
  • Reg: split registers into read and write pool; enough read registers for each instruction.
    • Read: always read from the same registers (R -> R dep is not a problem)
    • Write: round-robin
      • On some architectures, W -> W dependencies does not allow full parallelism
      • On some ISAs, some insn have R+W operands
  • Mem: allocate a memory arena, L1-sized. Split into read and write pool.
    • Direct register addressing mode (eg ldr x0, [x1]): always the same address (load/store separated)
    • Base-index-displacement mode: constant base, 0 offset, round-robin displacement.
  • Whenever possible (\sum_i(lat_i) < #reg), no data dependency during measurement
  • L1-residence: memory arena is small enough; warm-up rounds.

=> kernel throughput measurement.

Note: this works only because we measure a multiset of instructions, not a given asm code. We control the operands.

Results

With all this, Palmed is capable of producing throughput models.

Tried on x86 (SKX, ZEN1) and ARM (A72). => results

Contributions

Reproducibility: measurements database

  • important both for efficiency and reproducibility
  • efficiency: avoid re-computing already made measurements
  • reproducibility: all the raw data is available after the run
    • ability to derive the model from raw data again
    • ability to assess the quality of raw measurements
    • backup/restore

Evaluation

  • Harness to evaluate Palmed against other code analyzers
    • Raw pipedream
    • Gus
    • Iaca
    • UOPS
    • llvm-mca
    • PMEvo
  • Based on basic blocks
  • The kernel is defined as a Palmed kernel: unordered, no dependencies
    • in practice, use Pipedream generated code as kernel
  • As Pipedream doesn't support all instructions, some instructions must be stripped from the kernel (eg. control flow)

Measures:

  • Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
  • RMS error of IPC
  • Kendall's tau for the IPCs