phd-thesis/plan/30_palmed.md

143 lines
4.4 KiB
Markdown

# Palmed: automatically modelling the backend
* SotA: we saw efforts to build backend models
* they take considerable expert knowledge/time
* based on reverse-engineering, HW counters
* What if these counters are not as precise? (TODO: investigate ZEN, ARM)
* Too many new CPU/archs anyway for the experts to catch up
* Goal: make a benchmarks-based tool
* fully-automated
* yet as accurate as possible
* Mostly the work of Nicolas Derumigny
* I worked on Palmed as engineer about a year, gain expertise in CPU
architecture
## Resource models
* As seen before, CPU backend = ports
* Instruction --> decode --> μop(s)
* Each μop --> port able to process it
* Ports: in most cases, fully pipelined. 1μop/cycle (even though time to
completion is longer).
* Classical port mapping: insn -> μop -> possible ports (disjunctive)
* example where everything works well
* example with port overlap: 2xADDSS + BSR [cf palmed paper]
* nontrivial example?
=> in the general case, requires solving an optimisation problem
* Resource model
* presentation
* formal definition
* can be solved with a max
* same examples
=> trivial to find throughput in any case
* drawback: combinatorial explosion
* but this is very reasonable on real-life CPUs because ports are not
random.
## Palmed
* Find a resource model automatically
* Concerned only with backend throughput
* No dependencies
* Ignore completely in-order effects
* kernel = multiset of instructions
**General idea:** given enough, well-chosen kernels, and a measure of their
execution time, we can build a model.
Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)`
=> many equations describing the \rho.
* multi-stage model, builds intermediary results
[insert high-level view of Palmed]
* quickly describe intermediary results
* classes of instructions: will be useful later
## Actually measuring a kernel's throughput
Pipedream
* Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
* Goal: Measure #cycles of a multiset of instructions
* Full throughput
* No dependencies
* L1-resident
* Use HW counters to measure cycles (Papi)
* Generate an asm kernel of the form
```
for NUM_MEASURES:
HW_cycles_measure:
for NUM_ITER:
kernel
kernel
...
kernel
```
so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE *
NUM_ITER >= TOTAL_INSN`.
* Must instantiate insn:
* reg alloc
* mem addresses
* Reg: split registers into read and write pool; enough read registers for
each instruction.
* Read: always read from the same registers (R -> R dep is not a problem)
* Write: round-robin
* On some architectures, W -> W dependencies does not allow full
parallelism
* On some ISAs, some insn have R+W operands
* Mem: allocate a memory arena, L1-sized. Split into read and write pool.
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same
address (load/store separated)
* Base-index-displacement mode: constant base, 0 offset, round-robin
displacement.
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
measurement
* L1-residence: memory arena is small enough; warm-up rounds.
=> kernel throughput measurement.
Note: this works only because we measure a multiset of instructions, not a
given asm code. We control the operands.
## Results
With all this, Palmed is capable of producing throughput models.
Tried on x86 (SKX, ZEN1) and ARM (A72).
=> results
## Contributions
### Reproducibility: measurements database
* important both for efficiency and reproducibility
* efficiency: avoid re-computing already made measurements
* reproducibility: all the raw data is available after the run
* ability to derive the model from raw data again
* ability to assess the quality of raw measurements
* backup/restore
### Evaluation
* Harness to evaluate Palmed against other code analyzers
* Raw pipedream
* Gus
* Iaca
* UOPS
* llvm-mca
* PMEvo
* Based on basic blocks
* The kernel is defined as a Palmed kernel: unordered, no dependencies
* in practice, use Pipedream generated code as kernel
* As Pipedream doesn't support all instructions, some instructions must be
stripped from the kernel (eg. control flow)
Measures:
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
* RMS error of IPC
* Kendall's tau for the IPCs