phd-thesis/plan/30_palmed.md

177 lines
5.3 KiB
Markdown
Raw Permalink Normal View History

# Palmed: automatically modelling the backend
[[PREREQUISITES]]
* Microarch: ports, μops, pipeline, cycle, L1-res
* Define Cyc(kernel)
* Backend models
* HW counters
* Tools:
* Iaca
* UOPS
* llvm-mca
* PMEvo
[[END]]
2023-09-06 17:52:50 +02:00
* SotA: we saw efforts to build backend models
* they take considerable expert knowledge/time
* based on reverse-engineering, HW counters
* What if these counters are not as precise? (TODO: investigate ZEN, ARM)
* Too many new CPU/archs anyway for the experts to catch up
* Goal: make a benchmarks-based tool
* fully-automated
* yet as accurate as possible
* Mostly the work of Nicolas Derumigny
* I worked on Palmed as engineer about a year, gain expertise in CPU
architecture
## Resource models
* As seen before, CPU backend = ports
* Instruction --> decode --> μop(s)
* Each μop --> port able to process it
* Ports: in most cases, fully pipelined. 1μop/cycle (even though time to
completion is longer).
* Classical port mapping: insn -> μop -> possible ports (disjunctive)
* example where everything works well
* example with port overlap: 2xADDSS + BSR [cf palmed paper]
* nontrivial example?
=> in the general case, requires solving an optimisation problem
* Resource model
* presentation
* formal definition
* can be solved with a max
* same examples
=> trivial to find throughput in any case
* drawback: combinatorial explosion
* but this is very reasonable on real-life CPUs because ports are not
random.
## Palmed
* Find a resource model automatically
* Concerned only with backend throughput
* No dependencies
* Ignore completely in-order effects
* kernel = multiset of instructions
**General idea:** given enough, well-chosen kernels, and a measure of their
execution time, we can build a model.
Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)`
=> many equations describing the \rho.
* multi-stage model, builds intermediary results
[insert high-level view of Palmed]
* quickly describe intermediary results
* classes of instructions: will be useful later
## Actually measuring a kernel's throughput
Pipedream
* Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
* Goal: Measure #cycles of a multiset of instructions
* Full throughput
* No dependencies
* L1-resident
* Use HW counters to measure cycles (Papi)
* Generate an asm kernel of the form
```
for NUM_MEASURES:
HW_cycles_measure:
for NUM_ITER:
kernel
kernel
...
kernel
```
so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE *
NUM_ITER >= TOTAL_INSN`.
* Must instantiate insn:
* reg alloc
* mem addresses
* Reg: split registers into read and write pool; enough read registers for
each instruction.
* Read: always read from the same registers (R -> R dep is not a problem)
* Write: round-robin
* On some architectures, W -> W dependencies does not allow full
parallelism
* On some ISAs, some insn have R+W operands
* Mem: allocate a memory arena, L1-sized. Split into read and write pool.
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same
address (load/store separated)
* Base-index-displacement mode: constant base, 0 offset, round-robin
displacement on x86 (constant displacement on ARM)
2023-09-06 17:52:50 +02:00
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
measurement
* L1-residence: memory arena is small enough; warm-up rounds.
=> kernel throughput measurement.
Note: this works only because we measure a multiset of instructions, not a
given asm code. We control the operands.
## Results
With all this, Palmed is capable of producing throughput models.
Tried on x86 (SKX, ZEN1) and ARM (A72).
=> results
## Contributions
2023-09-06 17:52:50 +02:00
### Reproducibility: measurements database
* important both for efficiency and reproducibility
* efficiency: avoid re-computing already made measurements
* reproducibility: all the raw data is available after the run
* ability to derive the model from raw data again
* ability to assess the quality of raw measurements
* backup/restore
### Evaluation
#### Bench suites: SPEC, Polybench
* SPEC: real-world programs
* Mainly made to evaluate hardware on a fixed workload
* Provides a fixed workload to evaluate various pieces of software
experimentations as well
* Used throughout the litterature
* Describe versions of SPEC, architecture
* Polybench
* 30 numerical computations
* Computation kernels: domain specific (sci. computation, math, …)
* Kernel well-defined; no need to "figure out" the interesting basic blocks
* C language
* datasets
#### Experimental setup
2023-09-06 17:52:50 +02:00
* Harness to evaluate Palmed against other code analyzers
* Raw pipedream
* Iaca
* llvm-mca
* PMEvo
* UOPS
* UiCA did not exist at the time; + fair comparison (Palmed is backend)
2023-09-06 17:52:50 +02:00
* Based on basic blocks
* The kernel is defined as a Palmed kernel: unordered, no dependencies
* in practice, use Pipedream generated code as kernel
* As Pipedream doesn't support all instructions, some instructions must be
stripped from the kernel (eg. control flow)
Measures:
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
* RMS error of IPC
* Kendall's tau for the IPCs
#### Results
* Results