176 lines
5.3 KiB
Markdown
176 lines
5.3 KiB
Markdown
# Palmed: automatically modelling the backend
|
|
|
|
[[PREREQUISITES]]
|
|
* Microarch: ports, μops, pipeline, cycle, L1-res
|
|
* Define Cyc(kernel)
|
|
* Backend models
|
|
* HW counters
|
|
* Tools:
|
|
* Iaca
|
|
* UOPS
|
|
* llvm-mca
|
|
* PMEvo
|
|
[[END]]
|
|
|
|
* SotA: we saw efforts to build backend models
|
|
* they take considerable expert knowledge/time
|
|
* based on reverse-engineering, HW counters
|
|
* What if these counters are not as precise? (TODO: investigate ZEN, ARM)
|
|
* Too many new CPU/archs anyway for the experts to catch up
|
|
|
|
* Goal: make a benchmarks-based tool
|
|
* fully-automated
|
|
* yet as accurate as possible
|
|
|
|
* Mostly the work of Nicolas Derumigny
|
|
* I worked on Palmed as engineer about a year, gain expertise in CPU
|
|
architecture
|
|
|
|
## Resource models
|
|
|
|
* As seen before, CPU backend = ports
|
|
* Instruction --> decode --> μop(s)
|
|
* Each μop --> port able to process it
|
|
* Ports: in most cases, fully pipelined. 1μop/cycle (even though time to
|
|
completion is longer).
|
|
* Classical port mapping: insn -> μop -> possible ports (disjunctive)
|
|
* example where everything works well
|
|
* example with port overlap: 2xADDSS + BSR [cf palmed paper]
|
|
* nontrivial example?
|
|
=> in the general case, requires solving an optimisation problem
|
|
* Resource model
|
|
* presentation
|
|
* formal definition
|
|
* can be solved with a max
|
|
* same examples
|
|
=> trivial to find throughput in any case
|
|
* drawback: combinatorial explosion
|
|
* but this is very reasonable on real-life CPUs because ports are not
|
|
random.
|
|
|
|
## Palmed
|
|
|
|
* Find a resource model automatically
|
|
* Concerned only with backend throughput
|
|
* No dependencies
|
|
* Ignore completely in-order effects
|
|
* kernel = multiset of instructions
|
|
|
|
**General idea:** given enough, well-chosen kernels, and a measure of their
|
|
execution time, we can build a model.
|
|
|
|
Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)`
|
|
=> many equations describing the \rho.
|
|
|
|
* multi-stage model, builds intermediary results
|
|
|
|
[insert high-level view of Palmed]
|
|
|
|
* quickly describe intermediary results
|
|
* classes of instructions: will be useful later
|
|
|
|
## Actually measuring a kernel's throughput
|
|
|
|
Pipedream
|
|
* Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
|
|
* Goal: Measure #cycles of a multiset of instructions
|
|
* Full throughput
|
|
* No dependencies
|
|
* L1-resident
|
|
|
|
* Use HW counters to measure cycles (Papi)
|
|
* Generate an asm kernel of the form
|
|
```
|
|
for NUM_MEASURES:
|
|
HW_cycles_measure:
|
|
for NUM_ITER:
|
|
kernel
|
|
kernel
|
|
...
|
|
kernel
|
|
```
|
|
so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE *
|
|
NUM_ITER >= TOTAL_INSN`.
|
|
|
|
* Must instantiate insn:
|
|
* reg alloc
|
|
* mem addresses
|
|
* Reg: split registers into read and write pool; enough read registers for
|
|
each instruction.
|
|
* Read: always read from the same registers (R -> R dep is not a problem)
|
|
* Write: round-robin
|
|
* On some architectures, W -> W dependencies does not allow full
|
|
parallelism
|
|
* On some ISAs, some insn have R+W operands
|
|
* Mem: allocate a memory arena, L1-sized. Split into read and write pool.
|
|
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same
|
|
address (load/store separated)
|
|
* Base-index-displacement mode: constant base, 0 offset, round-robin
|
|
displacement on x86 (constant displacement on ARM)
|
|
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
|
|
measurement
|
|
* L1-residence: memory arena is small enough; warm-up rounds.
|
|
|
|
=> kernel throughput measurement.
|
|
|
|
Note: this works only because we measure a multiset of instructions, not a
|
|
given asm code. We control the operands.
|
|
|
|
## Results
|
|
|
|
With all this, Palmed is capable of producing throughput models.
|
|
|
|
Tried on x86 (SKX, ZEN1) and ARM (A72).
|
|
=> results
|
|
|
|
## Contributions
|
|
|
|
### Reproducibility: measurements database
|
|
|
|
* important both for efficiency and reproducibility
|
|
* efficiency: avoid re-computing already made measurements
|
|
* reproducibility: all the raw data is available after the run
|
|
* ability to derive the model from raw data again
|
|
* ability to assess the quality of raw measurements
|
|
* backup/restore
|
|
|
|
### Evaluation
|
|
|
|
#### Bench suites: SPEC, Polybench
|
|
|
|
* SPEC: real-world programs
|
|
* Mainly made to evaluate hardware on a fixed workload
|
|
* Provides a fixed workload to evaluate various pieces of software
|
|
experimentations as well
|
|
* Used throughout the litterature
|
|
* Describe versions of SPEC, architecture
|
|
* Polybench
|
|
* 30 numerical computations
|
|
* Computation kernels: domain specific (sci. computation, math, …)
|
|
* Kernel well-defined; no need to "figure out" the interesting basic blocks
|
|
* C language
|
|
* datasets
|
|
|
|
#### Experimental setup
|
|
|
|
* Harness to evaluate Palmed against other code analyzers
|
|
* Raw pipedream
|
|
* Iaca
|
|
* llvm-mca
|
|
* PMEvo
|
|
* UOPS
|
|
* UiCA did not exist at the time; + fair comparison (Palmed is backend)
|
|
* Based on basic blocks
|
|
* The kernel is defined as a Palmed kernel: unordered, no dependencies
|
|
* in practice, use Pipedream generated code as kernel
|
|
* As Pipedream doesn't support all instructions, some instructions must be
|
|
stripped from the kernel (eg. control flow)
|
|
|
|
Measures:
|
|
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
|
|
* RMS error of IPC
|
|
* Kendall's tau for the IPCs
|
|
|
|
#### Results
|
|
|
|
* Results
|