5.3 KiB
Palmed: automatically modelling the backend
-
Microarch: ports, μops, pipeline, cycle, L1-res
-
Define Cyc(kernel)
-
Backend models
-
HW counters
-
Tools:
- Iaca
- UOPS
- llvm-mca
- PMEvo END
-
SotA: we saw efforts to build backend models
-
they take considerable expert knowledge/time
-
based on reverse-engineering, HW counters
-
What if these counters are not as precise? (TODO: investigate ZEN, ARM)
-
Too many new CPU/archs anyway for the experts to catch up
-
Goal: make a benchmarks-based tool
- fully-automated
- yet as accurate as possible
-
Mostly the work of Nicolas Derumigny
-
I worked on Palmed as engineer about a year, gain expertise in CPU architecture
Resource models
- As seen before, CPU backend = ports
- Instruction --> decode --> μop(s)
- Each μop --> port able to process it
- Ports: in most cases, fully pipelined. 1μop/cycle (even though time to completion is longer).
- Classical port mapping: insn -> μop -> possible ports (disjunctive)
- example where everything works well
- example with port overlap: 2xADDSS + BSR [cf palmed paper]
- nontrivial example? => in the general case, requires solving an optimisation problem
- Resource model
- presentation
- formal definition
- can be solved with a max
- same examples => trivial to find throughput in any case
- drawback: combinatorial explosion
- but this is very reasonable on real-life CPUs because ports are not random.
Palmed
- Find a resource model automatically
- Concerned only with backend throughput
- No dependencies
- Ignore completely in-order effects
- kernel = multiset of instructions
General idea: given enough, well-chosen kernels, and a measure of their execution time, we can build a model.
Indeed, (K, Cyc(K)) => max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)
=> many equations describing the \rho.
- multi-stage model, builds intermediary results
[insert high-level view of Palmed]
- quickly describe intermediary results
- classes of instructions: will be useful later
Actually measuring a kernel's throughput
Pipedream
-
Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
-
Goal: Measure #cycles of a multiset of instructions
- Full throughput
- No dependencies
- L1-resident
-
Use HW counters to measure cycles (Papi)
-
Generate an asm kernel of the form
for NUM_MEASURES:
HW_cycles_measure:
for NUM_ITER:
kernel
kernel
...
kernel
so that unrolled body of the loop has >= UNROLL_SIZE
insn, and UNROLL_SIZE * NUM_ITER >= TOTAL_INSN
.
- Must instantiate insn:
- reg alloc
- mem addresses
- Reg: split registers into read and write pool; enough read registers for
each instruction.
- Read: always read from the same registers (R -> R dep is not a problem)
- Write: round-robin
- On some architectures, W -> W dependencies does not allow full parallelism
- On some ISAs, some insn have R+W operands
- Mem: allocate a memory arena, L1-sized. Split into read and write pool.
- Direct register addressing mode (eg
ldr x0, [x1]
): always the same address (load/store separated) - Base-index-displacement mode: constant base, 0 offset, round-robin displacement on x86 (constant displacement on ARM)
- Direct register addressing mode (eg
- Whenever possible (
\sum_i(lat_i) < #reg
), no data dependency during measurement - L1-residence: memory arena is small enough; warm-up rounds.
=> kernel throughput measurement.
Note: this works only because we measure a multiset of instructions, not a given asm code. We control the operands.
Results
With all this, Palmed is capable of producing throughput models.
Tried on x86 (SKX, ZEN1) and ARM (A72). => results
Contributions
Reproducibility: measurements database
- important both for efficiency and reproducibility
- efficiency: avoid re-computing already made measurements
- reproducibility: all the raw data is available after the run
- ability to derive the model from raw data again
- ability to assess the quality of raw measurements
- backup/restore
Evaluation
Bench suites: SPEC, Polybench
- SPEC: real-world programs
- Mainly made to evaluate hardware on a fixed workload
- Provides a fixed workload to evaluate various pieces of software
experimentations as well
- Used throughout the litterature
- Describe versions of SPEC, architecture
- Polybench
- 30 numerical computations
- Computation kernels: domain specific (sci. computation, math, …)
- Kernel well-defined; no need to "figure out" the interesting basic blocks
- C language
- datasets
Experimental setup
- Harness to evaluate Palmed against other code analyzers
- Raw pipedream
- Iaca
- llvm-mca
- PMEvo
- UOPS
- UiCA did not exist at the time; + fair comparison (Palmed is backend)
- Based on basic blocks
- The kernel is defined as a Palmed kernel: unordered, no dependencies
- in practice, use Pipedream generated code as kernel
- As Pipedream doesn't support all instructions, some instructions must be stripped from the kernel (eg. control flow)
Measures:
- Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
- RMS error of IPC
- Kendall's tau for the IPCs
Results
- Results