Palmed: automatically modelling the backend

Microarch: ports, μops, pipeline, cycle, L1-res
Define Cyc(kernel)
Backend models
HW counters
Tools:
- Iaca
- UOPS
- llvm-mca
- PMEvo END
SotA: we saw efforts to build backend models
they take considerable expert knowledge/time
based on reverse-engineering, HW counters
What if these counters are not as precise? (TODO: investigate ZEN, ARM)
Too many new CPU/archs anyway for the experts to catch up
Goal: make a benchmarks-based tool
- fully-automated
- yet as accurate as possible
Mostly the work of Nicolas Derumigny
I worked on Palmed as engineer about a year, gain expertise in CPU architecture

Resource models

As seen before, CPU backend = ports
Instruction --> decode --> μop(s)
Each μop --> port able to process it
Ports: in most cases, fully pipelined. 1μop/cycle (even though time to completion is longer).
Classical port mapping: insn -> μop -> possible ports (disjunctive)
- example where everything works well
- example with port overlap: 2xADDSS + BSR [cf palmed paper]
- nontrivial example? => in the general case, requires solving an optimisation problem
Resource model
- presentation
- formal definition
- can be solved with a max
- same examples => trivial to find throughput in any case
- drawback: combinatorial explosion
  - but this is very reasonable on real-life CPUs because ports are not random.

Find a resource model automatically
Concerned only with backend throughput
- No dependencies
- Ignore completely in-order effects
kernel = multiset of instructions

General idea: given enough, well-chosen kernels, and a measure of their execution time, we can build a model.

Indeed, (K, Cyc(K)) => max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K) => many equations describing the \rho.

[insert high-level view of Palmed]

Pipedream

Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
Goal: Measure #cycles of a multiset of instructions
- Full throughput
- No dependencies
- L1-resident
Use HW counters to measure cycles (Papi)
Generate an asm kernel of the form

for NUM_MEASURES:
    HW_cycles_measure:
        for NUM_ITER:
            kernel
            kernel
            ...
            kernel

so that unrolled body of the loop has >= UNROLL_SIZE insn, and UNROLL_SIZE * NUM_ITER >= TOTAL_INSN.

Must instantiate insn:
- reg alloc
- mem addresses
Reg: split registers into read and write pool; enough read registers for each instruction.
- Read: always read from the same registers (R -> R dep is not a problem)
- Write: round-robin
  - On some architectures, W -> W dependencies does not allow full parallelism
  - On some ISAs, some insn have R+W operands
Mem: allocate a memory arena, L1-sized. Split into read and write pool.
- Direct register addressing mode (eg ldr x0, [x1]): always the same address (load/store separated)
- Base-index-displacement mode: constant base, 0 offset, round-robin displacement on x86 (constant displacement on ARM)
Whenever possible (\sum_i(lat_i) < #reg), no data dependency during measurement
L1-residence: memory arena is small enough; warm-up rounds.

=> kernel throughput measurement.

Note: this works only because we measure a multiset of instructions, not a given asm code. We control the operands.

With all this, Palmed is capable of producing throughput models.

Tried on x86 (SKX, ZEN1) and ARM (A72). => results

important both for efficiency and reproducibility
efficiency: avoid re-computing already made measurements
reproducibility: all the raw data is available after the run
- ability to derive the model from raw data again
- ability to assess the quality of raw measurements
- backup/restore

SPEC: real-world programs
- Mainly made to evaluate hardware on a fixed workload
- Provides a fixed workload to evaluate various pieces of software experimentations as well
  - Used throughout the litterature
- Describe versions of SPEC, architecture
Polybench
- 30 numerical computations
- Computation kernels: domain specific (sci. computation, math, …)
- Kernel well-defined; no need to "figure out" the interesting basic blocks
- C language
- datasets

Harness to evaluate Palmed against other code analyzers
- Raw pipedream
- Iaca
- llvm-mca
- PMEvo
- UOPS
  - UiCA did not exist at the time; + fair comparison (Palmed is backend)
Based on basic blocks
The kernel is defined as a Palmed kernel: unordered, no dependencies
- in practice, use Pipedream generated code as kernel
As Pipedream doesn't support all instructions, some instructions must be stripped from the kernel (eg. control flow)

Measures: