phd-thesis/plan/30_palmed.md

# Palmed: automatically modelling the backend

[[PREREQUISITES]]
* Microarch: ports, μops, pipeline, cycle, L1-res
* Define Cyc(kernel)
* Backend models
* HW counters
* Tools:
    * Iaca
    * UOPS
    * llvm-mca
    * PMEvo
[[END]]

* SotA: we saw efforts to build backend models
* they take considerable expert knowledge/time
* based on reverse-engineering, HW counters
* What if these counters are not as precise? (TODO: investigate ZEN, ARM)
* Too many new CPU/archs anyway for the experts to catch up

* Goal: make a benchmarks-based tool
    * fully-automated
    * yet as accurate as possible

* Mostly the work of Nicolas Derumigny
* I worked on Palmed as engineer about a year, gain expertise in CPU
  architecture

## Resource models

* As seen before, CPU backend = ports
* Instruction --> decode --> μop(s)
* Each μop --> port able to process it
* Ports: in most cases, fully pipelined. 1μop/cycle (even though time to
  completion is longer).
* Classical port mapping: insn -> μop -> possible ports (disjunctive)
    * example where everything works well
    * example with port overlap: 2xADDSS + BSR [cf palmed paper]
    * nontrivial example?
    => in the general case, requires solving an optimisation problem
* Resource model
    * presentation
    * formal definition
    * can be solved with a max
    * same examples
    => trivial to find throughput in any case
    * drawback: combinatorial explosion
        * but this is very reasonable on real-life CPUs because ports are not
          random.

## Palmed

* Find a resource model automatically
* Concerned only with backend throughput
    * No dependencies
    * Ignore completely in-order effects
* kernel = multiset of instructions

**General idea:** given enough, well-chosen kernels, and a measure of their
execution time, we can build a model.

Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)`
=> many equations describing the \rho.

* multi-stage model, builds intermediary results

[insert high-level view of Palmed]

* quickly describe intermediary results
* classes of instructions: will be useful later

## Actually measuring a kernel's throughput

Pipedream
* Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
* Goal: Measure #cycles of a multiset of instructions
    * Full throughput
    * No dependencies
    * L1-resident

* Use HW counters to measure cycles (Papi)
* Generate an asm kernel of the form
```
for NUM_MEASURES:
    HW_cycles_measure:
        for NUM_ITER:
            kernel
            kernel
            ...
            kernel
```
so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE *
NUM_ITER >= TOTAL_INSN`.

* Must instantiate insn:
    * reg alloc
    * mem addresses
* Reg: split registers into read and write pool; enough read registers for
  each instruction.
    * Read: always read from the same registers (R -> R dep is not a problem)
    * Write: round-robin
        * On some architectures, W -> W dependencies does not allow full
          parallelism
        * On some ISAs, some insn have R+W operands
* Mem: allocate a memory arena, L1-sized. Split into read and write pool.
    * Direct register addressing mode (eg `ldr x0, [x1]`): always the same
      address (load/store separated)
    * Base-index-displacement mode: constant base, 0 offset, round-robin
      displacement on x86 (constant displacement on ARM)
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
  measurement
* L1-residence: memory arena is small enough; warm-up rounds.

=> kernel throughput measurement.

Note: this works only because we measure a multiset of instructions, not a
given asm code. We control the operands.

## Results

With all this, Palmed is capable of producing throughput models.

Tried on x86 (SKX, ZEN1) and ARM (A72).
=> results

## Contributions

### Reproducibility: measurements database

* important both for efficiency and reproducibility
* efficiency: avoid re-computing already made measurements
* reproducibility: all the raw data is available after the run
    * ability to derive the model from raw data again
    * ability to assess the quality of raw measurements
    * backup/restore

### Evaluation

#### Bench suites: SPEC, Polybench

* SPEC: real-world programs
    * Mainly made to evaluate hardware on a fixed workload
    * Provides a fixed workload to evaluate various pieces of software
      experimentations as well
        * Used throughout the litterature
    * Describe versions of SPEC, architecture
* Polybench
    * 30 numerical computations
    * Computation kernels: domain specific (sci. computation, math, …)
    * Kernel well-defined; no need to "figure out" the interesting basic blocks
    * C language
    * datasets

#### Experimental setup

* Harness to evaluate Palmed against other code analyzers
    * Raw pipedream
    * Iaca
    * llvm-mca
    * PMEvo
    * UOPS
        * UiCA did not exist at the time; + fair comparison (Palmed is backend)
* Based on basic blocks
* The kernel is defined as a Palmed kernel: unordered, no dependencies
    * in practice, use Pipedream generated code as kernel
* As Pipedream doesn't support all instructions, some instructions must be
  stripped from the kernel (eg. control flow)

Measures:
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
* RMS error of IPC
* Kendall's tau for the IPCs

#### Results

* Results