# Palmed: automatically modelling the backend * SotA: we saw efforts to build backend models * they take considerable expert knowledge/time * based on reverse-engineering, HW counters * What if these counters are not as precise? (TODO: investigate ZEN, ARM) * Too many new CPU/archs anyway for the experts to catch up * Goal: make a benchmarks-based tool * fully-automated * yet as accurate as possible * Mostly the work of Nicolas Derumigny * I worked on Palmed as engineer about a year, gain expertise in CPU architecture ## Resource models * As seen before, CPU backend = ports * Instruction --> decode --> μop(s) * Each μop --> port able to process it * Ports: in most cases, fully pipelined. 1μop/cycle (even though time to completion is longer). * Classical port mapping: insn -> μop -> possible ports (disjunctive) * example where everything works well * example with port overlap: 2xADDSS + BSR [cf palmed paper] * nontrivial example? => in the general case, requires solving an optimisation problem * Resource model * presentation * formal definition * can be solved with a max * same examples => trivial to find throughput in any case * drawback: combinatorial explosion * but this is very reasonable on real-life CPUs because ports are not random. ## Palmed * Find a resource model automatically * Concerned only with backend throughput * No dependencies * Ignore completely in-order effects * kernel = multiset of instructions **General idea:** given enough, well-chosen kernels, and a measure of their execution time, we can build a model. Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)` => many equations describing the \rho. * multi-stage model, builds intermediary results [insert high-level view of Palmed] * quickly describe intermediary results * classes of instructions: will be useful later ## Actually measuring a kernel's throughput Pipedream * Original work by F. Gruber, cont. by N. Derumigny and C. Guillon * Goal: Measure #cycles of a multiset of instructions * Full throughput * No dependencies * L1-resident * Use HW counters to measure cycles (Papi) * Generate an asm kernel of the form ``` for NUM_MEASURES: HW_cycles_measure: for NUM_ITER: kernel kernel ... kernel ``` so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE * NUM_ITER >= TOTAL_INSN`. * Must instantiate insn: * reg alloc * mem addresses * Reg: split registers into read and write pool; enough read registers for each instruction. * Read: always read from the same registers (R -> R dep is not a problem) * Write: round-robin * On some architectures, W -> W dependencies does not allow full parallelism * On some ISAs, some insn have R+W operands * Mem: allocate a memory arena, L1-sized. Split into read and write pool. * Direct register addressing mode (eg `ldr x0, [x1]`): always the same address (load/store separated) * Base-index-displacement mode: constant base, 0 offset, round-robin displacement. * Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during measurement * L1-residence: memory arena is small enough; warm-up rounds. => kernel throughput measurement. Note: this works only because we measure a multiset of instructions, not a given asm code. We control the operands. ## Results With all this, Palmed is capable of producing throughput models. Tried on x86 (SKX, ZEN1) and ARM (A72). => results ## Contributions ### Reproducibility: measurements database * important both for efficiency and reproducibility * efficiency: avoid re-computing already made measurements * reproducibility: all the raw data is available after the run * ability to derive the model from raw data again * ability to assess the quality of raw measurements * backup/restore ### Evaluation * Harness to evaluate Palmed against other code analyzers * Raw pipedream * Gus * Iaca * UOPS * llvm-mca * PMEvo * Based on basic blocks * The kernel is defined as a Palmed kernel: unordered, no dependencies * in practice, use Pipedream generated code as kernel * As Pipedream doesn't support all instructions, some instructions must be stripped from the kernel (eg. control flow) Measures: * Coverage: proportion of benchmarks supported by the tool, wrt. Palmed * RMS error of IPC * Kendall's tau for the IPCs