From c28639fdec3ef55a8ebc4eceb26bf0f49650563a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr> Date: Wed, 6 Sep 2023 17:52:50 +0200 Subject: [PATCH] Tentative plan for Palmed --- plan/20_SotA.md | 14 +++++ plan/30_palmed.md | 140 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 153 insertions(+), 1 deletion(-) diff --git a/plan/20_SotA.md b/plan/20_SotA.md index 4ffa2d1..a293a23 100644 --- a/plan/20_SotA.md +++ b/plan/20_SotA.md @@ -16,3 +16,17 @@ Throughput pred. : * PMEvo * OSACA * UiCA + +Benchmark suites: +* Polybench +* SPEC + +Backend models: +* To predict the throughput of a kernel, a precise model of the CPU backend is + required +* Could be obtained from the manufacturer: ARM A72 optimization guide, Intel + manual, … + * but this is often incomplete, sometimes even wrong +* Agner Fog +* Uops.info + diff --git a/plan/30_palmed.md b/plan/30_palmed.md index be5da08..045034d 100644 --- a/plan/30_palmed.md +++ b/plan/30_palmed.md @@ -1,5 +1,143 @@ # Palmed: automatically modelling the backend -## Introducing Palmed +* SotA: we saw efforts to build backend models +* they take considerable expert knowledge/time +* based on reverse-engineering, HW counters +* What if these counters are not as precise? (TODO: investigate ZEN, ARM) +* Too many new CPU/archs anyway for the experts to catch up + +* Goal: make a benchmarks-based tool + * fully-automated + * yet as accurate as possible + +* Mostly the work of Nicolas Derumigny +* I worked on Palmed as engineer about a year, gain expertise in CPU + architecture + +## Resource models + +* As seen before, CPU backend = ports +* Instruction --> decode --> μop(s) +* Each μop --> port able to process it +* Ports: in most cases, fully pipelined. 1μop/cycle (even though time to + completion is longer). +* Classical port mapping: insn -> μop -> possible ports (disjunctive) + * example where everything works well + * example with port overlap: 2xADDSS + BSR [cf palmed paper] + * nontrivial example? + => in the general case, requires solving an optimisation problem +* Resource model + * presentation + * formal definition + * can be solved with a max + * same examples + => trivial to find throughput in any case + * drawback: combinatorial explosion + * but this is very reasonable on real-life CPUs because ports are not + random. + +## Palmed + +* Find a resource model automatically +* Concerned only with backend throughput + * No dependencies + * Ignore completely in-order effects +* kernel = multiset of instructions + +**General idea:** given enough, well-chosen kernels, and a measure of their +execution time, we can build a model. + +Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)` +=> many equations describing the \rho. + +* multi-stage model, builds intermediary results + +[insert high-level view of Palmed] + +* quickly describe intermediary results +* classes of instructions: will be useful later + +## Actually measuring a kernel's throughput + +Pipedream +* Original work by F. Gruber, cont. by N. Derumigny and C. Guillon +* Goal: Measure #cycles of a multiset of instructions + * Full throughput + * No dependencies + * L1-resident + +* Use HW counters to measure cycles (Papi) +* Generate an asm kernel of the form +``` +for NUM_MEASURES: + HW_cycles_measure: + for NUM_ITER: + kernel + kernel + ... + kernel +``` +so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE * +NUM_ITER >= TOTAL_INSN`. + +* Must instantiate insn: + * reg alloc + * mem addresses +* Reg: split registers into read and write pool; enough read registers for + each instruction. + * Read: always read from the same registers (R -> R dep is not a problem) + * Write: round-robin + * On some architectures, W -> W dependencies does not allow full + parallelism + * On some ISAs, some insn have R+W operands +* Mem: allocate a memory arena, L1-sized. Split into read and write pool. + * Direct register addressing mode (eg `ldr x0, [x1]`): always the same + address (load/store separated) + * Base-index-displacement mode: constant base, 0 offset, round-robin + displacement. +* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during + measurement +* L1-residence: memory arena is small enough; warm-up rounds. + +=> kernel throughput measurement. + +Note: this works only because we measure a multiset of instructions, not a +given asm code. We control the operands. + +## Results + +With all this, Palmed is capable of producing throughput models. + +Tried on x86 (SKX, ZEN1) and ARM (A72). +=> results ## Contributions + +### Reproducibility: measurements database + +* important both for efficiency and reproducibility +* efficiency: avoid re-computing already made measurements +* reproducibility: all the raw data is available after the run + * ability to derive the model from raw data again + * ability to assess the quality of raw measurements + * backup/restore + +### Evaluation + +* Harness to evaluate Palmed against other code analyzers + * Raw pipedream + * Gus + * Iaca + * UOPS + * llvm-mca + * PMEvo +* Based on basic blocks +* The kernel is defined as a Palmed kernel: unordered, no dependencies + * in practice, use Pipedream generated code as kernel +* As Pipedream doesn't support all instructions, some instructions must be + stripped from the kernel (eg. control flow) + +Measures: +* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed +* RMS error of IPC +* Kendall's tau for the IPCs