From c28639fdec3ef55a8ebc4eceb26bf0f49650563a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr>
Date: Wed, 6 Sep 2023 17:52:50 +0200
Subject: [PATCH] Tentative plan for Palmed

---
 plan/20_SotA.md   |  14 +++++
 plan/30_palmed.md | 140 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/plan/20_SotA.md b/plan/20_SotA.md
index 4ffa2d1..a293a23 100644
--- a/plan/20_SotA.md
+++ b/plan/20_SotA.md
@@ -16,3 +16,17 @@ Throughput pred. :
 * PMEvo
 * OSACA
 * UiCA
+
+Benchmark suites:
+* Polybench
+* SPEC
+
+Backend models:
+* To predict the throughput of a kernel, a precise model of the CPU backend is
+  required
+* Could be obtained from the manufacturer: ARM A72 optimization guide, Intel
+  manual, …
+    * but this is often incomplete, sometimes even wrong
+* Agner Fog
+* Uops.info
+
diff --git a/plan/30_palmed.md b/plan/30_palmed.md
index be5da08..045034d 100644
--- a/plan/30_palmed.md
+++ b/plan/30_palmed.md
@@ -1,5 +1,143 @@
 # Palmed: automatically modelling the backend
 
-## Introducing Palmed
+* SotA: we saw efforts to build backend models
+* they take considerable expert knowledge/time
+* based on reverse-engineering, HW counters
+* What if these counters are not as precise? (TODO: investigate ZEN, ARM)
+* Too many new CPU/archs anyway for the experts to catch up
+
+* Goal: make a benchmarks-based tool
+    * fully-automated
+    * yet as accurate as possible
+
+* Mostly the work of Nicolas Derumigny
+* I worked on Palmed as engineer about a year, gain expertise in CPU
+  architecture
+
+## Resource models
+
+* As seen before, CPU backend = ports
+* Instruction --> decode --> μop(s)
+* Each μop --> port able to process it
+* Ports: in most cases, fully pipelined. 1μop/cycle (even though time to
+  completion is longer).
+* Classical port mapping: insn -> μop -> possible ports (disjunctive)
+    * example where everything works well
+    * example with port overlap: 2xADDSS + BSR [cf palmed paper]
+    * nontrivial example?
+    => in the general case, requires solving an optimisation problem
+* Resource model
+    * presentation
+    * formal definition
+    * can be solved with a max
+    * same examples
+    => trivial to find throughput in any case
+    * drawback: combinatorial explosion
+        * but this is very reasonable on real-life CPUs because ports are not
+          random.
+
+## Palmed
+
+* Find a resource model automatically
+* Concerned only with backend throughput
+    * No dependencies
+    * Ignore completely in-order effects
+* kernel = multiset of instructions
+
+**General idea:** given enough, well-chosen kernels, and a measure of their
+execution time, we can build a model.
+
+Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)`
+=> many equations describing the \rho.
+
+* multi-stage model, builds intermediary results
+
+[insert high-level view of Palmed]
+
+* quickly describe intermediary results
+* classes of instructions: will be useful later
+
+## Actually measuring a kernel's throughput
+
+Pipedream
+* Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
+* Goal: Measure #cycles of a multiset of instructions
+    * Full throughput
+    * No dependencies
+    * L1-resident
+
+* Use HW counters to measure cycles (Papi)
+* Generate an asm kernel of the form
+```
+for NUM_MEASURES:
+    HW_cycles_measure:
+        for NUM_ITER:
+            kernel
+            kernel
+            ...
+            kernel
+```
+so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE *
+NUM_ITER >= TOTAL_INSN`.
+
+* Must instantiate insn:
+    * reg alloc
+    * mem addresses
+* Reg: split registers into read and write pool; enough read registers for
+  each instruction.
+    * Read: always read from the same registers (R -> R dep is not a problem)
+    * Write: round-robin
+        * On some architectures, W -> W dependencies does not allow full
+          parallelism
+        * On some ISAs, some insn have R+W operands
+* Mem: allocate a memory arena, L1-sized. Split into read and write pool.
+    * Direct register addressing mode (eg `ldr x0, [x1]`): always the same
+      address (load/store separated)
+    * Base-index-displacement mode: constant base, 0 offset, round-robin
+      displacement.
+* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
+  measurement
+* L1-residence: memory arena is small enough; warm-up rounds.
+
+=> kernel throughput measurement.
+
+Note: this works only because we measure a multiset of instructions, not a
+given asm code. We control the operands.
+
+## Results
+
+With all this, Palmed is capable of producing throughput models.
+
+Tried on x86 (SKX, ZEN1) and ARM (A72).
+=> results
 
 ## Contributions
+
+### Reproducibility: measurements database
+
+* important both for efficiency and reproducibility
+* efficiency: avoid re-computing already made measurements
+* reproducibility: all the raw data is available after the run
+    * ability to derive the model from raw data again
+    * ability to assess the quality of raw measurements
+    * backup/restore
+
+### Evaluation
+
+* Harness to evaluate Palmed against other code analyzers
+    * Raw pipedream
+    * Gus
+    * Iaca
+    * UOPS
+    * llvm-mca
+    * PMEvo
+* Based on basic blocks
+* The kernel is defined as a Palmed kernel: unordered, no dependencies
+    * in practice, use Pipedream generated code as kernel
+* As Pipedream doesn't support all instructions, some instructions must be
+  stripped from the kernel (eg. control flow)
+
+Measures:
+* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
+* RMS error of IPC
+* Kendall's tau for the IPCs