Tentative plan for Palmed
This commit is contained in:
parent
f3b6936736
commit
c28639fdec
2 changed files with 153 additions and 1 deletions
|
@ -16,3 +16,17 @@ Throughput pred. :
|
|||
* PMEvo
|
||||
* OSACA
|
||||
* UiCA
|
||||
|
||||
Benchmark suites:
|
||||
* Polybench
|
||||
* SPEC
|
||||
|
||||
Backend models:
|
||||
* To predict the throughput of a kernel, a precise model of the CPU backend is
|
||||
required
|
||||
* Could be obtained from the manufacturer: ARM A72 optimization guide, Intel
|
||||
manual, …
|
||||
* but this is often incomplete, sometimes even wrong
|
||||
* Agner Fog
|
||||
* Uops.info
|
||||
|
||||
|
|
|
@ -1,5 +1,143 @@
|
|||
# Palmed: automatically modelling the backend
|
||||
|
||||
## Introducing Palmed
|
||||
* SotA: we saw efforts to build backend models
|
||||
* they take considerable expert knowledge/time
|
||||
* based on reverse-engineering, HW counters
|
||||
* What if these counters are not as precise? (TODO: investigate ZEN, ARM)
|
||||
* Too many new CPU/archs anyway for the experts to catch up
|
||||
|
||||
* Goal: make a benchmarks-based tool
|
||||
* fully-automated
|
||||
* yet as accurate as possible
|
||||
|
||||
* Mostly the work of Nicolas Derumigny
|
||||
* I worked on Palmed as engineer about a year, gain expertise in CPU
|
||||
architecture
|
||||
|
||||
## Resource models
|
||||
|
||||
* As seen before, CPU backend = ports
|
||||
* Instruction --> decode --> μop(s)
|
||||
* Each μop --> port able to process it
|
||||
* Ports: in most cases, fully pipelined. 1μop/cycle (even though time to
|
||||
completion is longer).
|
||||
* Classical port mapping: insn -> μop -> possible ports (disjunctive)
|
||||
* example where everything works well
|
||||
* example with port overlap: 2xADDSS + BSR [cf palmed paper]
|
||||
* nontrivial example?
|
||||
=> in the general case, requires solving an optimisation problem
|
||||
* Resource model
|
||||
* presentation
|
||||
* formal definition
|
||||
* can be solved with a max
|
||||
* same examples
|
||||
=> trivial to find throughput in any case
|
||||
* drawback: combinatorial explosion
|
||||
* but this is very reasonable on real-life CPUs because ports are not
|
||||
random.
|
||||
|
||||
## Palmed
|
||||
|
||||
* Find a resource model automatically
|
||||
* Concerned only with backend throughput
|
||||
* No dependencies
|
||||
* Ignore completely in-order effects
|
||||
* kernel = multiset of instructions
|
||||
|
||||
**General idea:** given enough, well-chosen kernels, and a measure of their
|
||||
execution time, we can build a model.
|
||||
|
||||
Indeed, (K, Cyc(K)) => `max_r∈R(sum_i∈K (\rho_i,r) ) = Cyc(K)`
|
||||
=> many equations describing the \rho.
|
||||
|
||||
* multi-stage model, builds intermediary results
|
||||
|
||||
[insert high-level view of Palmed]
|
||||
|
||||
* quickly describe intermediary results
|
||||
* classes of instructions: will be useful later
|
||||
|
||||
## Actually measuring a kernel's throughput
|
||||
|
||||
Pipedream
|
||||
* Original work by F. Gruber, cont. by N. Derumigny and C. Guillon
|
||||
* Goal: Measure #cycles of a multiset of instructions
|
||||
* Full throughput
|
||||
* No dependencies
|
||||
* L1-resident
|
||||
|
||||
* Use HW counters to measure cycles (Papi)
|
||||
* Generate an asm kernel of the form
|
||||
```
|
||||
for NUM_MEASURES:
|
||||
HW_cycles_measure:
|
||||
for NUM_ITER:
|
||||
kernel
|
||||
kernel
|
||||
...
|
||||
kernel
|
||||
```
|
||||
so that unrolled body of the loop has >= `UNROLL_SIZE` insn, and `UNROLL_SIZE *
|
||||
NUM_ITER >= TOTAL_INSN`.
|
||||
|
||||
* Must instantiate insn:
|
||||
* reg alloc
|
||||
* mem addresses
|
||||
* Reg: split registers into read and write pool; enough read registers for
|
||||
each instruction.
|
||||
* Read: always read from the same registers (R -> R dep is not a problem)
|
||||
* Write: round-robin
|
||||
* On some architectures, W -> W dependencies does not allow full
|
||||
parallelism
|
||||
* On some ISAs, some insn have R+W operands
|
||||
* Mem: allocate a memory arena, L1-sized. Split into read and write pool.
|
||||
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same
|
||||
address (load/store separated)
|
||||
* Base-index-displacement mode: constant base, 0 offset, round-robin
|
||||
displacement.
|
||||
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
|
||||
measurement
|
||||
* L1-residence: memory arena is small enough; warm-up rounds.
|
||||
|
||||
=> kernel throughput measurement.
|
||||
|
||||
Note: this works only because we measure a multiset of instructions, not a
|
||||
given asm code. We control the operands.
|
||||
|
||||
## Results
|
||||
|
||||
With all this, Palmed is capable of producing throughput models.
|
||||
|
||||
Tried on x86 (SKX, ZEN1) and ARM (A72).
|
||||
=> results
|
||||
|
||||
## Contributions
|
||||
|
||||
### Reproducibility: measurements database
|
||||
|
||||
* important both for efficiency and reproducibility
|
||||
* efficiency: avoid re-computing already made measurements
|
||||
* reproducibility: all the raw data is available after the run
|
||||
* ability to derive the model from raw data again
|
||||
* ability to assess the quality of raw measurements
|
||||
* backup/restore
|
||||
|
||||
### Evaluation
|
||||
|
||||
* Harness to evaluate Palmed against other code analyzers
|
||||
* Raw pipedream
|
||||
* Gus
|
||||
* Iaca
|
||||
* UOPS
|
||||
* llvm-mca
|
||||
* PMEvo
|
||||
* Based on basic blocks
|
||||
* The kernel is defined as a Palmed kernel: unordered, no dependencies
|
||||
* in practice, use Pipedream generated code as kernel
|
||||
* As Pipedream doesn't support all instructions, some instructions must be
|
||||
stripped from the kernel (eg. control flow)
|
||||
|
||||
Measures:
|
||||
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
|
||||
* RMS error of IPC
|
||||
* Kendall's tau for the IPCs
|
||||
|
|
Loading…
Reference in a new issue