Plan: list prerequisites for each chapter, ensure consistency

This commit is contained in:
Théophile Bastian 2023-09-14 11:42:50 +02:00
parent 7e5abd9669
commit 9ed7be7fc6
6 changed files with 143 additions and 14 deletions

View file

@ -1,5 +1,17 @@
# Palmed: automatically modelling the backend # Palmed: automatically modelling the backend
[[PREREQUISITES]]
* Microarch: ports, μops, pipeline, cycle, L1-res
* Define Cyc(kernel)
* Backend models
* HW counters
* Tools:
* Iaca
* UOPS
* llvm-mca
* PMEvo
[[END]]
* SotA: we saw efforts to build backend models * SotA: we saw efforts to build backend models
* they take considerable expert knowledge/time * they take considerable expert knowledge/time
* based on reverse-engineering, HW counters * based on reverse-engineering, HW counters
@ -94,7 +106,7 @@ NUM_ITER >= TOTAL_INSN`.
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same * Direct register addressing mode (eg `ldr x0, [x1]`): always the same
address (load/store separated) address (load/store separated)
* Base-index-displacement mode: constant base, 0 offset, round-robin * Base-index-displacement mode: constant base, 0 offset, round-robin
displacement. displacement on x86 (constant displacement on ARM)
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during * Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
measurement measurement
* L1-residence: memory arena is small enough; warm-up rounds. * L1-residence: memory arena is small enough; warm-up rounds.
@ -124,13 +136,30 @@ Tried on x86 (SKX, ZEN1) and ARM (A72).
### Evaluation ### Evaluation
#### Bench suites: SPEC, Polybench
* SPEC: real-world programs
* Mainly made to evaluate hardware on a fixed workload
* Provides a fixed workload to evaluate various pieces of software
experimentations as well
* Used throughout the litterature
* Describe versions of SPEC, architecture
* Polybench
* 30 numerical computations
* Computation kernels: domain specific (sci. computation, math, …)
* Kernel well-defined; no need to "figure out" the interesting basic blocks
* C language
* datasets
#### Experimental setup
* Harness to evaluate Palmed against other code analyzers * Harness to evaluate Palmed against other code analyzers
* Raw pipedream * Raw pipedream
* Gus
* Iaca * Iaca
* UOPS
* llvm-mca * llvm-mca
* PMEvo * PMEvo
* UOPS
* UiCA did not exist at the time; + fair comparison (Palmed is backend)
* Based on basic blocks * Based on basic blocks
* The kernel is defined as a Palmed kernel: unordered, no dependencies * The kernel is defined as a Palmed kernel: unordered, no dependencies
* in practice, use Pipedream generated code as kernel * in practice, use Pipedream generated code as kernel
@ -141,3 +170,7 @@ Measures:
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed * Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
* RMS error of IPC * RMS error of IPC
* Kendall's tau for the IPCs * Kendall's tau for the IPCs
#### Results
* Results

View file

@ -1,14 +1,26 @@
# Beyond ports: manually modelling the A72 frontend # Beyond ports: manually modelling the A72 frontend
[[PREREQUISITES]]
* Microarch: frontend, ports, in-order/OoO, μ/Mop
* Assembly
* SIMD
* Def Cyc(k) -> retired insn
* Palmed, Palmed results
* Palmed instruction classes
* Pipedream
* uops.info
* Notion of bottleneck
[[END]]
## Necessity to go beyond ports ## Necessity to go beyond ports
* Palmed: concerned mostly with ports * Palmed: concerned mostly with ports
* Noticed the importance of the frontend while investigating its performances * Noticed the importance of the frontend while investigating its performances
* heatmap representation: uops gone wild * heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX)
* example of a frontend-bound microkernel * example of a frontend-bound microkernel
* Palmed's vision of a frontend * Palmed's vision of a frontend
* Real difference: in-order * Real difference: in-order
* UiCA: OK, but it's more complicated * UiCA: proves that frontends are important, implements Intel frontends
## Cortex A72 ## Cortex A72
@ -73,10 +85,10 @@ From now on, we try to find models answering:
### No-cross model ### No-cross model
* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle * Hypothesis: the frontend cannot decode a multi-uop instruction across cycle
boundaries. boundaries.
* Reasonable: similar things on x86-64 [uica] (?? investigate) * Reasonable: similar things on x86-64 -- cf [uica] predecoder §4.1
* Would explain the example above [show again]. * Would explain the example above [show again].
* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle * Frontend state ∈ [|0,2|]: how many μops already decoded this cycle

View file

@ -1,5 +1,12 @@
# A more systematic approach to throughput prediction performance analysis # A more systematic approach to throughput prediction performance analysis
[[PREREQUISITES]]
* BB
* ISA
* ELF
[[END]]
* So far, evaluation only on lone basic blocks. * So far, evaluation only on lone basic blocks.
* Extracted with somewhat automated methods, somewhat reproducible with manual * Extracted with somewhat automated methods, somewhat reproducible with manual
effort. effort.
@ -38,6 +45,14 @@ Benchsuite
a given number of times per second. 2nd mode used. a given number of times per second. 2nd mode used.
* Extract PC for each sample * Extract PC for each sample
### ELF navigation: pyelftools & capstone
* Present the tools
* Pyelftools: find symbols, read ELF sections, etc.
* Capstone: disassemble for many ISAs
* Inspect operands, registers, …
* Instruction groups: control flow instructions
### Extract BBs ### Extract BBs
* For each sampled PC, * For each sampled PC,
@ -69,3 +84,17 @@ This way, chunk only the relevant portions
## CesASMe ## CesASMe
[paper with edits] [paper with edits]
### GUS
* Dynamic tool based on QEMU
* User-defined regions of interest
* In these regions, instrument all instructions, accesses, etc; using
throughput + latency + μarch models for instructuctions, analyze resource
usage, produce cycles prediction
* Sensitivity analysis: by tweaking the model (multiplying cost of some
resources by a factor), can stress/alleviate parts of the model
* Determine if a resource is bottleneck
* Dynamic with heavy instrumentation => slow
* Very detailed insight
* In particular, access to real-run instruction dependencies

View file

@ -1,5 +1,16 @@
# Static extraction of memory-carried dependencies # Static extraction of memory-carried dependencies
[[PREREQUISITES]]
* CesASMe results
* Gus
* Static vs dynamic
* PC
* μarch: μop, renamer, L1-res, ROB
* Osaca
* UiCA
[[END]]
## Intro ## Intro
* Previous chapt. : effect of mem-carried deps * Previous chapt. : effect of mem-carried deps
@ -62,6 +73,7 @@ On SKX,
instructions are out of the ROB anyway instructions are out of the ROB anyway
* 224 μops in Intel's Skylake, 2015 * 224 μops in Intel's Skylake, 2015
* 512 μops in Intel's Golden Cove, 2021 * 512 μops in Intel's Golden Cove, 2021
* Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
* Can unroll until we have ~|ROB|+|K| instructions in the kernel: since * Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
instructions yield at least a μop, safe [TODO check] instructions yield at least a μop, safe [TODO check]
* Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo. * Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
@ -95,8 +107,13 @@ On SKX,
1st kernel cannot depend on the previous kernel unroll); if it happens in the 1st kernel cannot depend on the previous kernel unroll); if it happens in the
majority of cases, keep; else drop majority of cases, keep; else drop
* Semantics of asm coming from Valgrind's IR -- should be portable to any * We need semantics for our assembly
architecture supported
### Valgrind's VEX
* Introduce Valgrind as an instrumentation tool
* Introduce VEX
* Should be portable to any architecture supported
* but suffers limitations for recent extension sets; eg avx512 not * but suffers limitations for recent extension sets; eg avx512 not
supported (TODO check) supported (TODO check)
@ -104,7 +121,7 @@ On SKX,
* Does not track aliasing that originates from outside of the kernel. * Does not track aliasing that originates from outside of the kernel.
* As advocated in CesASMe, would require a broader analysis range * As advocated in CesASMe, would require a broader analysis range
* Randomness may lead to false positives * Randomness may (theoretically) lead to false positives
* but re-running with different seed should eliminate the hazard close to * but re-running with different seed should eliminate the hazard close to
entirely entirely
* Should not have false negatives outside of aliasing or unsupported ops * Should not have false negatives outside of aliasing or unsupported ops
@ -135,14 +152,15 @@ Then, compare with staticdeps: `eval/vg_depsim.py` script.
* use genbench's bb split/occurrences to retrieve basic blocks * use genbench's bb split/occurrences to retrieve basic blocks
* for each BB with more than 10% of max BB hits, * for each BB with more than 10% of max BB hits,
* predict deps with staticdeps * predict deps with staticdeps
* cache the result: fast, but we're dealing with 3500 files. * cache the result: staticdeps is fast, but we're dealing with 3500
files.
* translate staticdeps' periodic deps to PC deps, discard the `iter` * translate staticdeps' periodic deps to PC deps, discard the `iter`
parameter parameter
* for each dependency from the depsim results that occurs inside this BB, * for each dependency from the depsim results that occurs inside this BB,
* check if found or missed, append to a list * check if found or missed, append to a list
* score: `|found| / (|found| + |missed|)`. Discards occurrences. * score: `|found| / (|found| + |missed|)`. Discards occurrences.
* limitation: will only find deps from/to the same BB! Dependencies leaving * limitation: will only find deps from/to the same BB! Dependencies leaving a
a BB are discarded. BB are discarded.
* Result: about 38% of deps found; 44% if weighting by occurrences * Result: about 38% of deps found; 44% if weighting by occurrences
@ -218,3 +236,7 @@ TODO ?
analysis analysis
* but might not be true for other applications that require dependencies * but might not be true for other applications that require dependencies
detection detection
### Speed
TODO: evaluate speed?

View file

@ -2,7 +2,10 @@
## 10. Introduction ## 10. Introduction
## 20. State of the art ## 20. Foundations
Introduce the related works, present their techniques, define and introduce
notins, …
## 30. Palmed: automatically modelling the backend ## 30. Palmed: automatically modelling the backend

View file

@ -0,0 +1,30 @@
# Stuff that must be introduced early (intro/foundations)
* Static vs. dynamic
* PC
* ELF
* ISA
* Assembly
* SIMD
* Basic block
* μarch:
* frontend
* ports
* in-order/out-of-order
* pipeline
* Mop
* μop
* renamer
* ROB
* L1-residence
* HW counters
* Tools:
* IACA
* llvm-mca
* Osaca
* uops.info
* UiCA
* PMEvo
* Define Cycles(K): retired instructions
* Define notion of bottleneck