Plan: list prerequisites for each chapter, ensure consistency
This commit is contained in:
parent
7e5abd9669
commit
9ed7be7fc6
6 changed files with 143 additions and 14 deletions
|
@ -1,5 +1,17 @@
|
|||
# Palmed: automatically modelling the backend
|
||||
|
||||
[[PREREQUISITES]]
|
||||
* Microarch: ports, μops, pipeline, cycle, L1-res
|
||||
* Define Cyc(kernel)
|
||||
* Backend models
|
||||
* HW counters
|
||||
* Tools:
|
||||
* Iaca
|
||||
* UOPS
|
||||
* llvm-mca
|
||||
* PMEvo
|
||||
[[END]]
|
||||
|
||||
* SotA: we saw efforts to build backend models
|
||||
* they take considerable expert knowledge/time
|
||||
* based on reverse-engineering, HW counters
|
||||
|
@ -94,7 +106,7 @@ NUM_ITER >= TOTAL_INSN`.
|
|||
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same
|
||||
address (load/store separated)
|
||||
* Base-index-displacement mode: constant base, 0 offset, round-robin
|
||||
displacement.
|
||||
displacement on x86 (constant displacement on ARM)
|
||||
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
|
||||
measurement
|
||||
* L1-residence: memory arena is small enough; warm-up rounds.
|
||||
|
@ -124,13 +136,30 @@ Tried on x86 (SKX, ZEN1) and ARM (A72).
|
|||
|
||||
### Evaluation
|
||||
|
||||
#### Bench suites: SPEC, Polybench
|
||||
|
||||
* SPEC: real-world programs
|
||||
* Mainly made to evaluate hardware on a fixed workload
|
||||
* Provides a fixed workload to evaluate various pieces of software
|
||||
experimentations as well
|
||||
* Used throughout the litterature
|
||||
* Describe versions of SPEC, architecture
|
||||
* Polybench
|
||||
* 30 numerical computations
|
||||
* Computation kernels: domain specific (sci. computation, math, …)
|
||||
* Kernel well-defined; no need to "figure out" the interesting basic blocks
|
||||
* C language
|
||||
* datasets
|
||||
|
||||
#### Experimental setup
|
||||
|
||||
* Harness to evaluate Palmed against other code analyzers
|
||||
* Raw pipedream
|
||||
* Gus
|
||||
* Iaca
|
||||
* UOPS
|
||||
* llvm-mca
|
||||
* PMEvo
|
||||
* UOPS
|
||||
* UiCA did not exist at the time; + fair comparison (Palmed is backend)
|
||||
* Based on basic blocks
|
||||
* The kernel is defined as a Palmed kernel: unordered, no dependencies
|
||||
* in practice, use Pipedream generated code as kernel
|
||||
|
@ -141,3 +170,7 @@ Measures:
|
|||
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
|
||||
* RMS error of IPC
|
||||
* Kendall's tau for the IPCs
|
||||
|
||||
#### Results
|
||||
|
||||
* Results
|
||||
|
|
|
@ -1,14 +1,26 @@
|
|||
# Beyond ports: manually modelling the A72 frontend
|
||||
|
||||
[[PREREQUISITES]]
|
||||
* Microarch: frontend, ports, in-order/OoO, μ/Mop
|
||||
* Assembly
|
||||
* SIMD
|
||||
* Def Cyc(k) -> retired insn
|
||||
* Palmed, Palmed results
|
||||
* Palmed instruction classes
|
||||
* Pipedream
|
||||
* uops.info
|
||||
* Notion of bottleneck
|
||||
[[END]]
|
||||
|
||||
## Necessity to go beyond ports
|
||||
|
||||
* Palmed: concerned mostly with ports
|
||||
* Noticed the importance of the frontend while investigating its performances
|
||||
* heatmap representation: uops gone wild
|
||||
* heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX)
|
||||
* example of a frontend-bound microkernel
|
||||
* Palmed's vision of a frontend
|
||||
* Real difference: in-order
|
||||
* UiCA: OK, but it's more complicated
|
||||
* UiCA: proves that frontends are important, implements Intel frontends
|
||||
|
||||
## Cortex A72
|
||||
|
||||
|
@ -73,10 +85,10 @@ From now on, we try to find models answering:
|
|||
|
||||
### No-cross model
|
||||
|
||||
* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
|
||||
* Hypothesis: the frontend cannot decode a multi-uop instruction across cycle
|
||||
boundaries.
|
||||
|
||||
* Reasonable: similar things on x86-64 [uica] (?? investigate)
|
||||
* Reasonable: similar things on x86-64 -- cf [uica] predecoder §4.1
|
||||
* Would explain the example above [show again].
|
||||
|
||||
* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
|
||||
|
|
|
@ -1,5 +1,12 @@
|
|||
# A more systematic approach to throughput prediction performance analysis
|
||||
|
||||
[[PREREQUISITES]]
|
||||
* BB
|
||||
* ISA
|
||||
* ELF
|
||||
[[END]]
|
||||
|
||||
|
||||
* So far, evaluation only on lone basic blocks.
|
||||
* Extracted with somewhat automated methods, somewhat reproducible with manual
|
||||
effort.
|
||||
|
@ -38,6 +45,14 @@ Benchsuite
|
|||
a given number of times per second. 2nd mode used.
|
||||
* Extract PC for each sample
|
||||
|
||||
### ELF navigation: pyelftools & capstone
|
||||
|
||||
* Present the tools
|
||||
* Pyelftools: find symbols, read ELF sections, etc.
|
||||
* Capstone: disassemble for many ISAs
|
||||
* Inspect operands, registers, …
|
||||
* Instruction groups: control flow instructions
|
||||
|
||||
### Extract BBs
|
||||
|
||||
* For each sampled PC,
|
||||
|
@ -69,3 +84,17 @@ This way, chunk only the relevant portions
|
|||
## CesASMe
|
||||
|
||||
[paper with edits]
|
||||
|
||||
### GUS
|
||||
|
||||
* Dynamic tool based on QEMU
|
||||
* User-defined regions of interest
|
||||
* In these regions, instrument all instructions, accesses, etc; using
|
||||
throughput + latency + μarch models for instructuctions, analyze resource
|
||||
usage, produce cycles prediction
|
||||
* Sensitivity analysis: by tweaking the model (multiplying cost of some
|
||||
resources by a factor), can stress/alleviate parts of the model
|
||||
* Determine if a resource is bottleneck
|
||||
* Dynamic with heavy instrumentation => slow
|
||||
* Very detailed insight
|
||||
* In particular, access to real-run instruction dependencies
|
||||
|
|
|
@ -1,5 +1,16 @@
|
|||
# Static extraction of memory-carried dependencies
|
||||
|
||||
[[PREREQUISITES]]
|
||||
* CesASMe results
|
||||
* Gus
|
||||
* Static vs dynamic
|
||||
* PC
|
||||
* μarch: μop, renamer, L1-res, ROB
|
||||
* Osaca
|
||||
* UiCA
|
||||
[[END]]
|
||||
|
||||
|
||||
## Intro
|
||||
|
||||
* Previous chapt. : effect of mem-carried deps
|
||||
|
@ -62,6 +73,7 @@ On SKX,
|
|||
instructions are out of the ROB anyway
|
||||
* 224 μops in Intel's Skylake, 2015
|
||||
* 512 μops in Intel's Golden Cove, 2021
|
||||
* Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
|
||||
* Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
|
||||
instructions yield at least a μop, safe [TODO check]
|
||||
* Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
|
||||
|
@ -95,8 +107,13 @@ On SKX,
|
|||
1st kernel cannot depend on the previous kernel unroll); if it happens in the
|
||||
majority of cases, keep; else drop
|
||||
|
||||
* Semantics of asm coming from Valgrind's IR -- should be portable to any
|
||||
architecture supported
|
||||
* We need semantics for our assembly
|
||||
|
||||
### Valgrind's VEX
|
||||
|
||||
* Introduce Valgrind as an instrumentation tool
|
||||
* Introduce VEX
|
||||
* Should be portable to any architecture supported
|
||||
* but suffers limitations for recent extension sets; eg avx512 not
|
||||
supported (TODO check)
|
||||
|
||||
|
@ -104,7 +121,7 @@ On SKX,
|
|||
|
||||
* Does not track aliasing that originates from outside of the kernel.
|
||||
* As advocated in CesASMe, would require a broader analysis range
|
||||
* Randomness may lead to false positives
|
||||
* Randomness may (theoretically) lead to false positives
|
||||
* but re-running with different seed should eliminate the hazard close to
|
||||
entirely
|
||||
* Should not have false negatives outside of aliasing or unsupported ops
|
||||
|
@ -135,14 +152,15 @@ Then, compare with staticdeps: `eval/vg_depsim.py` script.
|
|||
* use genbench's bb split/occurrences to retrieve basic blocks
|
||||
* for each BB with more than 10% of max BB hits,
|
||||
* predict deps with staticdeps
|
||||
* cache the result: fast, but we're dealing with 3500 files.
|
||||
* cache the result: staticdeps is fast, but we're dealing with 3500
|
||||
files.
|
||||
* translate staticdeps' periodic deps to PC deps, discard the `iter`
|
||||
parameter
|
||||
* for each dependency from the depsim results that occurs inside this BB,
|
||||
* check if found or missed, append to a list
|
||||
* score: `|found| / (|found| + |missed|)`. Discards occurrences.
|
||||
* limitation: will only find deps from/to the same BB! Dependencies leaving
|
||||
a BB are discarded.
|
||||
* limitation: will only find deps from/to the same BB! Dependencies leaving a
|
||||
BB are discarded.
|
||||
|
||||
* Result: about 38% of deps found; 44% if weighting by occurrences
|
||||
|
||||
|
@ -218,3 +236,7 @@ TODO ?
|
|||
analysis
|
||||
* but might not be true for other applications that require dependencies
|
||||
detection
|
||||
|
||||
### Speed
|
||||
|
||||
TODO: evaluate speed?
|
||||
|
|
|
@ -2,7 +2,10 @@
|
|||
|
||||
## 10. Introduction
|
||||
|
||||
## 20. State of the art
|
||||
## 20. Foundations
|
||||
|
||||
Introduce the related works, present their techniques, define and introduce
|
||||
notins, …
|
||||
|
||||
## 30. Palmed: automatically modelling the backend
|
||||
|
||||
|
|
30
plan/to_introduce_early.md
Normal file
30
plan/to_introduce_early.md
Normal file
|
@ -0,0 +1,30 @@
|
|||
# Stuff that must be introduced early (intro/foundations)
|
||||
|
||||
* Static vs. dynamic
|
||||
* PC
|
||||
* ELF
|
||||
* ISA
|
||||
* Assembly
|
||||
* SIMD
|
||||
* Basic block
|
||||
* μarch:
|
||||
* frontend
|
||||
* ports
|
||||
* in-order/out-of-order
|
||||
* pipeline
|
||||
* Mop
|
||||
* μop
|
||||
* renamer
|
||||
* ROB
|
||||
* L1-residence
|
||||
* HW counters
|
||||
* Tools:
|
||||
* IACA
|
||||
* llvm-mca
|
||||
* Osaca
|
||||
* uops.info
|
||||
* UiCA
|
||||
* PMEvo
|
||||
|
||||
* Define Cycles(K): retired instructions
|
||||
* Define notion of bottleneck
|
Loading…
Reference in a new issue