Plan: list prerequisites for each chapter, ensure consistency
This commit is contained in:
parent
7e5abd9669
commit
9ed7be7fc6
6 changed files with 143 additions and 14 deletions
|
@ -1,5 +1,17 @@
|
||||||
# Palmed: automatically modelling the backend
|
# Palmed: automatically modelling the backend
|
||||||
|
|
||||||
|
[[PREREQUISITES]]
|
||||||
|
* Microarch: ports, μops, pipeline, cycle, L1-res
|
||||||
|
* Define Cyc(kernel)
|
||||||
|
* Backend models
|
||||||
|
* HW counters
|
||||||
|
* Tools:
|
||||||
|
* Iaca
|
||||||
|
* UOPS
|
||||||
|
* llvm-mca
|
||||||
|
* PMEvo
|
||||||
|
[[END]]
|
||||||
|
|
||||||
* SotA: we saw efforts to build backend models
|
* SotA: we saw efforts to build backend models
|
||||||
* they take considerable expert knowledge/time
|
* they take considerable expert knowledge/time
|
||||||
* based on reverse-engineering, HW counters
|
* based on reverse-engineering, HW counters
|
||||||
|
@ -94,7 +106,7 @@ NUM_ITER >= TOTAL_INSN`.
|
||||||
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same
|
* Direct register addressing mode (eg `ldr x0, [x1]`): always the same
|
||||||
address (load/store separated)
|
address (load/store separated)
|
||||||
* Base-index-displacement mode: constant base, 0 offset, round-robin
|
* Base-index-displacement mode: constant base, 0 offset, round-robin
|
||||||
displacement.
|
displacement on x86 (constant displacement on ARM)
|
||||||
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
|
* Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
|
||||||
measurement
|
measurement
|
||||||
* L1-residence: memory arena is small enough; warm-up rounds.
|
* L1-residence: memory arena is small enough; warm-up rounds.
|
||||||
|
@ -124,13 +136,30 @@ Tried on x86 (SKX, ZEN1) and ARM (A72).
|
||||||
|
|
||||||
### Evaluation
|
### Evaluation
|
||||||
|
|
||||||
|
#### Bench suites: SPEC, Polybench
|
||||||
|
|
||||||
|
* SPEC: real-world programs
|
||||||
|
* Mainly made to evaluate hardware on a fixed workload
|
||||||
|
* Provides a fixed workload to evaluate various pieces of software
|
||||||
|
experimentations as well
|
||||||
|
* Used throughout the litterature
|
||||||
|
* Describe versions of SPEC, architecture
|
||||||
|
* Polybench
|
||||||
|
* 30 numerical computations
|
||||||
|
* Computation kernels: domain specific (sci. computation, math, …)
|
||||||
|
* Kernel well-defined; no need to "figure out" the interesting basic blocks
|
||||||
|
* C language
|
||||||
|
* datasets
|
||||||
|
|
||||||
|
#### Experimental setup
|
||||||
|
|
||||||
* Harness to evaluate Palmed against other code analyzers
|
* Harness to evaluate Palmed against other code analyzers
|
||||||
* Raw pipedream
|
* Raw pipedream
|
||||||
* Gus
|
|
||||||
* Iaca
|
* Iaca
|
||||||
* UOPS
|
|
||||||
* llvm-mca
|
* llvm-mca
|
||||||
* PMEvo
|
* PMEvo
|
||||||
|
* UOPS
|
||||||
|
* UiCA did not exist at the time; + fair comparison (Palmed is backend)
|
||||||
* Based on basic blocks
|
* Based on basic blocks
|
||||||
* The kernel is defined as a Palmed kernel: unordered, no dependencies
|
* The kernel is defined as a Palmed kernel: unordered, no dependencies
|
||||||
* in practice, use Pipedream generated code as kernel
|
* in practice, use Pipedream generated code as kernel
|
||||||
|
@ -141,3 +170,7 @@ Measures:
|
||||||
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
|
* Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
|
||||||
* RMS error of IPC
|
* RMS error of IPC
|
||||||
* Kendall's tau for the IPCs
|
* Kendall's tau for the IPCs
|
||||||
|
|
||||||
|
#### Results
|
||||||
|
|
||||||
|
* Results
|
||||||
|
|
|
@ -1,14 +1,26 @@
|
||||||
# Beyond ports: manually modelling the A72 frontend
|
# Beyond ports: manually modelling the A72 frontend
|
||||||
|
|
||||||
|
[[PREREQUISITES]]
|
||||||
|
* Microarch: frontend, ports, in-order/OoO, μ/Mop
|
||||||
|
* Assembly
|
||||||
|
* SIMD
|
||||||
|
* Def Cyc(k) -> retired insn
|
||||||
|
* Palmed, Palmed results
|
||||||
|
* Palmed instruction classes
|
||||||
|
* Pipedream
|
||||||
|
* uops.info
|
||||||
|
* Notion of bottleneck
|
||||||
|
[[END]]
|
||||||
|
|
||||||
## Necessity to go beyond ports
|
## Necessity to go beyond ports
|
||||||
|
|
||||||
* Palmed: concerned mostly with ports
|
* Palmed: concerned mostly with ports
|
||||||
* Noticed the importance of the frontend while investigating its performances
|
* Noticed the importance of the frontend while investigating its performances
|
||||||
* heatmap representation: uops gone wild
|
* heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX)
|
||||||
* example of a frontend-bound microkernel
|
* example of a frontend-bound microkernel
|
||||||
* Palmed's vision of a frontend
|
* Palmed's vision of a frontend
|
||||||
* Real difference: in-order
|
* Real difference: in-order
|
||||||
* UiCA: OK, but it's more complicated
|
* UiCA: proves that frontends are important, implements Intel frontends
|
||||||
|
|
||||||
## Cortex A72
|
## Cortex A72
|
||||||
|
|
||||||
|
@ -73,10 +85,10 @@ From now on, we try to find models answering:
|
||||||
|
|
||||||
### No-cross model
|
### No-cross model
|
||||||
|
|
||||||
* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
|
* Hypothesis: the frontend cannot decode a multi-uop instruction across cycle
|
||||||
boundaries.
|
boundaries.
|
||||||
|
|
||||||
* Reasonable: similar things on x86-64 [uica] (?? investigate)
|
* Reasonable: similar things on x86-64 -- cf [uica] predecoder §4.1
|
||||||
* Would explain the example above [show again].
|
* Would explain the example above [show again].
|
||||||
|
|
||||||
* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
|
* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
|
||||||
|
|
|
@ -1,5 +1,12 @@
|
||||||
# A more systematic approach to throughput prediction performance analysis
|
# A more systematic approach to throughput prediction performance analysis
|
||||||
|
|
||||||
|
[[PREREQUISITES]]
|
||||||
|
* BB
|
||||||
|
* ISA
|
||||||
|
* ELF
|
||||||
|
[[END]]
|
||||||
|
|
||||||
|
|
||||||
* So far, evaluation only on lone basic blocks.
|
* So far, evaluation only on lone basic blocks.
|
||||||
* Extracted with somewhat automated methods, somewhat reproducible with manual
|
* Extracted with somewhat automated methods, somewhat reproducible with manual
|
||||||
effort.
|
effort.
|
||||||
|
@ -38,6 +45,14 @@ Benchsuite
|
||||||
a given number of times per second. 2nd mode used.
|
a given number of times per second. 2nd mode used.
|
||||||
* Extract PC for each sample
|
* Extract PC for each sample
|
||||||
|
|
||||||
|
### ELF navigation: pyelftools & capstone
|
||||||
|
|
||||||
|
* Present the tools
|
||||||
|
* Pyelftools: find symbols, read ELF sections, etc.
|
||||||
|
* Capstone: disassemble for many ISAs
|
||||||
|
* Inspect operands, registers, …
|
||||||
|
* Instruction groups: control flow instructions
|
||||||
|
|
||||||
### Extract BBs
|
### Extract BBs
|
||||||
|
|
||||||
* For each sampled PC,
|
* For each sampled PC,
|
||||||
|
@ -69,3 +84,17 @@ This way, chunk only the relevant portions
|
||||||
## CesASMe
|
## CesASMe
|
||||||
|
|
||||||
[paper with edits]
|
[paper with edits]
|
||||||
|
|
||||||
|
### GUS
|
||||||
|
|
||||||
|
* Dynamic tool based on QEMU
|
||||||
|
* User-defined regions of interest
|
||||||
|
* In these regions, instrument all instructions, accesses, etc; using
|
||||||
|
throughput + latency + μarch models for instructuctions, analyze resource
|
||||||
|
usage, produce cycles prediction
|
||||||
|
* Sensitivity analysis: by tweaking the model (multiplying cost of some
|
||||||
|
resources by a factor), can stress/alleviate parts of the model
|
||||||
|
* Determine if a resource is bottleneck
|
||||||
|
* Dynamic with heavy instrumentation => slow
|
||||||
|
* Very detailed insight
|
||||||
|
* In particular, access to real-run instruction dependencies
|
||||||
|
|
|
@ -1,5 +1,16 @@
|
||||||
# Static extraction of memory-carried dependencies
|
# Static extraction of memory-carried dependencies
|
||||||
|
|
||||||
|
[[PREREQUISITES]]
|
||||||
|
* CesASMe results
|
||||||
|
* Gus
|
||||||
|
* Static vs dynamic
|
||||||
|
* PC
|
||||||
|
* μarch: μop, renamer, L1-res, ROB
|
||||||
|
* Osaca
|
||||||
|
* UiCA
|
||||||
|
[[END]]
|
||||||
|
|
||||||
|
|
||||||
## Intro
|
## Intro
|
||||||
|
|
||||||
* Previous chapt. : effect of mem-carried deps
|
* Previous chapt. : effect of mem-carried deps
|
||||||
|
@ -62,6 +73,7 @@ On SKX,
|
||||||
instructions are out of the ROB anyway
|
instructions are out of the ROB anyway
|
||||||
* 224 μops in Intel's Skylake, 2015
|
* 224 μops in Intel's Skylake, 2015
|
||||||
* 512 μops in Intel's Golden Cove, 2021
|
* 512 μops in Intel's Golden Cove, 2021
|
||||||
|
* Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
|
||||||
* Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
|
* Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
|
||||||
instructions yield at least a μop, safe [TODO check]
|
instructions yield at least a μop, safe [TODO check]
|
||||||
* Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
|
* Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
|
||||||
|
@ -95,8 +107,13 @@ On SKX,
|
||||||
1st kernel cannot depend on the previous kernel unroll); if it happens in the
|
1st kernel cannot depend on the previous kernel unroll); if it happens in the
|
||||||
majority of cases, keep; else drop
|
majority of cases, keep; else drop
|
||||||
|
|
||||||
* Semantics of asm coming from Valgrind's IR -- should be portable to any
|
* We need semantics for our assembly
|
||||||
architecture supported
|
|
||||||
|
### Valgrind's VEX
|
||||||
|
|
||||||
|
* Introduce Valgrind as an instrumentation tool
|
||||||
|
* Introduce VEX
|
||||||
|
* Should be portable to any architecture supported
|
||||||
* but suffers limitations for recent extension sets; eg avx512 not
|
* but suffers limitations for recent extension sets; eg avx512 not
|
||||||
supported (TODO check)
|
supported (TODO check)
|
||||||
|
|
||||||
|
@ -104,7 +121,7 @@ On SKX,
|
||||||
|
|
||||||
* Does not track aliasing that originates from outside of the kernel.
|
* Does not track aliasing that originates from outside of the kernel.
|
||||||
* As advocated in CesASMe, would require a broader analysis range
|
* As advocated in CesASMe, would require a broader analysis range
|
||||||
* Randomness may lead to false positives
|
* Randomness may (theoretically) lead to false positives
|
||||||
* but re-running with different seed should eliminate the hazard close to
|
* but re-running with different seed should eliminate the hazard close to
|
||||||
entirely
|
entirely
|
||||||
* Should not have false negatives outside of aliasing or unsupported ops
|
* Should not have false negatives outside of aliasing or unsupported ops
|
||||||
|
@ -135,14 +152,15 @@ Then, compare with staticdeps: `eval/vg_depsim.py` script.
|
||||||
* use genbench's bb split/occurrences to retrieve basic blocks
|
* use genbench's bb split/occurrences to retrieve basic blocks
|
||||||
* for each BB with more than 10% of max BB hits,
|
* for each BB with more than 10% of max BB hits,
|
||||||
* predict deps with staticdeps
|
* predict deps with staticdeps
|
||||||
* cache the result: fast, but we're dealing with 3500 files.
|
* cache the result: staticdeps is fast, but we're dealing with 3500
|
||||||
|
files.
|
||||||
* translate staticdeps' periodic deps to PC deps, discard the `iter`
|
* translate staticdeps' periodic deps to PC deps, discard the `iter`
|
||||||
parameter
|
parameter
|
||||||
* for each dependency from the depsim results that occurs inside this BB,
|
* for each dependency from the depsim results that occurs inside this BB,
|
||||||
* check if found or missed, append to a list
|
* check if found or missed, append to a list
|
||||||
* score: `|found| / (|found| + |missed|)`. Discards occurrences.
|
* score: `|found| / (|found| + |missed|)`. Discards occurrences.
|
||||||
* limitation: will only find deps from/to the same BB! Dependencies leaving
|
* limitation: will only find deps from/to the same BB! Dependencies leaving a
|
||||||
a BB are discarded.
|
BB are discarded.
|
||||||
|
|
||||||
* Result: about 38% of deps found; 44% if weighting by occurrences
|
* Result: about 38% of deps found; 44% if weighting by occurrences
|
||||||
|
|
||||||
|
@ -218,3 +236,7 @@ TODO ?
|
||||||
analysis
|
analysis
|
||||||
* but might not be true for other applications that require dependencies
|
* but might not be true for other applications that require dependencies
|
||||||
detection
|
detection
|
||||||
|
|
||||||
|
### Speed
|
||||||
|
|
||||||
|
TODO: evaluate speed?
|
||||||
|
|
|
@ -2,7 +2,10 @@
|
||||||
|
|
||||||
## 10. Introduction
|
## 10. Introduction
|
||||||
|
|
||||||
## 20. State of the art
|
## 20. Foundations
|
||||||
|
|
||||||
|
Introduce the related works, present their techniques, define and introduce
|
||||||
|
notins, …
|
||||||
|
|
||||||
## 30. Palmed: automatically modelling the backend
|
## 30. Palmed: automatically modelling the backend
|
||||||
|
|
||||||
|
|
30
plan/to_introduce_early.md
Normal file
30
plan/to_introduce_early.md
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
# Stuff that must be introduced early (intro/foundations)
|
||||||
|
|
||||||
|
* Static vs. dynamic
|
||||||
|
* PC
|
||||||
|
* ELF
|
||||||
|
* ISA
|
||||||
|
* Assembly
|
||||||
|
* SIMD
|
||||||
|
* Basic block
|
||||||
|
* μarch:
|
||||||
|
* frontend
|
||||||
|
* ports
|
||||||
|
* in-order/out-of-order
|
||||||
|
* pipeline
|
||||||
|
* Mop
|
||||||
|
* μop
|
||||||
|
* renamer
|
||||||
|
* ROB
|
||||||
|
* L1-residence
|
||||||
|
* HW counters
|
||||||
|
* Tools:
|
||||||
|
* IACA
|
||||||
|
* llvm-mca
|
||||||
|
* Osaca
|
||||||
|
* uops.info
|
||||||
|
* UiCA
|
||||||
|
* PMEvo
|
||||||
|
|
||||||
|
* Define Cycles(K): retired instructions
|
||||||
|
* Define notion of bottleneck
|
Loading…
Reference in a new issue