2023-09-06 17:41:57 +02:00
|
|
|
# A more systematic approach to throughput prediction performance analysis
|
|
|
|
|
2023-09-14 11:42:50 +02:00
|
|
|
[[PREREQUISITES]]
|
|
|
|
* BB
|
|
|
|
* ISA
|
|
|
|
* ELF
|
|
|
|
[[END]]
|
|
|
|
|
|
|
|
|
2023-09-06 17:41:57 +02:00
|
|
|
* So far, evaluation only on lone basic blocks.
|
|
|
|
* Extracted with somewhat automated methods, somewhat reproducible with manual
|
|
|
|
effort.
|
|
|
|
* Problematic when changing ISA: the same bench suite must be re-compiled… and
|
|
|
|
re-extracted.
|
|
|
|
|
|
|
|
## Benchsuite-bb
|
|
|
|
|
|
|
|
* Fully automated, cross-platform (weighted) BB extraction, based on bench
|
|
|
|
suites
|
|
|
|
* Extract relevant basic blocks from real workloads: most weighted BBs are
|
|
|
|
often executed
|
|
|
|
|
|
|
|
[big picture:
|
|
|
|
Benchsuite
|
|
|
|
-> knobs [size, cflags, …]
|
|
|
|
-> compile
|
|
|
|
-> run with perf
|
|
|
|
-> extract BBs
|
|
|
|
]
|
|
|
|
|
|
|
|
### Benchsuites
|
|
|
|
|
|
|
|
* Polybench: described earlier
|
|
|
|
* Spec: described earlier
|
|
|
|
* Rodinia: bench suite for heterogeneous computing
|
|
|
|
* Targetting GPU, OpenMP
|
|
|
|
* Used in OpenMP mode
|
|
|
|
* Exhibits various usual kernels (K-means, backprop, BFS, …)
|
|
|
|
* Lot of code and tooling to write to "standardize" the interfaces and bring
|
|
|
|
them into a single tool
|
|
|
|
|
|
|
|
### Perf analysis
|
|
|
|
|
|
|
|
* Perf profiler: works by sampling PC (+ stack) either on event occurrences, or
|
|
|
|
a given number of times per second. 2nd mode used.
|
|
|
|
* Extract PC for each sample
|
|
|
|
|
2023-09-14 11:42:50 +02:00
|
|
|
### ELF navigation: pyelftools & capstone
|
|
|
|
|
|
|
|
* Present the tools
|
|
|
|
* Pyelftools: find symbols, read ELF sections, etc.
|
|
|
|
* Capstone: disassemble for many ISAs
|
|
|
|
* Inspect operands, registers, …
|
|
|
|
* Instruction groups: control flow instructions
|
|
|
|
|
2023-09-06 17:41:57 +02:00
|
|
|
### Extract BBs
|
|
|
|
|
|
|
|
* For each sampled PC,
|
|
|
|
* Find corresponding binary symbol
|
|
|
|
* Break this symbol into basic blocks using Capstone
|
|
|
|
* Break at control flow instructions
|
|
|
|
* Break at jump sites
|
|
|
|
* Cache BBs from this symbol
|
|
|
|
* Map this PC to its corresponding BB
|
|
|
|
* Extract weighted BBs
|
|
|
|
|
|
|
|
This way, chunk only the relevant portions
|
|
|
|
|
|
|
|
### Conclusion
|
|
|
|
|
|
|
|
* Tooling to extract BBs from several benchmark suites
|
|
|
|
* On any architecture supported by the suite
|
|
|
|
* Weighted by measured occurrences on actual runs
|
|
|
|
|
|
|
|
* Works well to evaluate tools such as Palmed: kernels are multisets, no
|
|
|
|
dependencies, everything is L1-resident.
|
|
|
|
* We can use Pipedream as a baseline measurement.
|
|
|
|
|
|
|
|
* What if we want an execution of the real kernel as baseline (not Pipedream)?
|
|
|
|
* The extracted BB cannot be measured as-is: lacks context.
|
|
|
|
* BHive
|
|
|
|
…transition to
|
|
|
|
|
|
|
|
## CesASMe
|
|
|
|
|
|
|
|
[paper with edits]
|
2023-09-14 11:42:50 +02:00
|
|
|
|
|
|
|
### GUS
|
|
|
|
|
|
|
|
* Dynamic tool based on QEMU
|
|
|
|
* User-defined regions of interest
|
|
|
|
* In these regions, instrument all instructions, accesses, etc; using
|
|
|
|
throughput + latency + μarch models for instructuctions, analyze resource
|
|
|
|
usage, produce cycles prediction
|
|
|
|
* Sensitivity analysis: by tweaking the model (multiplying cost of some
|
|
|
|
resources by a factor), can stress/alleviate parts of the model
|
|
|
|
* Determine if a resource is bottleneck
|
|
|
|
* Dynamic with heavy instrumentation => slow
|
|
|
|
* Very detailed insight
|
|
|
|
* In particular, access to real-run instruction dependencies
|