phd-thesis/plan/50_systematic_evaluation.md

100 lines
2.8 KiB
Markdown

# A more systematic approach to throughput prediction performance analysis
[[PREREQUISITES]]
* BB
* ISA
* ELF
[[END]]
* So far, evaluation only on lone basic blocks.
* Extracted with somewhat automated methods, somewhat reproducible with manual
effort.
* Problematic when changing ISA: the same bench suite must be re-compiled… and
re-extracted.
## Benchsuite-bb
* Fully automated, cross-platform (weighted) BB extraction, based on bench
suites
* Extract relevant basic blocks from real workloads: most weighted BBs are
often executed
[big picture:
Benchsuite
-> knobs [size, cflags, …]
-> compile
-> run with perf
-> extract BBs
]
### Benchsuites
* Polybench: described earlier
* Spec: described earlier
* Rodinia: bench suite for heterogeneous computing
* Targetting GPU, OpenMP
* Used in OpenMP mode
* Exhibits various usual kernels (K-means, backprop, BFS, …)
* Lot of code and tooling to write to "standardize" the interfaces and bring
them into a single tool
### Perf analysis
* Perf profiler: works by sampling PC (+ stack) either on event occurrences, or
a given number of times per second. 2nd mode used.
* Extract PC for each sample
### ELF navigation: pyelftools & capstone
* Present the tools
* Pyelftools: find symbols, read ELF sections, etc.
* Capstone: disassemble for many ISAs
* Inspect operands, registers, …
* Instruction groups: control flow instructions
### Extract BBs
* For each sampled PC,
* Find corresponding binary symbol
* Break this symbol into basic blocks using Capstone
* Break at control flow instructions
* Break at jump sites
* Cache BBs from this symbol
* Map this PC to its corresponding BB
* Extract weighted BBs
This way, chunk only the relevant portions
### Conclusion
* Tooling to extract BBs from several benchmark suites
* On any architecture supported by the suite
* Weighted by measured occurrences on actual runs
* Works well to evaluate tools such as Palmed: kernels are multisets, no
dependencies, everything is L1-resident.
* We can use Pipedream as a baseline measurement.
* What if we want an execution of the real kernel as baseline (not Pipedream)?
* The extracted BB cannot be measured as-is: lacks context.
* BHive
…transition to
## CesASMe
[paper with edits]
### GUS
* Dynamic tool based on QEMU
* User-defined regions of interest
* In these regions, instrument all instructions, accesses, etc; using
throughput + latency + μarch models for instructuctions, analyze resource
usage, produce cycles prediction
* Sensitivity analysis: by tweaking the model (multiplying cost of some
resources by a factor), can stress/alleviate parts of the model
* Determine if a resource is bottleneck
* Dynamic with heavy instrumentation => slow
* Very detailed insight
* In particular, access to real-run instruction dependencies