2.8 KiB
A more systematic approach to throughput prediction performance analysis
So far, evaluation only on lone basic blocks.
Extracted with somewhat automated methods, somewhat reproducible with manual effort.
Problematic when changing ISA: the same bench suite must be re-compiled… and re-extracted.
- Fully automated, cross-platform (weighted) BB extraction, based on bench suites
- Extract relevant basic blocks from real workloads: most weighted BBs are often executed
[big picture: Benchsuite -> knobs [size, cflags, …] -> compile -> run with perf -> extract BBs ]
- Polybench: described earlier
- Spec: described earlier
- Rodinia: bench suite for heterogeneous computing
- Targetting GPU, OpenMP
- Used in OpenMP mode
- Exhibits various usual kernels (K-means, backprop, BFS, …)
- Targetting GPU, OpenMP
- Lot of code and tooling to write to "standardize" the interfaces and bring them into a single tool
Perf analysis
- Perf profiler: works by sampling PC (+ stack) either on event occurrences, or a given number of times per second. 2nd mode used.
- Extract PC for each sample
ELF navigation: pyelftools & capstone
- Present the tools
- Pyelftools: find symbols, read ELF sections, etc.
- Capstone: disassemble for many ISAs
- Inspect operands, registers, …
- Instruction groups: control flow instructions
Extract BBs
- For each sampled PC,
- Find corresponding binary symbol
- Break this symbol into basic blocks using Capstone
- Break at control flow instructions
- Break at jump sites
- Cache BBs from this symbol
- Map this PC to its corresponding BB
- Extract weighted BBs
This way, chunk only the relevant portions
Tooling to extract BBs from several benchmark suites
On any architecture supported by the suite
Weighted by measured occurrences on actual runs
Works well to evaluate tools such as Palmed: kernels are multisets, no dependencies, everything is L1-resident.
We can use Pipedream as a baseline measurement.
What if we want an execution of the real kernel as baseline (not Pipedream)?
The extracted BB cannot be measured as-is: lacks context.
BHive …transition to
[paper with edits]
- Dynamic tool based on QEMU
- User-defined regions of interest
- In these regions, instrument all instructions, accesses, etc; using throughput + latency + μarch models for instructuctions, analyze resource usage, produce cycles prediction
- Sensitivity analysis: by tweaking the model (multiplying cost of some
resources by a factor), can stress/alleviate parts of the model
- Determine if a resource is bottleneck
- Dynamic with heavy instrumentation => slow
- Very detailed insight
- In particular, access to real-run instruction dependencies