2.8 KiB
A more systematic approach to throughput prediction performance analysis
-
BB
-
ISA
-
ELF END
-
So far, evaluation only on lone basic blocks.
-
Extracted with somewhat automated methods, somewhat reproducible with manual effort.
-
Problematic when changing ISA: the same bench suite must be re-compiled… and re-extracted.
Benchsuite-bb
- Fully automated, cross-platform (weighted) BB extraction, based on bench suites
- Extract relevant basic blocks from real workloads: most weighted BBs are often executed
[big picture: Benchsuite -> knobs [size, cflags, …] -> compile -> run with perf -> extract BBs ]
Benchsuites
- Polybench: described earlier
- Spec: described earlier
- Rodinia: bench suite for heterogeneous computing
- Targetting GPU, OpenMP
- Used in OpenMP mode
- Exhibits various usual kernels (K-means, backprop, BFS, …)
- Targetting GPU, OpenMP
- Lot of code and tooling to write to "standardize" the interfaces and bring them into a single tool
Perf analysis
- Perf profiler: works by sampling PC (+ stack) either on event occurrences, or a given number of times per second. 2nd mode used.
- Extract PC for each sample
ELF navigation: pyelftools & capstone
- Present the tools
- Pyelftools: find symbols, read ELF sections, etc.
- Capstone: disassemble for many ISAs
- Inspect operands, registers, …
- Instruction groups: control flow instructions
Extract BBs
- For each sampled PC,
- Find corresponding binary symbol
- Break this symbol into basic blocks using Capstone
- Break at control flow instructions
- Break at jump sites
- Cache BBs from this symbol
- Map this PC to its corresponding BB
- Extract weighted BBs
This way, chunk only the relevant portions
Conclusion
-
Tooling to extract BBs from several benchmark suites
-
On any architecture supported by the suite
-
Weighted by measured occurrences on actual runs
-
Works well to evaluate tools such as Palmed: kernels are multisets, no dependencies, everything is L1-resident.
-
We can use Pipedream as a baseline measurement.
-
What if we want an execution of the real kernel as baseline (not Pipedream)?
-
The extracted BB cannot be measured as-is: lacks context.
-
BHive …transition to
CesASMe
[paper with edits]
GUS
- Dynamic tool based on QEMU
- User-defined regions of interest
- In these regions, instrument all instructions, accesses, etc; using throughput + latency + μarch models for instructuctions, analyze resource usage, produce cycles prediction
- Sensitivity analysis: by tweaking the model (multiplying cost of some
resources by a factor), can stress/alleviate parts of the model
- Determine if a resource is bottleneck
- Dynamic with heavy instrumentation => slow
- Very detailed insight
- In particular, access to real-run instruction dependencies