A more systematic approach to throughput prediction performance analysis

BB
ISA
ELF END
So far, evaluation only on lone basic blocks.
Extracted with somewhat automated methods, somewhat reproducible with manual effort.
Problematic when changing ISA: the same bench suite must be re-compiled… and re-extracted.

Benchsuite-bb

Fully automated, cross-platform (weighted) BB extraction, based on bench suites
Extract relevant basic blocks from real workloads: most weighted BBs are often executed

[big picture: Benchsuite -> knobs [size, cflags, …] -> compile -> run with perf -> extract BBs ]

Polybench: described earlier
Spec: described earlier
Rodinia: bench suite for heterogeneous computing
- Targetting GPU, OpenMP
  - Used in OpenMP mode
- Exhibits various usual kernels (K-means, backprop, BFS, …)
Lot of code and tooling to write to "standardize" the interfaces and bring them into a single tool

Perf profiler: works by sampling PC (+ stack) either on event occurrences, or a given number of times per second. 2nd mode used.
Extract PC for each sample

Present the tools
Pyelftools: find symbols, read ELF sections, etc.
Capstone: disassemble for many ISAs
- Inspect operands, registers, …
- Instruction groups: control flow instructions

This way, chunk only the relevant portions

Tooling to extract BBs from several benchmark suites
On any architecture supported by the suite
Weighted by measured occurrences on actual runs
Works well to evaluate tools such as Palmed: kernels are multisets, no dependencies, everything is L1-resident.
We can use Pipedream as a baseline measurement.
What if we want an execution of the real kernel as baseline (not Pipedream)?
The extracted BB cannot be measured as-is: lacks context.
BHive …transition to

[paper with edits]

Dynamic tool based on QEMU
User-defined regions of interest
In these regions, instrument all instructions, accesses, etc; using throughput + latency + μarch models for instructuctions, analyze resource usage, produce cycles prediction
Sensitivity analysis: by tweaking the model (multiplying cost of some resources by a factor), can stress/alleviate parts of the model
- Determine if a resource is bottleneck
Dynamic with heavy instrumentation => slow
Very detailed insight
In particular, access to real-run instruction dependencies