# A more systematic approach to throughput prediction performance analysis

[[PREREQUISITES]]
* BB
* ISA
* ELF
[[END]]


* So far, evaluation only on lone basic blocks.
* Extracted with somewhat automated methods, somewhat reproducible with manual
  effort.
* Problematic when changing ISA: the same bench suite must be re-compiled… and
  re-extracted.

## Benchsuite-bb

* Fully automated, cross-platform (weighted) BB extraction, based on bench
  suites
* Extract relevant basic blocks from real workloads: most weighted BBs are
  often executed

[big picture:
Benchsuite
-> knobs [size, cflags, …]
-> compile
-> run with perf
-> extract BBs
]

### Benchsuites

* Polybench: described earlier
* Spec: described earlier
* Rodinia: bench suite for heterogeneous computing
    * Targetting GPU, OpenMP
        * Used in OpenMP mode
    * Exhibits various usual kernels (K-means, backprop, BFS, …)
* Lot of code and tooling to write to "standardize" the interfaces and bring
  them into a single tool

### Perf analysis

* Perf profiler: works by sampling PC (+ stack) either on event occurrences, or
  a given number of times per second. 2nd mode used.
* Extract PC for each sample

### ELF navigation: pyelftools & capstone

* Present the tools
* Pyelftools: find symbols, read ELF sections, etc.
* Capstone: disassemble for many ISAs
    * Inspect operands, registers, …
    * Instruction groups: control flow instructions

### Extract BBs

* For each sampled PC,
    * Find corresponding binary symbol
    * Break this symbol into basic blocks using Capstone
        * Break at control flow instructions
        * Break at jump sites
    * Cache BBs from this symbol
    * Map this PC to its corresponding BB
* Extract weighted BBs

This way, chunk only the relevant portions

### Conclusion

* Tooling to extract BBs from several benchmark suites
* On any architecture supported by the suite
* Weighted by measured occurrences on actual runs

* Works well to evaluate tools such as Palmed: kernels are multisets, no
  dependencies, everything is L1-resident.
* We can use Pipedream as a baseline measurement.

* What if we want an execution of the real kernel as baseline (not Pipedream)?
* The extracted BB cannot be measured as-is: lacks context.
* BHive
…transition to

## CesASMe

[paper with edits]

### GUS

* Dynamic tool based on QEMU
* User-defined regions of interest
* In these regions, instrument all instructions, accesses, etc; using
  throughput + latency + μarch models for instructuctions, analyze resource
  usage, produce cycles prediction
* Sensitivity analysis: by tweaking the model (multiplying cost of some
  resources by a factor), can stress/alleviate parts of the model
    * Determine if a resource is bottleneck
* Dynamic with heavy instrumentation => slow
* Very detailed insight
* In particular, access to real-run instruction dependencies