phd-thesis/plan/50_systematic_evaluation.md

# A more systematic approach to throughput prediction performance analysis

[[PREREQUISITES]]
* BB
* ISA
* ELF
[[END]]


* So far, evaluation only on lone basic blocks.
* Extracted with somewhat automated methods, somewhat reproducible with manual
  effort.
* Problematic when changing ISA: the same bench suite must be re-compiled… and
  re-extracted.

## Benchsuite-bb

* Fully automated, cross-platform (weighted) BB extraction, based on bench
  suites
* Extract relevant basic blocks from real workloads: most weighted BBs are
  often executed

[big picture:
Benchsuite
-> knobs [size, cflags, …]
-> compile
-> run with perf
-> extract BBs
]

### Benchsuites

* Polybench: described earlier
* Spec: described earlier
* Rodinia: bench suite for heterogeneous computing
    * Targetting GPU, OpenMP
        * Used in OpenMP mode
    * Exhibits various usual kernels (K-means, backprop, BFS, …)
* Lot of code and tooling to write to "standardize" the interfaces and bring
  them into a single tool

### Perf analysis

* Perf profiler: works by sampling PC (+ stack) either on event occurrences, or
  a given number of times per second. 2nd mode used.
* Extract PC for each sample

### ELF navigation: pyelftools & capstone

* Present the tools
* Pyelftools: find symbols, read ELF sections, etc.
* Capstone: disassemble for many ISAs
    * Inspect operands, registers, …
    * Instruction groups: control flow instructions

### Extract BBs

* For each sampled PC,
    * Find corresponding binary symbol
    * Break this symbol into basic blocks using Capstone
        * Break at control flow instructions
        * Break at jump sites
    * Cache BBs from this symbol
    * Map this PC to its corresponding BB
* Extract weighted BBs

This way, chunk only the relevant portions

### Conclusion

* Tooling to extract BBs from several benchmark suites
* On any architecture supported by the suite
* Weighted by measured occurrences on actual runs

* Works well to evaluate tools such as Palmed: kernels are multisets, no
  dependencies, everything is L1-resident.
* We can use Pipedream as a baseline measurement.

* What if we want an execution of the real kernel as baseline (not Pipedream)?
* The extracted BB cannot be measured as-is: lacks context.
* BHive
…transition to

## CesASMe

[paper with edits]

### GUS

* Dynamic tool based on QEMU
* User-defined regions of interest
* In these regions, instrument all instructions, accesses, etc; using
  throughput + latency + μarch models for instructuctions, analyze resource
  usage, produce cycles prediction
* Sensitivity analysis: by tweaking the model (multiplying cost of some
  resources by a factor), can stress/alleviate parts of the model
    * Determine if a resource is bottleneck
* Dynamic with heavy instrumentation => slow
* Very detailed insight
* In particular, access to real-run instruction dependencies
Add tentative 50_systematic_evaluation (mainly benchsuite-bb) 2023-09-06 17:41:57 +02:00			`# A more systematic approach to throughput prediction performance analysis`

Plan: list prerequisites for each chapter, ensure consistency 2023-09-14 11:42:50 +02:00			`[[PREREQUISITES]]`
			`* BB`
			`* ISA`
			`* ELF`
			`[[END]]`


Add tentative 50_systematic_evaluation (mainly benchsuite-bb) 2023-09-06 17:41:57 +02:00			`* So far, evaluation only on lone basic blocks.`
			`* Extracted with somewhat automated methods, somewhat reproducible with manual`
			`effort.`
			`* Problematic when changing ISA: the same bench suite must be re-compiled… and`
			`re-extracted.`

			`## Benchsuite-bb`

			`* Fully automated, cross-platform (weighted) BB extraction, based on bench`
			`suites`
			`* Extract relevant basic blocks from real workloads: most weighted BBs are`
			`often executed`

			`[big picture:`
			`Benchsuite`
			`-> knobs [size, cflags, …]`
			`-> compile`
			`-> run with perf`
			`-> extract BBs`
			`]`

			`### Benchsuites`

			`* Polybench: described earlier`
			`* Spec: described earlier`
			`* Rodinia: bench suite for heterogeneous computing`
			`* Targetting GPU, OpenMP`
			`* Used in OpenMP mode`
			`* Exhibits various usual kernels (K-means, backprop, BFS, …)`
			`* Lot of code and tooling to write to "standardize" the interfaces and bring`
			`them into a single tool`

			`### Perf analysis`

			`* Perf profiler: works by sampling PC (+ stack) either on event occurrences, or`
			`a given number of times per second. 2nd mode used.`
			`* Extract PC for each sample`

Plan: list prerequisites for each chapter, ensure consistency 2023-09-14 11:42:50 +02:00			`### ELF navigation: pyelftools & capstone`

			`* Present the tools`
			`* Pyelftools: find symbols, read ELF sections, etc.`
			`* Capstone: disassemble for many ISAs`
			`* Inspect operands, registers, …`
			`* Instruction groups: control flow instructions`

Add tentative 50_systematic_evaluation (mainly benchsuite-bb) 2023-09-06 17:41:57 +02:00			`### Extract BBs`

			`* For each sampled PC,`
			`* Find corresponding binary symbol`
			`* Break this symbol into basic blocks using Capstone`
			`* Break at control flow instructions`
			`* Break at jump sites`
			`* Cache BBs from this symbol`
			`* Map this PC to its corresponding BB`
			`* Extract weighted BBs`

			`This way, chunk only the relevant portions`

			`### Conclusion`

			`* Tooling to extract BBs from several benchmark suites`
			`* On any architecture supported by the suite`
			`* Weighted by measured occurrences on actual runs`

			`* Works well to evaluate tools such as Palmed: kernels are multisets, no`
			`dependencies, everything is L1-resident.`
			`* We can use Pipedream as a baseline measurement.`

			`* What if we want an execution of the real kernel as baseline (not Pipedream)?`
			`* The extracted BB cannot be measured as-is: lacks context.`
			`* BHive`
			`…transition to`

			`## CesASMe`

			`[paper with edits]`
Plan: list prerequisites for each chapter, ensure consistency 2023-09-14 11:42:50 +02:00
			`### GUS`

			`* Dynamic tool based on QEMU`
			`* User-defined regions of interest`
			`* In these regions, instrument all instructions, accesses, etc; using`
			`throughput + latency + μarch models for instructuctions, analyze resource`
			`usage, produce cycles prediction`
			`* Sensitivity analysis: by tweaking the model (multiplying cost of some`
			`resources by a factor), can stress/alleviate parts of the model`
			`* Determine if a resource is bottleneck`
			`* Dynamic with heavy instrumentation => slow`
			`* Very detailed insight`
			`* In particular, access to real-run instruction dependencies`