phd-thesis/plan/50_systematic_evaluation.md

2.8 KiB

A more systematic approach to throughput prediction performance analysis

PREREQUISITES

  • BB

  • ISA

  • ELF END

  • So far, evaluation only on lone basic blocks.

  • Extracted with somewhat automated methods, somewhat reproducible with manual effort.

  • Problematic when changing ISA: the same bench suite must be re-compiled… and re-extracted.

Benchsuite-bb

  • Fully automated, cross-platform (weighted) BB extraction, based on bench suites
  • Extract relevant basic blocks from real workloads: most weighted BBs are often executed

[big picture: Benchsuite -> knobs [size, cflags, …] -> compile -> run with perf -> extract BBs ]

Benchsuites

  • Polybench: described earlier
  • Spec: described earlier
  • Rodinia: bench suite for heterogeneous computing
    • Targetting GPU, OpenMP
      • Used in OpenMP mode
    • Exhibits various usual kernels (K-means, backprop, BFS, …)
  • Lot of code and tooling to write to "standardize" the interfaces and bring them into a single tool

Perf analysis

  • Perf profiler: works by sampling PC (+ stack) either on event occurrences, or a given number of times per second. 2nd mode used.
  • Extract PC for each sample

ELF navigation: pyelftools & capstone

  • Present the tools
  • Pyelftools: find symbols, read ELF sections, etc.
  • Capstone: disassemble for many ISAs
    • Inspect operands, registers, …
    • Instruction groups: control flow instructions

Extract BBs

  • For each sampled PC,
    • Find corresponding binary symbol
    • Break this symbol into basic blocks using Capstone
      • Break at control flow instructions
      • Break at jump sites
    • Cache BBs from this symbol
    • Map this PC to its corresponding BB
  • Extract weighted BBs

This way, chunk only the relevant portions

Conclusion

  • Tooling to extract BBs from several benchmark suites

  • On any architecture supported by the suite

  • Weighted by measured occurrences on actual runs

  • Works well to evaluate tools such as Palmed: kernels are multisets, no dependencies, everything is L1-resident.

  • We can use Pipedream as a baseline measurement.

  • What if we want an execution of the real kernel as baseline (not Pipedream)?

  • The extracted BB cannot be measured as-is: lacks context.

  • BHive …transition to

CesASMe

[paper with edits]

GUS

  • Dynamic tool based on QEMU
  • User-defined regions of interest
  • In these regions, instrument all instructions, accesses, etc; using throughput + latency + μarch models for instructuctions, analyze resource usage, produce cycles prediction
  • Sensitivity analysis: by tweaking the model (multiplying cost of some resources by a factor), can stress/alleviate parts of the model
    • Determine if a resource is bottleneck
  • Dynamic with heavy instrumentation => slow
  • Very detailed insight
  • In particular, access to real-run instruction dependencies