# A more systematic approach to throughput prediction performance analysis [[PREREQUISITES]] * BB * ISA * ELF [[END]] * So far, evaluation only on lone basic blocks. * Extracted with somewhat automated methods, somewhat reproducible with manual effort. * Problematic when changing ISA: the same bench suite must be re-compiled… and re-extracted. ## Benchsuite-bb * Fully automated, cross-platform (weighted) BB extraction, based on bench suites * Extract relevant basic blocks from real workloads: most weighted BBs are often executed [big picture: Benchsuite -> knobs [size, cflags, …] -> compile -> run with perf -> extract BBs ] ### Benchsuites * Polybench: described earlier * Spec: described earlier * Rodinia: bench suite for heterogeneous computing * Targetting GPU, OpenMP * Used in OpenMP mode * Exhibits various usual kernels (K-means, backprop, BFS, …) * Lot of code and tooling to write to "standardize" the interfaces and bring them into a single tool ### Perf analysis * Perf profiler: works by sampling PC (+ stack) either on event occurrences, or a given number of times per second. 2nd mode used. * Extract PC for each sample ### ELF navigation: pyelftools & capstone * Present the tools * Pyelftools: find symbols, read ELF sections, etc. * Capstone: disassemble for many ISAs * Inspect operands, registers, … * Instruction groups: control flow instructions ### Extract BBs * For each sampled PC, * Find corresponding binary symbol * Break this symbol into basic blocks using Capstone * Break at control flow instructions * Break at jump sites * Cache BBs from this symbol * Map this PC to its corresponding BB * Extract weighted BBs This way, chunk only the relevant portions ### Conclusion * Tooling to extract BBs from several benchmark suites * On any architecture supported by the suite * Weighted by measured occurrences on actual runs * Works well to evaluate tools such as Palmed: kernels are multisets, no dependencies, everything is L1-resident. * We can use Pipedream as a baseline measurement. * What if we want an execution of the real kernel as baseline (not Pipedream)? * The extracted BB cannot be measured as-is: lacks context. * BHive …transition to ## CesASMe [paper with edits] ### GUS * Dynamic tool based on QEMU * User-defined regions of interest * In these regions, instrument all instructions, accesses, etc; using throughput + latency + μarch models for instructuctions, analyze resource usage, produce cycles prediction * Sensitivity analysis: by tweaking the model (multiplying cost of some resources by a factor), can stress/alleviate parts of the model * Determine if a resource is bottleneck * Dynamic with heavy instrumentation => slow * Very detailed insight * In particular, access to real-run instruction dependencies