Plan: list prerequisites for each chapter, ensure consistency

2023-09-14 11:42:50 +02:00 · 2023-09-14 11:42:50 +02:00 · 9ed7be7fc6
commit 9ed7be7fc6
parent 7e5abd9669
6 changed files with 143 additions and 14 deletions
--- a/plan/30_palmed.md
+++ b/plan/30_palmed.md
@ -1,5 +1,17 @@
 # Palmed: automatically modelling the backend
 [[PREREQUISITES]]
 * Microarch: ports, μops, pipeline, cycle, L1-res
 * Define Cyc(kernel)
 * Backend models
 * HW counters
 * Tools:
    * Iaca
    * UOPS
    * llvm-mca
    * PMEvo
 [[END]]
 * SotA: we saw efforts to build backend models
 * they take considerable expert knowledge/time
 * based on reverse-engineering, HW counters
@ -94,7 +106,7 @@ NUM_ITER >= TOTAL_INSN`.
    * Direct register addressing mode (eg `ldr x0, [x1]`): always the same
      address (load/store separated)
    * Base-index-displacement mode: constant base, 0 offset, round-robin
-      displacement.
+      displacement on x86 (constant displacement on ARM)
 * Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
  measurement
 * L1-residence: memory arena is small enough; warm-up rounds.
@ -124,13 +136,30 @@ Tried on x86 (SKX, ZEN1) and ARM (A72).
 ### Evaluation
 #### Bench suites: SPEC, Polybench
 * SPEC: real-world programs
    * Mainly made to evaluate hardware on a fixed workload
    * Provides a fixed workload to evaluate various pieces of software
      experimentations as well
        * Used throughout the litterature
    * Describe versions of SPEC, architecture
 * Polybench
    * 30 numerical computations
    * Computation kernels: domain specific (sci. computation, math, …)
    * Kernel well-defined; no need to "figure out" the interesting basic blocks
    * C language
    * datasets
 #### Experimental setup
 * Harness to evaluate Palmed against other code analyzers
    * Raw pipedream
    * Gus
    * Iaca
    * UOPS
    * llvm-mca
    * PMEvo
    * UOPS
        * UiCA did not exist at the time; + fair comparison (Palmed is backend)
 * Based on basic blocks
 * The kernel is defined as a Palmed kernel: unordered, no dependencies
    * in practice, use Pipedream generated code as kernel
@ -141,3 +170,7 @@ Measures:
 * Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
 * RMS error of IPC
 * Kendall's tau for the IPCs
 #### Results
 * Results
--- a/plan/40_a72_frontend.md
+++ b/plan/40_a72_frontend.md
@ -1,14 +1,26 @@
 # Beyond ports: manually modelling the A72 frontend
 [[PREREQUISITES]]
 * Microarch: frontend, ports, in-order/OoO, μ/Mop
 * Assembly
 * SIMD
 * Def Cyc(k) -> retired insn
 * Palmed, Palmed results
    * Palmed instruction classes
 * Pipedream
 * uops.info
 * Notion of bottleneck
 [[END]]
 ## Necessity to go beyond ports
 * Palmed: concerned mostly with ports
 * Noticed the importance of the frontend while investigating its performances
-    * heatmap representation: uops gone wild
+    * heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX)
    * example of a frontend-bound microkernel
 * Palmed's vision of a frontend
 * Real difference: in-order
-* UiCA: OK, but it's more complicated
+* UiCA: proves that frontends are important, implements Intel frontends
 ## Cortex A72
@ -73,10 +85,10 @@ From now on, we try to find models answering:
 ### No-cross model
-* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
+* Hypothesis: the frontend cannot decode a multi-uop instruction across cycle
  boundaries.
-* Reasonable: similar things on x86-64 [uica] (?? investigate)
+* Reasonable: similar things on x86-64 -- cf [uica] predecoder §4.1
 * Would explain the example above [show again].
 * Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
--- a/plan/50_systematic_evaluation.md
+++ b/plan/50_systematic_evaluation.md
@ -1,5 +1,12 @@
 # A more systematic approach to throughput prediction performance analysis
 [[PREREQUISITES]]
 * BB
 * ISA
 * ELF
 [[END]]
 * So far, evaluation only on lone basic blocks.
 * Extracted with somewhat automated methods, somewhat reproducible with manual
  effort.
@ -38,6 +45,14 @@ Benchsuite
  a given number of times per second. 2nd mode used.
 * Extract PC for each sample
 ### ELF navigation: pyelftools & capstone
 * Present the tools
 * Pyelftools: find symbols, read ELF sections, etc.
 * Capstone: disassemble for many ISAs
    * Inspect operands, registers, …
    * Instruction groups: control flow instructions
 ### Extract BBs
 * For each sampled PC,
@ -69,3 +84,17 @@ This way, chunk only the relevant portions
 ## CesASMe
 [paper with edits]
 ### GUS
 * Dynamic tool based on QEMU
 * User-defined regions of interest
 * In these regions, instrument all instructions, accesses, etc; using
  throughput + latency + μarch models for instructuctions, analyze resource
  usage, produce cycles prediction
 * Sensitivity analysis: by tweaking the model (multiplying cost of some
  resources by a factor), can stress/alleviate parts of the model
    * Determine if a resource is bottleneck
 * Dynamic with heavy instrumentation => slow
 * Very detailed insight
 * In particular, access to real-run instruction dependencies
--- a/plan/60_staticdeps.md
+++ b/plan/60_staticdeps.md
@ -1,5 +1,16 @@
 # Static extraction of memory-carried dependencies
 [[PREREQUISITES]]
 * CesASMe results
 * Gus
 * Static vs dynamic
 * PC
 * μarch: μop, renamer, L1-res, ROB
 * Osaca
 * UiCA
 [[END]]
 ## Intro
 * Previous chapt. : effect of mem-carried deps
@ -62,6 +73,7 @@ On SKX,
      instructions are out of the ROB anyway
        * 224 μops in Intel's Skylake, 2015
        * 512 μops in Intel's Golden Cove, 2021
        * Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
    * Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
      instructions yield at least a μop, safe [TODO check]
    * Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
@ -95,8 +107,13 @@ On SKX,
  1st kernel cannot depend on the previous kernel unroll); if it happens in the
  majority of cases, keep; else drop
-* Semantics of asm coming from Valgrind's IR -- should be portable to any
+* We need semantics for our assembly
-  architecture supported
+
 ### Valgrind's VEX
 * Introduce Valgrind as an instrumentation tool
 * Introduce VEX
 * Should be portable to any architecture supported
    * but suffers limitations for recent extension sets; eg avx512 not
      supported (TODO check)
@ -104,7 +121,7 @@ On SKX,
 * Does not track aliasing that originates from outside of the kernel.
    * As advocated in CesASMe, would require a broader analysis range
-* Randomness may lead to false positives
+* Randomness may (theoretically) lead to false positives
    * but re-running with different seed should eliminate the hazard close to
      entirely
 * Should not have false negatives outside of aliasing or unsupported ops
@ -135,14 +152,15 @@ Then, compare with staticdeps: `eval/vg_depsim.py` script.
    * use genbench's bb split/occurrences to retrieve basic blocks
    * for each BB with more than 10% of max BB hits,
    * predict deps with staticdeps
-        * cache the result: fast, but we're dealing with 3500 files.
+        * cache the result: staticdeps is fast, but we're dealing with 3500
          files.
    * translate staticdeps' periodic deps to PC deps, discard the `iter`
      parameter
    * for each dependency from the depsim results that occurs inside this BB,
        * check if found or missed, append to a list
 * score: `|found| / (|found| + |missed|)`. Discards occurrences.
-* limitation: will only find deps from/to the same BB! Dependencies leaving
+* limitation: will only find deps from/to the same BB! Dependencies leaving a
-  a BB are discarded.
+  BB are discarded.
 * Result: about 38% of deps found; 44% if weighting by occurrences
@ -218,3 +236,7 @@ TODO ?
  analysis
    * but might not be true for other applications that require dependencies
      detection
 ### Speed
 TODO: evaluate speed?
--- a/plan/main.md
+++ b/plan/main.md
@ -2,7 +2,10 @@
 ## 10. Introduction
-## 20. State of the art
+## 20. Foundations
 Introduce the related works, present their techniques, define and introduce
 notins, …
 ## 30. Palmed: automatically modelling the backend
--- a/plan/to_introduce_early.md
+++ b/plan/to_introduce_early.md
@ -0,0 +1,30 @@
 # Stuff that must be introduced early (intro/foundations)
 * Static vs. dynamic
 * PC
 * ELF
 * ISA
 * Assembly
 * SIMD
 * Basic block
 * μarch:
    * frontend
    * ports
    * in-order/out-of-order
    * pipeline
    * Mop
    * μop
    * renamer
    * ROB
    * L1-residence
 * HW counters
 * Tools:
    * IACA
    * llvm-mca
    * Osaca
    * uops.info
    * UiCA
    * PMEvo
 * Define Cycles(K): retired instructions
 * Define notion of bottleneck