Plan: list prerequisites for each chapter, ensure consistency

2023-09-14 11:42:50 +02:00 · 2023-09-14 11:42:50 +02:00 · 9ed7be7fc6
commit 9ed7be7fc6
parent 7e5abd9669
6 changed files with 143 additions and 14 deletions
--- a/plan/30_palmed.md
+++ b/plan/30_palmed.md
@ -1,5 +1,17 @@
 # Palmed: automatically modelling the backend

+[[PREREQUISITES]]
+* Microarch: ports, μops, pipeline, cycle, L1-res
+* Define Cyc(kernel)
+* Backend models
+* HW counters
+* Tools:
+    * Iaca
+    * UOPS
+    * llvm-mca
+    * PMEvo
+[[END]]
+
 * SotA: we saw efforts to build backend models
 * they take considerable expert knowledge/time
 * based on reverse-engineering, HW counters
@ -94,7 +106,7 @@ NUM_ITER >= TOTAL_INSN`.
    * Direct register addressing mode (eg `ldr x0, [x1]`): always the same
      address (load/store separated)
    * Base-index-displacement mode: constant base, 0 offset, round-robin
-      displacement.
+      displacement on x86 (constant displacement on ARM)
 * Whenever possible (`\sum_i(lat_i) < #reg`), no data dependency during
  measurement
 * L1-residence: memory arena is small enough; warm-up rounds.
@ -124,13 +136,30 @@ Tried on x86 (SKX, ZEN1) and ARM (A72).

 ### Evaluation

+#### Bench suites: SPEC, Polybench
+
+* SPEC: real-world programs
+    * Mainly made to evaluate hardware on a fixed workload
+    * Provides a fixed workload to evaluate various pieces of software
+      experimentations as well
+        * Used throughout the litterature
+    * Describe versions of SPEC, architecture
+* Polybench
+    * 30 numerical computations
+    * Computation kernels: domain specific (sci. computation, math, …)
+    * Kernel well-defined; no need to "figure out" the interesting basic blocks
+    * C language
+    * datasets
+
+#### Experimental setup
+
 * Harness to evaluate Palmed against other code analyzers
    * Raw pipedream
-    * Gus
    * Iaca
-    * UOPS
    * llvm-mca
    * PMEvo
+    * UOPS
+        * UiCA did not exist at the time; + fair comparison (Palmed is backend)
 * Based on basic blocks
 * The kernel is defined as a Palmed kernel: unordered, no dependencies
    * in practice, use Pipedream generated code as kernel
@ -141,3 +170,7 @@ Measures:
 * Coverage: proportion of benchmarks supported by the tool, wrt. Palmed
 * RMS error of IPC
 * Kendall's tau for the IPCs
+
+#### Results
+
+* Results
--- a/plan/40_a72_frontend.md
+++ b/plan/40_a72_frontend.md
@ -1,14 +1,26 @@
 # Beyond ports: manually modelling the A72 frontend

+[[PREREQUISITES]]
+* Microarch: frontend, ports, in-order/OoO, μ/Mop
+* Assembly
+* SIMD
+* Def Cyc(k) -> retired insn
+* Palmed, Palmed results
+    * Palmed instruction classes
+* Pipedream
+* uops.info
+* Notion of bottleneck
+[[END]]
+
 ## Necessity to go beyond ports

 * Palmed: concerned mostly with ports
 * Noticed the importance of the frontend while investigating its performances
-    * heatmap representation: uops gone wild
+    * heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX)
    * example of a frontend-bound microkernel
 * Palmed's vision of a frontend
 * Real difference: in-order
-* UiCA: OK, but it's more complicated
+* UiCA: proves that frontends are important, implements Intel frontends

 ## Cortex A72

@ -73,10 +85,10 @@ From now on, we try to find models answering:

 ### No-cross model

-* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
+* Hypothesis: the frontend cannot decode a multi-uop instruction across cycle
  boundaries.

-* Reasonable: similar things on x86-64 [uica] (?? investigate)
+* Reasonable: similar things on x86-64 -- cf [uica] predecoder §4.1
 * Would explain the example above [show again].

 * Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
--- a/plan/50_systematic_evaluation.md
+++ b/plan/50_systematic_evaluation.md
@ -1,5 +1,12 @@
 # A more systematic approach to throughput prediction performance analysis

+[[PREREQUISITES]]
+* BB
+* ISA
+* ELF
+[[END]]
+
+
 * So far, evaluation only on lone basic blocks.
 * Extracted with somewhat automated methods, somewhat reproducible with manual
  effort.
@ -38,6 +45,14 @@ Benchsuite
  a given number of times per second. 2nd mode used.
 * Extract PC for each sample

+### ELF navigation: pyelftools & capstone
+
+* Present the tools
+* Pyelftools: find symbols, read ELF sections, etc.
+* Capstone: disassemble for many ISAs
+    * Inspect operands, registers, …
+    * Instruction groups: control flow instructions
+
 ### Extract BBs

 * For each sampled PC,
@ -69,3 +84,17 @@ This way, chunk only the relevant portions
 ## CesASMe

 [paper with edits]
+
+### GUS
+
+* Dynamic tool based on QEMU
+* User-defined regions of interest
+* In these regions, instrument all instructions, accesses, etc; using
+  throughput + latency + μarch models for instructuctions, analyze resource
+  usage, produce cycles prediction
+* Sensitivity analysis: by tweaking the model (multiplying cost of some
+  resources by a factor), can stress/alleviate parts of the model
+    * Determine if a resource is bottleneck
+* Dynamic with heavy instrumentation => slow
+* Very detailed insight
+* In particular, access to real-run instruction dependencies
--- a/plan/60_staticdeps.md
+++ b/plan/60_staticdeps.md
@ -1,5 +1,16 @@
 # Static extraction of memory-carried dependencies

+[[PREREQUISITES]]
+* CesASMe results
+* Gus
+* Static vs dynamic
+* PC
+* μarch: μop, renamer, L1-res, ROB
+* Osaca
+* UiCA
+[[END]]
+
+
 ## Intro

 * Previous chapt. : effect of mem-carried deps
@ -62,6 +73,7 @@ On SKX,
      instructions are out of the ROB anyway
        * 224 μops in Intel's Skylake, 2015
        * 512 μops in Intel's Golden Cove, 2021
+        * Source: https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/ [consulted 2023-09-13]
    * Can unroll until we have ~|ROB|+|K| instructions in the kernel: since
      instructions yield at least a μop, safe [TODO check]
    * Sometimes unrolled only once, eg. Osaca. Not sufficient; eg. Fibo.
@ -95,8 +107,13 @@ On SKX,
  1st kernel cannot depend on the previous kernel unroll); if it happens in the
  majority of cases, keep; else drop

-* Semantics of asm coming from Valgrind's IR -- should be portable to any
-  architecture supported
+* We need semantics for our assembly
+
+### Valgrind's VEX
+
+* Introduce Valgrind as an instrumentation tool
+* Introduce VEX
+* Should be portable to any architecture supported
    * but suffers limitations for recent extension sets; eg avx512 not
      supported (TODO check)

@ -104,7 +121,7 @@ On SKX,

 * Does not track aliasing that originates from outside of the kernel.
    * As advocated in CesASMe, would require a broader analysis range
-* Randomness may lead to false positives
+* Randomness may (theoretically) lead to false positives
    * but re-running with different seed should eliminate the hazard close to
      entirely
 * Should not have false negatives outside of aliasing or unsupported ops
@ -135,14 +152,15 @@ Then, compare with staticdeps: `eval/vg_depsim.py` script.
    * use genbench's bb split/occurrences to retrieve basic blocks
    * for each BB with more than 10% of max BB hits,
    * predict deps with staticdeps
-        * cache the result: fast, but we're dealing with 3500 files.
+        * cache the result: staticdeps is fast, but we're dealing with 3500
+          files.
    * translate staticdeps' periodic deps to PC deps, discard the `iter`
      parameter
    * for each dependency from the depsim results that occurs inside this BB,
        * check if found or missed, append to a list
 * score: `|found| / (|found| + |missed|)`. Discards occurrences.
-* limitation: will only find deps from/to the same BB! Dependencies leaving
-  a BB are discarded.
+* limitation: will only find deps from/to the same BB! Dependencies leaving a
+  BB are discarded.

 * Result: about 38% of deps found; 44% if weighting by occurrences

@ -218,3 +236,7 @@ TODO ?
  analysis
    * but might not be true for other applications that require dependencies
      detection
+
+### Speed
+
+TODO: evaluate speed?
--- a/plan/main.md
+++ b/plan/main.md
@ -2,7 +2,10 @@

 ## 10. Introduction

-## 20. State of the art
+## 20. Foundations
+
+Introduce the related works, present their techniques, define and introduce
+notins, …

 ## 30. Palmed: automatically modelling the backend

--- a/plan/to_introduce_early.md
+++ b/plan/to_introduce_early.md
@ -0,0 +1,30 @@
+# Stuff that must be introduced early (intro/foundations)
+
+* Static vs. dynamic
+* PC
+* ELF
+* ISA
+* Assembly
+* SIMD
+* Basic block
+* μarch:
+    * frontend
+    * ports
+    * in-order/out-of-order
+    * pipeline
+    * Mop
+    * μop
+    * renamer
+    * ROB
+    * L1-residence
+* HW counters
+* Tools:
+    * IACA
+    * llvm-mca
+    * Osaca
+    * uops.info
+    * UiCA
+    * PMEvo
+
+* Define Cycles(K): retired instructions
+* Define notion of bottleneck