a72 frontend: rework with broken UopNoCross model

2023-09-06 15:59:13 +02:00 · 2023-09-06 15:59:13 +02:00 · 277b9e8483
commit 277b9e8483
parent df32b9d0f9
1 changed files with 66 additions and 17 deletions
--- a/plan/40_a72_frontend.md
+++ b/plan/40_a72_frontend.md
@ -7,6 +7,7 @@
    * heatmap representation: uops gone wild
    * example of a frontend-bound microkernel
 * Palmed's vision of a frontend
+* Real difference: in-order
 * UiCA: OK, but it's more complicated

 ## Cortex A72
@ -25,7 +26,10 @@
    * FP0
    * FP1
 * Frontend: 3 insn/cycle
-    * very limiting: eg. (?)
+    * very limiting compared to its backend (TODO: example?)
+
+* Very few hardware counters regarding the frontend! In particular, no access
+  *at all* to macro-ops. No micro-op count.

 * Pure Palmed results

@ -46,26 +50,28 @@
        * Measure with Pipedream
    * General case: F(i) = 3xCyc(B) - |simples|

+* Examples: find the μop-count of
+    * `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM` (1)
+    * `ADDV_FD_H_VN_V_8H` (2)
+
 ### Bubbles

-* Instructions:
-    * `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM`, eg. `add w0, w1, #0x10`: 1 μop, Int01
-    * `MUL_RD_W_RN_W_RM_W`, eg. `mul w0,w1,w2`, 1μop, IntM
-    * `FMIN_FD_D_FN_D_FM_D`, eg `fmin d0, d1, d2`, 1μop, FP01
-    * `ADDV_FD_H_VN_V_8H`, eg. `addv h0, v0.8h`, 2μop, FP01
-      [doc](https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en)
+The frontend is not as simple as a linear resource.

-* Example:
-    * `add`: 1/2 cycle (backend bound)
-    * `addv`: 1 cycle (backend bound)
-    * `add; fmin`: 2/3 cycle (frontend bound)
-    * `add; addv`: 1 cycle (frontend bound)
-    * [find nice example]
+* Example: addv + 3x add. Expect 1.67c, actually 2c.

-### Adopted model
+From now on, we try to find models answering:
+> given a kernel K, how many (frac.) cycles does it take to be decoded in
+> steady state?
+
+### No-cross model
+
+* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
+  boundaries.
+
+* Reasonable: similar things on x86-64 [uica] (?? investigate)
+* Would explain the example above [show again].

-* Question: given a kernel K, how many (frac.) cycles does it take to be
-  decoded in steady state?
 * Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
 * Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without
  crossing a cycle boundary, we leave a bubble and start decoding it next cycle
@ -75,8 +81,51 @@
 * Execute K enough (t) times to reach the same state as the first
 * Result is C(K^t) / t

+* …although, this is crappy: predicts incorrectly on `addv + 2x add`.
+
+### Dispatch-queues model
+
+* Found in the optimisation manual
+* Dispatcher: limits to 3μop/cycle
+* But also has dispatch queues with tighter limits
+
+#### Finding a dispatch model
+
+Two sources of data:
+* Palmed
+* Optim manual
+Plus pipedream experiments.
+
+* Palmed not usable as-is: resources are not accurate, 1-to-1 match
+* However, good basis: eg. Ld, St ports are 1-1 match
+* Multiple resources not coalesced for eg. Int, FP01
+* For each insn class,
+    * generate a base dispatch model with Palmed
+    * cross-check with manual
+
+* Some special cases.
+    * More #dispatch than #uop: does not happen
+    * Single #dispatch, multiple #uop: replicate dispatch #uop times
+    * #dispatch = #uop > 1: arbitrary order. This is a problem, but future
+      work.
+    * 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.
+
+* The model is a very simple version of abstract resources model: indeed, FP0
+  and FP1 are separate dispatch queues, yet some μops hit FP01.
+
+#### Implementing the model
+
+* Assuming each insn has at least 1μop, the dispatcher is always the frontend's
+  bottleneck
+* The state *at the end of a kernel* is still determined only by dispatch pos
+* => the same algorithm + keep track of queues still works
+
 ### Evaluation on Palmed

-* Add this model to Palmed: for each kernel, simply take
+* Add these models to Palmed: for each kernel, simply take
  max(frontend(K), Palmed(k))
 * Results
+
+### Discussion: how to generalize
+
+[TODO]