diff --git a/plan/40_a72_frontend.md b/plan/40_a72_frontend.md index 4d49962..8fb88c4 100644 --- a/plan/40_a72_frontend.md +++ b/plan/40_a72_frontend.md @@ -7,6 +7,7 @@ * heatmap representation: uops gone wild * example of a frontend-bound microkernel * Palmed's vision of a frontend +* Real difference: in-order * UiCA: OK, but it's more complicated ## Cortex A72 @@ -25,7 +26,10 @@ * FP0 * FP1 * Frontend: 3 insn/cycle - * very limiting: eg. (?) + * very limiting compared to its backend (TODO: example?) + +* Very few hardware counters regarding the frontend! In particular, no access + *at all* to macro-ops. No micro-op count. * Pure Palmed results @@ -46,26 +50,28 @@ * Measure with Pipedream * General case: F(i) = 3xCyc(B) - |simples| +* Examples: find the μop-count of + * `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM` (1) + * `ADDV_FD_H_VN_V_8H` (2) + ### Bubbles -* Instructions: - * `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM`, eg. `add w0, w1, #0x10`: 1 μop, Int01 - * `MUL_RD_W_RN_W_RM_W`, eg. `mul w0,w1,w2`, 1μop, IntM - * `FMIN_FD_D_FN_D_FM_D`, eg `fmin d0, d1, d2`, 1μop, FP01 - * `ADDV_FD_H_VN_V_8H`, eg. `addv h0, v0.8h`, 2μop, FP01 - [doc](https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en) +The frontend is not as simple as a linear resource. -* Example: - * `add`: 1/2 cycle (backend bound) - * `addv`: 1 cycle (backend bound) - * `add; fmin`: 2/3 cycle (frontend bound) - * `add; addv`: 1 cycle (frontend bound) - * [find nice example] +* Example: addv + 3x add. Expect 1.67c, actually 2c. -### Adopted model +From now on, we try to find models answering: +> given a kernel K, how many (frac.) cycles does it take to be decoded in +> steady state? + +### No-cross model + +* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle + boundaries. + +* Reasonable: similar things on x86-64 [uica] (?? investigate) +* Would explain the example above [show again]. -* Question: given a kernel K, how many (frac.) cycles does it take to be - decoded in steady state? * Frontend state ∈ [|0,2|]: how many μops already decoded this cycle * Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle @@ -75,8 +81,51 @@ * Execute K enough (t) times to reach the same state as the first * Result is C(K^t) / t +* …although, this is crappy: predicts incorrectly on `addv + 2x add`. + +### Dispatch-queues model + +* Found in the optimisation manual +* Dispatcher: limits to 3μop/cycle +* But also has dispatch queues with tighter limits + +#### Finding a dispatch model + +Two sources of data: +* Palmed +* Optim manual +Plus pipedream experiments. + +* Palmed not usable as-is: resources are not accurate, 1-to-1 match +* However, good basis: eg. Ld, St ports are 1-1 match +* Multiple resources not coalesced for eg. Int, FP01 +* For each insn class, + * generate a base dispatch model with Palmed + * cross-check with manual + +* Some special cases. + * More #dispatch than #uop: does not happen + * Single #dispatch, multiple #uop: replicate dispatch #uop times + * #dispatch = #uop > 1: arbitrary order. This is a problem, but future + work. + * 1 < #dispatch < #uop: unsupported. Only 35 insn/1749. + +* The model is a very simple version of abstract resources model: indeed, FP0 + and FP1 are separate dispatch queues, yet some μops hit FP01. + +#### Implementing the model + +* Assuming each insn has at least 1μop, the dispatcher is always the frontend's + bottleneck +* The state *at the end of a kernel* is still determined only by dispatch pos +* => the same algorithm + keep track of queues still works + ### Evaluation on Palmed -* Add this model to Palmed: for each kernel, simply take +* Add these models to Palmed: for each kernel, simply take max(frontend(K), Palmed(k)) * Results + +### Discussion: how to generalize + +[TODO]