# Beyond ports: manually modelling the A72 frontend [[PREREQUISITES]] * Microarch: frontend, ports, in-order/OoO, μ/Mop * Assembly * SIMD * Def Cyc(k) -> retired insn * Palmed, Palmed results * Palmed instruction classes * Pipedream * uops.info * Notion of bottleneck [[END]] ## Necessity to go beyond ports * Palmed: concerned mostly with ports * Noticed the importance of the frontend while investigating its performances * heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX) * example of a frontend-bound microkernel * Palmed's vision of a frontend * Real difference: in-order * UiCA: proves that frontends are important, implements Intel frontends ## Cortex A72 * General intro * ARMv8-A * Out of order * Designed as general-purpose, high-performance core for low-power applications "Next generation of high-efficiency compute" * Raspberry Pi 4 (BCM2711) * Backend * 2x Int * IntM * Load * Store * FP0 * FP1 * Frontend: 3 insn/cycle * very limiting compared to its backend. * Example: 2nd-order polynomial calculation: ``` P[i] = aX[i]² + bX[i] + c <=> P[i] = (a*X[i] + b) * X[i] + c <=> P[i] = (a*X[i] + b); P[i] = P[i] * X[i] + c ``` so load, FMAdd, FMAdd, store. Backend OK, frontend bottleneck. * Very few hardware counters regarding the frontend! In particular, no access *at all* to macro-ops. No micro-op count. * Pure Palmed results ## Manual frontend ### Base methodology * Basis: throughput model * eg. Palmed, uops, official reference * Simple instructions: 1μop, single port. * Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi) * (not necessary, but reduces number of experiments required) * Find the impact of insn i on frontend. * The frontend must be bottleneck; build a benchmark B = i + (simples) so that the simples do not cause a bottleneck backend-wise * Add until Cyc(B) > Cyc(i) * Should need at most 3 x Cyc(i) - 1 simples * Measure with Pipedream * General case: F(i) = 3xCyc(B) - |simples| * Examples: find the μop-count of * `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM` (1) * `ADDV_FD_H_VN_V_8H` (2) ### Bubbles The frontend is not as simple as a linear resource. * Example: addv + 3x add. Expect 1.67c, actually 2c. From now on, we try to find models answering: > given a kernel K, how many (frac.) cycles does it take to be decoded in > steady state? ### No-cross model * Hypothesis: the frontend cannot decode a multi-uop instruction across cycle boundaries. * Reasonable: similar things on x86-64 -- cf [uica] predecoder §4.1 * Would explain the example above [show again]. * Frontend state ∈ [|0,2|]: how many μops already decoded this cycle * Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle * next: st \mapsto st after executing K * longest "cycle" of next is at most 3 * thus next^3(0) brings us into steady state * Execute K enough (t) times to reach the same state as the first * Result is C(K^t) / t * …although, this is crappy: predicts incorrectly on `addv + 2x add`. ### Dispatch-queues model * Found in the optimisation manual * Dispatcher: limits to 3μop/cycle * But also has dispatch queues with tighter limits #### Finding a dispatch model Two sources of data: * Palmed * Optim manual Plus pipedream experiments. * Palmed not usable as-is: resources are not accurate, 1-to-1 match * However, good basis: eg. Ld, St ports are 1-1 match * Multiple resources not coalesced for eg. Int, FP01 * For each insn class, * generate a base dispatch model with Palmed * cross-check with manual * Some special cases. * More #dispatch than #uop: does not happen * Single #dispatch, multiple #uop: replicate dispatch #uop times * #dispatch = #uop > 1: arbitrary order. This is a problem, but future work. * 1 < #dispatch < #uop: unsupported. Only 35 insn/1749. * The model is a very simple version of abstract resources model: indeed, FP0 and FP1 are separate dispatch queues, yet some μops hit FP01. #### Implementing the model * Assuming each insn has at least 1μop, the dispatcher is always the frontend's bottleneck * The state *at the end of a kernel* is still determined only by dispatch pos * => the same algorithm + keep track of queues still works ### Evaluation on Palmed * Add these models to Palmed: for each kernel, simply take max(frontend(K), Palmed(k)) * Results ### Discussion: how to generalize [TODO]