# Beyond ports: manually modelling the A72 frontend ## Necessity to go beyond ports * Palmed: concerned mostly with ports * Noticed the importance of the frontend while investigating its performances * heatmap representation: uops gone wild * example of a frontend-bound microkernel * Palmed's vision of a frontend * UiCA: OK, but it's more complicated ## Cortex A72 * General intro * ARMv8-A * Out of order * Designed as general-purpose, high-performance core for low-power applications "Next generation of high-efficiency compute" * Raspberry Pi 4 (BCM2711) * Backend * 2x Int * IntM * Load * Store * FP0 * FP1 * Frontend: 3 insn/cycle * very limiting: eg. (?) * Pure Palmed results ## Manual frontend ### Base methodology * Basis: throughput model * eg. Palmed, uops, official reference * Simple instructions: 1μop, single port. * Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi) * (not necessary, but reduces number of experiments required) * Find the impact of insn i on frontend. * The frontend must be bottleneck; build a benchmark B = i + (simples) so that the simples do not cause a bottleneck backend-wise * Add until Cyc(B) > Cyc(i) * Should need at most 3 x Cyc(i) - 1 simples * Measure with Pipedream * General case: F(i) = 3xCyc(B) - |simples| ### Bubbles * Instructions: * `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM`, eg. `add w0, w1, #0x10`: 1 μop, Int01 * `MUL_RD_W_RN_W_RM_W`, eg. `mul w0,w1,w2`, 1μop, IntM * `FMIN_FD_D_FN_D_FM_D`, eg `fmin d0, d1, d2`, 1μop, FP01 * `ADDV_FD_H_VN_V_8H`, eg. `addv h0, v0.8h`, 2μop, FP01 [doc](https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en) * Example: * `add`: 1/2 cycle (backend bound) * `addv`: 1 cycle (backend bound) * `add; fmin`: 2/3 cycle (frontend bound) * `add; addv`: 1 cycle (frontend bound) * [find nice example] ### Adopted model * Question: given a kernel K, how many (frac.) cycles does it take to be decoded in steady state? * Frontend state ∈ [|0,2|]: how many μops already decoded this cycle * Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle * next: st \mapsto st after executing K * longest "cycle" of next is at most 3 * thus next^3(0) brings us into steady state * Execute K enough (t) times to reach the same state as the first * Result is C(K^t) / t ### Evaluation on Palmed * Add this model to Palmed: for each kernel, simply take max(frontend(K), Palmed(k)) * Results