4.5 KiB
Beyond ports: manually modelling the A72 frontend
- Microarch: frontend, ports, in-order/OoO, μ/Mop
- Assembly
- SIMD
- Def Cyc(k) -> retired insn
- Palmed, Palmed results
- Palmed instruction classes
- Pipedream
- uops.info
- Notion of bottleneck END
Necessity to go beyond ports
- Palmed: concerned mostly with ports
- Noticed the importance of the frontend while investigating its performances
- heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX)
- example of a frontend-bound microkernel
- Palmed's vision of a frontend
- Real difference: in-order
- UiCA: proves that frontends are important, implements Intel frontends
Cortex A72
-
General intro
- ARMv8-A
- Out of order
- Designed as general-purpose, high-performance core for low-power applications "Next generation of high-efficiency compute"
- Raspberry Pi 4 (BCM2711)
-
Backend
- 2x Int
- IntM
- Load
- Store
- FP0
- FP1
-
Frontend: 3 insn/cycle
- very limiting compared to its backend.
- Example: 2nd-order polynomial calculation:
so load, FMAdd, FMAdd, store. Backend OK, frontend bottleneck.P[i] = aX[i]² + bX[i] + c <=> P[i] = (a*X[i] + b) * X[i] + c <=> P[i] = (a*X[i] + b); P[i] = P[i] * X[i] + c
-
Very few hardware counters regarding the frontend! In particular, no access at all to macro-ops. No micro-op count.
-
Pure Palmed results
Manual frontend
Base methodology
-
Basis: throughput model
- eg. Palmed, uops, official reference
-
Simple instructions: 1μop, single port.
-
Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)
- (not necessary, but reduces number of experiments required)
-
Find the impact of insn i on frontend.
- The frontend must be bottleneck; build a benchmark B = i + (simples) so that the simples do not cause a bottleneck backend-wise
- Add until Cyc(B) > Cyc(i)
- Should need at most 3 x Cyc(i) - 1 simples
- Measure with Pipedream
- General case: F(i) = 3xCyc(B) - |simples|
-
Examples: find the μop-count of
ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM
(1)ADDV_FD_H_VN_V_8H
(2)
Bubbles
The frontend is not as simple as a linear resource.
- Example: addv + 3x add. Expect 1.67c, actually 2c.
From now on, we try to find models answering:
given a kernel K, how many (frac.) cycles does it take to be decoded in steady state?
No-cross model
-
Hypothesis: the frontend cannot decode a multi-uop instruction across cycle boundaries.
-
Reasonable: similar things on x86-64 -- cf [uica] predecoder §4.1
-
Would explain the example above [show again].
-
Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
-
Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle
-
next: st \mapsto st after executing K
-
longest "cycle" of next is at most 3
-
thus next^3(0) brings us into steady state
-
Execute K enough (t) times to reach the same state as the first
-
Result is C(K^t) / t
-
…although, this is crappy: predicts incorrectly on
addv + 2x add
.
Dispatch-queues model
- Found in the optimisation manual
- Dispatcher: limits to 3μop/cycle
- But also has dispatch queues with tighter limits
Finding a dispatch model
Two sources of data:
-
Palmed
-
Optim manual Plus pipedream experiments.
-
Palmed not usable as-is: resources are not accurate, 1-to-1 match
-
However, good basis: eg. Ld, St ports are 1-1 match
-
Multiple resources not coalesced for eg. Int, FP01
-
For each insn class,
- generate a base dispatch model with Palmed
- cross-check with manual
-
Some special cases.
- More #dispatch than #uop: does not happen
- Single #dispatch, multiple #uop: replicate dispatch #uop times
- #dispatch = #uop > 1: arbitrary order. This is a problem, but future work.
- 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.
-
The model is a very simple version of abstract resources model: indeed, FP0 and FP1 are separate dispatch queues, yet some μops hit FP01.
Implementing the model
- Assuming each insn has at least 1μop, the dispatcher is always the frontend's bottleneck
- The state at the end of a kernel is still determined only by dispatch pos
- => the same algorithm + keep track of queues still works
Evaluation on Palmed
- Add these models to Palmed: for each kernel, simply take max(frontend(K), Palmed(k))
- Results
Discussion: how to generalize
[TODO]