2.6 KiB
2.6 KiB
Beyond ports: manually modelling the A72 frontend
Necessity to go beyond ports
- Palmed: concerned mostly with ports
- Noticed the importance of the frontend while investigating its performances
- heatmap representation: uops gone wild
- example of a frontend-bound microkernel
- Palmed's vision of a frontend
- UiCA: OK, but it's more complicated
Cortex A72
-
General intro
- ARMv8-A
- Out of order
- Designed as general-purpose, high-performance core for low-power applications "Next generation of high-efficiency compute"
- Raspberry Pi 4 (BCM2711)
-
Backend
- 2x Int
- IntM
- Load
- Store
- FP0
- FP1
-
Frontend: 3 insn/cycle
- very limiting: eg. (?)
-
Pure Palmed results
Manual frontend
Base methodology
- Basis: throughput model
- eg. Palmed, uops, official reference
- Simple instructions: 1μop, single port.
- Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)
- (not necessary, but reduces number of experiments required)
- Find the impact of insn i on frontend.
- The frontend must be bottleneck; build a benchmark B = i + (simples) so that the simples do not cause a bottleneck backend-wise
- Add until Cyc(B) > Cyc(i)
- Should need at most 3 x Cyc(i) - 1 simples
- Measure with Pipedream
- General case: F(i) = 3xCyc(B) - |simples|
Bubbles
-
Instructions:
ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM
, eg.add w0, w1, #0x10
: 1 μop, Int01MUL_RD_W_RN_W_RM_W
, eg.mul w0,w1,w2
, 1μop, IntMFMIN_FD_D_FN_D_FM_D
, egfmin d0, d1, d2
, 1μop, FP01ADDV_FD_H_VN_V_8H
, eg.addv h0, v0.8h
, 2μop, FP01 doc
-
Example:
add
: 1/2 cycle (backend bound)addv
: 1 cycle (backend bound)add; fmin
: 2/3 cycle (frontend bound)add; addv
: 1 cycle (frontend bound)- [find nice example]
Adopted model
- Question: given a kernel K, how many (frac.) cycles does it take to be decoded in steady state?
- Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
- Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle
- next: st \mapsto st after executing K
- longest "cycle" of next is at most 3
- thus next^3(0) brings us into steady state
- Execute K enough (t) times to reach the same state as the first
- Result is C(K^t) / t
Evaluation on Palmed
- Add this model to Palmed: for each kernel, simply take max(frontend(K), Palmed(k))
- Results