82 lines
2.6 KiB
Markdown
82 lines
2.6 KiB
Markdown
# Beyond ports: manually modelling the A72 frontend
|
|
|
|
## Necessity to go beyond ports
|
|
|
|
* Palmed: concerned mostly with ports
|
|
* Noticed the importance of the frontend while investigating its performances
|
|
* heatmap representation: uops gone wild
|
|
* example of a frontend-bound microkernel
|
|
* Palmed's vision of a frontend
|
|
* UiCA: OK, but it's more complicated
|
|
|
|
## Cortex A72
|
|
|
|
* General intro
|
|
* ARMv8-A
|
|
* Out of order
|
|
* Designed as general-purpose, high-performance core for low-power applications
|
|
"Next generation of high-efficiency compute"
|
|
* Raspberry Pi 4 (BCM2711)
|
|
* Backend
|
|
* 2x Int
|
|
* IntM
|
|
* Load
|
|
* Store
|
|
* FP0
|
|
* FP1
|
|
* Frontend: 3 insn/cycle
|
|
* very limiting: eg. (?)
|
|
|
|
* Pure Palmed results
|
|
|
|
## Manual frontend
|
|
|
|
### Base methodology
|
|
|
|
* Basis: throughput model
|
|
* eg. Palmed, uops, official reference
|
|
* Simple instructions: 1μop, single port.
|
|
* Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)
|
|
* (not necessary, but reduces number of experiments required)
|
|
* Find the impact of insn i on frontend.
|
|
* The frontend must be bottleneck; build a benchmark B = i + (simples) so
|
|
that the simples do not cause a bottleneck backend-wise
|
|
* Add until Cyc(B) > Cyc(i)
|
|
* Should need at most 3 x Cyc(i) - 1 simples
|
|
* Measure with Pipedream
|
|
* General case: F(i) = 3xCyc(B) - |simples|
|
|
|
|
### Bubbles
|
|
|
|
* Instructions:
|
|
* `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM`, eg. `add w0, w1, #0x10`: 1 μop, Int01
|
|
* `MUL_RD_W_RN_W_RM_W`, eg. `mul w0,w1,w2`, 1μop, IntM
|
|
* `FMIN_FD_D_FN_D_FM_D`, eg `fmin d0, d1, d2`, 1μop, FP01
|
|
* `ADDV_FD_H_VN_V_8H`, eg. `addv h0, v0.8h`, 2μop, FP01
|
|
[doc](https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en)
|
|
|
|
* Example:
|
|
* `add`: 1/2 cycle (backend bound)
|
|
* `addv`: 1 cycle (backend bound)
|
|
* `add; fmin`: 2/3 cycle (frontend bound)
|
|
* `add; addv`: 1 cycle (frontend bound)
|
|
* [find nice example]
|
|
|
|
### Adopted model
|
|
|
|
* Question: given a kernel K, how many (frac.) cycles does it take to be
|
|
decoded in steady state?
|
|
* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
|
|
* Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without
|
|
crossing a cycle boundary, we leave a bubble and start decoding it next cycle
|
|
* next: st \mapsto st after executing K
|
|
* longest "cycle" of next is at most 3
|
|
* thus next^3(0) brings us into steady state
|
|
* Execute K enough (t) times to reach the same state as the first
|
|
* Result is C(K^t) / t
|
|
|
|
### Evaluation on Palmed
|
|
|
|
* Add this model to Palmed: for each kernel, simply take
|
|
max(frontend(K), Palmed(k))
|
|
* Results
|