2023-08-23 17:37:17 +02:00
|
|
|
# Beyond ports: manually modelling the A72 frontend
|
|
|
|
|
|
|
|
## Necessity to go beyond ports
|
|
|
|
|
|
|
|
* Palmed: concerned mostly with ports
|
|
|
|
* Noticed the importance of the frontend while investigating its performances
|
|
|
|
* heatmap representation: uops gone wild
|
|
|
|
* example of a frontend-bound microkernel
|
|
|
|
* Palmed's vision of a frontend
|
2023-09-06 15:59:13 +02:00
|
|
|
* Real difference: in-order
|
2023-08-23 17:37:17 +02:00
|
|
|
* UiCA: OK, but it's more complicated
|
|
|
|
|
|
|
|
## Cortex A72
|
|
|
|
|
|
|
|
* General intro
|
|
|
|
* ARMv8-A
|
|
|
|
* Out of order
|
|
|
|
* Designed as general-purpose, high-performance core for low-power applications
|
|
|
|
"Next generation of high-efficiency compute"
|
|
|
|
* Raspberry Pi 4 (BCM2711)
|
|
|
|
* Backend
|
|
|
|
* 2x Int
|
|
|
|
* IntM
|
|
|
|
* Load
|
|
|
|
* Store
|
|
|
|
* FP0
|
|
|
|
* FP1
|
|
|
|
* Frontend: 3 insn/cycle
|
2023-09-06 16:55:54 +02:00
|
|
|
* very limiting compared to its backend.
|
|
|
|
* Example: 2nd-order polynomial calculation:
|
|
|
|
```
|
|
|
|
P[i] = aX[i]² + bX[i] + c
|
|
|
|
<=> P[i] = (a*X[i] + b) * X[i] + c
|
|
|
|
<=> P[i] = (a*X[i] + b); P[i] = P[i] * X[i] + c
|
|
|
|
```
|
|
|
|
so load, FMAdd, FMAdd, store. Backend OK, frontend bottleneck.
|
2023-09-06 15:59:13 +02:00
|
|
|
|
|
|
|
* Very few hardware counters regarding the frontend! In particular, no access
|
|
|
|
*at all* to macro-ops. No micro-op count.
|
2023-08-23 17:37:17 +02:00
|
|
|
|
|
|
|
* Pure Palmed results
|
|
|
|
|
|
|
|
## Manual frontend
|
|
|
|
|
|
|
|
### Base methodology
|
|
|
|
|
|
|
|
* Basis: throughput model
|
|
|
|
* eg. Palmed, uops, official reference
|
|
|
|
* Simple instructions: 1μop, single port.
|
|
|
|
* Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)
|
|
|
|
* (not necessary, but reduces number of experiments required)
|
|
|
|
* Find the impact of insn i on frontend.
|
|
|
|
* The frontend must be bottleneck; build a benchmark B = i + (simples) so
|
|
|
|
that the simples do not cause a bottleneck backend-wise
|
|
|
|
* Add until Cyc(B) > Cyc(i)
|
|
|
|
* Should need at most 3 x Cyc(i) - 1 simples
|
|
|
|
* Measure with Pipedream
|
|
|
|
* General case: F(i) = 3xCyc(B) - |simples|
|
|
|
|
|
2023-09-06 15:59:13 +02:00
|
|
|
* Examples: find the μop-count of
|
|
|
|
* `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM` (1)
|
|
|
|
* `ADDV_FD_H_VN_V_8H` (2)
|
|
|
|
|
2023-08-23 17:37:17 +02:00
|
|
|
### Bubbles
|
|
|
|
|
2023-09-06 15:59:13 +02:00
|
|
|
The frontend is not as simple as a linear resource.
|
|
|
|
|
|
|
|
* Example: addv + 3x add. Expect 1.67c, actually 2c.
|
|
|
|
|
|
|
|
From now on, we try to find models answering:
|
|
|
|
> given a kernel K, how many (frac.) cycles does it take to be decoded in
|
|
|
|
> steady state?
|
|
|
|
|
|
|
|
### No-cross model
|
2023-08-23 17:37:17 +02:00
|
|
|
|
2023-09-06 15:59:13 +02:00
|
|
|
* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
|
|
|
|
boundaries.
|
2023-08-23 17:37:17 +02:00
|
|
|
|
2023-09-06 15:59:13 +02:00
|
|
|
* Reasonable: similar things on x86-64 [uica] (?? investigate)
|
|
|
|
* Would explain the example above [show again].
|
2023-08-23 17:37:17 +02:00
|
|
|
|
|
|
|
* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
|
|
|
|
* Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without
|
|
|
|
crossing a cycle boundary, we leave a bubble and start decoding it next cycle
|
|
|
|
* next: st \mapsto st after executing K
|
|
|
|
* longest "cycle" of next is at most 3
|
|
|
|
* thus next^3(0) brings us into steady state
|
|
|
|
* Execute K enough (t) times to reach the same state as the first
|
|
|
|
* Result is C(K^t) / t
|
|
|
|
|
2023-09-06 15:59:13 +02:00
|
|
|
* …although, this is crappy: predicts incorrectly on `addv + 2x add`.
|
|
|
|
|
|
|
|
### Dispatch-queues model
|
|
|
|
|
|
|
|
* Found in the optimisation manual
|
|
|
|
* Dispatcher: limits to 3μop/cycle
|
|
|
|
* But also has dispatch queues with tighter limits
|
|
|
|
|
|
|
|
#### Finding a dispatch model
|
|
|
|
|
|
|
|
Two sources of data:
|
|
|
|
* Palmed
|
|
|
|
* Optim manual
|
|
|
|
Plus pipedream experiments.
|
|
|
|
|
|
|
|
* Palmed not usable as-is: resources are not accurate, 1-to-1 match
|
|
|
|
* However, good basis: eg. Ld, St ports are 1-1 match
|
|
|
|
* Multiple resources not coalesced for eg. Int, FP01
|
|
|
|
* For each insn class,
|
|
|
|
* generate a base dispatch model with Palmed
|
|
|
|
* cross-check with manual
|
|
|
|
|
|
|
|
* Some special cases.
|
|
|
|
* More #dispatch than #uop: does not happen
|
|
|
|
* Single #dispatch, multiple #uop: replicate dispatch #uop times
|
|
|
|
* #dispatch = #uop > 1: arbitrary order. This is a problem, but future
|
|
|
|
work.
|
|
|
|
* 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.
|
|
|
|
|
|
|
|
* The model is a very simple version of abstract resources model: indeed, FP0
|
|
|
|
and FP1 are separate dispatch queues, yet some μops hit FP01.
|
|
|
|
|
|
|
|
#### Implementing the model
|
|
|
|
|
|
|
|
* Assuming each insn has at least 1μop, the dispatcher is always the frontend's
|
|
|
|
bottleneck
|
|
|
|
* The state *at the end of a kernel* is still determined only by dispatch pos
|
|
|
|
* => the same algorithm + keep track of queues still works
|
|
|
|
|
2023-08-23 17:37:17 +02:00
|
|
|
### Evaluation on Palmed
|
|
|
|
|
2023-09-06 15:59:13 +02:00
|
|
|
* Add these models to Palmed: for each kernel, simply take
|
2023-08-23 17:37:17 +02:00
|
|
|
max(frontend(K), Palmed(k))
|
|
|
|
* Results
|
2023-09-06 15:59:13 +02:00
|
|
|
|
|
|
|
### Discussion: how to generalize
|
|
|
|
|
|
|
|
[TODO]
|