Théophile Bastian 72c289239d A72 frontend: add example of front-bottleneck

load, fmadd, fmadd, store

2023-09-06 16:55:54 +02:00

Beyond ports: manually modelling the A72 frontend

Necessity to go beyond ports

Palmed: concerned mostly with ports
Noticed the importance of the frontend while investigating its performances
- heatmap representation: uops gone wild
- example of a frontend-bound microkernel
Palmed's vision of a frontend
Real difference: in-order
UiCA: OK, but it's more complicated

General intro
- ARMv8-A
- Out of order
- Designed as general-purpose, high-performance core for low-power applications "Next generation of high-efficiency compute"
- Raspberry Pi 4 (BCM2711)
Backend
- 2x Int
- IntM
- Load
- Store
- FP0
- FP1
Frontend: 3 insn/cycle
- very limiting compared to its backend.
- Example: 2nd-order polynomial calculation:
```
    P[i] = aX[i]² + bX[i] + c
<=> P[i] = (a*X[i] + b) * X[i] + c
<=> P[i] = (a*X[i] + b); P[i] = P[i] * X[i] + c
```
  so load, FMAdd, FMAdd, store. Backend OK, frontend bottleneck.
Very few hardware counters regarding the frontend! In particular, no access at all to macro-ops. No micro-op count.
Pure Palmed results

Basis: throughput model
- eg. Palmed, uops, official reference
Simple instructions: 1μop, single port.
Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)
- (not necessary, but reduces number of experiments required)
Find the impact of insn i on frontend.
- The frontend must be bottleneck; build a benchmark B = i + (simples) so that the simples do not cause a bottleneck backend-wise
- Add until Cyc(B) > Cyc(i)
  - Should need at most 3 x Cyc(i) - 1 simples
  - Measure with Pipedream
- General case: F(i) = 3xCyc(B) - |simples|
Examples: find the μop-count of
- ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM (1)
- ADDV_FD_H_VN_V_8H (2)

The frontend is not as simple as a linear resource.

From now on, we try to find models answering:

given a kernel K, how many (frac.) cycles does it take to be decoded in steady state?

Hypothesis: the frontend cannot decross a multi-uop instruction across cycle boundaries.
Reasonable: similar things on x86-64 [uica] (?? investigate)
Would explain the example above [show again].
Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle
next: st \mapsto st after executing K
longest "cycle" of next is at most 3
thus next^3(0) brings us into steady state
Execute K enough (t) times to reach the same state as the first
Result is C(K^t) / t
…although, this is crappy: predicts incorrectly on addv + 2x add.

Two sources of data:

Palmed
Optim manual Plus pipedream experiments.
Palmed not usable as-is: resources are not accurate, 1-to-1 match
However, good basis: eg. Ld, St ports are 1-1 match
Multiple resources not coalesced for eg. Int, FP01
For each insn class,
- generate a base dispatch model with Palmed
- cross-check with manual
Some special cases.
- More #dispatch than #uop: does not happen
- Single #dispatch, multiple #uop: replicate dispatch #uop times
- #dispatch = #uop > 1: arbitrary order. This is a problem, but future work.
- 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.
The model is a very simple version of abstract resources model: indeed, FP0 and FP1 are separate dispatch queues, yet some μops hit FP01.

Assuming each insn has at least 1μop, the dispatcher is always the frontend's bottleneck
The state at the end of a kernel is still determined only by dispatch pos
=> the same algorithm + keep track of queues still works

Add these models to Palmed: for each kernel, simply take max(frontend(K), Palmed(k))
Results

[TODO]