phd-thesis/plan/40_a72_frontend.md

# Beyond ports: manually modelling the A72 frontend

## Necessity to go beyond ports

* Palmed: concerned mostly with ports
* Noticed the importance of the frontend while investigating its performances
    * heatmap representation: uops gone wild
    * example of a frontend-bound microkernel
* Palmed's vision of a frontend
* Real difference: in-order
* UiCA: OK, but it's more complicated

## Cortex A72

* General intro
    * ARMv8-A
    * Out of order
    * Designed as general-purpose, high-performance core for low-power applications
        "Next generation of high-efficiency compute"
    * Raspberry Pi 4 (BCM2711)
* Backend
    * 2x Int
    * IntM
    * Load
    * Store
    * FP0
    * FP1
* Frontend: 3 insn/cycle
    * very limiting compared to its backend.
    * Example: 2nd-order polynomial calculation:
        ```
            P[i] = aX[i]² + bX[i] + c
        <=> P[i] = (a*X[i] + b) * X[i] + c
        <=> P[i] = (a*X[i] + b); P[i] = P[i] * X[i] + c
        ```
        so load, FMAdd, FMAdd, store. Backend OK, frontend bottleneck.

* Very few hardware counters regarding the frontend! In particular, no access
  *at all* to macro-ops. No micro-op count.

* Pure Palmed results

## Manual frontend

### Base methodology

* Basis: throughput model
    * eg. Palmed, uops, official reference
* Simple instructions: 1μop, single port.
* Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)
    * (not necessary, but reduces number of experiments required)
* Find the impact of insn i on frontend.
    * The frontend must be bottleneck; build a benchmark B = i + (simples) so
      that the simples do not cause a bottleneck backend-wise
    * Add until Cyc(B) > Cyc(i)
        * Should need at most 3 x Cyc(i) - 1 simples
        * Measure with Pipedream
    * General case: F(i) = 3xCyc(B) - |simples|

* Examples: find the μop-count of
    * `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM` (1)
    * `ADDV_FD_H_VN_V_8H` (2)

### Bubbles

The frontend is not as simple as a linear resource.

* Example: addv + 3x add. Expect 1.67c, actually 2c.

From now on, we try to find models answering:
> given a kernel K, how many (frac.) cycles does it take to be decoded in
> steady state?

### No-cross model

* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
  boundaries.

* Reasonable: similar things on x86-64 [uica] (?? investigate)
* Would explain the example above [show again].

* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
* Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without
  crossing a cycle boundary, we leave a bubble and start decoding it next cycle
* next: st \mapsto st after executing K
* longest "cycle" of next is at most 3
* thus next^3(0) brings us into steady state
* Execute K enough (t) times to reach the same state as the first
* Result is C(K^t) / t

* …although, this is crappy: predicts incorrectly on `addv + 2x add`.

### Dispatch-queues model

* Found in the optimisation manual
* Dispatcher: limits to 3μop/cycle
* But also has dispatch queues with tighter limits

#### Finding a dispatch model

Two sources of data:
* Palmed
* Optim manual
Plus pipedream experiments.

* Palmed not usable as-is: resources are not accurate, 1-to-1 match
* However, good basis: eg. Ld, St ports are 1-1 match
* Multiple resources not coalesced for eg. Int, FP01
* For each insn class,
    * generate a base dispatch model with Palmed
    * cross-check with manual

* Some special cases.
    * More #dispatch than #uop: does not happen
    * Single #dispatch, multiple #uop: replicate dispatch #uop times
    * #dispatch = #uop > 1: arbitrary order. This is a problem, but future
      work.
    * 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.

* The model is a very simple version of abstract resources model: indeed, FP0
  and FP1 are separate dispatch queues, yet some μops hit FP01.

#### Implementing the model

* Assuming each insn has at least 1μop, the dispatcher is always the frontend's
  bottleneck
* The state *at the end of a kernel* is still determined only by dispatch pos
* => the same algorithm + keep track of queues still works

### Evaluation on Palmed

* Add these models to Palmed: for each kernel, simply take
  max(frontend(K), Palmed(k))
* Results

### Discussion: how to generalize

[TODO]
First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00			`# Beyond ports: manually modelling the A72 frontend`

			`## Necessity to go beyond ports`

			`* Palmed: concerned mostly with ports`
			`* Noticed the importance of the frontend while investigating its performances`
			`* heatmap representation: uops gone wild`
			`* example of a frontend-bound microkernel`
			`* Palmed's vision of a frontend`
a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00			`* Real difference: in-order`
First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00			`* UiCA: OK, but it's more complicated`

			`## Cortex A72`

			`* General intro`
			`* ARMv8-A`
			`* Out of order`
			`* Designed as general-purpose, high-performance core for low-power applications`
			`"Next generation of high-efficiency compute"`
			`* Raspberry Pi 4 (BCM2711)`
			`* Backend`
			`* 2x Int`
			`* IntM`
			`* Load`
			`* Store`
			`* FP0`
			`* FP1`
			`* Frontend: 3 insn/cycle`
A72 frontend: add example of front-bottleneck load, fmadd, fmadd, store 2023-09-06 16:55:54 +02:00			`* very limiting compared to its backend.`
			`* Example: 2nd-order polynomial calculation:`
			```
			`P[i] = aX[i]² + bX[i] + c`
			`<=> P[i] = (aX[i] + b) X[i] + c`
			`<=> P[i] = (aX[i] + b); P[i] = P[i] X[i] + c`
			```
			`so load, FMAdd, FMAdd, store. Backend OK, frontend bottleneck.`
a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00
			`* Very few hardware counters regarding the frontend! In particular, no access`
			`at all to macro-ops. No micro-op count.`
First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00
			`* Pure Palmed results`

			`## Manual frontend`

			`### Base methodology`

			`* Basis: throughput model`
			`* eg. Palmed, uops, official reference`
			`* Simple instructions: 1μop, single port.`
			`* Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)`
			`* (not necessary, but reduces number of experiments required)`
			`* Find the impact of insn i on frontend.`
			`* The frontend must be bottleneck; build a benchmark B = i + (simples) so`
			`that the simples do not cause a bottleneck backend-wise`
			`* Add until Cyc(B) > Cyc(i)`
			`* Should need at most 3 x Cyc(i) - 1 simples`
			`* Measure with Pipedream`
			`* General case: F(i) = 3xCyc(B) - \|simples\|`

a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00			`* Examples: find the μop-count of`
			* `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM` (1)
			* `ADDV_FD_H_VN_V_8H` (2)

First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00			`### Bubbles`

a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00			`The frontend is not as simple as a linear resource.`

			`* Example: addv + 3x add. Expect 1.67c, actually 2c.`

			`From now on, we try to find models answering:`
			`> given a kernel K, how many (frac.) cycles does it take to be decoded in`
			`> steady state?`

			`### No-cross model`
First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00
a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00			`* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle`
			`boundaries.`
First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00
a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00			`* Reasonable: similar things on x86-64 [uica] (?? investigate)`
			`* Would explain the example above [show again].`
First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00
			`* Frontend state ∈ [\|0,2\|]: how many μops already decoded this cycle`
			`* Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without`
			`crossing a cycle boundary, we leave a bubble and start decoding it next cycle`
			`* next: st \mapsto st after executing K`
			`* longest "cycle" of next is at most 3`
			`* thus next^3(0) brings us into steady state`
			`* Execute K enough (t) times to reach the same state as the first`
			`* Result is C(K^t) / t`

a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00			* …although, this is crappy: predicts incorrectly on `addv + 2x add`.

			`### Dispatch-queues model`

			`* Found in the optimisation manual`
			`* Dispatcher: limits to 3μop/cycle`
			`* But also has dispatch queues with tighter limits`

			`#### Finding a dispatch model`

			`Two sources of data:`
			`* Palmed`
			`* Optim manual`
			`Plus pipedream experiments.`

			`* Palmed not usable as-is: resources are not accurate, 1-to-1 match`
			`* However, good basis: eg. Ld, St ports are 1-1 match`
			`* Multiple resources not coalesced for eg. Int, FP01`
			`* For each insn class,`
			`* generate a base dispatch model with Palmed`
			`* cross-check with manual`

			`* Some special cases.`
			`* More #dispatch than #uop: does not happen`
			`* Single #dispatch, multiple #uop: replicate dispatch #uop times`
			`* #dispatch = #uop > 1: arbitrary order. This is a problem, but future`
			`work.`
			`* 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.`

			`* The model is a very simple version of abstract resources model: indeed, FP0`
			`and FP1 are separate dispatch queues, yet some μops hit FP01.`

			`#### Implementing the model`

			`* Assuming each insn has at least 1μop, the dispatcher is always the frontend's`
			`bottleneck`
			`* The state at the end of a kernel is still determined only by dispatch pos`
			`* => the same algorithm + keep track of queues still works`

First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00			`### Evaluation on Palmed`

a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00			`* Add these models to Palmed: for each kernel, simply take`
First run of work. Needs A72 bubble example. 2023-08-23 17:37:17 +02:00			`max(frontend(K), Palmed(k))`
			`* Results`
a72 frontend: rework with broken UopNoCross model 2023-09-06 15:59:13 +02:00
			`### Discussion: how to generalize`

			`[TODO]`