a72 frontend: rework with broken UopNoCross model
This commit is contained in:
parent
df32b9d0f9
commit
277b9e8483
1 changed files with 66 additions and 17 deletions
|
@ -7,6 +7,7 @@
|
|||
* heatmap representation: uops gone wild
|
||||
* example of a frontend-bound microkernel
|
||||
* Palmed's vision of a frontend
|
||||
* Real difference: in-order
|
||||
* UiCA: OK, but it's more complicated
|
||||
|
||||
## Cortex A72
|
||||
|
@ -25,7 +26,10 @@
|
|||
* FP0
|
||||
* FP1
|
||||
* Frontend: 3 insn/cycle
|
||||
* very limiting: eg. (?)
|
||||
* very limiting compared to its backend (TODO: example?)
|
||||
|
||||
* Very few hardware counters regarding the frontend! In particular, no access
|
||||
*at all* to macro-ops. No micro-op count.
|
||||
|
||||
* Pure Palmed results
|
||||
|
||||
|
@ -46,26 +50,28 @@
|
|||
* Measure with Pipedream
|
||||
* General case: F(i) = 3xCyc(B) - |simples|
|
||||
|
||||
* Examples: find the μop-count of
|
||||
* `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM` (1)
|
||||
* `ADDV_FD_H_VN_V_8H` (2)
|
||||
|
||||
### Bubbles
|
||||
|
||||
* Instructions:
|
||||
* `ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM`, eg. `add w0, w1, #0x10`: 1 μop, Int01
|
||||
* `MUL_RD_W_RN_W_RM_W`, eg. `mul w0,w1,w2`, 1μop, IntM
|
||||
* `FMIN_FD_D_FN_D_FM_D`, eg `fmin d0, d1, d2`, 1μop, FP01
|
||||
* `ADDV_FD_H_VN_V_8H`, eg. `addv h0, v0.8h`, 2μop, FP01
|
||||
[doc](https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en)
|
||||
The frontend is not as simple as a linear resource.
|
||||
|
||||
* Example:
|
||||
* `add`: 1/2 cycle (backend bound)
|
||||
* `addv`: 1 cycle (backend bound)
|
||||
* `add; fmin`: 2/3 cycle (frontend bound)
|
||||
* `add; addv`: 1 cycle (frontend bound)
|
||||
* [find nice example]
|
||||
* Example: addv + 3x add. Expect 1.67c, actually 2c.
|
||||
|
||||
### Adopted model
|
||||
From now on, we try to find models answering:
|
||||
> given a kernel K, how many (frac.) cycles does it take to be decoded in
|
||||
> steady state?
|
||||
|
||||
### No-cross model
|
||||
|
||||
* Hypothesis: the frontend cannot decross a multi-uop instruction across cycle
|
||||
boundaries.
|
||||
|
||||
* Reasonable: similar things on x86-64 [uica] (?? investigate)
|
||||
* Would explain the example above [show again].
|
||||
|
||||
* Question: given a kernel K, how many (frac.) cycles does it take to be
|
||||
decoded in steady state?
|
||||
* Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
|
||||
* Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without
|
||||
crossing a cycle boundary, we leave a bubble and start decoding it next cycle
|
||||
|
@ -75,8 +81,51 @@
|
|||
* Execute K enough (t) times to reach the same state as the first
|
||||
* Result is C(K^t) / t
|
||||
|
||||
* …although, this is crappy: predicts incorrectly on `addv + 2x add`.
|
||||
|
||||
### Dispatch-queues model
|
||||
|
||||
* Found in the optimisation manual
|
||||
* Dispatcher: limits to 3μop/cycle
|
||||
* But also has dispatch queues with tighter limits
|
||||
|
||||
#### Finding a dispatch model
|
||||
|
||||
Two sources of data:
|
||||
* Palmed
|
||||
* Optim manual
|
||||
Plus pipedream experiments.
|
||||
|
||||
* Palmed not usable as-is: resources are not accurate, 1-to-1 match
|
||||
* However, good basis: eg. Ld, St ports are 1-1 match
|
||||
* Multiple resources not coalesced for eg. Int, FP01
|
||||
* For each insn class,
|
||||
* generate a base dispatch model with Palmed
|
||||
* cross-check with manual
|
||||
|
||||
* Some special cases.
|
||||
* More #dispatch than #uop: does not happen
|
||||
* Single #dispatch, multiple #uop: replicate dispatch #uop times
|
||||
* #dispatch = #uop > 1: arbitrary order. This is a problem, but future
|
||||
work.
|
||||
* 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.
|
||||
|
||||
* The model is a very simple version of abstract resources model: indeed, FP0
|
||||
and FP1 are separate dispatch queues, yet some μops hit FP01.
|
||||
|
||||
#### Implementing the model
|
||||
|
||||
* Assuming each insn has at least 1μop, the dispatcher is always the frontend's
|
||||
bottleneck
|
||||
* The state *at the end of a kernel* is still determined only by dispatch pos
|
||||
* => the same algorithm + keep track of queues still works
|
||||
|
||||
### Evaluation on Palmed
|
||||
|
||||
* Add this model to Palmed: for each kernel, simply take
|
||||
* Add these models to Palmed: for each kernel, simply take
|
||||
max(frontend(K), Palmed(k))
|
||||
* Results
|
||||
|
||||
### Discussion: how to generalize
|
||||
|
||||
[TODO]
|
||||
|
|
Loading…
Reference in a new issue