phd-thesis/plan/40_a72_frontend.md
2023-09-06 16:55:54 +02:00

4.2 KiB

Beyond ports: manually modelling the A72 frontend

Necessity to go beyond ports

  • Palmed: concerned mostly with ports
  • Noticed the importance of the frontend while investigating its performances
    • heatmap representation: uops gone wild
    • example of a frontend-bound microkernel
  • Palmed's vision of a frontend
  • Real difference: in-order
  • UiCA: OK, but it's more complicated

Cortex A72

  • General intro

    • ARMv8-A
    • Out of order
    • Designed as general-purpose, high-performance core for low-power applications "Next generation of high-efficiency compute"
    • Raspberry Pi 4 (BCM2711)
  • Backend

    • 2x Int
    • IntM
    • Load
    • Store
    • FP0
    • FP1
  • Frontend: 3 insn/cycle

    • very limiting compared to its backend.
    • Example: 2nd-order polynomial calculation:
          P[i] = aX[i]² + bX[i] + c
      <=> P[i] = (a*X[i] + b) * X[i] + c
      <=> P[i] = (a*X[i] + b); P[i] = P[i] * X[i] + c
      
      so load, FMAdd, FMAdd, store. Backend OK, frontend bottleneck.
  • Very few hardware counters regarding the frontend! In particular, no access at all to macro-ops. No micro-op count.

  • Pure Palmed results

Manual frontend

Base methodology

  • Basis: throughput model

    • eg. Palmed, uops, official reference
  • Simple instructions: 1μop, single port.

  • Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)

    • (not necessary, but reduces number of experiments required)
  • Find the impact of insn i on frontend.

    • The frontend must be bottleneck; build a benchmark B = i + (simples) so that the simples do not cause a bottleneck backend-wise
    • Add until Cyc(B) > Cyc(i)
      • Should need at most 3 x Cyc(i) - 1 simples
      • Measure with Pipedream
    • General case: F(i) = 3xCyc(B) - |simples|
  • Examples: find the μop-count of

    • ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM (1)
    • ADDV_FD_H_VN_V_8H (2)

Bubbles

The frontend is not as simple as a linear resource.

  • Example: addv + 3x add. Expect 1.67c, actually 2c.

From now on, we try to find models answering:

given a kernel K, how many (frac.) cycles does it take to be decoded in steady state?

No-cross model

  • Hypothesis: the frontend cannot decross a multi-uop instruction across cycle boundaries.

  • Reasonable: similar things on x86-64 [uica] (?? investigate)

  • Would explain the example above [show again].

  • Frontend state ∈ [|0,2|]: how many μops already decoded this cycle

  • Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle

  • next: st \mapsto st after executing K

  • longest "cycle" of next is at most 3

  • thus next^3(0) brings us into steady state

  • Execute K enough (t) times to reach the same state as the first

  • Result is C(K^t) / t

  • …although, this is crappy: predicts incorrectly on addv + 2x add.

Dispatch-queues model

  • Found in the optimisation manual
  • Dispatcher: limits to 3μop/cycle
  • But also has dispatch queues with tighter limits

Finding a dispatch model

Two sources of data:

  • Palmed

  • Optim manual Plus pipedream experiments.

  • Palmed not usable as-is: resources are not accurate, 1-to-1 match

  • However, good basis: eg. Ld, St ports are 1-1 match

  • Multiple resources not coalesced for eg. Int, FP01

  • For each insn class,

    • generate a base dispatch model with Palmed
    • cross-check with manual
  • Some special cases.

    • More #dispatch than #uop: does not happen
    • Single #dispatch, multiple #uop: replicate dispatch #uop times
    • #dispatch = #uop > 1: arbitrary order. This is a problem, but future work.
    • 1 < #dispatch < #uop: unsupported. Only 35 insn/1749.
  • The model is a very simple version of abstract resources model: indeed, FP0 and FP1 are separate dispatch queues, yet some μops hit FP01.

Implementing the model

  • Assuming each insn has at least 1μop, the dispatcher is always the frontend's bottleneck
  • The state at the end of a kernel is still determined only by dispatch pos
  • => the same algorithm + keep track of queues still works

Evaluation on Palmed

  • Add these models to Palmed: for each kernel, simply take max(frontend(K), Palmed(k))
  • Results

Discussion: how to generalize

[TODO]