phd-thesis/plan/40_a72_frontend.md

2.6 KiB

Beyond ports: manually modelling the A72 frontend

Necessity to go beyond ports

  • Palmed: concerned mostly with ports
  • Noticed the importance of the frontend while investigating its performances
    • heatmap representation: uops gone wild
    • example of a frontend-bound microkernel
  • Palmed's vision of a frontend
  • UiCA: OK, but it's more complicated

Cortex A72

  • General intro

    • ARMv8-A
    • Out of order
    • Designed as general-purpose, high-performance core for low-power applications "Next generation of high-efficiency compute"
    • Raspberry Pi 4 (BCM2711)
  • Backend

    • 2x Int
    • IntM
    • Load
    • Store
    • FP0
    • FP1
  • Frontend: 3 insn/cycle

    • very limiting: eg. (?)
  • Pure Palmed results

Manual frontend

Base methodology

  • Basis: throughput model
    • eg. Palmed, uops, official reference
  • Simple instructions: 1μop, single port.
  • Categorize by Palmed quads: a~b iff ∀i, Cyc(ai) = Cyc(bi)
    • (not necessary, but reduces number of experiments required)
  • Find the impact of insn i on frontend.
    • The frontend must be bottleneck; build a benchmark B = i + (simples) so that the simples do not cause a bottleneck backend-wise
    • Add until Cyc(B) > Cyc(i)
      • Should need at most 3 x Cyc(i) - 1 simples
      • Measure with Pipedream
    • General case: F(i) = 3xCyc(B) - |simples|

Bubbles

  • Instructions:

    • ADD_RD_SP_W_SP_RN_SP_W_SP_AIMM, eg. add w0, w1, #0x10: 1 μop, Int01
    • MUL_RD_W_RN_W_RM_W, eg. mul w0,w1,w2, 1μop, IntM
    • FMIN_FD_D_FN_D_FM_D, eg fmin d0, d1, d2, 1μop, FP01
    • ADDV_FD_H_VN_V_8H, eg. addv h0, v0.8h, 2μop, FP01 doc
  • Example:

    • add: 1/2 cycle (backend bound)
    • addv: 1 cycle (backend bound)
    • add; fmin: 2/3 cycle (frontend bound)
    • add; addv: 1 cycle (frontend bound)
    • [find nice example]

Adopted model

  • Question: given a kernel K, how many (frac.) cycles does it take to be decoded in steady state?
  • Frontend state ∈ [|0,2|]: how many μops already decoded this cycle
  • Assumption: if st > 0 and i s.t. F(i) > 1 cannot be decoded fully without crossing a cycle boundary, we leave a bubble and start decoding it next cycle
  • next: st \mapsto st after executing K
  • longest "cycle" of next is at most 3
  • thus next^3(0) brings us into steady state
  • Execute K enough (t) times to reach the same state as the first
  • Result is C(K^t) / t

Evaluation on Palmed

  • Add this model to Palmed: for each kernel, simply take max(frontend(K), Palmed(k))
  • Results