Foundations: llvm-mca examples

This commit is contained in:
Théophile Bastian 2024-03-28 11:32:08 +01:00
parent 5459729661
commit e8b94b4b5a
8 changed files with 315 additions and 2 deletions

View file

@ -163,11 +163,128 @@ be analyzed.
\subsection{Examples with \llvmmca}
\todo{}
We have now covered enough of the theoretical background to introduce code
analyzers in a concrete way, through examples of their usage. For this purpose,
we use \llvmmca{}, one of the state-of-the-art code analyzers.
Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
implementations~--, we will base the following examples on ARM's Cortex A72,
which we introduce in depth later in \autoref{chap:frontend}. No specific
knowledge of this microarchitecture is required to understand the following
examples; for our purposes, if suffices to say that:
\begin{itemize}
\item the A72 has a single load port, a single store port and two integer arithmetics ports;
\item the \texttt{xN} registers are 64-bits registers;
\item the \texttt{ldr} instruction (\textbf{l}oa\textbf{d}
\textbf{r}egister) loads a value from memory into a register;
\item the \texttt{str} instruction (\textbf{st}ore
\textbf{r}egister) stores the value of a register to memory;
\item the \texttt{add} instruction adds integer values from its two last
operands and stores the result in the first.
\end{itemize}
\bigskip{}
\paragraph{Simple example: a single load.} We first start by running \llvmmca{}
on a single load operation: \lstarmasm{ldr x1, [x2]}.
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
kernel contains only one instruction, which breaks down into a single \uop{}.
Iterating it takes 106 cycles instead of the expected 100 cycles, as this
execution is \emph{not} in steady-state, but accounts for the cycles from the
decoding of the first instruction to the retirement of the last.
The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}.
The next two rows are simple ratios. Row 10 is the block's \emph{reverse
throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but
is roughly defined as the number of cycles a single iteration of the kernel
takes.
The next section, \emph{instruction info}, lists data about the instructions
present.
Finally, the last section, \emph{resources}, breaks down individual
instructions into load incurred on execution ports, first aggregating it by
full iteration of the kernel, then instruction by instruction. The maximal load
of each port is normalized to 1, which amounts to say that it is expressed in
number of cycles required to process the load.
Here, the only pressure is 1 on the port labeled \texttt{[2]}, that is, the
load port. Thus, the kernel cannot complete in less than a full cycle, as it
takes up all load resources available.
\paragraph{The timeline mode.} Another useful view that can be displayed by
\llvmmca{} is its timeline mode, enabled by passing an extra
\lstbash{--timeline} flag. In the previous example, it further outputs:
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
which indicates, for each instruction, the timeline of its execution. Here,
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
waiting to be dispatched to execution, a \texttt{=} is shown.
The identifier at the beginning of each row indicates the kernel iteration
number, and the instruction within.
Here, we can better understand the 106 cycles seen earlier: it takes a first
cycle to decode the first instruction, the instruction remains in the pipeline
for 5 cycles, and must finally be retired. In steady-state, however, the
instruction would be already decoded (while a previous instruction was being
executed), the retirement would also be taking place while another instruction
executes, and the pipeline would be accepting new instructions for four of
these five cycles. We can thus avoid using up 6 of those 106 cycles in
steady-state, taking us back to the expected 100 cycles.
\paragraph{Single integer add.} If we substitute this load operation with an
integer add operation, we find a reverse throughput halved:
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/10_add.out}
Indeed, as we have two integer arithmetics unit, two adds may be executed in
parallel, as can be seen in the timeline view.
\paragraph{Load and two adds.} If we combine those two instructions in a kernel
with a single load and two adds, we obtain a kernel that still fits in the
execution ports in a single cycle. \llvmmca{} confirms this:
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/20_laa.out}
We can indeed see that an iteration fully utilizes the three ports, but still
fits: the kernel still manages to have a reverse throughput of 1.
\newpage
\paragraph{Three adds.} A kernel of three adds, however, will not be able to
run in a single cycle:
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/30_aaa.out}
The resource pressure by iteration view confirms that we exceed the integer
arithmetic capacity of the processor for a single cycle. This is correctly
reflected in the timeline view: the instruction \texttt{[0,2]} starts executing
only at cycle 3, along with \texttt{[1,0]}.
\paragraph{Load, store and two adds.} A kernel of one load, two adds and one
store should, ports-wise, fit in a single cycle. However, \llvmmca{} finds for
this kernel a reverse throughput of 1.3:
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/40_laas.out}
While the resource pressure views confirm that the ports are able to handle
this kernel in a single cycle, the timeline shows that it is in fact the
frontend that stalls the computation. As only three instructions may be decoded
and issued per cycle, the backend is not fed with enough instructions per cycle
to reach a reverse throughput of 1.
\subsection{Definitions}
\subsubsection{Throughput and reciprocal throughput}
\subsubsection{Throughput and reciprocal
throughput}\label{sssec:def:rthroughput}
Given a kernel $\kerK$ of straight-line assembly code, we have referred to
$\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many

View file

@ -0,0 +1 @@
!*.out

View file

@ -0,0 +1,42 @@
$ echo 'ldr x1,[x2]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
Iterations: 100
Instructions: 100
Total Cycles: 106
Total uOps: 100
Dispatch Width: 3
uOps Per Cycle: 0.94
IPC: 0.94
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 4 1.00 * ldr x1, [x2]
Resources:
[0] - A57UnitB
[1.0] - A57UnitI
[1.1] - A57UnitI
[2] - A57UnitL
[3] - A57UnitM
[4] - A57UnitS
[5] - A57UnitW
[6] - A57UnitX
Resource pressure per iteration:
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
- - - 1.00 - - - -
Resource pressure by instruction:
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
- - - 1.00 - - - - ldr x1, [x2]

View file

@ -0,0 +1,14 @@
Timeline view:
012345
Index 0123456789
[0,0] DeeeeER . . ldr x1, [x2]
[1,0] D=eeeeER . . ldr x1, [x2]
[2,0] D==eeeeER . . ldr x1, [x2]
[3,0] .D==eeeeER. . ldr x1, [x2]
[4,0] .D===eeeeER . ldr x1, [x2]
[5,0] .D====eeeeER . ldr x1, [x2]
[6,0] . D====eeeeER . ldr x1, [x2]
[7,0] . D=====eeeeER . ldr x1, [x2]
[8,0] . D======eeeeER. ldr x1, [x2]
[9,0] . D======eeeeER ldr x1, [x2]

View file

@ -0,0 +1,35 @@
$ echo 'add x1,x2,x3' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
Iterations: 100
Instructions: 100
Total Cycles: 53
Total uOps: 100
Dispatch Width: 3
uOps Per Cycle: 1.89
IPC: 1.89
Block RThroughput: 0.5
[...]
[1.0] - A57UnitI
[1.1] - A57UnitI
[...]
Resource pressure by instruction:
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
- 0.50 0.50 - - - - - add x1, x2, x3
Timeline view:
Index 01234567
[0,0] DeER . . add x1, x2, x3
[1,0] DeER . . add x1, x2, x3
[2,0] D=eER. . add x1, x2, x3
[3,0] .DeER. . add x1, x2, x3
[4,0] .D=eER . add x1, x2, x3
[5,0] .D=eER . add x1, x2, x3
[6,0] . D=eER. add x1, x2, x3
[7,0] . D=eER. add x1, x2, x3
[8,0] . D==eER add x1, x2, x3
[9,0] . D=eER add x1, x2, x3

View file

@ -0,0 +1,25 @@
$ echo -e 'ldr x1,[x2]
add x3,x4,x5
add x6,x7,x8' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
Iterations: 100
Instructions: 300
Total Cycles: 106
Total uOps: 300
Dispatch Width: 3
uOps Per Cycle: 2.83
IPC: 2.83
Block RThroughput: 1.0
[...]
Resource pressure per iteration:
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
- 1.00 1.00 1.00 - - - -
Resource pressure by instruction:
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
- - - 1.00 - - - - ldr x1, [x2]
- - 1.00 - - - - - add x3, x4, x5
- 1.00 - - - - - - add x6, x7, x8

View file

@ -0,0 +1,38 @@
$ echo -e 'add x1,x2,x3
add x4,x5,x6
add x7,x8,x9' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline -
Iterations: 100
Instructions: 300
Total Cycles: 153
Total uOps: 300
Dispatch Width: 3
uOps Per Cycle: 1.96
IPC: 1.96
Block RThroughput: 1.5
[...]
Resource pressure per iteration:
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
- 1.50 1.50 - - - - -
Resource pressure by instruction:
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
- 0.50 0.50 - - - - - add x1, x2, x3
- 0.50 0.50 - - - - - add x4, x5, x6
- 0.50 0.50 - - - - - add x7, x8, x9
Timeline view:
01234567
Index 0123456789
[0,0] DeER . . . . add x1, x2, x3
[0,1] DeER . . . . add x4, x5, x6
[0,2] D=eER. . . . add x7, x8, x9
[1,0] .DeER. . . . add x1, x2, x3
[1,1] .D=eER . . . add x4, x5, x6
[1,2] .D=eER . . . add x7, x8, x9
[...]

View file

@ -0,0 +1,41 @@
$ echo -e 'ldr x1,[x2]
add x3,x4,x5
add x6,x7,x8
str x9,[x10]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline -
Iterations: 100
Instructions: 400
Total Cycles: 139
Total uOps: 400
Dispatch Width: 3
uOps Per Cycle: 2.88
IPC: 2.88
Block RThroughput: 1.3
[...]
Resource pressure per iteration:
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
- 1.00 1.00 1.00 - 1.00 - -
Resource pressure by instruction:
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
- - - 1.00 - - - - ldr x1, [x2]
- - 1.00 - - - - - add x3, x4, x5
- 1.00 - - - - - - add x6, x7, x8
- - - - - 1.00 - - str x9, [x10]
Timeline view:
012345678
Index 0123456789
[0,0] DeeeeER . . . ldr x1, [x2]
[0,1] DeE---R . . . add x3, x4, x5
[0,2] DeE---R . . . add x6, x7, x8
[0,3] .DeE--R . . . str x9, [x10]
[1,0] .DeeeeER . . . ldr x1, [x2]
[1,1] .DeE---R . . . add x3, x4, x5
[1,2] . DeE--R . . . add x6, x7, x8
[1,3] . DeE--R . . . str x9, [x10]