Foundations: llvm-mca examples
This commit is contained in:
parent
5459729661
commit
e8b94b4b5a
8 changed files with 315 additions and 2 deletions
|
@ -163,11 +163,128 @@ be analyzed.
|
|||
|
||||
\subsection{Examples with \llvmmca}
|
||||
|
||||
\todo{}
|
||||
We have now covered enough of the theoretical background to introduce code
|
||||
analyzers in a concrete way, through examples of their usage. For this purpose,
|
||||
we use \llvmmca{}, one of the state-of-the-art code analyzers.
|
||||
|
||||
Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
|
||||
implementations~--, we will base the following examples on ARM's Cortex A72,
|
||||
which we introduce in depth later in \autoref{chap:frontend}. No specific
|
||||
knowledge of this microarchitecture is required to understand the following
|
||||
examples; for our purposes, if suffices to say that:
|
||||
|
||||
\begin{itemize}
|
||||
\item the A72 has a single load port, a single store port and two integer arithmetics ports;
|
||||
\item the \texttt{xN} registers are 64-bits registers;
|
||||
\item the \texttt{ldr} instruction (\textbf{l}oa\textbf{d}
|
||||
\textbf{r}egister) loads a value from memory into a register;
|
||||
\item the \texttt{str} instruction (\textbf{st}ore
|
||||
\textbf{r}egister) stores the value of a register to memory;
|
||||
\item the \texttt{add} instruction adds integer values from its two last
|
||||
operands and stores the result in the first.
|
||||
\end{itemize}
|
||||
|
||||
\bigskip{}
|
||||
|
||||
\paragraph{Simple example: a single load.} We first start by running \llvmmca{}
|
||||
on a single load operation: \lstarmasm{ldr x1, [x2]}.
|
||||
|
||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
|
||||
|
||||
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
|
||||
the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
|
||||
kernel contains only one instruction, which breaks down into a single \uop{}.
|
||||
Iterating it takes 106 cycles instead of the expected 100 cycles, as this
|
||||
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
||||
decoding of the first instruction to the retirement of the last.
|
||||
|
||||
The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}.
|
||||
The next two rows are simple ratios. Row 10 is the block's \emph{reverse
|
||||
throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but
|
||||
is roughly defined as the number of cycles a single iteration of the kernel
|
||||
takes.
|
||||
|
||||
The next section, \emph{instruction info}, lists data about the instructions
|
||||
present.
|
||||
|
||||
Finally, the last section, \emph{resources}, breaks down individual
|
||||
instructions into load incurred on execution ports, first aggregating it by
|
||||
full iteration of the kernel, then instruction by instruction. The maximal load
|
||||
of each port is normalized to 1, which amounts to say that it is expressed in
|
||||
number of cycles required to process the load.
|
||||
|
||||
Here, the only pressure is 1 on the port labeled \texttt{[2]}, that is, the
|
||||
load port. Thus, the kernel cannot complete in less than a full cycle, as it
|
||||
takes up all load resources available.
|
||||
|
||||
\paragraph{The timeline mode.} Another useful view that can be displayed by
|
||||
\llvmmca{} is its timeline mode, enabled by passing an extra
|
||||
\lstbash{--timeline} flag. In the previous example, it further outputs:
|
||||
|
||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
|
||||
|
||||
which indicates, for each instruction, the timeline of its execution. Here,
|
||||
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
|
||||
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
|
||||
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
|
||||
waiting to be dispatched to execution, a \texttt{=} is shown.
|
||||
|
||||
The identifier at the beginning of each row indicates the kernel iteration
|
||||
number, and the instruction within.
|
||||
|
||||
Here, we can better understand the 106 cycles seen earlier: it takes a first
|
||||
cycle to decode the first instruction, the instruction remains in the pipeline
|
||||
for 5 cycles, and must finally be retired. In steady-state, however, the
|
||||
instruction would be already decoded (while a previous instruction was being
|
||||
executed), the retirement would also be taking place while another instruction
|
||||
executes, and the pipeline would be accepting new instructions for four of
|
||||
these five cycles. We can thus avoid using up 6 of those 106 cycles in
|
||||
steady-state, taking us back to the expected 100 cycles.
|
||||
|
||||
\paragraph{Single integer add.} If we substitute this load operation with an
|
||||
integer add operation, we find a reverse throughput halved:
|
||||
|
||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/10_add.out}
|
||||
|
||||
Indeed, as we have two integer arithmetics unit, two adds may be executed in
|
||||
parallel, as can be seen in the timeline view.
|
||||
|
||||
\paragraph{Load and two adds.} If we combine those two instructions in a kernel
|
||||
with a single load and two adds, we obtain a kernel that still fits in the
|
||||
execution ports in a single cycle. \llvmmca{} confirms this:
|
||||
|
||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/20_laa.out}
|
||||
|
||||
We can indeed see that an iteration fully utilizes the three ports, but still
|
||||
fits: the kernel still manages to have a reverse throughput of 1.
|
||||
|
||||
\newpage
|
||||
\paragraph{Three adds.} A kernel of three adds, however, will not be able to
|
||||
run in a single cycle:
|
||||
|
||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/30_aaa.out}
|
||||
|
||||
The resource pressure by iteration view confirms that we exceed the integer
|
||||
arithmetic capacity of the processor for a single cycle. This is correctly
|
||||
reflected in the timeline view: the instruction \texttt{[0,2]} starts executing
|
||||
only at cycle 3, along with \texttt{[1,0]}.
|
||||
|
||||
\paragraph{Load, store and two adds.} A kernel of one load, two adds and one
|
||||
store should, ports-wise, fit in a single cycle. However, \llvmmca{} finds for
|
||||
this kernel a reverse throughput of 1.3:
|
||||
|
||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/40_laas.out}
|
||||
|
||||
While the resource pressure views confirm that the ports are able to handle
|
||||
this kernel in a single cycle, the timeline shows that it is in fact the
|
||||
frontend that stalls the computation. As only three instructions may be decoded
|
||||
and issued per cycle, the backend is not fed with enough instructions per cycle
|
||||
to reach a reverse throughput of 1.
|
||||
|
||||
\subsection{Definitions}
|
||||
|
||||
\subsubsection{Throughput and reciprocal throughput}
|
||||
\subsubsection{Throughput and reciprocal
|
||||
throughput}\label{sssec:def:rthroughput}
|
||||
|
||||
Given a kernel $\kerK$ of straight-line assembly code, we have referred to
|
||||
$\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many
|
||||
|
|
1
manuscrit/assets/src/20_foundations/llvm_mca_examples/.gitignore
vendored
Normal file
1
manuscrit/assets/src/20_foundations/llvm_mca_examples/.gitignore
vendored
Normal file
|
@ -0,0 +1 @@
|
|||
!*.out
|
|
@ -0,0 +1,42 @@
|
|||
$ echo 'ldr x1,[x2]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
|
||||
Iterations: 100
|
||||
Instructions: 100
|
||||
Total Cycles: 106
|
||||
Total uOps: 100
|
||||
|
||||
Dispatch Width: 3
|
||||
uOps Per Cycle: 0.94
|
||||
IPC: 0.94
|
||||
Block RThroughput: 1.0
|
||||
|
||||
|
||||
Instruction Info:
|
||||
[1]: #uOps
|
||||
[2]: Latency
|
||||
[3]: RThroughput
|
||||
[4]: MayLoad
|
||||
[5]: MayStore
|
||||
[6]: HasSideEffects (U)
|
||||
|
||||
[1] [2] [3] [4] [5] [6] Instructions:
|
||||
1 4 1.00 * ldr x1, [x2]
|
||||
|
||||
|
||||
Resources:
|
||||
[0] - A57UnitB
|
||||
[1.0] - A57UnitI
|
||||
[1.1] - A57UnitI
|
||||
[2] - A57UnitL
|
||||
[3] - A57UnitM
|
||||
[4] - A57UnitS
|
||||
[5] - A57UnitW
|
||||
[6] - A57UnitX
|
||||
|
||||
|
||||
Resource pressure per iteration:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
|
||||
- - - 1.00 - - - -
|
||||
|
||||
Resource pressure by instruction:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
|
||||
- - - 1.00 - - - - ldr x1, [x2]
|
|
@ -0,0 +1,14 @@
|
|||
Timeline view:
|
||||
012345
|
||||
Index 0123456789
|
||||
|
||||
[0,0] DeeeeER . . ldr x1, [x2]
|
||||
[1,0] D=eeeeER . . ldr x1, [x2]
|
||||
[2,0] D==eeeeER . . ldr x1, [x2]
|
||||
[3,0] .D==eeeeER. . ldr x1, [x2]
|
||||
[4,0] .D===eeeeER . ldr x1, [x2]
|
||||
[5,0] .D====eeeeER . ldr x1, [x2]
|
||||
[6,0] . D====eeeeER . ldr x1, [x2]
|
||||
[7,0] . D=====eeeeER . ldr x1, [x2]
|
||||
[8,0] . D======eeeeER. ldr x1, [x2]
|
||||
[9,0] . D======eeeeER ldr x1, [x2]
|
|
@ -0,0 +1,35 @@
|
|||
$ echo 'add x1,x2,x3' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
|
||||
Iterations: 100
|
||||
Instructions: 100
|
||||
Total Cycles: 53
|
||||
Total uOps: 100
|
||||
|
||||
Dispatch Width: 3
|
||||
uOps Per Cycle: 1.89
|
||||
IPC: 1.89
|
||||
Block RThroughput: 0.5
|
||||
|
||||
[...]
|
||||
|
||||
[1.0] - A57UnitI
|
||||
[1.1] - A57UnitI
|
||||
|
||||
[...]
|
||||
|
||||
Resource pressure by instruction:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
|
||||
- 0.50 0.50 - - - - - add x1, x2, x3
|
||||
|
||||
Timeline view:
|
||||
Index 01234567
|
||||
|
||||
[0,0] DeER . . add x1, x2, x3
|
||||
[1,0] DeER . . add x1, x2, x3
|
||||
[2,0] D=eER. . add x1, x2, x3
|
||||
[3,0] .DeER. . add x1, x2, x3
|
||||
[4,0] .D=eER . add x1, x2, x3
|
||||
[5,0] .D=eER . add x1, x2, x3
|
||||
[6,0] . D=eER. add x1, x2, x3
|
||||
[7,0] . D=eER. add x1, x2, x3
|
||||
[8,0] . D==eER add x1, x2, x3
|
||||
[9,0] . D=eER add x1, x2, x3
|
|
@ -0,0 +1,25 @@
|
|||
$ echo -e 'ldr x1,[x2]
|
||||
add x3,x4,x5
|
||||
add x6,x7,x8' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
|
||||
|
||||
Iterations: 100
|
||||
Instructions: 300
|
||||
Total Cycles: 106
|
||||
Total uOps: 300
|
||||
|
||||
Dispatch Width: 3
|
||||
uOps Per Cycle: 2.83
|
||||
IPC: 2.83
|
||||
Block RThroughput: 1.0
|
||||
|
||||
[...]
|
||||
|
||||
Resource pressure per iteration:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
|
||||
- 1.00 1.00 1.00 - - - -
|
||||
|
||||
Resource pressure by instruction:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
|
||||
- - - 1.00 - - - - ldr x1, [x2]
|
||||
- - 1.00 - - - - - add x3, x4, x5
|
||||
- 1.00 - - - - - - add x6, x7, x8
|
|
@ -0,0 +1,38 @@
|
|||
$ echo -e 'add x1,x2,x3
|
||||
add x4,x5,x6
|
||||
add x7,x8,x9' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline -
|
||||
|
||||
Iterations: 100
|
||||
Instructions: 300
|
||||
Total Cycles: 153
|
||||
Total uOps: 300
|
||||
|
||||
Dispatch Width: 3
|
||||
uOps Per Cycle: 1.96
|
||||
IPC: 1.96
|
||||
Block RThroughput: 1.5
|
||||
|
||||
[...]
|
||||
|
||||
Resource pressure per iteration:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
|
||||
- 1.50 1.50 - - - - -
|
||||
|
||||
Resource pressure by instruction:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
|
||||
- 0.50 0.50 - - - - - add x1, x2, x3
|
||||
- 0.50 0.50 - - - - - add x4, x5, x6
|
||||
- 0.50 0.50 - - - - - add x7, x8, x9
|
||||
|
||||
|
||||
Timeline view:
|
||||
01234567
|
||||
Index 0123456789
|
||||
|
||||
[0,0] DeER . . . . add x1, x2, x3
|
||||
[0,1] DeER . . . . add x4, x5, x6
|
||||
[0,2] D=eER. . . . add x7, x8, x9
|
||||
[1,0] .DeER. . . . add x1, x2, x3
|
||||
[1,1] .D=eER . . . add x4, x5, x6
|
||||
[1,2] .D=eER . . . add x7, x8, x9
|
||||
[...]
|
|
@ -0,0 +1,41 @@
|
|||
$ echo -e 'ldr x1,[x2]
|
||||
add x3,x4,x5
|
||||
add x6,x7,x8
|
||||
str x9,[x10]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline -
|
||||
|
||||
Iterations: 100
|
||||
Instructions: 400
|
||||
Total Cycles: 139
|
||||
Total uOps: 400
|
||||
|
||||
Dispatch Width: 3
|
||||
uOps Per Cycle: 2.88
|
||||
IPC: 2.88
|
||||
Block RThroughput: 1.3
|
||||
|
||||
[...]
|
||||
|
||||
Resource pressure per iteration:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6]
|
||||
- 1.00 1.00 1.00 - 1.00 - -
|
||||
|
||||
Resource pressure by instruction:
|
||||
[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions:
|
||||
- - - 1.00 - - - - ldr x1, [x2]
|
||||
- - 1.00 - - - - - add x3, x4, x5
|
||||
- 1.00 - - - - - - add x6, x7, x8
|
||||
- - - - - 1.00 - - str x9, [x10]
|
||||
|
||||
|
||||
Timeline view:
|
||||
012345678
|
||||
Index 0123456789
|
||||
|
||||
[0,0] DeeeeER . . . ldr x1, [x2]
|
||||
[0,1] DeE---R . . . add x3, x4, x5
|
||||
[0,2] DeE---R . . . add x6, x7, x8
|
||||
[0,3] .DeE--R . . . str x9, [x10]
|
||||
[1,0] .DeeeeER . . . ldr x1, [x2]
|
||||
[1,1] .DeE---R . . . add x3, x4, x5
|
||||
[1,2] . DeE--R . . . add x6, x7, x8
|
||||
[1,3] . DeE--R . . . str x9, [x10]
|
Loading…
Reference in a new issue