Foundations: llvm-mca examples

2024-03-28 11:32:08 +01:00 · 2024-03-28 11:32:08 +01:00 · e8b94b4b5a
commit e8b94b4b5a
parent 5459729661
8 changed files with 315 additions and 2 deletions
--- a/manuscrit/20_foundations/20_code_analyzers.tex
+++ b/manuscrit/20_foundations/20_code_analyzers.tex
@ -163,11 +163,128 @@ be analyzed.

 \subsection{Examples with \llvmmca}

-\todo{}
+We have now covered enough of the theoretical background to introduce code
+analyzers in a concrete way, through examples of their usage. For this purpose,
+we use \llvmmca{}, one of the state-of-the-art code analyzers.
+
+Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
+implementations~--, we will base the following examples on ARM's Cortex A72,
+which we introduce in depth later in \autoref{chap:frontend}. No specific
+knowledge of this microarchitecture is required to understand the following
+examples; for our purposes, if suffices to say that:
+
+\begin{itemize}
+    \item the A72 has a single load port, a single store port and two integer arithmetics ports;
+    \item the \texttt{xN} registers are 64-bits registers;
+    \item the \texttt{ldr} instruction (\textbf{l}oa\textbf{d}
+        \textbf{r}egister) loads a value from memory into a register;
+    \item the \texttt{str} instruction (\textbf{st}ore
+        \textbf{r}egister) stores the value of a register to memory;
+    \item the \texttt{add} instruction adds integer values from its two last
+        operands and stores the result in the first.
+\end{itemize}
+
+\bigskip{}
+
+\paragraph{Simple example: a single load.} We first start by running \llvmmca{}
+on a single load operation: \lstarmasm{ldr x1, [x2]}.
+
+\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
+
+The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
+the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
+kernel contains only one instruction, which breaks down into a single \uop{}.
+Iterating it takes 106 cycles instead of the expected 100 cycles, as this
+execution is \emph{not} in steady-state, but accounts for the cycles from the
+decoding of the first instruction to the retirement of the last.
+
+The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}.
+The next two rows are simple ratios. Row 10 is the block's \emph{reverse
+throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but
+is roughly defined as the number of cycles a single iteration of the kernel
+takes.
+
+The next section, \emph{instruction info}, lists data about the instructions
+present.
+
+Finally, the last section, \emph{resources}, breaks down individual
+instructions into load incurred on execution ports, first aggregating it by
+full iteration of the kernel, then instruction by instruction. The maximal load
+of each port is normalized to 1, which amounts to say that it is expressed in
+number of cycles required to process the load.
+
+Here, the only pressure is 1 on the port labeled \texttt{[2]}, that is, the
+load port. Thus, the kernel cannot complete in less than a full cycle, as it
+takes up all load resources available.
+
+\paragraph{The timeline mode.} Another useful view that can be displayed by
+\llvmmca{} is its timeline mode, enabled by passing an extra
+\lstbash{--timeline} flag. In the previous example, it further outputs:
+
+\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
+
+which indicates, for each instruction, the timeline of its execution. Here,
+\texttt{D} stands for decode, \texttt{e} for being executed --~in the
+pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
+pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
+waiting to be dispatched to execution, a \texttt{=} is shown.
+
+The identifier at the beginning of each row indicates the kernel iteration
+number, and the instruction within.
+
+Here, we can better understand the 106 cycles seen earlier: it takes a first
+cycle to decode the first instruction, the instruction remains in the pipeline
+for 5 cycles, and must finally be retired. In steady-state, however, the
+instruction would be already decoded (while a previous instruction was being
+executed), the retirement would also be taking place while another instruction
+executes, and the pipeline would be accepting new instructions for four of
+these five cycles. We can thus avoid using up 6 of those 106 cycles in
+steady-state, taking us back to the expected 100 cycles.
+
+\paragraph{Single integer add.} If we substitute this load operation with an
+integer add operation, we find a reverse throughput halved:
+
+\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/10_add.out}
+
+Indeed, as we have two integer arithmetics unit, two adds may be executed in
+parallel, as can be seen in the timeline view.
+
+\paragraph{Load and two adds.} If we combine those two instructions in a kernel
+with a single load and two adds, we obtain a kernel that still fits in the
+execution ports in a single cycle. \llvmmca{} confirms this:
+
+\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/20_laa.out}
+
+We can indeed see that an iteration fully utilizes the three ports, but still
+fits: the kernel still manages to have a reverse throughput of 1.
+
+\newpage
+\paragraph{Three adds.} A kernel of three adds, however, will not be able to
+run in a single cycle:
+
+\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/30_aaa.out}
+
+The resource pressure by iteration view confirms that we exceed the integer
+arithmetic capacity of the processor for a single cycle. This is correctly
+reflected in the timeline view: the instruction \texttt{[0,2]} starts executing
+only at cycle 3, along with \texttt{[1,0]}.
+
+\paragraph{Load, store and two adds.} A kernel of one load, two adds and one
+store should, ports-wise, fit in a single cycle. However, \llvmmca{} finds for
+this kernel a reverse throughput of 1.3:
+
+\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/40_laas.out}
+
+While the resource pressure views confirm that the ports are able to handle
+this kernel in a single cycle, the timeline shows that it is in fact the
+frontend that stalls the computation. As only three instructions may be decoded
+and issued per cycle, the backend is not fed with enough instructions per cycle
+to reach a reverse throughput of 1.

 \subsection{Definitions}

-\subsubsection{Throughput and reciprocal throughput}
+\subsubsection{Throughput and reciprocal
+throughput}\label{sssec:def:rthroughput}

 Given a kernel $\kerK$ of straight-line assembly code, we have referred to
 $\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many
--- a/manuscrit/assets/src/20_foundations/llvm_mca_examples/.gitignore
+++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/.gitignore
@ -0,0 +1 @@
+!*.out
--- a/manuscrit/assets/src/20_foundations/llvm_mca_examples/01_ldr.out
+++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/01_ldr.out
@ -0,0 +1,42 @@
+$ echo 'ldr x1,[x2]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
+Iterations:        100
+Instructions:      100
+Total Cycles:      106
+Total uOps:        100
+
+Dispatch Width:    3
+uOps Per Cycle:    0.94
+IPC:               0.94
+Block RThroughput: 1.0
+
+
+Instruction Info:
+[1]: #uOps
+[2]: Latency
+[3]: RThroughput
+[4]: MayLoad
+[5]: MayStore
+[6]: HasSideEffects (U)
+
+[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
+ 1      4     1.00    *                   ldr   x1, [x2]
+
+
+Resources:
+[0]   - A57UnitB
+[1.0] - A57UnitI
+[1.1] - A57UnitI
+[2]   - A57UnitL
+[3]   - A57UnitM
+[4]   - A57UnitS
+[5]   - A57UnitW
+[6]   - A57UnitX
+
+
+Resource pressure per iteration:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]    
+ -      -      -     1.00    -      -      -      -     
+
+Resource pressure by instruction:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]    Instructions:
+ -      -      -     1.00    -      -      -      -     ldr     x1, [x2]
--- a/manuscrit/assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out
+++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out
@ -0,0 +1,14 @@
+Timeline view:
+                    012345
+Index     0123456789      
+
+[0,0]     DeeeeER   .    .   ldr        x1, [x2]
+[1,0]     D=eeeeER  .    .   ldr        x1, [x2]
+[2,0]     D==eeeeER .    .   ldr        x1, [x2]
+[3,0]     .D==eeeeER.    .   ldr        x1, [x2]
+[4,0]     .D===eeeeER    .   ldr        x1, [x2]
+[5,0]     .D====eeeeER   .   ldr        x1, [x2]
+[6,0]     . D====eeeeER  .   ldr        x1, [x2]
+[7,0]     . D=====eeeeER .   ldr        x1, [x2]
+[8,0]     . D======eeeeER.   ldr        x1, [x2]
+[9,0]     .  D======eeeeER   ldr        x1, [x2]
--- a/manuscrit/assets/src/20_foundations/llvm_mca_examples/10_add.out
+++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/10_add.out
@ -0,0 +1,35 @@
+$ echo 'add x1,x2,x3' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
+Iterations:        100
+Instructions:      100
+Total Cycles:      53
+Total uOps:        100
+
+Dispatch Width:    3
+uOps Per Cycle:    1.89
+IPC:               1.89
+Block RThroughput: 0.5
+
+[...]
+
+[1.0] - A57UnitI
+[1.1] - A57UnitI
+
+[...]
+
+Resource pressure by instruction:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]    Instructions:
+ -     0.50   0.50    -      -      -      -      -     add     x1, x2, x3
+
+Timeline view:
+Index     01234567
+
+[0,0]     DeER . .   add        x1, x2, x3
+[1,0]     DeER . .   add        x1, x2, x3
+[2,0]     D=eER. .   add        x1, x2, x3
+[3,0]     .DeER. .   add        x1, x2, x3
+[4,0]     .D=eER .   add        x1, x2, x3
+[5,0]     .D=eER .   add        x1, x2, x3
+[6,0]     . D=eER.   add        x1, x2, x3
+[7,0]     . D=eER.   add        x1, x2, x3
+[8,0]     . D==eER   add        x1, x2, x3
+[9,0]     .  D=eER   add        x1, x2, x3
--- a/manuscrit/assets/src/20_foundations/llvm_mca_examples/20_laa.out
+++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/20_laa.out
@ -0,0 +1,25 @@
+$ echo -e 'ldr x1,[x2]
+add x3,x4,x5
+add x6,x7,x8' | llvm-mca --march=aarch64 --mcpu=cortex-a72 -
+
+Iterations:        100
+Instructions:      300
+Total Cycles:      106
+Total uOps:        300
+
+Dispatch Width:    3
+uOps Per Cycle:    2.83
+IPC:               2.83
+Block RThroughput: 1.0
+
+[...]
+
+Resource pressure per iteration:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]    
+ -     1.00   1.00   1.00    -      -      -      -     
+
+Resource pressure by instruction:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]    Instructions:
+ -      -      -     1.00    -      -      -      -     ldr     x1, [x2]
+ -      -     1.00    -      -      -      -      -     add     x3, x4, x5
+ -     1.00    -      -      -      -      -      -     add     x6, x7, x8
--- a/manuscrit/assets/src/20_foundations/llvm_mca_examples/30_aaa.out
+++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/30_aaa.out
@ -0,0 +1,38 @@
+$ echo -e 'add x1,x2,x3
+add x4,x5,x6
+add x7,x8,x9' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline -
+
+Iterations:        100
+Instructions:      300
+Total Cycles:      153
+Total uOps:        300
+
+Dispatch Width:    3
+uOps Per Cycle:    1.96
+IPC:               1.96
+Block RThroughput: 1.5
+
+[...]
+
+Resource pressure per iteration:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]
+ -     1.50   1.50    -      -      -      -      -
+
+Resource pressure by instruction:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]    Instructions:
+ -     0.50   0.50    -      -      -      -      -     add     x1, x2, x3
+ -     0.50   0.50    -      -      -      -      -     add     x4, x5, x6
+ -     0.50   0.50    -      -      -      -      -     add     x7, x8, x9
+
+
+Timeline view:
+                    01234567
+Index     0123456789
+
+[0,0]     DeER .    .    . .   add      x1, x2, x3
+[0,1]     DeER .    .    . .   add      x4, x5, x6
+[0,2]     D=eER.    .    . .   add      x7, x8, x9
+[1,0]     .DeER.    .    . .   add      x1, x2, x3
+[1,1]     .D=eER    .    . .   add      x4, x5, x6
+[1,2]     .D=eER    .    . .   add      x7, x8, x9
+[...]
--- a/manuscrit/assets/src/20_foundations/llvm_mca_examples/40_laas.out
+++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/40_laas.out
@ -0,0 +1,41 @@
+$ echo -e 'ldr x1,[x2]
+add x3,x4,x5
+add x6,x7,x8
+str x9,[x10]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline -
+
+Iterations:        100
+Instructions:      400
+Total Cycles:      139
+Total uOps:        400
+
+Dispatch Width:    3
+uOps Per Cycle:    2.88
+IPC:               2.88
+Block RThroughput: 1.3
+
+[...]
+
+Resource pressure per iteration:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]
+ -     1.00   1.00   1.00    -     1.00    -      -
+
+Resource pressure by instruction:
+[0]    [1.0]  [1.1]  [2]    [3]    [4]    [5]    [6]    Instructions:
+ -      -      -     1.00    -      -      -      -     ldr     x1, [x2]
+ -      -     1.00    -      -      -      -      -     add     x3, x4, x5
+ -     1.00    -      -      -      -      -      -     add     x6, x7, x8
+ -      -      -      -      -     1.00    -      -     str     x9, [x10]
+
+
+Timeline view:
+                    012345678
+Index     0123456789
+
+[0,0]     DeeeeER   .    .  .   ldr     x1, [x2]
+[0,1]     DeE---R   .    .  .   add     x3, x4, x5
+[0,2]     DeE---R   .    .  .   add     x6, x7, x8
+[0,3]     .DeE--R   .    .  .   str     x9, [x10]
+[1,0]     .DeeeeER  .    .  .   ldr     x1, [x2]
+[1,1]     .DeE---R  .    .  .   add     x3, x4, x5
+[1,2]     . DeE--R  .    .  .   add     x6, x7, x8
+[1,3]     . DeE--R  .    .  .   str     x9, [x10]