diff --git a/manuscrit/20_foundations/20_code_analyzers.tex b/manuscrit/20_foundations/20_code_analyzers.tex index 9c66b4a..54b15bc 100644 --- a/manuscrit/20_foundations/20_code_analyzers.tex +++ b/manuscrit/20_foundations/20_code_analyzers.tex @@ -163,11 +163,128 @@ be analyzed. \subsection{Examples with \llvmmca} -\todo{} +We have now covered enough of the theoretical background to introduce code +analyzers in a concrete way, through examples of their usage. For this purpose, +we use \llvmmca{}, one of the state-of-the-art code analyzers. + +Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64 +implementations~--, we will base the following examples on ARM's Cortex A72, +which we introduce in depth later in \autoref{chap:frontend}. No specific +knowledge of this microarchitecture is required to understand the following +examples; for our purposes, if suffices to say that: + +\begin{itemize} + \item the A72 has a single load port, a single store port and two integer arithmetics ports; + \item the \texttt{xN} registers are 64-bits registers; + \item the \texttt{ldr} instruction (\textbf{l}oa\textbf{d} + \textbf{r}egister) loads a value from memory into a register; + \item the \texttt{str} instruction (\textbf{st}ore + \textbf{r}egister) stores the value of a register to memory; + \item the \texttt{add} instruction adds integer values from its two last + operands and stores the result in the first. +\end{itemize} + +\bigskip{} + +\paragraph{Simple example: a single load.} We first start by running \llvmmca{} +on a single load operation: \lstarmasm{ldr x1, [x2]}. + +\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out} + +The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating +the execution of the kernel --~here, 100 times, as seen row 2~--. This simple +kernel contains only one instruction, which breaks down into a single \uop{}. +Iterating it takes 106 cycles instead of the expected 100 cycles, as this +execution is \emph{not} in steady-state, but accounts for the cycles from the +decoding of the first instruction to the retirement of the last. + +The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. +The next two rows are simple ratios. Row 10 is the block's \emph{reverse +throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but +is roughly defined as the number of cycles a single iteration of the kernel +takes. + +The next section, \emph{instruction info}, lists data about the instructions +present. + +Finally, the last section, \emph{resources}, breaks down individual +instructions into load incurred on execution ports, first aggregating it by +full iteration of the kernel, then instruction by instruction. The maximal load +of each port is normalized to 1, which amounts to say that it is expressed in +number of cycles required to process the load. + +Here, the only pressure is 1 on the port labeled \texttt{[2]}, that is, the +load port. Thus, the kernel cannot complete in less than a full cycle, as it +takes up all load resources available. + +\paragraph{The timeline mode.} Another useful view that can be displayed by +\llvmmca{} is its timeline mode, enabled by passing an extra +\lstbash{--timeline} flag. In the previous example, it further outputs: + +\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out} + +which indicates, for each instruction, the timeline of its execution. Here, +\texttt{D} stands for decode, \texttt{e} for being executed --~in the +pipeline~--, \texttt{E} for last cycle of its execution --~leaving the +pipeline~--, \texttt{R} for retiring. When an instruction is decoded and +waiting to be dispatched to execution, a \texttt{=} is shown. + +The identifier at the beginning of each row indicates the kernel iteration +number, and the instruction within. + +Here, we can better understand the 106 cycles seen earlier: it takes a first +cycle to decode the first instruction, the instruction remains in the pipeline +for 5 cycles, and must finally be retired. In steady-state, however, the +instruction would be already decoded (while a previous instruction was being +executed), the retirement would also be taking place while another instruction +executes, and the pipeline would be accepting new instructions for four of +these five cycles. We can thus avoid using up 6 of those 106 cycles in +steady-state, taking us back to the expected 100 cycles. + +\paragraph{Single integer add.} If we substitute this load operation with an +integer add operation, we find a reverse throughput halved: + +\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/10_add.out} + +Indeed, as we have two integer arithmetics unit, two adds may be executed in +parallel, as can be seen in the timeline view. + +\paragraph{Load and two adds.} If we combine those two instructions in a kernel +with a single load and two adds, we obtain a kernel that still fits in the +execution ports in a single cycle. \llvmmca{} confirms this: + +\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/20_laa.out} + +We can indeed see that an iteration fully utilizes the three ports, but still +fits: the kernel still manages to have a reverse throughput of 1. + +\newpage +\paragraph{Three adds.} A kernel of three adds, however, will not be able to +run in a single cycle: + +\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/30_aaa.out} + +The resource pressure by iteration view confirms that we exceed the integer +arithmetic capacity of the processor for a single cycle. This is correctly +reflected in the timeline view: the instruction \texttt{[0,2]} starts executing +only at cycle 3, along with \texttt{[1,0]}. + +\paragraph{Load, store and two adds.} A kernel of one load, two adds and one +store should, ports-wise, fit in a single cycle. However, \llvmmca{} finds for +this kernel a reverse throughput of 1.3: + +\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/40_laas.out} + +While the resource pressure views confirm that the ports are able to handle +this kernel in a single cycle, the timeline shows that it is in fact the +frontend that stalls the computation. As only three instructions may be decoded +and issued per cycle, the backend is not fed with enough instructions per cycle +to reach a reverse throughput of 1. \subsection{Definitions} -\subsubsection{Throughput and reciprocal throughput} +\subsubsection{Throughput and reciprocal +throughput}\label{sssec:def:rthroughput} Given a kernel $\kerK$ of straight-line assembly code, we have referred to $\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many diff --git a/manuscrit/assets/src/20_foundations/llvm_mca_examples/.gitignore b/manuscrit/assets/src/20_foundations/llvm_mca_examples/.gitignore new file mode 100644 index 0000000..6aa6b80 --- /dev/null +++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/.gitignore @@ -0,0 +1 @@ +!*.out diff --git a/manuscrit/assets/src/20_foundations/llvm_mca_examples/01_ldr.out b/manuscrit/assets/src/20_foundations/llvm_mca_examples/01_ldr.out new file mode 100644 index 0000000..6e32b03 --- /dev/null +++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/01_ldr.out @@ -0,0 +1,42 @@ +$ echo 'ldr x1,[x2]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 - +Iterations: 100 +Instructions: 100 +Total Cycles: 106 +Total uOps: 100 + +Dispatch Width: 3 +uOps Per Cycle: 0.94 +IPC: 0.94 +Block RThroughput: 1.0 + + +Instruction Info: +[1]: #uOps +[2]: Latency +[3]: RThroughput +[4]: MayLoad +[5]: MayStore +[6]: HasSideEffects (U) + +[1] [2] [3] [4] [5] [6] Instructions: + 1 4 1.00 * ldr x1, [x2] + + +Resources: +[0] - A57UnitB +[1.0] - A57UnitI +[1.1] - A57UnitI +[2] - A57UnitL +[3] - A57UnitM +[4] - A57UnitS +[5] - A57UnitW +[6] - A57UnitX + + +Resource pressure per iteration: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] + - - - 1.00 - - - - + +Resource pressure by instruction: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions: + - - - 1.00 - - - - ldr x1, [x2] diff --git a/manuscrit/assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out b/manuscrit/assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out new file mode 100644 index 0000000..bc329d1 --- /dev/null +++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out @@ -0,0 +1,14 @@ +Timeline view: + 012345 +Index 0123456789 + +[0,0] DeeeeER . . ldr x1, [x2] +[1,0] D=eeeeER . . ldr x1, [x2] +[2,0] D==eeeeER . . ldr x1, [x2] +[3,0] .D==eeeeER. . ldr x1, [x2] +[4,0] .D===eeeeER . ldr x1, [x2] +[5,0] .D====eeeeER . ldr x1, [x2] +[6,0] . D====eeeeER . ldr x1, [x2] +[7,0] . D=====eeeeER . ldr x1, [x2] +[8,0] . D======eeeeER. ldr x1, [x2] +[9,0] . D======eeeeER ldr x1, [x2] diff --git a/manuscrit/assets/src/20_foundations/llvm_mca_examples/10_add.out b/manuscrit/assets/src/20_foundations/llvm_mca_examples/10_add.out new file mode 100644 index 0000000..ce8ed44 --- /dev/null +++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/10_add.out @@ -0,0 +1,35 @@ +$ echo 'add x1,x2,x3' | llvm-mca --march=aarch64 --mcpu=cortex-a72 - +Iterations: 100 +Instructions: 100 +Total Cycles: 53 +Total uOps: 100 + +Dispatch Width: 3 +uOps Per Cycle: 1.89 +IPC: 1.89 +Block RThroughput: 0.5 + +[...] + +[1.0] - A57UnitI +[1.1] - A57UnitI + +[...] + +Resource pressure by instruction: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions: + - 0.50 0.50 - - - - - add x1, x2, x3 + +Timeline view: +Index 01234567 + +[0,0] DeER . . add x1, x2, x3 +[1,0] DeER . . add x1, x2, x3 +[2,0] D=eER. . add x1, x2, x3 +[3,0] .DeER. . add x1, x2, x3 +[4,0] .D=eER . add x1, x2, x3 +[5,0] .D=eER . add x1, x2, x3 +[6,0] . D=eER. add x1, x2, x3 +[7,0] . D=eER. add x1, x2, x3 +[8,0] . D==eER add x1, x2, x3 +[9,0] . D=eER add x1, x2, x3 diff --git a/manuscrit/assets/src/20_foundations/llvm_mca_examples/20_laa.out b/manuscrit/assets/src/20_foundations/llvm_mca_examples/20_laa.out new file mode 100644 index 0000000..663ee23 --- /dev/null +++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/20_laa.out @@ -0,0 +1,25 @@ +$ echo -e 'ldr x1,[x2] +add x3,x4,x5 +add x6,x7,x8' | llvm-mca --march=aarch64 --mcpu=cortex-a72 - + +Iterations: 100 +Instructions: 300 +Total Cycles: 106 +Total uOps: 300 + +Dispatch Width: 3 +uOps Per Cycle: 2.83 +IPC: 2.83 +Block RThroughput: 1.0 + +[...] + +Resource pressure per iteration: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] + - 1.00 1.00 1.00 - - - - + +Resource pressure by instruction: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions: + - - - 1.00 - - - - ldr x1, [x2] + - - 1.00 - - - - - add x3, x4, x5 + - 1.00 - - - - - - add x6, x7, x8 diff --git a/manuscrit/assets/src/20_foundations/llvm_mca_examples/30_aaa.out b/manuscrit/assets/src/20_foundations/llvm_mca_examples/30_aaa.out new file mode 100644 index 0000000..3471901 --- /dev/null +++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/30_aaa.out @@ -0,0 +1,38 @@ +$ echo -e 'add x1,x2,x3 +add x4,x5,x6 +add x7,x8,x9' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline - + +Iterations: 100 +Instructions: 300 +Total Cycles: 153 +Total uOps: 300 + +Dispatch Width: 3 +uOps Per Cycle: 1.96 +IPC: 1.96 +Block RThroughput: 1.5 + +[...] + +Resource pressure per iteration: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] + - 1.50 1.50 - - - - - + +Resource pressure by instruction: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions: + - 0.50 0.50 - - - - - add x1, x2, x3 + - 0.50 0.50 - - - - - add x4, x5, x6 + - 0.50 0.50 - - - - - add x7, x8, x9 + + +Timeline view: + 01234567 +Index 0123456789 + +[0,0] DeER . . . . add x1, x2, x3 +[0,1] DeER . . . . add x4, x5, x6 +[0,2] D=eER. . . . add x7, x8, x9 +[1,0] .DeER. . . . add x1, x2, x3 +[1,1] .D=eER . . . add x4, x5, x6 +[1,2] .D=eER . . . add x7, x8, x9 +[...] diff --git a/manuscrit/assets/src/20_foundations/llvm_mca_examples/40_laas.out b/manuscrit/assets/src/20_foundations/llvm_mca_examples/40_laas.out new file mode 100644 index 0000000..e498322 --- /dev/null +++ b/manuscrit/assets/src/20_foundations/llvm_mca_examples/40_laas.out @@ -0,0 +1,41 @@ +$ echo -e 'ldr x1,[x2] +add x3,x4,x5 +add x6,x7,x8 +str x9,[x10]' | llvm-mca --march=aarch64 --mcpu=cortex-a72 --timeline - + +Iterations: 100 +Instructions: 400 +Total Cycles: 139 +Total uOps: 400 + +Dispatch Width: 3 +uOps Per Cycle: 2.88 +IPC: 2.88 +Block RThroughput: 1.3 + +[...] + +Resource pressure per iteration: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] + - 1.00 1.00 1.00 - 1.00 - - + +Resource pressure by instruction: +[0] [1.0] [1.1] [2] [3] [4] [5] [6] Instructions: + - - - 1.00 - - - - ldr x1, [x2] + - - 1.00 - - - - - add x3, x4, x5 + - 1.00 - - - - - - add x6, x7, x8 + - - - - - 1.00 - - str x9, [x10] + + +Timeline view: + 012345678 +Index 0123456789 + +[0,0] DeeeeER . . . ldr x1, [x2] +[0,1] DeE---R . . . add x3, x4, x5 +[0,2] DeE---R . . . add x6, x7, x8 +[0,3] .DeE--R . . . str x9, [x10] +[1,0] .DeeeeER . . . ldr x1, [x2] +[1,1] .DeE---R . . . add x3, x4, x5 +[1,2] . DeE--R . . . add x6, x7, x8 +[1,3] . DeE--R . . . str x9, [x10]