\section{Kernel optimization and code analyzers}\label{ssec:code_analyzers} Optimizing a program, in most contexts, mainly means optimizing it from an algorithmic point of view ---~using efficient data structures, running some computations in parallel on multiple cores, etc. As pointed out in our introduction, though, optimizations close to the machine's microarchitecture can yield large efficiency benefits, sometimes up to two orders of magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to carry for multiple reasons: they depend on the specific machine on which the code is run; they require deep expert knowledge; they are most often manual, requiring expert time ---~and thus making them expensive. Such optimizations are, however, routinely used in some domains. Scientific computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often rely on the same operations, implemented by low-level libraries optimized in such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear algebra. Machine learning applications, on the other hand, may typically be trained for extensive periods of time, on many cores and accelerators, on a well-defined hardware, with small portions of code being executed many times on different data; as such, they are very well suited for such specific and low-level optimizations. \medskip{} When optimizing those short fragments of code whose efficiency is critical, or \emph{computation kernels}, insights on what limits the code's performance, or \emph{performance bottlenecks}, are precious to the expert. These insights can be gained by reading the processor's hardware counters, described above in \autoref{sssec:hw_counters}, typically accurate but of limited versatility. Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate these counters with profiling to derive further performance metrics at runtime. \subsection{Code analyzers} Another approach is to rely on \emph{code analyzers}, pieces of software that analyze a code fragment ---~typically at assembly or binary level~---, and provide insights on its performance metrics on a given hardware. Code analyzers thus work statically, that is, without executing the code. \paragraph{Common hypotheses.} Code analyzers operate under a common hypotheses, derived from the typical intended usage. The kernel analyzed is expected to be the body of a loop, or nest of loops, that should be iterated many times enough to be approximated by an infinite loop. The kernel will further be analyzed under the assumption that it is in \emph{steady-state}, and will thus ignore startup or border effects occurring in extremal cases. As the kernels analyzed are those worth optimizing manually, it is reasonable to assume that they will be executed many times, and focus on their steady-state. The kernel is further assumed to be \emph{L1-resident}, that is, to work only on data that resides in the L1 cache. This assumption is reasonable in two ways. First, if data must be fetched from farther caches, or even the main memory, these fetch operations will be multiple orders of magnitude slower than the computation being analyzed, making it useless to optimize this kernel for CPU efficiency ---~the expert should, in this case, focus instead on data locality, prefetching, etc. Second, code analyzers typically focus only on the CPU itself, and ignore memory effects. This hypothesis formalizes this focus; code analyzers metrics are thus to be regarded \textit{assuming the CPU is the bottleneck}. Code analyzers also disregard control flow, and thus assume the code to be \emph{straight-line code}: the kernel analyzed is considered as a sequence of instructions without influence on the control flow, executed in order, and jumping unconditionally back to the first instruction after the last ---~or, more accurately, the last jump is always assumed taken, and any control flow instruction in the middle is assumed not taken, while their computational cost is accounted for. \paragraph{Metrics produced.} The insights they provide as an output vary with the code analyzer used. All of them are able to predict either the throughput or reciprocal throughput ---~defined below~--- of the kernel studied, that is, how many cycles one iteration of the loop takes, in average and in steady-state. Although throughput can already be measured at runtime with hardware counters, a static estimation ---~if reliable~--- is already an improvement, as a static analyzer is typically faster than running the actual program under profiling. Each code analyzer relies on a model, or a collection of models, of the hardware on which it provides analyzes. Depending on what is, or is not modelled by a specific code analyzer, it may further extract any available and relevant metric from its model: whether the frontend is saturated, which computation units from the backend are stressed and by which precise instructions, when the CPU stalls and why, etc. Code analyzers may further point towards the resources that are limiting the kernel's performance, or \emph{bottlenecks}. \paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code analyzers among them, are generally either performing \emph{static} or \emph{dynamic} analyses. Static analysers work on the program itself, be it source code, assembly or any representation, without running it; while dynamic analysers run the analyzed program, keeping it under scrutiny through either instrumentation, monitoring or any relevant technique. Some analyzers mix both strategies to further refine their analyses. As a general rule of thumb, dynamic analyzers are typically more accurate, being able to study the actual execution trace (or traces) of the program, but are significantly slower due to instrumentation's large overhead and focus more on the general, average case than on edge cases. As most code analyzers are static, this manuscript largely focuses on static analysis. The only dynamic code analyzer we are aware of is \gus{}, described more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in accuracy, especially regarding data dependencies that may not be easily obtained otherwise. \paragraph{Input formats used.} The analyzers studied in this manuscript all take as input either assembly code, or assembled binaries. In the case of assembly code, as for instance with \llvmmca{}, analyzers take either a short assembly snippet, treated as straight-line code and analyzed as such; or longer pieces of assembly, part or parts of which being marked for analysis my surrounding assembly comments. In the case of assembled binaries, as all analyzers were run on Linux, executables or object files are ELF files. Some analyzers work on sections of the file defined by user-provided offsets in the binary, while others require the presence of \iaca{} markers around the code portion or portions to be analyzed. Those markers, introduced by \iaca{} as C-level preprocessor statements, consist in the following x86 assembly snippets: \hfill\begin{minipage}{0.35\textwidth} \begin{lstlisting}[language={[x86masm]Assembler}] mov ebx, 111 db 0x64, 0x67, 0x90 \end{lstlisting} \textit{\iaca{} start marker} \end{minipage}\hfill\begin{minipage}{0.35\textwidth} \begin{lstlisting}[language={[x86masm]Assembler}] mov ebx, 222 db 0x64, 0x67, 0x90 \end{lstlisting} \textit{\iaca{} end marker} \end{minipage} \medskip On UNIX-based operating systems, the standard format for assembled binaries ---~either object files (\lstc{.o}) or executables~--- is ELF~\cite{elf_tis}. Such files are organized in sections, the assembled instructions themselves being found in the \texttt{.text} section ---~the rest holding metadata, program data (strings, icons, \ldots), debugging information, etc. When an ELF is loaded to memory for execution, each segment may be \emph{mapped} to a portion of the address space. For instance, if the \texttt{.text} section has 1024 bytes, starting at offset 4096 of the ELF file itself, it may be mapped at virtual address \texttt{0x454000}; as such, the byte that could be read from the program by dereferencing address \texttt{0x454010} would be the 16\up{th} byte from the \texttt{.text} section, that is, the byte at offset 4112 in the ELF file. Throughout the ELF file, \emph{symbols} are defined as references, or pointers, to specific offsets or chunks in the file. This mechanism is used, among others, to refer to the program's function. For instance, a symbol \texttt{main} may be defined, that would point to the offset of the first byte of the \lstc{main} function, and may also hold its total number of bytes. Both these mechanisms can be used to identify, without \iaca{} markers or the like, a section of ELF file to be analyzed: an offset and size in the \texttt{.text} section can be provided (which can be found with tools like \lstc{objdump}), or a symbol name can be provided, if an entire function is to be analyzed. \subsection{Examples with \llvmmca} We have now covered enough of the theoretical background to introduce code analyzers in a concrete way, through examples of their usage. For this purpose, we use \llvmmca{}, one of the state-of-the-art code analyzers. Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64 implementations~--, we will base the following examples on ARM's Cortex A72, which we introduce in depth later in \autoref{chap:frontend}. No specific knowledge of this microarchitecture is required to understand the following examples; for our purposes, if suffices to say that: \begin{itemize} \item the A72 has a single load port, a single store port and two integer arithmetics ports; \item the \texttt{xN} registers are 64-bits registers; \item the \texttt{ldr} instruction (\textbf{l}oa\textbf{d} \textbf{r}egister) loads a value from memory into a register; \item the \texttt{str} instruction (\textbf{st}ore \textbf{r}egister) stores the value of a register to memory; \item the \texttt{add} instruction adds integer values from its two last operands and stores the result in the first. \end{itemize} \bigskip{} \paragraph{Simple example: a single load.} We first start by running \llvmmca{} on a single load operation: \lstarmasm{ldr x1, [x2]}. \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out} The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating the execution of the kernel --~here, 100 times, as seen row 2~--. This simple kernel contains only one instruction, which breaks down into a single \uop{}. Iterating it takes 106 cycles instead of the expected 100 cycles, as this execution is \emph{not} in steady-state, but accounts for the cycles from the decoding of the first instruction to the retirement of the last. The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. The next two rows are simple ratios. Row 10 is the block's \emph{reverse throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but is roughly defined as the number of cycles a single iteration of the kernel takes. The next section, \emph{instruction info}, lists data about the instructions present. Finally, the last section, \emph{resources}, breaks down individual instructions into load incurred on execution ports, first aggregating it by full iteration of the kernel, then instruction by instruction. The maximal load of each port is normalized to 1, which amounts to say that it is expressed in number of cycles required to process the load. Here, the only pressure is 1 on the port labeled \texttt{[2]}, that is, the load port. Thus, the kernel cannot complete in less than a full cycle, as it takes up all load resources available. \paragraph{The timeline mode.} Another useful view that can be displayed by \llvmmca{} is its timeline mode, enabled by passing an extra \lstbash{--timeline} flag. In the previous example, it further outputs: \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out} which indicates, for each instruction, the timeline of its execution. Here, \texttt{D} stands for decode, \texttt{e} for being executed --~in the pipeline~--, \texttt{E} for last cycle of its execution --~leaving the pipeline~--, \texttt{R} for retiring. When an instruction is decoded and waiting to be dispatched to execution, a \texttt{=} is shown. The identifier at the beginning of each row indicates the kernel iteration number, and the instruction within. Here, we can better understand the 106 cycles seen earlier: it takes a first cycle to decode the first instruction, the instruction remains in the pipeline for 5 cycles, and must finally be retired. In steady-state, however, the instruction would be already decoded (while a previous instruction was being executed), the retirement would also be taking place while another instruction executes, and the pipeline would be accepting new instructions for four of these five cycles. We can thus avoid using up 6 of those 106 cycles in steady-state, taking us back to the expected 100 cycles. \paragraph{Single integer add.} If we substitute this load operation with an integer add operation, we find a reverse throughput halved: \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/10_add.out} Indeed, as we have two integer arithmetics unit, two adds may be executed in parallel, as can be seen in the timeline view. \paragraph{Load and two adds.} If we combine those two instructions in a kernel with a single load and two adds, we obtain a kernel that still fits in the execution ports in a single cycle. \llvmmca{} confirms this: \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/20_laa.out} We can indeed see that an iteration fully utilizes the three ports, but still fits: the kernel still manages to have a reverse throughput of 1. \newpage \paragraph{Three adds.} A kernel of three adds, however, will not be able to run in a single cycle: \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/30_aaa.out} The resource pressure by iteration view confirms that we exceed the integer arithmetic capacity of the processor for a single cycle. This is correctly reflected in the timeline view: the instruction \texttt{[0,2]} starts executing only at cycle 3, along with \texttt{[1,0]}. \paragraph{Load, store and two adds.} A kernel of one load, two adds and one store should, ports-wise, fit in a single cycle. However, \llvmmca{} finds for this kernel a reverse throughput of 1.3: \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/40_laas.out} While the resource pressure views confirm that the ports are able to handle this kernel in a single cycle, the timeline shows that it is in fact the frontend that stalls the computation. As only three instructions may be decoded and issued per cycle, the backend is not fed with enough instructions per cycle to reach a reverse throughput of 1. \subsection{Definitions} \subsubsection{Throughput and reciprocal throughput}\label{sssec:def:rthroughput} Given a kernel $\kerK$ of straight-line assembly code, we have referred to $\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many cycles $\kerK$ will require to complete its execution in steady-state. We define this notion here more formally. \begin{notation}[$\kerK^n$]\label{not:kerK_N} Given a kernel $\kerK$ and a positive integer $n \in \nat^*$, we note $\kerK^n$ the kernel $\kerK$ repeated $n$ times, that is, the instructions of $\kerK$ concatenated $n$ times. \end{notation} \begin{definition}[$C(\kerK)$] The \emph{number of cycles} of a kernel $\kerK$ is defined, \emph{in steady-state}, as the number of elapsed cycles from the moment the first instruction of $\kerK$ starts to be decoded to the moment the last instruction of $\kerK$ is issued. We note $C(\kerK)$ the number of cycles of $\kerK$. We extend this definition so that $C(\emptyset) = 0$; however, care must be taken that, as we work in steady-state, this $\emptyset$ must be \emph{in the context of a given kernel} (\ie{} we run $\kerK$ until steady-state is reached, then consider how many cycles it takes to execute 0 further instructions). This context is clarified by noting $\ckn{0}$. \end{definition} Due to the pipelined nature of execution units, this means that the same instruction of each iteration of $\kerK$ will be retired ---~\ie{} yield its result~--- every steady-state execution time. For this reason, the execution time is measured until the last instruction is issued, not retired. \begin{lemma}[Periodicity of $\ckn{n+1}-\ckn{n}$] Given a kernel $\kerK$, the sequence $\left(\ckn{n+1} - \ckn{n}\right)_{n \in \nat}$ is periodic, that is, there exists $p \in \nat^*$ such that \[ \forall n \in \nat, \ckn{n+1} - \ckn{n} = \ckn{n+p+1} - \ckn{n+p} \] We note this period $\calP(\kerK)$. \end{lemma} \begin{proof} The number of CPU resources that can be shared between instructions in a processor is finite (and relatively small, usually on the order of magnitude of 10). These resources are typically the number of \uops{} issued for each port in the current cycle, the number of decoded instructions, total number of issued \uops{} this cycle and such. For each of these resources, their number of possible states is also finite (and also small). Thus, the total number of possible states of a processor at the end of a kernel iteration cannot be higher than the combination of those states. For a given kernel $\kerK$, We note $\sigma(\kerK)$ the CPU state reached after executing $\kerK$, in steady-state. Given a kernel $\kerK$, the set $\left\{\sigma(\kerK^n), n\in \nat\right\}$ is a subset of the total set of possible states of the processor, and is thus finite ---~and, in all realistic cases, is usually way smaller than the full set, given that only a portion of those resources are used by a kernel. \medskip{} We further note that, for all $n \in \nat, \sigma(\kerK^{n+1})$ is function of only the processor considered, $\kerK$ and $\sigma(\kerK^n)$: indeed, a steady-state for $\kerK^{n}$ is also a steady-state for $\kerK^{n+1}$ and, knowing $\sigma(\kerK^n)$, the execution can be continued for the following $\kerK$, reaching $\sigma(\kerK^{n+1})$. \medskip{} Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that $\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of period $p$. As the number of cycles needed to execute $\kerK$ only depend on the initial state of the processor, we thus have \[ \forall n \in \nat, \ckn{n+1} - \ckn{n} = \ckn{n+p+1} - \ckn{n+p} \] \end{proof} \begin{definition}[Reciprocal throughput of a kernel]\label{def:cyc_kerK} The \emph{reciprocal throughput} of a kernel $\kerK$, noted $\cyc{\kerK}$ and measured in \emph{cycles per iteration}, is also called the steady-state execution time of a kernel. We note $p \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ (by the above lemma), and define \[ \cyc{\kerK} = \dfrac{\ckn{p}}{p} \] \end{definition} We define this as the average on a whole period because subsequent kernel iterations may ``share'' a cycle. \begin{example} Let $\kerK$ be a kernel of three instructions, and assume that a given processor can only issue two instructions per cycle, but has no other bottleneck for $\kerK$. Then, $C(\kerK) = 2$, as three instructions cannot be issued in a single cycle; yet $\ckn{2} = 3$, as six instructions can be issued in only three cycles. In this case, the period $p$ is clearly $2$. Thus, in this case, $\cyc{\kerK} = 1.5$. \end{example} \begin{remark} As $C(\kerK)$ depends on the microarchitecture of the processor considered, the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor considered. \end{remark} \medskip \begin{lemma} Let $\kerK$ be a kernel and $p = \calP(\kerK)$. For all $n \in \nat$ such that $n = kp + r$, with $k, r \in \nat$, $1 \leq r \leq p$, \[ \ckn{n} = k \ckn{p} + \ckn{r} \] \end{lemma} \begin{proof} From the previous lemma instantiated with $n = 0$, we have \begin{align*} \ckn{1} - \ckn{0} &= \ckn{p+1} - \ckn{p} \\ \iff{} \ckn{p} &= \ckn{p+1} - \ckn{1} \end{align*} and thus by induction, $\forall m \in \nat, \ckn{m+p} - \ckn{m} = \ckn{p}$. \medskip{} Thus, if $k = 0$, the property is trivial. If $k = 1$, it is a direct application of the above: \[ \ckn{p+r} = \ckn{p} + \ckn{r} \] We prove by induction the cases for $k > 1$. \end{proof} \begin{lemma}\label{lem:cyc_k_conv} Given a kernel $\kerK$, \[ \dfrac{C(\kerK^n)}{n} \limarrow{n}{\infty} \cyc{\kerK} \] Furthermore, this convergence is linear: \[ \abs{\dfrac{\ckn{n}}{n} - \cyc{\kerK}} = \bigO{\dfrac{1}{n}} \] \end{lemma} \begin{proof} Let $n \in \nat^*$. We note $p \in \nat^*$ the periodicity by the above lemma. Let $k, r \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$. \begin{align*} \ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\ &= kp \dfrac{\ckn{p}}{p} + \ckn{r} \\ &= kp \cyc{\kerK} + \ckn{r} \\ \implies \abs{\ckn{n} - n \cyc{\kerK}} &= \abs{kp \cyc{\kerK} + \ckn{r} - (kp+r) \cyc{\kerK}}\\ &= \abs{\ckn{r} - r \cyc{\kerK}} \\ &\leq \ckn{r} + r \cyc{\kerK} & \textit{(all is positive)} \\ &\leq \left(\max_{m \leq p}\ckn{m}\right) + p \cyc{\kerK} \end{align*} This last right-hand expression is independent of $n$, which we note $M$. Dividing by $n$, we obtain \[ \abs{\dfrac{\ckn{n}}{n} - \cyc{\kerK}} \leq \dfrac{M}{n} \\ \] from which both results follow. \end{proof} \medskip Throughout this manuscript, we mostly use reciprocal throughput as a metric, as we find it more relevant from an optimisation point of view ---~an opinion we detail in \autoref{chap:CesASMe}. However, the \emph{throughput} of a kernel is most widely used in the literature in its stead. \medskip \begin{definition}[Throughput of a kernel] The \emph{throughput} of a kernel $\kerK$, measured in \emph{instructions per cycle}, or IPC, is defined as the number of instructions in $\kerK$, divided by the steady-state execution time of $\kerK$: \[ \operatorname{IPC}(\kerK) = \dfrac{\card{\kerK}}{\cyc{\kerK}} \] \end{definition} In the literature or in analyzers' reports, the throughput of a kernel is often referred to as its \emph{IPC} (its unit). \newpage \begin{notation}[Experimental measure of $\cyc{\kerK}$] We note $\cycmes{\kerK}{n}$ the experimental measure of $\kerK$, realized by: \begin{itemize} \item sampling the hardware counter of total number of instructions retired and the counter of total number of cycles elapsed, \item executing $\kerK^n$, \item sampling again the same counters, and noting respectively $\Delta_n\text{ret}$ and $\Delta_{n}C$ their differences, \item noting $\cycmes{\kerK}{n} = \dfrac{\Delta_{n}C\cdot \card{\kerK}}{\Delta_n\text{ret}}$, where $\card{\kerK}$ is the number of instructions in $\kerK$. \end{itemize} \end{notation} \begin{lemma} For any kernel $\kerK$, $\cycmes{\kerK}{n} \limarrow{n}{\infty} \cyc{\kerK}$. \end{lemma} \begin{proof} For an integer number of kernel iterations $n$, $\sfrac{\Delta_n\text{ret}}{\card{\kerK}} = n$. While measurement errors may make $\Delta_{n}\text{ret}$ fluctuate slightly, this fluctuation will be below a constant threshold. \[ \abs{\dfrac{\Delta_n\text{ret}}{\card{\kerK}} - n} \leq E_\text{ret} \] The same way, and due to the pipelining effects we noted below the definition of $\cyc{\kerK}$, \[ \abs{\Delta_{n}C - C(\kerK^n)} \leq E_C \] with $E_C$ a constant. As those errors are constant, while other quantities are linear, we thus have \[ \cycmes{\kerK}{n} = \dfrac{\Delta_n C}{\sfrac{\Delta_n ret}{\card{\kerK}}} \limarrow{n}{\infty} \dfrac{C(\kerK^n)}{n} \] and, composing limits with the previous lemma, we thus obtain \[ \cycmes{\kerK}{n} \limarrow{n}{\infty} \cyc{\kerK} \] \end{proof} Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$ for large values of $n$ in this manuscript whenever it is clear that this value is a measure. \subsubsection{Basic block of an assembly-level program} Code analyzers are meant to analyze sections of straight-line code, that is, portions of code which do not contain control flow. As such, it is convenient to split the program into \emph{basic blocks}, that is, portions of straight-line code linked to other basic blocks to reflect control flow. We define this notion here formally, to use it soundly in the following chapters of this manuscript. \newpage % FIXME \begin{notation} For the purposes of this section, \begin{itemize} \item we formalize a segment of assembly code as a sequence of instructions; \item we confuse an instruction with its address. \end{itemize} \smallskip{} An instruction is said to be a \emph{flow-altering instruction} if this address may alter the normal control flow of the program. This is typically true of jumps (conditional or unconditional), function calls, function returns, \ldots \smallskip{} An address is said to be a \emph{jump site} if any flow-altering instruction in the considered sequence may alter control to this address (and this address is not the natural flow of the program, \eg{} in the case of a conditional jump). \end{notation} \begin{definition}[Basic block decomposition] Consider a sequence of assembly code $A$. We note the $J_A$ the set of jump sites of $A$, $F_A$ the set of flow-altering instructions of $A$. As each element of those sets is the address of an instruction, we note $F^+_A$ the set of addresses of instructions \emph{directly following} an instruction from $F_A$ ---~note that, as instructions may be longer than one byte, it is not sufficient to increase by 1 each address from $F_A$. We note $S_A = J_A \cup F^+_A$. We split the instructions from $A$ into $BB(A)$, the set of segments that begin either at the beginning of $A$, or at instructions from $S_A$ ---~less formally, we split $A$ at each point from $S_A$, including each boundary in the following segment. The members of $BB(A)$ are the \emph{basic blocks} of $A$, and are segments of code which, by construction, will always be executed as straight-line code, and whose execution will always begin from their first instruction. \end{definition} \begin{remark} This definition gives a direct algorithm to split a segment of assembly code into basic blocks, as long as we have access to a semantics of the considered assembly that indicates whether an instruction is flow-altering, and if so, what are its possible jump sites. \end{remark}