\section{Kernel optimization and code analyzers}

Optimizing a program, in most contexts, mainly means optimizing it from an
algorithmic point of view ---~using efficient data structures, running some
computations in parallel on multiple cores, etc. As pointed out in our
introduction, though, optimizations close to the machine's microarchitecture
can yield large efficiency benefits, sometimes up to two orders of
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
carry for multiple reasons: they depend on the specific machine on which the
code is run; they require deep expert knowledge; they are most often manual,
requiring expert time ---~and thus making them expensive.

Such optimizations are, however, routinely used in some domains. Scientific
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
rely on the same operations, implemented by low-level libraries optimized in
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
algebra. Machine learning applications, on the other hand, may typically be
trained for extensive periods of time, on many cores and accelerators, on a
well-defined hardware, with small portions of code being executed many times on
different data; as such, they are very well suited for such specific and
low-level optimizations.

\medskip{}

When optimizing those short fragments of code whose efficiency is critical, or
\emph{computation kernels}, insights on what limits the code's performance, or
\emph{performance bottlenecks}, are precious to the expert. These insights can
be gained by reading the processor's hardware counters, described above in
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
these counters with profiling to derive further performance metrics at runtime.

\subsection{Code analyzers}

Another approach is to rely on \emph{code analyzers}, pieces of software that
analyze a code fragment ---~typically at assembly or binary level~---, and
provide insights on its performance metrics on a given hardware. Code analyzers
thus work statically, that is, without executing the code.

\paragraph{Common hypotheses.} Code analyzers operate under a common
hypotheses, derived from the typical intended usage.

The kernel analyzed is expected to be the body of a loop, or
nest of loops, that should be iterated many times enough to be approximated by
an infinite loop. The kernel will further be analyzed under the assumption that
it is in \emph{steady-state}, and will thus ignore startup or border effects
occurring in extremal cases. As the kernels analyzed are those worth optimizing
manually, it is reasonable to assume that they will be executed many times, and
focus on their steady-state.

The kernel is further assumed to be \emph{L1-resident}, that is, to work only
on data that resides in the L1 cache. This assumption is reasonable in two
ways. First, if data must be fetched from farther caches, or even the main
memory, these fetch operations will be multiple orders of magnitude slower than
the computation being analyzed, making it useless to optimize this kernel for
CPU efficiency ---~the expert should, in this case, focus instead on data
locality, prefetching, etc. Second, code analyzers typically focus only on the
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
bottleneck}.

Code analyzers also disregard control flow, and thus assume the code to be
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
instructions without influence on the control flow, executed in order, and
jumping unconditionally back to the first instruction after the last ---~or,
more accurately, the last jump is always assumed taken, and any control flow
instruction in the middle is assumed not taken, while their computational cost
is accounted for.

\paragraph{Metrics produced.} The insights they provide as an output vary with
the code analyzer used. All of them are able to predict either the throughput
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
how many cycles one iteration of the loop takes, in average and in
steady-state. Although throughput can already be measured at runtime with
hardware counters, a static estimation ---~if reliable~--- is already an
improvement, as a static analyzer is typically faster than running the actual
program under profiling.

Each code analyzer relies on a model, or a collection of models, of the
hardware on which it provides analyzes. Depending on what is, or is not
modelled by a specific code analyzer, it may further extract any available and
relevant metric from its model: whether the frontend is saturated, which
computation units from the backend are stressed and by which precise
instructions, when the CPU stalls and why, etc. Code analyzers may further
point towards the resources that are limiting the kernel's performance, or
\emph{bottlenecks}.


\paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code
analyzers among them, are generally either performing \emph{static} or
\emph{dynamic} analyses. Static analysers work on the program itself, be it
source code, assembly or any representation, without running it; while dynamic
analysers run the analyzed program, keeping it under scrutiny through either
instrumentation, monitoring or any relevant technique. Some analyzers mix both
strategies to further refine their analyses. As a general rule of thumb,
dynamic analyzers are typically more accurate, being able to study the actual
execution trace (or traces) of the program, but are significantly slower due to
instrumentation's large overhead and focus more on the general, average case
than on edge cases.

As most code analyzers are static, this manuscript largely focuses on static
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
accuracy, especially regarding data dependencies that may not be easily
obtained otherwise.

\paragraph{Input formats used.} The analyzers studied in this manuscript all
take as input either assembly code, or assembled binaries.

In the case of assembly code, as for instance with \llvmmca{}, analyzers
take either a short assembly snippet, treated as straight-line code and
analyzed as such; or longer pieces of assembly, part or parts of which being
marked for analysis my surrounding assembly comments.

In the case of assembled binaries, as all analyzers were run on Linux,
executables or object files are ELF files. Some analyzers work on sections of
the file defined by user-provided offsets in the binary, while others require
the presence of \iaca{} markers around the code portion or portions to be
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
statements, consist in the following x86 assembly snippets:

\hfill\begin{minipage}{0.35\textwidth}
    \begin{lstlisting}[language={[x86masm]Assembler}]
mov ebx, 111
db 0x64, 0x67, 0x90
\end{lstlisting}
\textit{\iaca{} start marker}
\end{minipage}\hfill\begin{minipage}{0.35\textwidth}
    \begin{lstlisting}[language={[x86masm]Assembler}]
mov ebx, 222
db 0x64, 0x67, 0x90
\end{lstlisting}
\textit{\iaca{} end marker}
\end{minipage}

\medskip

On UNIX-based operating systems, the standard format for assembled binaries
---~either object files (\lstc{.o}) or executables~--- is ELF~\cite{elf_tis}.
Such files are organized in sections, the assembled instructions themselves
being found in the \texttt{.text} section ---~the rest holding metadata,
program data (strings, icons, \ldots), debugging information, etc. When an ELF
is loaded to memory for execution, each segment may be \emph{mapped} to a
portion of the address space. For instance, if the \texttt{.text} section has
1024 bytes, starting at offset 4096 of the ELF file itself, it may be mapped at
virtual address \texttt{0x454000}; as such, the byte that could be read from
the program by dereferencing address \texttt{0x454010} would be the 16\up{th}
byte from the \texttt{.text} section, that is, the byte at offset 4112 in the
ELF file.

Throughout the ELF file, \emph{symbols} are defined as references, or pointers,
to specific offsets or chunks in the file. This mechanism is used, among
others, to refer to the program's function. For instance, a symbol
\texttt{main} may be defined, that would point to the offset of the first byte
of the \lstc{main} function, and may also hold its total number of bytes.

Both these mechanisms can be used to identify, without \iaca{} markers or the
like, a section of ELF file to be analyzed: an offset and size in the
\texttt{.text} section can be provided (which can be found with tools like
\lstc{objdump}), or a symbol name can be provided, if an entire function is to
be analyzed.

\subsection{Examples with \llvmmca}

\todo{}

\subsection{Definitions}

\subsubsection{Throughput and reciprocal throughput}

Given a kernel $\kerK$ of straight-line assembly code, we have referred to
$\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many
cycles $\kerK$ will require to complete its execution in steady-state. We
define this notion here more formally.

\begin{notation}[$\kerK^n$]\label{not:kerK_N}
    Given a kernel $\kerK$ and a positive integer $n \in \nat^*$, we note
    $\kerK^n$ the kernel $\kerK$ repeated $n$ times, that is, the instructions
    of $\kerK$ concatenated $n$ times.
\end{notation}

\begin{definition}[$C(\kerK)$]
    The \emph{number of cycles} of a kernel $\kerK$ is defined, \emph{in
    steady-state}, as the number of elapsed cycles from the moment the first
    instruction of $\kerK$ starts to be decoded to the moment the last
    instruction of $\kerK$ is issued.

    We note $C(\kerK)$ the number of cycles of $\kerK$.

    We extend this definition so that $C(\emptyset) = 0$; however, care must be
    taken that, as we work in steady-state, this $\emptyset$ must be \emph{in
    the context of a given kernel} (\ie{} we run $\kerK$ until steady-state is
    reached, then consider how many cycles it takes to execute 0 further
    instructions). This context is clarified by noting $\ckn{0}$.
\end{definition}

Due to the pipelined nature of execution units, this means that the same
instruction of each iteration of $\kerK$ will be retired ---~\ie{} yield its
result~--- every steady-state execution time. For this reason, the execution
time is measured until the last instruction is issued, not retired.

\begin{lemma}[Periodicity of $\ckn{n+1}-\ckn{n}$]
    Given a kernel $\kerK$, the sequence $\left(\ckn{n+1} - \ckn{n}\right)_{n
    \in \nat}$ is periodic, that is, there exists $p \in \nat^*$ such that
    \[
        \forall n \in \nat, \ckn{n+1} - \ckn{n} = \ckn{n+p+1} - \ckn{n+p}
    \]

    We note this period $\calP(\kerK)$.
\end{lemma}

\begin{proof}
    The number of CPU resources that can be shared between instructions
    in a processor is finite (and relatively small, usually on the order of
    magnitude of 10). These resources are typically the number of \uops{}
    issued for each port in the current cycle, the number of decoded
    instructions, total number of issued \uops{} this cycle and such.

    For each of these resources, their number of possible states is also finite
    (and also small). Thus, the total number of possible states of a processor
    at the end of a kernel iteration cannot be higher than the combination of
    those states.

    For a given kernel $\kerK$, We note $\sigma(\kerK)$ the CPU state reached
    after executing $\kerK$, in steady-state.

    Given a kernel $\kerK$, the set $\left\{\sigma(\kerK^n), n\in
    \nat\right\}$ is a subset of the total set of possible states of the
    processor, and is thus finite ---~and, in all realistic cases, is usually
    way smaller than the full set, given that only a portion of those resources
    are used by a kernel.

    \medskip{}

    We further note that, for all $n \in \nat, \sigma(\kerK^{n+1})$ is
    function of only the processor considered, $\kerK$ and $\sigma(\kerK^n)$:
    indeed, a steady-state for $\kerK^{n}$ is also a steady-state for
    $\kerK^{n+1}$ and, knowing $\sigma(\kerK^n)$, the execution can be
    continued for the following $\kerK$, reaching $\sigma(\kerK^{n+1})$.

    \medskip{}

    Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that
    $\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the
    previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of
    period $p$.

    As the number of cycles needed to execute $\kerK$ only depend on the
    initial state of the processor, we thus have
    \[
        \forall n \in \nat, \ckn{n+1} - \ckn{n} = \ckn{n+p+1} - \ckn{n+p}
    \]
\end{proof}

\begin{definition}[Reciprocal throughput of a kernel]\label{def:cyc_kerK}
    The \emph{reciprocal throughput} of a kernel $\kerK$, noted $\cyc{\kerK}$
    and measured in \emph{cycles per iteration}, is also called the
    steady-state execution time of a kernel.

    We note $p \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ (by the above
    lemma), and define \[
        \cyc{\kerK} = \dfrac{\ckn{p}}{p}
    \]
\end{definition}

We define this as the average on a whole period because subsequent
kernel iterations may ``share'' a cycle.

\begin{example}
    Let $\kerK$ be a kernel of three instructions, and assume that a given processor can only
issue two instructions per cycle, but has no other bottleneck for $\kerK$.
Then, $C(\kerK) = 2$, as three
instructions cannot be issued in a single cycle; yet $\ckn{2} = 3$, as six
instructions can be issued in only three cycles. In this case, the period $p$
is clearly $2$. Thus, in this case,
$\cyc{\kerK} = 1.5$.
\end{example}

\begin{remark}
    As $C(\kerK)$ depends on the microarchitecture of the processor considered,
    the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor
    considered.
\end{remark}

\medskip

\begin{lemma}
    Let $\kerK$ be a kernel and $p = \calP(\kerK)$. For all $n \in \nat$ such
    that $n = kp + r$, with $k, r \in \nat$, $1 \leq r \leq p$,
    \[
        \ckn{n} = k \ckn{p} + \ckn{r}
    \]
\end{lemma}
\begin{proof}
    From the previous lemma instantiated with $n = 0$, we have
    \begin{align*}
        \ckn{1} - \ckn{0} &= \ckn{p+1} - \ckn{p} \\
        \iff{} \ckn{p} &= \ckn{p+1} - \ckn{1}
    \end{align*}
    and thus by induction, $\forall m \in \nat, \ckn{m+p} - \ckn{m} = \ckn{p}$.

    \medskip{}
    Thus, if $k = 0$, the property is trivial. If $k = 1$, it is a direct
    application of the above:
    \[
        \ckn{p+r} = \ckn{p} + \ckn{r}
    \]

    We prove by induction the cases for $k > 1$.
\end{proof}

\begin{lemma}\label{lem:cyc_k_conv}
    Given a kernel $\kerK$,
    \[
        \dfrac{C(\kerK^n)}{n} \limarrow{n}{\infty} \cyc{\kerK}
    \]

    Furthermore, this convergence is linear:
    \[
        \abs{\dfrac{\ckn{n}}{n} - \cyc{\kerK}} = \bigO{\dfrac{1}{n}}
    \]
\end{lemma}

\begin{proof}
    Let $n \in \nat^*$. We note $p \in \nat^*$ the periodicity by the above
    lemma.

    Let $k, r \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$.

    \begin{align*}
        \ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\
                   &= kp \dfrac{\ckn{p}}{p} + \ckn{r} \\
                   &= kp \cyc{\kerK} + \ckn{r} \\
        \implies \abs{\ckn{n} - n \cyc{\kerK}} &= \abs{kp \cyc{\kerK} + \ckn{r} - (kp+r) \cyc{\kerK}}\\
                   &= \abs{\ckn{r} - r \cyc{\kerK}} \\
                   &\leq \ckn{r} + r \cyc{\kerK}   & \textit{(all is positive)} \\
                   &\leq \left(\max_{m \leq p}\ckn{m}\right) + p \cyc{\kerK}
    \end{align*}

    This last right-hand expression is independent of $n$, which we note $M$.
    Dividing by $n$, we obtain
    \[
        \abs{\dfrac{\ckn{n}}{n} - \cyc{\kerK}} \leq \dfrac{M}{n} \\
    \]

    from which both results follow.
\end{proof}

\medskip

Throughout this manuscript, we mostly use reciprocal throughput as a metric, as
we find it more relevant from an optimisation point of view ---~an opinion we
detail in \autoref{chap:CesASMe}. However, the
\emph{throughput} of a kernel is most widely used in the literature in its
stead.

\medskip

\begin{definition}[Throughput of a kernel]
    The \emph{throughput} of a kernel $\kerK$, measured in \emph{instructions
    per cycle}, or IPC, is defined as the number of instructions in $\kerK$, divided
    by the steady-state execution time of $\kerK$:
    \[
        \operatorname{IPC}(\kerK) = \dfrac{\card{\kerK}}{\cyc{\kerK}}
    \]
\end{definition}

In the literature or in analyzers' reports, the throughput of a kernel is often
referred to as its \emph{IPC} (its unit).

\newpage

\begin{notation}[Experimental measure of $\cyc{\kerK}$]
    We note $\cycmes{\kerK}{n}$ the experimental measure of $\kerK$, realized
    by:
    \begin{itemize}
        \item sampling the hardware counter of total number of instructions
            retired and the counter of total number of cycles elapsed,
        \item executing $\kerK^n$,
        \item sampling again the same counters, and noting respectively
            $\Delta_n\text{ret}$ and $\Delta_{n}C$ their differences,
        \item noting $\cycmes{\kerK}{n} = \dfrac{\Delta_{n}C\cdot
            \card{\kerK}}{\Delta_n\text{ret}}$, where $\card{\kerK}$ is the
            number of instructions in $\kerK$.
    \end{itemize}
\end{notation}

\begin{lemma}
    For any kernel $\kerK$,
    $\cycmes{\kerK}{n} \limarrow{n}{\infty} \cyc{\kerK}$.
\end{lemma}
\begin{proof}
    For an integer number of kernel iterations $n$,
    $\sfrac{\Delta_n\text{ret}}{\card{\kerK}} = n$. While measurement
    errors may make $\Delta_{n}\text{ret}$ fluctuate slightly, this
    fluctuation will be below a constant threshold.
    \[
        \abs{\dfrac{\Delta_n\text{ret}}{\card{\kerK}} - n}
        \leq E_\text{ret}
    \]

    The same way, and due to the pipelining effects we noted below
    the definition of $\cyc{\kerK}$,
    \[
        \abs{\Delta_{n}C - C(\kerK^n)} \leq E_C
    \]
    with $E_C$ a constant.

    As those errors are constant, while other quantities are linear, we thus
    have

    \[
        \cycmes{\kerK}{n} = \dfrac{\Delta_n C}{\sfrac{\Delta_n
        ret}{\card{\kerK}}} \limarrow{n}{\infty} \dfrac{C(\kerK^n)}{n}
    \]

    and, composing limits with the previous lemma, we thus obtain

    \[
        \cycmes{\kerK}{n} \limarrow{n}{\infty} \cyc{\kerK}
    \]
    \end{proof}

Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
for large values of $n$ in this manuscript whenever it is clear that this value
is a measure.

\subsubsection{Basic block of an assembly-level program}

Code analyzers are meant to analyze sections of straight-line code, that is,
portions of code which do not contain control flow. As such, it is convenient
to split the program into \emph{basic blocks}, that is, portions of
straight-line code linked to other basic blocks to reflect control flow. We
define this notion here formally, to use it soundly in the following chapters
of this manuscript.

\begin{notation}
    For the purposes of this section,
    \begin{itemize}
        \item we formalize a segment of assembly code as a sequence of
            instructions;
        \item we confuse an instruction with its address.
    \end{itemize}

    \smallskip{}

    An instruction is said to be a \emph{flow-altering instruction} if this
    address may alter the normal control flow of the program. This is typically
    true of jumps (conditional or unconditional), function calls, function
    returns, \ldots

    \smallskip{}

    An address is said to be a \emph{jump site} if any flow-altering
    instruction in the considered sequence may alter control to this address
    (and this address is not the natural flow of the program, \eg{} in the case
    of a conditional jump).
\end{notation}

\begin{definition}[Basic block decomposition]
    \todo{}
\end{definition}

\begin{remark}
    This definition gives a direct algorithm to split a segment of assembly
    code into basic blocks, as long as we have access to a semantics of the
    considered assembly that indicates whether an instruction is flow-altering,
    and if so, what are its possible jump sites.
\end{remark}