\section{Kernel optimization and code analyzers}

Optimizing a program, in most contexts, mainly means optimizing it from an
algorithmic point of view ---~using efficient data structures, running some
computations in parallel on multiple cores, etc. As pointed out in our
introduction, though, optimizations close to the machine's microarchitecture
can yield large efficiency benefits, sometimes up to two orders of
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
carry for multiple reasons: they depend on the specific machine on which the
code is run; they require deep expert knowledge; they are most often manual,
requiring expert time ---~and thus making them expensive.

Such optimizations are, however, routinely used in some domains. Scientific
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
rely on the same operations, implemented by low-level libraries optimized in
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
algebra. Machine learning applications, on the other hand, may typically be
trained for extensive periods of time, on many cores and accelerators, on a
well-defined hardware, with small portions of code being executed many times on
different data; as such, they are very well suited for such specific and
low-level optimizations.

\medskip{}

When optimizing those short fragments of code whose efficiency is critical, or
\emph{computation kernels}, insights on what limits the code's performance, or
\emph{performance bottlenecks}, are precious to the expert. These insights can
be gained by reading the processor's hardware counters, described above in
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
these counters with profiling to derive further performance metrics at runtime.

\subsection{Code analyzers}

Another approach is to rely on \emph{code analyzers}, pieces of software that
analyze a code fragment ---~typically at assembly or binary level~---, and
provide insights on its performance metrics on a given hardware. Code analyzers
thus work statically, that is, without executing the code.

\paragraph{Common hypotheses.} Code analyzers operate under a common
hypotheses, derived from the typical intended usage.

The kernel analyzed is expected to be the body of a loop, or
nest of loops, that should be iterated many times enough to be approximated by
an infinite loop. The kernel will further be analyzed under the assumption that
it is in \emph{steady-state}, and will thus ignore startup or border effects
occurring in extremal cases. As the kernels analyzed are those worth optimizing
manually, it is reasonable to assume that they will be executed many times, and
focus on their steady-state.

The kernel is further assumed to be \emph{L1-resident}, that is, to work only
on data that resides in the L1 cache. This assumption is reasonable in two
ways. First, if data must be fetched from farther caches, or even the main
memory, these fetch operations will be multiple orders of magnitude slower than
the computation being analyzed, making it useless to optimize this kernel for
CPU efficiency ---~the expert should, in this case, focus instead on data
locality, prefetching, etc. Second, code analyzers typically focus only on the
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
bottleneck}.

Code analyzers also disregard control flow, and thus assume the code to be
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
instructions without influence on the control flow, executed in order, and
jumping unconditionally back to the first instruction after the last ---~or,
more accurately, the last jump is always assumed taken, and any control flow
instruction in the middle is assumed not taken, while their computational cost
is accounted for.

\paragraph{Metrics produced.} The insights they provide as an output vary with
the code analyzer used. All of them are able to predict either the throughput
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
how many cycles one iteration of the loop takes, in average and in
steady-state. Although throughput can already be measured at runtime with
hardware counters, a static estimation ---~if reliable~--- is already an
improvement, as a static analyzer is typically faster than running the actual
program under profiling.

Each code analyzer relies on a model, or a collection of models, of the
hardware on which it provides analyzes. Depending on what is, or is not
modelled by a specific code analyzer, it may further extract any available and
relevant metric from its model: whether the frontend is saturated, which
computation units from the backend are stressed and by which precise
instructions, when the CPU stalls and why, etc. Code analyzers may further
point towards the resources that are limiting the kernel's performance, or
\emph{bottlenecks}.


\paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code
analyzers among them, are generally either performing \emph{static} or
\emph{dynamic} analyses. Static analysers work on the program itself, be it
source code, assembly or any representation, without running it; while dynamic
analysers run the analyzed program, keeping it under scrutiny through either
instrumentation, monitoring or any relevant technique. Some analyzers mix both
strategies to further refine their analyses. As a general rule of thumb,
dynamic analyzers are typically more accurate, being able to study the actual
execution trace (or traces) of the program, but are significantly slower due to
instrumentation's large overhead and focus more on the general, average case
than on edge cases.

As most code analyzers are static, this manuscript largely focuses on static
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
accuracy, especially regarding data dependencies that may not be easily
obtained otherwise.

\paragraph{Input formats used.} The analyzers studied in this manuscript all
take as input either assembly code, or assembled binaries.

In the case of assembly code, as for instance with \llvmmca{}, analyzers
take either a short assembly snippet, treated as straight-line code and
analyzed as such; or longer pieces of assembly, part or parts of which being
marked for analysis my surrounding assembly comments.

In the case of assembled binaries, as all analyzers were run on Linux,
executables or object files are ELF files. Some analyzers work on sections of
the file defined by user-provided offsets in the binary, while others require
the presence of \iaca{} markers around the code portion or portions to be
analyzed. Those markers, introduced by \iaca{}, consist in the following
assembly snippets: \todo{}

\subsection{Examples with \llvmmca}

\subsection{Definitions}

\subsubsection{Throughput and reverse-throughput}

\subsubsection{Basic block of an assembly-level program}