\section{Kernel optimization and code analyzers} Optimizing a program, in most contexts, mainly means optimizing it from an algorithmic point of view ---~using efficient data structures, running some computations in parallel on multiple cores, etc. As pointed out in our introduction, though, optimizations close to the machine's microarchitecture can yield large efficiency benefits, sometimes up to two orders of magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to carry for multiple reasons: they depend on the specific machine on which the code is run; they require deep expert knowledge; they are most often manual, requiring expert time ---~and thus making them expensive. Such optimizations are, however, routinely used in some domains. Scientific computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often rely on the same operations, implemented by low-level libraries optimized in such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear algebra. Machine learning applications, on the other hand, may typically be trained for extensive periods of time, on many cores and accelerators, on a well-defined hardware, with small portions of code being executed many times on different data; as such, they are very well suited for such specific and low-level optimizations. \medskip{} When optimizing those short fragments of code whose efficiency is critical, or \emph{computation kernels}, insights on what limits the code's performance, or \emph{performance bottlenecks}, are precious to the expert. These insights can be gained by reading the processor's hardware counters, described above in \autoref{sssec:hw_counters}, typically accurate but of limited versatility. Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate these counters with profiling to derive further performance metrics at runtime. \subsection{Code analyzers} Another approach is to rely on \emph{code analyzers}, pieces of software that analyze a code fragment ---~typically at assembly or binary level~---, and provide insights on its performance metrics on a given hardware. Code analyzers thus work statically, that is, without executing the code. \paragraph{Common hypotheses.} Code analyzers operate under a common hypotheses, derived from the typical intended usage. The kernel analyzed is expected to be the body of a loop, or nest of loops, that should be iterated many times enough to be approximated by an infinite loop. The kernel will further be analyzed under the assumption that it is in \emph{steady-state}, and will thus ignore startup or border effects occurring in extremal cases. As the kernels analyzed are those worth optimizing manually, it is reasonable to assume that they will be executed many times, and focus on their steady-state. The kernel is further assumed to be \emph{L1-resident}, that is, to work only on data that resides in the L1 cache. This assumption is reasonable in two ways. First, if data must be fetched from farther caches, or even the main memory, these fetch operations will be multiple orders of magnitude slower than the computation being analyzed, making it useless to optimize this kernel for CPU efficiency ---~the expert should, in this case, focus instead on data locality, prefetching, etc. Second, code analyzers typically focus only on the CPU itself, and ignore memory effects. This hypothesis formalizes this focus; code analyzers metrics are thus to be regarded \textit{assuming the CPU is the bottleneck}. Code analyzers also disregard control flow, and thus assume the code to be \emph{straight-line code}: the kernel analyzed is considered as a sequence of instructions without influence on the control flow, executed in order, and jumping unconditionally back to the first instruction after the last ---~or, more accurately, the last jump is always assumed taken, and any control flow instruction in the middle is assumed not taken, while their computational cost is accounted for. \paragraph{Metrics produced.} The insights they provide as an output vary with the code analyzer used. All of them are able to predict either the throughput or reciprocal throughput ---~defined below~--- of the kernel studied, that is, how many cycles one iteration of the loop takes, in average and in steady-state. Although throughput can already be measured at runtime with hardware counters, a static estimation ---~if reliable~--- is already an improvement, as a static analyzer is typically faster than running the actual program under profiling. Each code analyzer relies on a model, or a collection of models, of the hardware on which it provides analyzes. Depending on what is, or is not modelled by a specific code analyzer, it may further extract any available and relevant metric from its model: whether the frontend is saturated, which computation units from the backend are stressed and by which precise instructions, when the CPU stalls and why, etc. Code analyzers may further point towards the resources that are limiting the kernel's performance, or \emph{bottlenecks}. \paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code analyzers among them, are generally either performing \emph{static} or \emph{dynamic} analyses. Static analysers work on the program itself, be it source code, assembly or any representation, without running it; while dynamic analysers run the analyzed program, keeping it under scrutiny through either instrumentation, monitoring or any relevant technique. Some analyzers mix both strategies to further refine their analyses. As a general rule of thumb, dynamic analyzers are typically more accurate, being able to study the actual execution trace (or traces) of the program, but are significantly slower due to instrumentation's large overhead and focus more on the general, average case than on edge cases. As most code analyzers are static, this manuscript largely focuses on static analysis. The only dynamic code analyzer we are aware of is \gus{}, described more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in accuracy, especially regarding data dependencies that may not be easily obtained otherwise. \paragraph{Input formats used.} The analyzers studied in this manuscript all take as input either assembly code, or assembled binaries. In the case of assembly code, as for instance with \llvmmca{}, analyzers take either a short assembly snippet, treated as straight-line code and analyzed as such; or longer pieces of assembly, part or parts of which being marked for analysis my surrounding assembly comments. In the case of assembled binaries, as all analyzers were run on Linux, executables or object files are ELF files. Some analyzers work on sections of the file defined by user-provided offsets in the binary, while others require the presence of \iaca{} markers around the code portion or portions to be analyzed. Those markers, introduced by \iaca{}, consist in the following assembly snippets: \todo{} \subsection{Examples with \llvmmca} \subsection{Definitions} \subsubsection{Throughput and reverse-throughput} \subsubsection{Basic block of an assembly-level program}