\chapter*{Introduction}\label{chap:intro} \addcontentsline{toc}{chapter}{Introduction} Developing new features and fixing problems are often regarded as the major parts of the development cycle of a program. However, performance optimization might be just as crucial for compute-intensive software. On small-scale applications, it improves usability by reducing, or even hiding, the waiting time the user must endure between operations, or by allowing heavier workloads to be processed without needing larger resources or in constrained embedded hardware environments. On large-scale applications, that may run for an extended period of time, or may be run on whole clusters, optimization is a cost-effective path, as it allows the same workload to be run on smaller clusters, for reduced periods of time. The most significant optimisation gains come from ``high-level'' algorithmic changes, such as computing on multiple cores instead of sequentially, caching already computed results, reimplementing a function to run asymptotically in $\bigO{n\cdot \log(n)}$ instead of $\bigO{n^2}$ or avoiding the copy of large data structures. However, when a software is already well-optimized from these perspectives, the impact of low-level considerations, stemming from the hardware implementation of the machine itself, cannot be neglected anymore. A common example of such impacts is the iteration of a large matrix either row-major or column-major: \vspace{1em} \definecolor{rowmajor_row}{HTML}{1b5898} \definecolor{rowmajor_col}{HTML}{9c6210} \begin{minipage}[c]{0.48\linewidth} \begin{algorithmic} \State{sum $\gets 0$} \For{{\color{rowmajor_row}row} $<$ MAX\_ROW} \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN} \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$} \EndFor \EndFor \end{algorithmic} \end{minipage}\hfill \begin{minipage}[c]{0.48\linewidth} \begin{algorithmic} \State{sum $\gets 0$} \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN} \For{{\color{rowmajor_row}row} $<$ MAX\_ROW} \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$} \EndFor \EndFor \end{algorithmic} \end{minipage} \vspace{1em} While both programs are performing the exact same computation, the left one iterates on rows first, or \textit{row-major}, while the right one iterates on columns first, or \textit{column-major}. The latter, on large matrices, will cause frequent cache misses, and was measured to run up to about six times slower than the former~\cite{rowmajor_repo}. This, however, is still an optimization that holds for the vast majority of CPUs. In many cases, transformations targeting a specific microarchitecture can be very beneficial. For instance, Uday Bondhugula found out that manual tuning, through many techniques and tools, of a general matrix multiplication could multiply its throughput by roughly 13.5 compared to \texttt{gcc~-O3}, or even be 130 times faster than \texttt{clang -O3}~\cite{dgemm_finetune}. This kind of optimizations, however, requires manual effort, and a deep expert knowledge both in optimization techniques and on the specific architecture targeted. These techniques are only worth applying on the parts of a program that are most executed ---~usually called the \emph{hottest} parts~---, loop bodies that are iterated enough times to be assumed infinite. Such loop bodies are called \emph{computation kernels}, with which this whole manuscript will be concerned. \medskip{} Developers are used to \emph{functional debugging}, the practice of tracking the root cause of an unexpected bad functional behaviour. Akin to it is \emph{performance debugging}, the practice of tracking the root cause of a performance below expectations. Just as functional debugging can be carried in a variety of ways, from guessing and inserting print instructions to sophisticated tools such as \gdb{}, performance debugging can be carried with different tools. Crude timing measures and profiling can point to a general part of the program or hint an issue; reading \emph{hardware counters} ---~metrics reported by the CPU~--- can lead to a better understanding, and may confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers}, analyze the assembly code and, in the light of a built-in hardware model, strive to provide a performance analysis. An exact modelling of the processor would require a cycle-accurate simulator, reproducing the precise behaviour of the silicon, allowing one to observe any desired metric. Such a simulator, however, would be prohibitively slow, and is not available on most architectures anyway, as processors are not usually open hardware and the manufacturer regards their implementation as industrial secret. Code analyzers thus resort to approximated, higher-level models of varied kinds. Tools based on such models, as opposed to measures or hardware counters sampling, may not always be precise and faithful. They can, however, inspect at will their inner model state, and derive more advanced metrics or hypotheses, for instance by predicting which resource might be overloaded and slow the whole computation. \vspace{2em} In this thesis, we explore the three major aspects that work towards a code analyzer's accuracy: a \emph{backend model}, a \emph{frontend model} and a \emph{dependencies model}. We propose contributions to strengthen them, as well as to automate the underlying models' synthesis. We focus on \emph{static} code analyzers, that derive metrics, including runtime predictions, from an assembly code or assembled binary without executing it. The \hyperref[chap:foundations]{first chapter} introduces the foundations of this manuscript, describing the microarchitectural notions on which our analyses will be based, and exploring the current state of the art. The \autoref{chap:palmed} introduces \palmed{}, a benchmarks-based tool automatically synthesizing a model of a CPU's backend. Although the theoretical core of \palmed{} is not my own work, I made major contributions to other aspects of the tool. The chapter also presents the foundations and methodologies \palmed{} shares with the following parts. In \autoref{chap:frontend}, we explore the frontend aspects of static code analyzers. This chapter focuses on the manual study of the Cortex A72 processor, and proposes a static model of its frontend. We finally reflect on the generalization of our manual approach into an automated frontend modelling tool, akin to \palmed. Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art code analyzers' strengths and shortcomings. To this end, we introduce a fully-tooled approach in two parts: first, a benchmark-generation procedure, yielding thousands of benchmarks relevant in the context of our approach; then, a benchmarking harness evaluating code analyzers on these benchmarks. We find that most state-of-the-art code analyzers struggle to correctly account for some types of data dependencies. Further building on our findings, \autoref{chap:staticdeps} introduces \staticdeps{}, an accurate heuristic-based tool to statically extract data dependencies from an assembly computation kernel. We extend \uica{}, a state-of-the-art code analyzer, with \staticdeps{} predictions, and evaluate the enhancement of its accuracy. \bigskip{} Throughout this manuscript, we explore notions that are transversal to the hardware blocks the chapters lay out. \medskip{} Most of our approaches work towards an \emph{automated, microarchitecture-independent} tooling. While fine-grained, accurate code analysis is directly concerned with the underlying hardware and its specific implementation, we strive to write tooling that has the least dependency towards vendor-specific interfaces. In practice, this rules out most uses of hardware counters, which depend greatly on the manufacturer, or even the specific chip considered. As some CPUs expose only very bare hardware counters, we see this commitment as an opportunity to develop methodologies able to model these processors. This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is to model a processor's backend resources without resorting to its vendor-specific hardware counters. Our frontend study, in \autoref{chap:frontend}, also follows this strategy by focusing on a processor whose hardware counters give little to no insight on its frontend. While this goal is less relevant to \staticdeps{}, we rely on external libraries to abstract the underlying architecture. \medskip{} Our methodologies are, whenever relevant, \emph{benchmarks- and experiments-driven}, in a bottom-up style, placing real hardware at the center. In this spirit, \palmed{} is based solely on benchmarks, discarding entirely the manufacturer's documentation. Our model of the Cortex A72 frontend is based both on measures and documentation, yet it strives to be a case study from which future works can generalize, to automatically synthesize frontend models in a benchmarks-based fashion. One of the goals of our survey of the state of the art, in \autoref{chap:CesASMe}, is to identify through experiments the shortcomings that are most crucial to address in order to strengthen static code analyzers. \medskip{} Finally, against the extent of the ecological and climatic crises we are facing, as assessed among others by the IPCC~\cite{ipcc_ar6_syr}, we believe that every field and discipline should strive for a positive impact or, at the very least, to reduce as much as possible its negative impact. Our very modest contribution to this end, throughout this thesis, is to commit ourselves to computations as \emph{frugal} as possible: run computation-heavy experiments as least as possible; avoid running multiple times the same experiment, but cache results instead when this is feasible; etc. This commitment partly motivated us to implement a results database in \palmed{}, to compute only once each benchmark. As our experiments in \autoref{chap:CesASMe} take many hours to yield a result, we at least evaluate their carbon impact. We believe it noteworthy, however, to point out that although this thesis is concerned with tools that help optimize large computation workloads, \emph{optimization does not lead to frugality}. In most cases, Jevons paradox ---~also called rebound effect~--- makes it instead more likely to lead to an increased absolute usage of computational resources~\cite{jevons_coal_question,understanding_jevons_paradox}.