phd-thesis/manuscrit/10_introduction/main.tex

\chapter*{Introduction}\label{chap:intro}
\addcontentsline{toc}{chapter}{Introduction}

Developing new features and fixing problems are often regarded as the major
parts of the development cycle of a program. However, performance optimization
might be just as crucial for compute-intensive software. On small-scale
applications, it improves usability by reducing, or even hiding, the waiting
time the user must endure between operations, or by allowing heavier workloads
to be processed without needing larger resources or in constrained embedded
hardware environments. On large-scale applications, that may run for an
extended period of time, or may be run on whole clusters, optimization is a
cost-effective path, as it allows the same workload to be run on smaller
clusters, for reduced periods of time.

The most significant optimisation gains come from ``high-level'' algorithmic
changes, such as computing on multiple cores instead of sequentially, caching
already computed results, reimplementing a function to run asymptotically in
$\bigO{n\cdot \log(n)}$ instead of $\bigO{n^2}$ or avoiding the copy of large
data structures. However, when a software is already well-optimized from these
perspectives, the impact of low-level considerations, stemming from the
hardware implementation of the machine itself, cannot be neglected anymore. A
common example of such impacts is the iteration of a large matrix either
row-major or column-major:

\vspace{1em}

\definecolor{rowmajor_row}{HTML}{1b5898}
\definecolor{rowmajor_col}{HTML}{9c6210}
\begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
        \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
            \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
\end{minipage}\hfill
\begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
        \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
            \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
\end{minipage}

\vspace{1em}

While both programs are performing the exact same computation, the left one
iterates on rows first, or \textit{row-major}, while the right one iterates on
columns first, or \textit{column-major}. The latter, on large matrices, will
cause frequent cache misses, and was measured to run up to about six times
slower than the former~\cite{rowmajor_repo}.

This, however, is still an optimization that holds for the vast majority of
CPUs. In many cases, transformations targeting a specific microarchitecture can
be very beneficial.
For instance, Uday Bondhugula found out that manual tuning, through many
techniques and tools, of a general matrix multiplication could multiply its
throughput by roughly 13.5 compared to \texttt{gcc~-O3}, or even be 130 times
faster than \texttt{clang -O3}~\cite{dgemm_finetune}.
This kind of optimizations, however, requires manual effort, and a
deep expert knowledge both in optimization techniques and on the specific
architecture targeted.
These techniques are only worth applying on the parts of a program that are
most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
are iterated enough times to be assumed infinite. Such loop bodies are called
\emph{computation kernels}, with which this whole manuscript will be concerned.

\medskip{}

Developers are used to \emph{functional debugging}, the practice of tracking
the root cause of an unexpected bad functional behaviour. Akin to it is
\emph{performance debugging}, the practice of tracking the root cause of a
performance below expectations. Just as functional debugging can be carried in
a variety of ways, from guessing and inserting print instructions to
sophisticated tools such as \gdb{}, performance debugging can be carried with
different tools. Crude timing measures and profiling can point to a general
part of the program or hint an issue; reading \emph{hardware counters}
---~metrics reported by the CPU~--- can lead to a better understanding, and may
confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
analyze the assembly code and, in the light of a built-in hardware model,
strive to provide a performance analysis.

An exact modelling of the processor would require a cycle-accurate simulator,
reproducing the precise behaviour of the silicon, allowing one to observe any
desired metric. Such a simulator, however, would be prohibitively slow, and is
not available on most architectures anyway, as processors are not usually open
hardware and the manufacturer regards their implementation as industrial
secret. Code analyzers thus resort to approximated, higher-level models of
varied kinds. Tools based on such models, as opposed to measures or hardware
counters sampling, may not always be precise and faithful. They can, however,
inspect at will their inner model state, and derive more advanced metrics or
hypotheses, for instance by predicting which resource might be overloaded and
slow the whole computation.

\vspace{2em}

In this thesis, we explore the three major aspects that work towards a code
analyzers' accuracy: a \emph{backend model}, a \emph{frontend model} and a
\emph{dependencies model}. We propose contributions to strengthen them, as well
as to automate the underlying models' synthesis.  We focus on \emph{static}
code analyzers, that derive metrics, including runtime predictions, from an
assembly code or assembled binary without executing it.

The \hyperref[chap:foundations]{first chapter} introduces the foundations
of this manuscript, describing the microarchitectural notions on which our
analyses will be based, and exploring the current state of the art.

The \autoref{chap:palmed} introduces \palmed{}, a benchmarks-based tool
automatically synthesizing a model of a CPU's backend.  Although the
theoretical core of \palmed{} is not my own work, I made major contributions to
other aspects of the tool. The chapter also presents the foundations and
methodologies \palmed{} shares with the following parts.

In \autoref{chap:frontend}, we explore the frontend aspects of static code
analyzers. This chapter focuses on the manual study of the Cortex A72
processor, and proposes a static model of its frontend. We finally reflect on
the generalization of our manual approach into an automated frontend modelling
tool, akin to \palmed.

Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art
code analyzers' strengths and shortcomings. To this end, we introduce a
fully-tooled approach in two parts: first, a benchmarks-generation procedure,
yielding thousands of benchmarks relevant in the context of our approach; then,
a benchmarking harness evaluating code analyzers on these benchmarks. We find
that most state-of-the-art code analyzers struggle to correctly account for
some types of data dependencies.

Further building on our findings, \autoref{chap:staticdeps} introduces
\staticdeps{}, an accurate heuristic-based tool to statically extract data
dependencies from an assembly computation kernel. We extend \uica{}, a
state-of-the-art code analyzer, with \staticdeps{} predictions, and evaluate
the enhancement of its accuracy.

\bigskip{}

Throughout this manuscript, we explore notions that are transversal to the
hardware blocks the chapters lay out.

\medskip{}

Most of our approaches work towards an \emph{automated,
microarchitecture-independent} tooling. While fine-grained, accurate code
analysis is directly concerned with the underlying hardware and its specific
implementation, we strive to write tooling that has the least dependency
towards vendor-specific interfaces. In practice, this rules out most uses of
hardware counters, which depend greatly on the manufacturer, or even the
specific chip considered. As some CPUs expose only very bare hardware counters,
we see this commitment as an opportunity to develop methodologies able to model
these processors.

This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is
to model a processor's backend resources without resorting to its hardware
counters. Our frontend study, in \autoref{chap:frontend}, also follows this
strategy by focusing on a processor whose hardware counters give little to no
insight on its frontend. While this goal is less relevant to \staticdeps{}, we
rely on external libraries to abstract the underlying architecture.

\medskip{}

Our methodologies are, whenever relevant, \emph{benchmarks- and
experiments-driven}, in a bottom-up style, placing real hardware at the center.
In this spirit, \palmed{} is based solely on benchmarks, discarding entirely
the manufacturer's documentation. Our model of the Cortex A72 frontend is based
both on measures and documentation, yet it strives to be a case study from
which future works can generalize, to automatically synthesize frontend models
in a benchmarks-based fashion. One of the goals of our survey of the state of
the art, in \autoref{chap:CesASMe}, is to identify through experiments the
shortcomings that are most crucial to address in order to strengthen static
code analyzers.

\medskip{}

Finally, against the extent of the ecological and climatic crises we are
facing, as assessed among others by the IPCC~\cite{ipcc_ar6_syr}, we believe
that every field and discipline should strive for a positive impact or, at
the very least, to reduce as much as possible its negative impact. Our very
modest contribution to this end, throughout this thesis, is to commit
ourselves to computations as \emph{frugal} as possible: run computation-heavy
experiments as least as possible; avoid running multiple times the same
experiment, but cache results instead when this is feasible; etc. This
commitment partly motivated us to implement a results database in \palmed{}, to
compute only once each benchmark. As our experiments in
\autoref{chap:CesASMe} take many hours to yield a result, we at least evaluate
their carbon impact.

We believe it noteworthy, however, to point out that although this thesis is
concerned with tools that help optimize large computation workloads,
\emph{optimization does not lead to frugality}. In most cases, Jevons paradox
---~also called rebound effect~--- makes it instead
more likely to lead to an increased absolute usage of computational
resources~\cite{jevons_coal_question,understanding_jevons_paradox}.