197 lines
10 KiB
TeX
197 lines
10 KiB
TeX
\chapter*{Introduction}\label{chap:intro}
|
|
\addcontentsline{toc}{chapter}{Introduction}
|
|
|
|
Developing new features and fixing problems are often regarded as the major
|
|
parts of the development cycle of a program. However, performance optimization
|
|
might be just as crucial for compute-intensive software. On small-scale
|
|
applications, it improves usability by reducing, or even hiding, the waiting
|
|
time the user must endure between operations, or by allowing heavier workloads
|
|
to be processed without needing larger resources or in constrained embedded
|
|
hardware environments. On large-scale applications, that may run for an
|
|
extended period of time, or may be run on whole clusters, optimization is a
|
|
cost-effective path, as it allows the same workload to be run on smaller
|
|
clusters, for reduced periods of time.
|
|
|
|
The most significant optimisation gains come from ``high-level'' algorithmic
|
|
changes, such as computing on multiple cores instead of sequentially, caching
|
|
already computed results, reimplementing a function to run asymptotically in
|
|
$\bigO{n\cdot \log(n)}$ instead of $\bigO{n^2}$ or avoiding the copy of large
|
|
data structures. However, when a software is already well-optimized from these
|
|
perspectives, the impact of low-level considerations, stemming from the
|
|
hardware implementation of the machine itself, cannot be neglected anymore. A
|
|
common example of such impacts is the iteration of a large matrix either
|
|
row-major or column-major:
|
|
|
|
\vspace{1em}
|
|
|
|
\definecolor{rowmajor_row}{HTML}{1b5898}
|
|
\definecolor{rowmajor_col}{HTML}{9c6210}
|
|
\begin{minipage}[c]{0.48\linewidth}
|
|
\begin{algorithmic}
|
|
\State{sum $\gets 0$}
|
|
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
|
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
|
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
|
\EndFor
|
|
\EndFor
|
|
\end{algorithmic}
|
|
\end{minipage}\hfill
|
|
\begin{minipage}[c]{0.48\linewidth}
|
|
\begin{algorithmic}
|
|
\State{sum $\gets 0$}
|
|
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
|
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
|
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
|
\EndFor
|
|
\EndFor
|
|
\end{algorithmic}
|
|
\end{minipage}
|
|
|
|
\vspace{1em}
|
|
|
|
While both programs are performing the exact same computation, the left one
|
|
iterates on rows first, or \textit{row-major}, while the right one iterates on
|
|
columns first, or \textit{column-major}. The latter, on large matrices, will
|
|
cause frequent cache misses, and was measured to run up to about six times
|
|
slower than the former~\cite{rowmajor_repo}.
|
|
|
|
This, however, is still an optimization that holds for the vast majority of
|
|
CPUs. In many cases, transformations targeting a specific microarchitecture can
|
|
be very beneficial.
|
|
For instance, Uday Bondhugula found out that manual tuning, through many
|
|
techniques and tools, of a general matrix multiplication could multiply its
|
|
throughput by roughly 13.5 compared to \texttt{gcc~-O3}, or even be 130 times
|
|
faster than \texttt{clang -O3}~\cite{dgemm_finetune}.
|
|
This kind of optimizations, however, requires manual effort, and a
|
|
deep expert knowledge both in optimization techniques and on the specific
|
|
architecture targeted.
|
|
These techniques are only worth applying on the parts of a program that are
|
|
most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
|
|
are iterated enough times to be assumed infinite. Such loop bodies are called
|
|
\emph{computation kernels}, with which this whole manuscript will be concerned.
|
|
|
|
\medskip{}
|
|
|
|
Developers are used to \emph{functional debugging}, the practice of tracking
|
|
the root cause of an unexpected bad functional behaviour. Akin to it is
|
|
\emph{performance debugging}, the practice of tracking the root cause of a
|
|
performance below expectations. Just as functional debugging can be carried in
|
|
a variety of ways, from guessing and inserting print instructions to
|
|
sophisticated tools such as \gdb{}, performance debugging can be carried with
|
|
different tools. Crude timing measures and profiling can point to a general
|
|
part of the program or hint an issue; reading \emph{hardware counters}
|
|
---~metrics reported by the CPU~--- can lead to a better understanding, and may
|
|
confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
|
|
analyze the assembly code and, in the light of a built-in hardware model,
|
|
strive to provide a performance analysis.
|
|
|
|
An exact modelling of the processor would require a cycle-accurate simulator,
|
|
reproducing the precise behaviour of the silicon, allowing one to observe any
|
|
desired metric. Such a simulator, however, would be prohibitively slow, and is
|
|
not available on most architectures anyway, as processors are not usually open
|
|
hardware and the manufacturer regards their implementation as industrial
|
|
secret. Code analyzers thus resort to approximated, higher-level models of
|
|
varied kinds. Tools based on such models, as opposed to measures or hardware
|
|
counters sampling, may not always be precise and faithful. They can, however,
|
|
inspect at will their inner model state, and derive more advanced metrics or
|
|
hypotheses, for instance by predicting which resource might be overloaded and
|
|
slow the whole computation.
|
|
|
|
\vspace{2em}
|
|
|
|
In this thesis, we explore the three major aspects that work towards a code
|
|
analyzer's accuracy: a \emph{backend model}, a \emph{frontend model} and a
|
|
\emph{dependencies model}. We propose contributions to strengthen them, as well
|
|
as to automate the underlying models' synthesis. We focus on \emph{static}
|
|
code analyzers, that derive metrics, including runtime predictions, from an
|
|
assembly code or assembled binary without executing it.
|
|
|
|
The \hyperref[chap:foundations]{first chapter} introduces the foundations
|
|
of this manuscript, describing the microarchitectural notions on which our
|
|
analyses will be based, and exploring the current state of the art.
|
|
|
|
The \autoref{chap:palmed} introduces \palmed{}, a benchmarks-based tool
|
|
automatically synthesizing a model of a CPU's backend. Although the
|
|
theoretical core of \palmed{} is not my own work, I made major contributions to
|
|
other aspects of the tool. The chapter also presents the foundations and
|
|
methodologies \palmed{} shares with the following parts.
|
|
|
|
In \autoref{chap:frontend}, we explore the frontend aspects of static code
|
|
analyzers. This chapter focuses on the manual study of the Cortex A72
|
|
processor, and proposes a static model of its frontend. We finally reflect on
|
|
the generalization of our manual approach into an automated frontend modelling
|
|
tool, akin to \palmed.
|
|
|
|
Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art
|
|
code analyzers' strengths and shortcomings. To this end, we introduce a
|
|
fully-tooled approach in two parts: first, a benchmark-generation procedure,
|
|
yielding thousands of benchmarks relevant in the context of our approach; then,
|
|
a benchmarking harness evaluating code analyzers on these benchmarks. We find
|
|
that most state-of-the-art code analyzers struggle to correctly account for
|
|
some types of data dependencies.
|
|
|
|
Further building on our findings, \autoref{chap:staticdeps} introduces
|
|
\staticdeps{}, an accurate heuristic-based tool to statically extract data
|
|
dependencies from an assembly computation kernel. We extend \uica{}, a
|
|
state-of-the-art code analyzer, with \staticdeps{} predictions, and evaluate
|
|
the enhancement of its accuracy.
|
|
|
|
\bigskip{}
|
|
|
|
Throughout this manuscript, we explore notions that are transversal to the
|
|
hardware blocks the chapters lay out.
|
|
|
|
\medskip{}
|
|
|
|
Most of our approaches work towards an \emph{automated,
|
|
microarchitecture-independent} tooling. While fine-grained, accurate code
|
|
analysis is directly concerned with the underlying hardware and its specific
|
|
implementation, we strive to write tooling that has the least dependency
|
|
towards vendor-specific interfaces. In practice, this rules out most uses of
|
|
hardware counters, which depend greatly on the manufacturer, or even the
|
|
specific chip considered. As some CPUs expose only very bare hardware counters,
|
|
we see this commitment as an opportunity to develop methodologies able to model
|
|
these processors.
|
|
|
|
This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is
|
|
to model a processor's backend resources without resorting to its
|
|
vendor-specific hardware counters. Our frontend study, in
|
|
\autoref{chap:frontend}, also follows this strategy by focusing on a processor
|
|
whose hardware counters give little to no insight on its frontend. While this
|
|
goal is less relevant to \staticdeps{}, we rely on external libraries to
|
|
abstract the underlying architecture.
|
|
|
|
\medskip{}
|
|
|
|
Our methodologies are, whenever relevant, \emph{benchmarks- and
|
|
experiments-driven}, in a bottom-up style, placing real hardware at the center.
|
|
In this spirit, \palmed{} is based solely on benchmarks, discarding entirely
|
|
the manufacturer's documentation. Our model of the Cortex A72 frontend is based
|
|
both on measures and documentation, yet it strives to be a case study from
|
|
which future works can generalize, to automatically synthesize frontend models
|
|
in a benchmarks-based fashion. One of the goals of our survey of the state of
|
|
the art, in \autoref{chap:CesASMe}, is to identify through experiments the
|
|
shortcomings that are most crucial to address in order to strengthen static
|
|
code analyzers.
|
|
|
|
\medskip{}
|
|
|
|
Finally, against the extent of the ecological and climatic crises we are
|
|
facing, as assessed among others by the IPCC~\cite{ipcc_ar6_syr}, we believe
|
|
that every field and discipline should strive for a positive impact or, at
|
|
the very least, to reduce as much as possible its negative impact. Our very
|
|
modest contribution to this end, throughout this thesis, is to commit
|
|
ourselves to computations as \emph{frugal} as possible: run computation-heavy
|
|
experiments as least as possible; avoid running multiple times the same
|
|
experiment, but cache results instead when this is feasible; etc. This
|
|
commitment partly motivated us to implement a results database in \palmed{}, to
|
|
compute only once each benchmark. As our experiments in
|
|
\autoref{chap:CesASMe} take many hours to yield a result, we at least evaluate
|
|
their carbon impact.
|
|
|
|
We believe it noteworthy, however, to point out that although this thesis is
|
|
concerned with tools that help optimize large computation workloads,
|
|
\emph{optimization does not lead to frugality}. In most cases, Jevons paradox
|
|
---~also called rebound effect~--- makes it instead
|
|
more likely to lead to an increased absolute usage of computational
|
|
resources~\cite{jevons_coal_question,understanding_jevons_paradox}.
|