97 lines
5.2 KiB
TeX
97 lines
5.2 KiB
TeX
\chapter{Introduction}\label{chap:intro}
|
|
|
|
Developing new features and fixing problems are often regarded as the major
|
|
parts of the development cycle of a program. However, performance optimization
|
|
might be just as crucial for compute-intensive software. On small-scale
|
|
applications, it improves usability by reducing, or even hiding, the waiting
|
|
time the user must endure between operations, or by allowing heavier workloads
|
|
to be processed without needing larger resources or in constrained embedded
|
|
hardware environments. On large-scale applications, that may run for an
|
|
extended period of time, or may be run on whole clusters, optimization is a
|
|
cost-effective path, as it allows the same workload to be run on smaller
|
|
clusters, for reduced periods of time.
|
|
|
|
The most significant optimisation gains come from ``high-level'' algorithmic
|
|
changes, such as computing on multiple cores instead of sequentially, caching
|
|
already computed results, reimplementing a function to run asymptotically in
|
|
$\bigO{n\cdot \log(n)}$ instead of $\bigO{n^2}$ or avoiding the copy of large
|
|
data structures. However, when a software is already well-optimized from these
|
|
perspectives, the impact of low-level considerations, stemming from the
|
|
hardware implementation of the machine itself, cannot be neglected anymore. A
|
|
common example of such impacts is the iteration of a large matrix either
|
|
row-major or column-major:
|
|
|
|
\vspace{1em}
|
|
|
|
\definecolor{rowmajor_row}{HTML}{1b5898}
|
|
\definecolor{rowmajor_col}{HTML}{9c6210}
|
|
\begin{minipage}[c]{0.48\linewidth}
|
|
\begin{algorithmic}
|
|
\State{sum $\gets 0$}
|
|
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
|
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
|
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
|
\EndFor
|
|
\EndFor
|
|
\end{algorithmic}
|
|
\end{minipage}\hfill
|
|
\begin{minipage}[c]{0.48\linewidth}
|
|
\begin{algorithmic}
|
|
\State{sum $\gets 0$}
|
|
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
|
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
|
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
|
\EndFor
|
|
\EndFor
|
|
\end{algorithmic}
|
|
\end{minipage}
|
|
|
|
\vspace{1em}
|
|
|
|
While both programs are performing the exact same computation, the left one
|
|
iterates on rows first, or \textit{row-major}, while the right one iterates on
|
|
columns first, or \textit{column-major}. The latter, on large matrices, will
|
|
cause frequent cache misses, and was measured to run up to about six times
|
|
slower than the former~\cite{rowmajor_repo}.
|
|
|
|
This, however, is still an optimization that holds for the vast majority of
|
|
CPUs. In many cases, transformations targeting a specific microarchitecture can
|
|
be very beneficial.
|
|
For instance, Uday Bondhugula found out that manual tuning, through many
|
|
techniques and tools, of a general matrix multiplication could multiply its
|
|
throughput by roughly 13.5 compared to \texttt{gcc~-O3}, or even 130 times
|
|
faster than \texttt{clang -O3}~\cite{dgemm_finetune}.
|
|
This kind of optimizations, however, requires manual effort, and a
|
|
deep expert knowledge both in optimization techniques and on the specific
|
|
architecture targeted.
|
|
These techniques are only worth applying on the parts of a program that are
|
|
most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
|
|
are iterated enough times to be assumed infinite. Such loop bodies are called
|
|
\emph{computation kernels}, with which this whole manuscript will be concerned.
|
|
|
|
\medskip{}
|
|
|
|
Developers are used to \emph{functional debugging}, the practice of tracking
|
|
the root cause of an unexpected bad functional behaviour. Akin to it is
|
|
\emph{performance debugging}, the practice of tracking the root cause of a
|
|
performance below expectations. Just as functional debugging can be carried in
|
|
a variety of ways, from guessing and inserting print instructions to
|
|
sophisticated tools such as \gdb{}, performance debugging can be carried with
|
|
different tools. Crude timing measures and profiling can point to a general
|
|
part of the program or hint an issue; reading \emph{hardware counters}
|
|
---~metrics reported by the CPU~--- can lead to a better understanding, and may
|
|
confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
|
|
analyze the assembly code and, in the light of a built-in hardware model,
|
|
strive to provide a performance analysis.
|
|
|
|
An exact modelling of the processor would require a cycle-accurate simulator,
|
|
reproducing the precise behaviour of the silicon, allowing one to observe any
|
|
desired metric. Such a simulator, however, would be prohibitively slow, and is
|
|
not available on most architectures anyway, as processors are not usually open
|
|
hardware and the manufacturer regards their implementation as industrial
|
|
secret. Code analyzers thus resort to approximated, higher-level models of
|
|
varied kinds. Tools based on such models, as opposed to measures or hardware
|
|
counters sampling, may not always be precise and faithful. They can, however,
|
|
inspect at will their inner model state, and derive more advanced metrics or
|
|
hypotheses, for instance by predicting which resource might be overloaded and
|
|
slow the whole computation.
|