phd-thesis/manuscrit/10_introduction/main.tex

\chapter{Introduction}\label{chap:intro}

Developing new features and fixing problems are often regarded as the major
parts of the development cycle of a program. However, performance optimization
might be just as crucial for compute-intensive software. On small-scale
applications, it improves usability by reducing, or even hiding, the waiting
time the user must endure between operations, or by allowing heavier workloads
to be processed without needing larger resources or in constrained embedded
hardware environments. On large-scale applications, that may run for an
extended period of time, or may be run on whole clusters, optimization is a
cost-effective path, as it allows the same workload to be run on smaller
clusters, for reduced periods of time.

The most significant optimisation gains come from ``high-level'' algorithmic
changes, such as computing on multiple cores instead of sequentially, caching
already computed results, reimplementing a function to run asymptotically in
$\bigO{n\cdot \log(n)}$ instead of $\bigO{n^2}$ or avoiding the copy of large
data structures. However, when a software is already well-optimized from these
perspectives, the impact of low-level considerations, stemming from the
hardware implementation of the machine itself, cannot be neglected anymore. A
common example of such impacts is the iteration of a large matrix either
row-major or column-major:

\vspace{1em}

\definecolor{rowmajor_row}{HTML}{1b5898}
\definecolor{rowmajor_col}{HTML}{9c6210}
\begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
        \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
            \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
\end{minipage}\hfill
\begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
        \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
            \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
\end{minipage}

\vspace{1em}

While both programs are performing the exact same computation, the left one
iterates on rows first, or \textit{row-major}, while the right one iterates on
columns first, or \textit{column-major}. The latter, on large matrices, will
cause frequent cache misses, and was measured to run up to about six times
slower than the former~\cite{rowmajor_repo}.

This, however, is still an optimization that holds for the vast majority of
CPUs. In many cases, transformations targeting a specific microarchitecture can
be very beneficial.
For instance, Uday Bondhugula found out that manual tuning, through many
techniques and tools, of a general matrix multiplication could multiply its
throughput by roughly 13.5 compared to \texttt{gcc~-O3}, or even 130 times
faster than \texttt{clang -O3}~\cite{dgemm_finetune}.
This kind of optimizations, however, requires manual effort, and a
deep expert knowledge both in optimization techniques and on the specific
architecture targeted.
These techniques are only worth applying on the parts of a program that are
most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
are iterated enough times to be assumed infinite. Such loop bodies are called
\emph{computation kernels}, with which this whole manuscript will be concerned.

\medskip{}

Developers are used to \emph{functional debugging}, the practice of tracking
the root cause of an unexpected bad functional behaviour. Akin to it is
\emph{performance debugging}, the practice of tracking the root cause of a
performance below expectations. Just as functional debugging can be carried in
a variety of ways, from guessing and inserting print instructions to
sophisticated tools such as \gdb{}, performance debugging can be carried with
different tools. Crude timing measures and profiling can point to a general
part of the program or hint an issue; reading \emph{hardware counters}
---~metrics reported by the CPU~--- can lead to a better understanding, and may
confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
analyze the assembly code and, in the light of a built-in hardware model,
strive to provide a performance analysis.

An exact modelling of the processor would require a cycle-accurate simulator,
reproducing the precise behaviour of the silicon, allowing one to observe any
desired metric. Such a simulator, however, would be prohibitively slow, and is
not available on most architectures anyway, as processors are not usually open
hardware and the manufacturer regards their implementation as industrial
secret. Code analyzers thus resort to approximated, higher-level models of
varied kinds. Tools based on such models, as opposed to measures or hardware
counters sampling, may not always be precise and faithful. They can, however,
inspect at will their inner model state, and derive more advanced metrics or
hypotheses, for instance by predicting which resource might be overloaded and
slow the whole computation.
No results found.