Intro: progress

This commit is contained in:
Théophile Bastian 2023-10-08 22:32:04 +02:00
parent 3285743cd4
commit 0d7bf6a974

View file

@ -5,10 +5,11 @@ parts of the development cycle of a program. However, performance optimization
might be just as crucial for compute-intensive software. On small-scale
applications, it improves usability by reducing, or even hiding, the waiting
time the user must endure between operations, or by allowing heavier workloads
to be processed without needing larger resources. On large-scale applications,
that may run for an extended period of time, or may be run on whole clusters,
optimization is a cost-effective path, as it allows the same workload to be run
on smaller clusters, for reduced periods of time.
to be processed without needing larger resources or in constrained embedded
hardware environments. On large-scale applications, that may run for an
extended period of time, or may be run on whole clusters, optimization is a
cost-effective path, as it allows the same workload to be run on smaller
clusters, for reduced periods of time.
The most significant optimisation gains come from ``high-level'' algorithmic
changes, such as computing on multiple cores instead of sequentially, caching
@ -22,12 +23,14 @@ row-major or column-major:
\vspace{1em}
\definecolor{rowmajor_row}{HTML}{1b5898}
\definecolor{rowmajor_col}{HTML}{9c6210}
\begin{minipage}[c]{0.48\linewidth}
\begin{algorithmic}
\State{sum $\gets 0$}
\For{row $<$ MAX\_ROW}
\For{column $<$ MAX\_COLUMN}
\State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
\EndFor
\EndFor
\end{algorithmic}
@ -35,9 +38,9 @@ row-major or column-major:
\begin{minipage}[c]{0.48\linewidth}
\begin{algorithmic}
\State{sum $\gets 0$}
\For{column $<$ MAX\_COLUMN}
\For{row $<$ MAX\_ROW}
\State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
\EndFor
\EndFor
\end{algorithmic}
@ -50,3 +53,41 @@ iterates on rows first, or \textit{row-major}, while the right one iterates on
columns first, or \textit{column-major}. The latter, on large matrices, will
cause frequent cache misses, and was measured to run up to about six times
slower than the former~\cite{rowmajor_repo}.
This, however, is still an optimization that holds for the vast majority of
CPUs. In many cases, transformations targeting a specific microarchitecture can
be very beneficial. \qtodo{Insert number/ref \wrt{} matmult or some kernel of
the like.} This kind of optimizations, however, requires manual effort, and a
deep expert knowledge both in optimization techniques and on the specific
architecture targeted.
These techniques are only worth applying on the parts of a program that are
most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
are iterated enough times to be assumed infinite. Such loop bodies are called
\emph{computation kernels}, with which this whole manuscript will be concerned.
\medskip{}
Developers are used to \emph{functional debugging}, the practice of tracking
the root cause of an unexpected bad functional behaviour. Akin to it is
\emph{performance debugging}, the practice of tracking the root cause of a
performance below expectations. Just as functional debugging can be carried in
a variety of ways, from guessing and inserting print instructions to
sophisticated tools such as \gdb{}, performance debugging can be carried with
different tools. Crude timing measures and profiling can point to a general
part of the program or hint an issue; reading \emph{hardware counters}
---~metrics reported by the CPU~--- can lead to a better understanding, and may
confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
analyze the assembly code and, in the light of a built-in hardware model,
strive to provide a performance analysis.
An exact modelling of the processor would require a cycle-accurate simulator,
reproducing the precise behaviour of the silicon, allowing one to observe any
desired metric. Such a simulator, however, would be prohibitively slow, and is
not available on most architectures anyway, as processors are not usually open
hardware and the manufacturer regards their implementation as industrial
secret. Code analyzers thus resort to approximated, higher-level models of
varied kinds. Tools based on such models, as opposed to measures or hardware
counters sampling, may not always be precise and faithful. They can, however,
inspect at will their inner model state, and derive more advanced metrics or
hypotheses, for instance by predicting which resource might be overloaded and
slow the whole computation.