Intro: progress
This commit is contained in:
parent
3285743cd4
commit
0d7bf6a974
1 changed files with 51 additions and 10 deletions
|
@ -5,10 +5,11 @@ parts of the development cycle of a program. However, performance optimization
|
||||||
might be just as crucial for compute-intensive software. On small-scale
|
might be just as crucial for compute-intensive software. On small-scale
|
||||||
applications, it improves usability by reducing, or even hiding, the waiting
|
applications, it improves usability by reducing, or even hiding, the waiting
|
||||||
time the user must endure between operations, or by allowing heavier workloads
|
time the user must endure between operations, or by allowing heavier workloads
|
||||||
to be processed without needing larger resources. On large-scale applications,
|
to be processed without needing larger resources or in constrained embedded
|
||||||
that may run for an extended period of time, or may be run on whole clusters,
|
hardware environments. On large-scale applications, that may run for an
|
||||||
optimization is a cost-effective path, as it allows the same workload to be run
|
extended period of time, or may be run on whole clusters, optimization is a
|
||||||
on smaller clusters, for reduced periods of time.
|
cost-effective path, as it allows the same workload to be run on smaller
|
||||||
|
clusters, for reduced periods of time.
|
||||||
|
|
||||||
The most significant optimisation gains come from ``high-level'' algorithmic
|
The most significant optimisation gains come from ``high-level'' algorithmic
|
||||||
changes, such as computing on multiple cores instead of sequentially, caching
|
changes, such as computing on multiple cores instead of sequentially, caching
|
||||||
|
@ -22,12 +23,14 @@ row-major or column-major:
|
||||||
|
|
||||||
\vspace{1em}
|
\vspace{1em}
|
||||||
|
|
||||||
|
\definecolor{rowmajor_row}{HTML}{1b5898}
|
||||||
|
\definecolor{rowmajor_col}{HTML}{9c6210}
|
||||||
\begin{minipage}[c]{0.48\linewidth}
|
\begin{minipage}[c]{0.48\linewidth}
|
||||||
\begin{algorithmic}
|
\begin{algorithmic}
|
||||||
\State{sum $\gets 0$}
|
\State{sum $\gets 0$}
|
||||||
\For{row $<$ MAX\_ROW}
|
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
||||||
\For{column $<$ MAX\_COLUMN}
|
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
||||||
\State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
|
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
||||||
\EndFor
|
\EndFor
|
||||||
\EndFor
|
\EndFor
|
||||||
\end{algorithmic}
|
\end{algorithmic}
|
||||||
|
@ -35,9 +38,9 @@ row-major or column-major:
|
||||||
\begin{minipage}[c]{0.48\linewidth}
|
\begin{minipage}[c]{0.48\linewidth}
|
||||||
\begin{algorithmic}
|
\begin{algorithmic}
|
||||||
\State{sum $\gets 0$}
|
\State{sum $\gets 0$}
|
||||||
\For{column $<$ MAX\_COLUMN}
|
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
||||||
\For{row $<$ MAX\_ROW}
|
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
||||||
\State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
|
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
||||||
\EndFor
|
\EndFor
|
||||||
\EndFor
|
\EndFor
|
||||||
\end{algorithmic}
|
\end{algorithmic}
|
||||||
|
@ -50,3 +53,41 @@ iterates on rows first, or \textit{row-major}, while the right one iterates on
|
||||||
columns first, or \textit{column-major}. The latter, on large matrices, will
|
columns first, or \textit{column-major}. The latter, on large matrices, will
|
||||||
cause frequent cache misses, and was measured to run up to about six times
|
cause frequent cache misses, and was measured to run up to about six times
|
||||||
slower than the former~\cite{rowmajor_repo}.
|
slower than the former~\cite{rowmajor_repo}.
|
||||||
|
|
||||||
|
This, however, is still an optimization that holds for the vast majority of
|
||||||
|
CPUs. In many cases, transformations targeting a specific microarchitecture can
|
||||||
|
be very beneficial. \qtodo{Insert number/ref \wrt{} matmult or some kernel of
|
||||||
|
the like.} This kind of optimizations, however, requires manual effort, and a
|
||||||
|
deep expert knowledge both in optimization techniques and on the specific
|
||||||
|
architecture targeted.
|
||||||
|
These techniques are only worth applying on the parts of a program that are
|
||||||
|
most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
|
||||||
|
are iterated enough times to be assumed infinite. Such loop bodies are called
|
||||||
|
\emph{computation kernels}, with which this whole manuscript will be concerned.
|
||||||
|
|
||||||
|
\medskip{}
|
||||||
|
|
||||||
|
Developers are used to \emph{functional debugging}, the practice of tracking
|
||||||
|
the root cause of an unexpected bad functional behaviour. Akin to it is
|
||||||
|
\emph{performance debugging}, the practice of tracking the root cause of a
|
||||||
|
performance below expectations. Just as functional debugging can be carried in
|
||||||
|
a variety of ways, from guessing and inserting print instructions to
|
||||||
|
sophisticated tools such as \gdb{}, performance debugging can be carried with
|
||||||
|
different tools. Crude timing measures and profiling can point to a general
|
||||||
|
part of the program or hint an issue; reading \emph{hardware counters}
|
||||||
|
---~metrics reported by the CPU~--- can lead to a better understanding, and may
|
||||||
|
confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
|
||||||
|
analyze the assembly code and, in the light of a built-in hardware model,
|
||||||
|
strive to provide a performance analysis.
|
||||||
|
|
||||||
|
An exact modelling of the processor would require a cycle-accurate simulator,
|
||||||
|
reproducing the precise behaviour of the silicon, allowing one to observe any
|
||||||
|
desired metric. Such a simulator, however, would be prohibitively slow, and is
|
||||||
|
not available on most architectures anyway, as processors are not usually open
|
||||||
|
hardware and the manufacturer regards their implementation as industrial
|
||||||
|
secret. Code analyzers thus resort to approximated, higher-level models of
|
||||||
|
varied kinds. Tools based on such models, as opposed to measures or hardware
|
||||||
|
counters sampling, may not always be precise and faithful. They can, however,
|
||||||
|
inspect at will their inner model state, and derive more advanced metrics or
|
||||||
|
hypotheses, for instance by predicting which resource might be overloaded and
|
||||||
|
slow the whole computation.
|
||||||
|
|
Loading…
Add table
Reference in a new issue