Intro: progress
This commit is contained in:
parent
3285743cd4
commit
0d7bf6a974
1 changed files with 51 additions and 10 deletions
|
@ -5,10 +5,11 @@ parts of the development cycle of a program. However, performance optimization
|
|||
might be just as crucial for compute-intensive software. On small-scale
|
||||
applications, it improves usability by reducing, or even hiding, the waiting
|
||||
time the user must endure between operations, or by allowing heavier workloads
|
||||
to be processed without needing larger resources. On large-scale applications,
|
||||
that may run for an extended period of time, or may be run on whole clusters,
|
||||
optimization is a cost-effective path, as it allows the same workload to be run
|
||||
on smaller clusters, for reduced periods of time.
|
||||
to be processed without needing larger resources or in constrained embedded
|
||||
hardware environments. On large-scale applications, that may run for an
|
||||
extended period of time, or may be run on whole clusters, optimization is a
|
||||
cost-effective path, as it allows the same workload to be run on smaller
|
||||
clusters, for reduced periods of time.
|
||||
|
||||
The most significant optimisation gains come from ``high-level'' algorithmic
|
||||
changes, such as computing on multiple cores instead of sequentially, caching
|
||||
|
@ -22,12 +23,14 @@ row-major or column-major:
|
|||
|
||||
\vspace{1em}
|
||||
|
||||
\definecolor{rowmajor_row}{HTML}{1b5898}
|
||||
\definecolor{rowmajor_col}{HTML}{9c6210}
|
||||
\begin{minipage}[c]{0.48\linewidth}
|
||||
\begin{algorithmic}
|
||||
\State{sum $\gets 0$}
|
||||
\For{row $<$ MAX\_ROW}
|
||||
\For{column $<$ MAX\_COLUMN}
|
||||
\State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
|
||||
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
||||
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
||||
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
||||
\EndFor
|
||||
\EndFor
|
||||
\end{algorithmic}
|
||||
|
@ -35,9 +38,9 @@ row-major or column-major:
|
|||
\begin{minipage}[c]{0.48\linewidth}
|
||||
\begin{algorithmic}
|
||||
\State{sum $\gets 0$}
|
||||
\For{column $<$ MAX\_COLUMN}
|
||||
\For{row $<$ MAX\_ROW}
|
||||
\State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
|
||||
\For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
|
||||
\For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
|
||||
\State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
|
||||
\EndFor
|
||||
\EndFor
|
||||
\end{algorithmic}
|
||||
|
@ -50,3 +53,41 @@ iterates on rows first, or \textit{row-major}, while the right one iterates on
|
|||
columns first, or \textit{column-major}. The latter, on large matrices, will
|
||||
cause frequent cache misses, and was measured to run up to about six times
|
||||
slower than the former~\cite{rowmajor_repo}.
|
||||
|
||||
This, however, is still an optimization that holds for the vast majority of
|
||||
CPUs. In many cases, transformations targeting a specific microarchitecture can
|
||||
be very beneficial. \qtodo{Insert number/ref \wrt{} matmult or some kernel of
|
||||
the like.} This kind of optimizations, however, requires manual effort, and a
|
||||
deep expert knowledge both in optimization techniques and on the specific
|
||||
architecture targeted.
|
||||
These techniques are only worth applying on the parts of a program that are
|
||||
most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
|
||||
are iterated enough times to be assumed infinite. Such loop bodies are called
|
||||
\emph{computation kernels}, with which this whole manuscript will be concerned.
|
||||
|
||||
\medskip{}
|
||||
|
||||
Developers are used to \emph{functional debugging}, the practice of tracking
|
||||
the root cause of an unexpected bad functional behaviour. Akin to it is
|
||||
\emph{performance debugging}, the practice of tracking the root cause of a
|
||||
performance below expectations. Just as functional debugging can be carried in
|
||||
a variety of ways, from guessing and inserting print instructions to
|
||||
sophisticated tools such as \gdb{}, performance debugging can be carried with
|
||||
different tools. Crude timing measures and profiling can point to a general
|
||||
part of the program or hint an issue; reading \emph{hardware counters}
|
||||
---~metrics reported by the CPU~--- can lead to a better understanding, and may
|
||||
confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
|
||||
analyze the assembly code and, in the light of a built-in hardware model,
|
||||
strive to provide a performance analysis.
|
||||
|
||||
An exact modelling of the processor would require a cycle-accurate simulator,
|
||||
reproducing the precise behaviour of the silicon, allowing one to observe any
|
||||
desired metric. Such a simulator, however, would be prohibitively slow, and is
|
||||
not available on most architectures anyway, as processors are not usually open
|
||||
hardware and the manufacturer regards their implementation as industrial
|
||||
secret. Code analyzers thus resort to approximated, higher-level models of
|
||||
varied kinds. Tools based on such models, as opposed to measures or hardware
|
||||
counters sampling, may not always be precise and faithful. They can, however,
|
||||
inspect at will their inner model state, and derive more advanced metrics or
|
||||
hypotheses, for instance by predicting which resource might be overloaded and
|
||||
slow the whole computation.
|
||||
|
|
Loading…
Reference in a new issue