diff --git a/manuscrit/10_introduction/main.tex b/manuscrit/10_introduction/main.tex index a2e4e35..1a5c205 100644 --- a/manuscrit/10_introduction/main.tex +++ b/manuscrit/10_introduction/main.tex @@ -5,10 +5,11 @@ parts of the development cycle of a program. However, performance optimization might be just as crucial for compute-intensive software. On small-scale applications, it improves usability by reducing, or even hiding, the waiting time the user must endure between operations, or by allowing heavier workloads -to be processed without needing larger resources. On large-scale applications, -that may run for an extended period of time, or may be run on whole clusters, -optimization is a cost-effective path, as it allows the same workload to be run -on smaller clusters, for reduced periods of time. +to be processed without needing larger resources or in constrained embedded +hardware environments. On large-scale applications, that may run for an +extended period of time, or may be run on whole clusters, optimization is a +cost-effective path, as it allows the same workload to be run on smaller +clusters, for reduced periods of time. The most significant optimisation gains come from ``high-level'' algorithmic changes, such as computing on multiple cores instead of sequentially, caching @@ -22,12 +23,14 @@ row-major or column-major: \vspace{1em} +\definecolor{rowmajor_row}{HTML}{1b5898} +\definecolor{rowmajor_col}{HTML}{9c6210} \begin{minipage}[c]{0.48\linewidth} \begin{algorithmic} \State{sum $\gets 0$} - \For{row $<$ MAX\_ROW} - \For{column $<$ MAX\_COLUMN} - \State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$} + \For{{\color{rowmajor_row}row} $<$ MAX\_ROW} + \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN} + \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$} \EndFor \EndFor \end{algorithmic} @@ -35,9 +38,9 @@ row-major or column-major: \begin{minipage}[c]{0.48\linewidth} \begin{algorithmic} \State{sum $\gets 0$} - \For{column $<$ MAX\_COLUMN} - \For{row $<$ MAX\_ROW} - \State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$} + \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN} + \For{{\color{rowmajor_row}row} $<$ MAX\_ROW} + \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$} \EndFor \EndFor \end{algorithmic} @@ -50,3 +53,41 @@ iterates on rows first, or \textit{row-major}, while the right one iterates on columns first, or \textit{column-major}. The latter, on large matrices, will cause frequent cache misses, and was measured to run up to about six times slower than the former~\cite{rowmajor_repo}. + +This, however, is still an optimization that holds for the vast majority of +CPUs. In many cases, transformations targeting a specific microarchitecture can +be very beneficial. \qtodo{Insert number/ref \wrt{} matmult or some kernel of +the like.} This kind of optimizations, however, requires manual effort, and a +deep expert knowledge both in optimization techniques and on the specific +architecture targeted. +These techniques are only worth applying on the parts of a program that are +most executed ---~usually called the \emph{hottest} parts~---, loop bodies that +are iterated enough times to be assumed infinite. Such loop bodies are called +\emph{computation kernels}, with which this whole manuscript will be concerned. + +\medskip{} + +Developers are used to \emph{functional debugging}, the practice of tracking +the root cause of an unexpected bad functional behaviour. Akin to it is +\emph{performance debugging}, the practice of tracking the root cause of a +performance below expectations. Just as functional debugging can be carried in +a variety of ways, from guessing and inserting print instructions to +sophisticated tools such as \gdb{}, performance debugging can be carried with +different tools. Crude timing measures and profiling can point to a general +part of the program or hint an issue; reading \emph{hardware counters} +---~metrics reported by the CPU~--- can lead to a better understanding, and may +confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers}, +analyze the assembly code and, in the light of a built-in hardware model, +strive to provide a performance analysis. + +An exact modelling of the processor would require a cycle-accurate simulator, +reproducing the precise behaviour of the silicon, allowing one to observe any +desired metric. Such a simulator, however, would be prohibitively slow, and is +not available on most architectures anyway, as processors are not usually open +hardware and the manufacturer regards their implementation as industrial +secret. Code analyzers thus resort to approximated, higher-level models of +varied kinds. Tools based on such models, as opposed to measures or hardware +counters sampling, may not always be precise and faithful. They can, however, +inspect at will their inner model state, and derive more advanced metrics or +hypotheses, for instance by predicting which resource might be overloaded and +slow the whole computation.