Intro: progress

2023-10-08 22:32:04 +02:00 · 2023-10-08 22:32:04 +02:00 · 0d7bf6a974
commit 0d7bf6a974
parent 3285743cd4
1 changed files with 51 additions and 10 deletions
--- a/manuscrit/10_introduction/main.tex
+++ b/manuscrit/10_introduction/main.tex
@ -5,10 +5,11 @@ parts of the development cycle of a program. However, performance optimization
 might be just as crucial for compute-intensive software. On small-scale
 applications, it improves usability by reducing, or even hiding, the waiting
 time the user must endure between operations, or by allowing heavier workloads
-to be processed without needing larger resources. On large-scale applications,
-that may run for an extended period of time, or may be run on whole clusters,
-optimization is a cost-effective path, as it allows the same workload to be run
-on smaller clusters, for reduced periods of time.
+to be processed without needing larger resources or in constrained embedded
+hardware environments. On large-scale applications, that may run for an
+extended period of time, or may be run on whole clusters, optimization is a
+cost-effective path, as it allows the same workload to be run on smaller
+clusters, for reduced periods of time.

 The most significant optimisation gains come from ``high-level'' algorithmic
 changes, such as computing on multiple cores instead of sequentially, caching
@ -22,12 +23,14 @@ row-major or column-major:

 \vspace{1em}

+\definecolor{rowmajor_row}{HTML}{1b5898}
+\definecolor{rowmajor_col}{HTML}{9c6210}
 \begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
-        \For{row $<$ MAX\_ROW}
-            \For{column $<$ MAX\_COLUMN}
-                \State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
+        \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
+            \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
+                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
@ -35,9 +38,9 @@ row-major or column-major:
 \begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
-        \For{column $<$ MAX\_COLUMN}
-            \For{row $<$ MAX\_ROW}
-                \State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
+        \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
+            \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
+                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
@ -50,3 +53,41 @@ iterates on rows first, or \textit{row-major}, while the right one iterates on
 columns first, or \textit{column-major}. The latter, on large matrices, will
 cause frequent cache misses, and was measured to run up to about six times
 slower than the former~\cite{rowmajor_repo}.
+
+This, however, is still an optimization that holds for the vast majority of
+CPUs. In many cases, transformations targeting a specific microarchitecture can
+be very beneficial. \qtodo{Insert number/ref \wrt{} matmult or some kernel of
+the like.} This kind of optimizations, however, requires manual effort, and a
+deep expert knowledge both in optimization techniques and on the specific
+architecture targeted.
+These techniques are only worth applying on the parts of a program that are
+most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
+are iterated enough times to be assumed infinite. Such loop bodies are called
+\emph{computation kernels}, with which this whole manuscript will be concerned.
+
+\medskip{}
+
+Developers are used to \emph{functional debugging}, the practice of tracking
+the root cause of an unexpected bad functional behaviour. Akin to it is
+\emph{performance debugging}, the practice of tracking the root cause of a
+performance below expectations. Just as functional debugging can be carried in
+a variety of ways, from guessing and inserting print instructions to
+sophisticated tools such as \gdb{}, performance debugging can be carried with
+different tools. Crude timing measures and profiling can point to a general
+part of the program or hint an issue; reading \emph{hardware counters}
+---~metrics reported by the CPU~--- can lead to a better understanding, and may
+confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
+analyze the assembly code and, in the light of a built-in hardware model,
+strive to provide a performance analysis.
+
+An exact modelling of the processor would require a cycle-accurate simulator,
+reproducing the precise behaviour of the silicon, allowing one to observe any
+desired metric. Such a simulator, however, would be prohibitively slow, and is
+not available on most architectures anyway, as processors are not usually open
+hardware and the manufacturer regards their implementation as industrial
+secret. Code analyzers thus resort to approximated, higher-level models of
+varied kinds. Tools based on such models, as opposed to measures or hardware
+counters sampling, may not always be precise and faithful. They can, however,
+inspect at will their inner model state, and derive more advanced metrics or
+hypotheses, for instance by predicting which resource might be overloaded and
+slow the whole computation.