Intro: progress

2023-10-08 22:32:04 +02:00 · 2023-10-08 22:32:04 +02:00 · 0d7bf6a974
commit 0d7bf6a974
parent 3285743cd4
1 changed files with 51 additions and 10 deletions
--- a/manuscrit/10_introduction/main.tex
+++ b/manuscrit/10_introduction/main.tex
@ -5,10 +5,11 @@ parts of the development cycle of a program. However, performance optimization
 might be just as crucial for compute-intensive software. On small-scale
 applications, it improves usability by reducing, or even hiding, the waiting
 time the user must endure between operations, or by allowing heavier workloads
-to be processed without needing larger resources. On large-scale applications,
+to be processed without needing larger resources or in constrained embedded
-that may run for an extended period of time, or may be run on whole clusters,
+hardware environments. On large-scale applications, that may run for an
-optimization is a cost-effective path, as it allows the same workload to be run
+extended period of time, or may be run on whole clusters, optimization is a
-on smaller clusters, for reduced periods of time.
+cost-effective path, as it allows the same workload to be run on smaller
 clusters, for reduced periods of time.
 The most significant optimisation gains come from ``high-level'' algorithmic
 changes, such as computing on multiple cores instead of sequentially, caching
@ -22,12 +23,14 @@ row-major or column-major:
 \vspace{1em}
 \definecolor{rowmajor_row}{HTML}{1b5898}
 \definecolor{rowmajor_col}{HTML}{9c6210}
 \begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
-        \For{row $<$ MAX\_ROW}
+        \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
-            \For{column $<$ MAX\_COLUMN}
+            \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
-                \State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
+                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
@ -35,9 +38,9 @@ row-major or column-major:
 \begin{minipage}[c]{0.48\linewidth}
    \begin{algorithmic}
        \State{sum $\gets 0$}
-        \For{column $<$ MAX\_COLUMN}
+        \For{{\color{rowmajor_col}col} $<$ MAX\_COLUMN}
-            \For{row $<$ MAX\_ROW}
+            \For{{\color{rowmajor_row}row} $<$ MAX\_ROW}
-                \State{sum $\gets$ sum $+ \text{matrix}[\text{row}][\text{col}]$}
+                \State{sum $\gets$ sum $+~\text{matrix}[\text{\color{rowmajor_row}row}][\text{\color{rowmajor_col}col}]$}
            \EndFor
        \EndFor
    \end{algorithmic}
@ -50,3 +53,41 @@ iterates on rows first, or \textit{row-major}, while the right one iterates on
 columns first, or \textit{column-major}. The latter, on large matrices, will
 cause frequent cache misses, and was measured to run up to about six times
 slower than the former~\cite{rowmajor_repo}.
 This, however, is still an optimization that holds for the vast majority of
 CPUs. In many cases, transformations targeting a specific microarchitecture can
 be very beneficial. \qtodo{Insert number/ref \wrt{} matmult or some kernel of
 the like.} This kind of optimizations, however, requires manual effort, and a
 deep expert knowledge both in optimization techniques and on the specific
 architecture targeted.
 These techniques are only worth applying on the parts of a program that are
 most executed ---~usually called the \emph{hottest} parts~---, loop bodies that
 are iterated enough times to be assumed infinite. Such loop bodies are called
 \emph{computation kernels}, with which this whole manuscript will be concerned.
 \medskip{}
 Developers are used to \emph{functional debugging}, the practice of tracking
 the root cause of an unexpected bad functional behaviour. Akin to it is
 \emph{performance debugging}, the practice of tracking the root cause of a
 performance below expectations. Just as functional debugging can be carried in
 a variety of ways, from guessing and inserting print instructions to
 sophisticated tools such as \gdb{}, performance debugging can be carried with
 different tools. Crude timing measures and profiling can point to a general
 part of the program or hint an issue; reading \emph{hardware counters}
 ---~metrics reported by the CPU~--- can lead to a better understanding, and may
 confirm or invalidate an hypothesis. Other tools still, \emph{code analyzers},
 analyze the assembly code and, in the light of a built-in hardware model,
 strive to provide a performance analysis.
 An exact modelling of the processor would require a cycle-accurate simulator,
 reproducing the precise behaviour of the silicon, allowing one to observe any
 desired metric. Such a simulator, however, would be prohibitively slow, and is
 not available on most architectures anyway, as processors are not usually open
 hardware and the manufacturer regards their implementation as industrial
 secret. Code analyzers thus resort to approximated, higher-level models of
 varied kinds. Tools based on such models, as opposed to measures or hardware
 counters sampling, may not always be precise and faithful. They can, however,
 inspect at will their inner model state, and derive more advanced metrics or
 hypotheses, for instance by predicting which resource might be overloaded and
 slow the whole computation.