From 6a24e7a4c587ba79a61d04785ad42f2889918a79 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr>
Date: Wed, 27 Dec 2023 20:14:44 +0100
Subject: [PATCH] Foundations: start writeup on code analyzers

---
 manuscrit/20_foundations/10_cpu_arch.tex      |  2 +-
 .../20_foundations/20_code_analyzers.tex      | 90 +++++++++++++++++++
 manuscrit/20_foundations/main.tex             |  1 +
 manuscrit/biblio/tools.bib                    | 37 ++++++++
 4 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 manuscrit/20_foundations/20_code_analyzers.tex

diff --git a/manuscrit/20_foundations/10_cpu_arch.tex b/manuscrit/20_foundations/10_cpu_arch.tex
index b0a6cea..1d64a45 100644
--- a/manuscrit/20_foundations/10_cpu_arch.tex
+++ b/manuscrit/20_foundations/10_cpu_arch.tex
@@ -214,7 +214,7 @@ It is also important to note that out-of-order processors are only out-of-order
 \emph{from a certain point on}: a substantial part of the processor's frontend
 is typically still in-order.
 
-\subsubsection{Hardware counters}
+\subsubsection{Hardware counters}\label{sssec:hw_counters}
 
 Many processors provide \emph{hardware counters}, to help (low-level)
 programmers understand how their code is executed. The counters available
diff --git a/manuscrit/20_foundations/20_code_analyzers.tex b/manuscrit/20_foundations/20_code_analyzers.tex
new file mode 100644
index 0000000..ca6a3dd
--- /dev/null
+++ b/manuscrit/20_foundations/20_code_analyzers.tex
@@ -0,0 +1,90 @@
+\section{Kernel optimization and code analyzers}
+
+Optimizing a program, in most contexts, mainly means optimizing it from an
+algorithmic point of view ---~using efficient data structures, running some
+computations in parallel on multiple cores, etc. As pointed out in our
+introduction, though, optimizations close to the machine's microarchitecture
+can yield large efficiency benefits, sometimes up to two orders of
+magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
+carry for multiple reasons: they depend on the specific machine on which the
+code is run; they require deep expert knowledge; they are most often manual,
+requiring expert time ---~and thus making them expensive.
+
+Such optimizations are, however, routinely used in some domains. Scientific
+computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
+rely on the same operations, implemented by low-level libraries optimized in
+such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
+MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
+algebra. Machine learning applications, on the other hand, may typically be
+trained for extensive periods of time, on many cores and accelerators, on a
+well-defined hardware, with small portions of code being executed many times on
+different data; as such, they are very well suited for such specific and
+low-level optimizations.
+
+\medskip{}
+
+When optimizing those short fragments of code whose efficiency is critical, or
+\emph{computation kernels}, insights on what limits the code's performance, or
+\emph{performance bottlenecks}, are precious to the expert. These insights can
+be gained by reading the processor's hardware counters, described above in
+\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
+Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
+these counters with profiling to derive further performance metrics at runtime.
+
+\subsection{Code analyzers}
+
+Another approach is to rely on \emph{code analyzers}, pieces of software that
+analyze a code fragment ---~typically at assembly or binary level~---, and
+provide insights on its performance metrics on a given hardware. Code analyzers
+thus work statically, that is, without executing the code.
+
+\paragraph{Common hypotheses.} Code analyzers operate under a common
+hypotheses, derived from the typical intended usage.
+
+The kernel analyzed is expected to be the body of a loop, or
+nest of loops, that should be iterated many times enough to be approximated by
+an infinite loop. The kernel will further be analyzed under the assumption that
+it is in \emph{steady-state}, and will thus ignore startup or border effects
+occurring in extremal cases. As the kernels analyzed are those worth optimizing
+manually, it is reasonable to assume that they will be executed many times, and
+focus on their steady-state.
+
+The kernel is further assumed to be \emph{L1-resident}, that is, to work only
+on data that resides in the L1 cache. This assumption is reasonable in two
+ways. First, if data must be fetched from farther caches, or even the main
+memory, these fetch operations will be multiple orders of magnitude slower than
+the computation being analyzed, making it useless to optimize this kernel for
+CPU efficiency ---~the expert should, in this case, focus instead on data
+locality, prefetching, etc. Second, code analyzers typically focus only on the
+CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
+code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
+bottleneck}.
+
+Code analyzers also disregard control flow, and thus assume the code to be
+\emph{straight-line code}: the kernel analyzed is considered as a sequence of
+instructions without influence on the control flow, executed in order, and
+jumping unconditionally back to the first instruction after the last ---~or,
+more accurately, the last jump is always assumed taken, and any control flow
+instruction in the middle is assumed not taken, while their computational cost
+is accounted for.
+
+\paragraph{Metrics produced.} The insights they provide as an output vary with
+the code analyzer used. All of them are able to predict either the throughput
+or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
+how many cycles one iteration of the loop takes, in average and in
+steady-state. Although throughput can already be measured at runtime with
+hardware counters, a static estimation ---~if reliable~--- is already an
+improvement, as a static analyzer is typically faster than running the actual
+program under profiling.
+
+Each code analyzer relies on a model, or a collection of models, of the
+hardware on which it provides analyzes. Depending on what is, or is not
+modelled by a specific code analyzer, it may further extract any available and
+relevant metric from its model: whether the frontend is saturated, which
+computation units from the backend are stressed and by which precise
+instructions, when the CPU stalls and why, etc. Code analyzers may further
+point towards the resources that are limiting the kernel's performance, or
+\emph{bottlenecks}.
+
+
+\paragraph{Static vs.\ dynamic analyzers.}
diff --git a/manuscrit/20_foundations/main.tex b/manuscrit/20_foundations/main.tex
index f732f0e..63b3a0d 100644
--- a/manuscrit/20_foundations/main.tex
+++ b/manuscrit/20_foundations/main.tex
@@ -2,3 +2,4 @@
 
 \input{00_intro.tex}
 \input{10_cpu_arch.tex}
+\input{20_code_analyzers.tex}
diff --git a/manuscrit/biblio/tools.bib b/manuscrit/biblio/tools.bib
index a62f603..af8ecc5 100644
--- a/manuscrit/biblio/tools.bib
+++ b/manuscrit/biblio/tools.bib
@@ -91,3 +91,40 @@
 	howpublished={\url{https://www.qemu.org}}
 }
 
+% OpenBLAS
+@inproceedings{openblas_2013,
+	author = {Wang, Qian and Zhang, Xianyi and Zhang, Yunquan and Yi, Qing},
+	title = {AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on X86 CPUs},
+	year = {2013},
+	isbn = {9781450323789},
+	publisher = {Association for Computing Machinery},
+	address = {New York, NY, USA},
+	url = {https://doi.org/10.1145/2503210.2503219},
+	doi = {10.1145/2503210.2503219},
+	abstract = {Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.},
+	booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis},
+	articleno = {25},
+	numpages = {12},
+	keywords = {auto-tuning, code generation, DLA code optimization},
+	location = {Denver, Colorado},
+	series = {SC '13}
+}
+
+@misc{openblas_webpage,
+	title={{OpenBLAS}: an optimized {BLAS} library},
+	author={Xianyi, Zhang},
+	howpublished={\url{https://www.qemu.org}}
+}
+
+@misc{intel_mkl,
+    title={oneAPI Math Kernel Library ({oneMKL})},
+    author={{Intel}},
+    howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html}},
+    year=2003,
+}
+
+@misc{intel_vtune,
+    title={{VTune} profiler},
+    author={{Intel}},
+    howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}},
+}