Foundations: start writeup on code analyzers

This commit is contained in:
Théophile Bastian 2023-12-27 20:14:44 +01:00
parent 1b3607b18c
commit 6a24e7a4c5
4 changed files with 129 additions and 1 deletions

View file

@ -214,7 +214,7 @@ It is also important to note that out-of-order processors are only out-of-order
\emph{from a certain point on}: a substantial part of the processor's frontend
is typically still in-order.
\subsubsection{Hardware counters}
\subsubsection{Hardware counters}\label{sssec:hw_counters}
Many processors provide \emph{hardware counters}, to help (low-level)
programmers understand how their code is executed. The counters available

View file

@ -0,0 +1,90 @@
\section{Kernel optimization and code analyzers}
Optimizing a program, in most contexts, mainly means optimizing it from an
algorithmic point of view ---~using efficient data structures, running some
computations in parallel on multiple cores, etc. As pointed out in our
introduction, though, optimizations close to the machine's microarchitecture
can yield large efficiency benefits, sometimes up to two orders of
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
carry for multiple reasons: they depend on the specific machine on which the
code is run; they require deep expert knowledge; they are most often manual,
requiring expert time ---~and thus making them expensive.
Such optimizations are, however, routinely used in some domains. Scientific
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
rely on the same operations, implemented by low-level libraries optimized in
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
algebra. Machine learning applications, on the other hand, may typically be
trained for extensive periods of time, on many cores and accelerators, on a
well-defined hardware, with small portions of code being executed many times on
different data; as such, they are very well suited for such specific and
low-level optimizations.
\medskip{}
When optimizing those short fragments of code whose efficiency is critical, or
\emph{computation kernels}, insights on what limits the code's performance, or
\emph{performance bottlenecks}, are precious to the expert. These insights can
be gained by reading the processor's hardware counters, described above in
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
these counters with profiling to derive further performance metrics at runtime.
\subsection{Code analyzers}
Another approach is to rely on \emph{code analyzers}, pieces of software that
analyze a code fragment ---~typically at assembly or binary level~---, and
provide insights on its performance metrics on a given hardware. Code analyzers
thus work statically, that is, without executing the code.
\paragraph{Common hypotheses.} Code analyzers operate under a common
hypotheses, derived from the typical intended usage.
The kernel analyzed is expected to be the body of a loop, or
nest of loops, that should be iterated many times enough to be approximated by
an infinite loop. The kernel will further be analyzed under the assumption that
it is in \emph{steady-state}, and will thus ignore startup or border effects
occurring in extremal cases. As the kernels analyzed are those worth optimizing
manually, it is reasonable to assume that they will be executed many times, and
focus on their steady-state.
The kernel is further assumed to be \emph{L1-resident}, that is, to work only
on data that resides in the L1 cache. This assumption is reasonable in two
ways. First, if data must be fetched from farther caches, or even the main
memory, these fetch operations will be multiple orders of magnitude slower than
the computation being analyzed, making it useless to optimize this kernel for
CPU efficiency ---~the expert should, in this case, focus instead on data
locality, prefetching, etc. Second, code analyzers typically focus only on the
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
bottleneck}.
Code analyzers also disregard control flow, and thus assume the code to be
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
instructions without influence on the control flow, executed in order, and
jumping unconditionally back to the first instruction after the last ---~or,
more accurately, the last jump is always assumed taken, and any control flow
instruction in the middle is assumed not taken, while their computational cost
is accounted for.
\paragraph{Metrics produced.} The insights they provide as an output vary with
the code analyzer used. All of them are able to predict either the throughput
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
how many cycles one iteration of the loop takes, in average and in
steady-state. Although throughput can already be measured at runtime with
hardware counters, a static estimation ---~if reliable~--- is already an
improvement, as a static analyzer is typically faster than running the actual
program under profiling.
Each code analyzer relies on a model, or a collection of models, of the
hardware on which it provides analyzes. Depending on what is, or is not
modelled by a specific code analyzer, it may further extract any available and
relevant metric from its model: whether the frontend is saturated, which
computation units from the backend are stressed and by which precise
instructions, when the CPU stalls and why, etc. Code analyzers may further
point towards the resources that are limiting the kernel's performance, or
\emph{bottlenecks}.
\paragraph{Static vs.\ dynamic analyzers.}

View file

@ -2,3 +2,4 @@
\input{00_intro.tex}
\input{10_cpu_arch.tex}
\input{20_code_analyzers.tex}

View file

@ -91,3 +91,40 @@
howpublished={\url{https://www.qemu.org}}
}
% OpenBLAS
@inproceedings{openblas_2013,
author = {Wang, Qian and Zhang, Xianyi and Zhang, Yunquan and Yi, Qing},
title = {AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on X86 CPUs},
year = {2013},
isbn = {9781450323789},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2503210.2503219},
doi = {10.1145/2503210.2503219},
abstract = {Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.},
booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis},
articleno = {25},
numpages = {12},
keywords = {auto-tuning, code generation, DLA code optimization},
location = {Denver, Colorado},
series = {SC '13}
}
@misc{openblas_webpage,
title={{OpenBLAS}: an optimized {BLAS} library},
author={Xianyi, Zhang},
howpublished={\url{https://www.qemu.org}}
}
@misc{intel_mkl,
title={oneAPI Math Kernel Library ({oneMKL})},
author={{Intel}},
howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html}},
year=2003,
}
@misc{intel_vtune,
title={{VTune} profiler},
author={{Intel}},
howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}},
}