Foundations: start writeup on code analyzers
This commit is contained in:
parent
1b3607b18c
commit
6a24e7a4c5
4 changed files with 129 additions and 1 deletions
|
@ -214,7 +214,7 @@ It is also important to note that out-of-order processors are only out-of-order
|
|||
\emph{from a certain point on}: a substantial part of the processor's frontend
|
||||
is typically still in-order.
|
||||
|
||||
\subsubsection{Hardware counters}
|
||||
\subsubsection{Hardware counters}\label{sssec:hw_counters}
|
||||
|
||||
Many processors provide \emph{hardware counters}, to help (low-level)
|
||||
programmers understand how their code is executed. The counters available
|
||||
|
|
90
manuscrit/20_foundations/20_code_analyzers.tex
Normal file
90
manuscrit/20_foundations/20_code_analyzers.tex
Normal file
|
@ -0,0 +1,90 @@
|
|||
\section{Kernel optimization and code analyzers}
|
||||
|
||||
Optimizing a program, in most contexts, mainly means optimizing it from an
|
||||
algorithmic point of view ---~using efficient data structures, running some
|
||||
computations in parallel on multiple cores, etc. As pointed out in our
|
||||
introduction, though, optimizations close to the machine's microarchitecture
|
||||
can yield large efficiency benefits, sometimes up to two orders of
|
||||
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
|
||||
carry for multiple reasons: they depend on the specific machine on which the
|
||||
code is run; they require deep expert knowledge; they are most often manual,
|
||||
requiring expert time ---~and thus making them expensive.
|
||||
|
||||
Such optimizations are, however, routinely used in some domains. Scientific
|
||||
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
|
||||
rely on the same operations, implemented by low-level libraries optimized in
|
||||
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
|
||||
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
|
||||
algebra. Machine learning applications, on the other hand, may typically be
|
||||
trained for extensive periods of time, on many cores and accelerators, on a
|
||||
well-defined hardware, with small portions of code being executed many times on
|
||||
different data; as such, they are very well suited for such specific and
|
||||
low-level optimizations.
|
||||
|
||||
\medskip{}
|
||||
|
||||
When optimizing those short fragments of code whose efficiency is critical, or
|
||||
\emph{computation kernels}, insights on what limits the code's performance, or
|
||||
\emph{performance bottlenecks}, are precious to the expert. These insights can
|
||||
be gained by reading the processor's hardware counters, described above in
|
||||
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
|
||||
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
|
||||
these counters with profiling to derive further performance metrics at runtime.
|
||||
|
||||
\subsection{Code analyzers}
|
||||
|
||||
Another approach is to rely on \emph{code analyzers}, pieces of software that
|
||||
analyze a code fragment ---~typically at assembly or binary level~---, and
|
||||
provide insights on its performance metrics on a given hardware. Code analyzers
|
||||
thus work statically, that is, without executing the code.
|
||||
|
||||
\paragraph{Common hypotheses.} Code analyzers operate under a common
|
||||
hypotheses, derived from the typical intended usage.
|
||||
|
||||
The kernel analyzed is expected to be the body of a loop, or
|
||||
nest of loops, that should be iterated many times enough to be approximated by
|
||||
an infinite loop. The kernel will further be analyzed under the assumption that
|
||||
it is in \emph{steady-state}, and will thus ignore startup or border effects
|
||||
occurring in extremal cases. As the kernels analyzed are those worth optimizing
|
||||
manually, it is reasonable to assume that they will be executed many times, and
|
||||
focus on their steady-state.
|
||||
|
||||
The kernel is further assumed to be \emph{L1-resident}, that is, to work only
|
||||
on data that resides in the L1 cache. This assumption is reasonable in two
|
||||
ways. First, if data must be fetched from farther caches, or even the main
|
||||
memory, these fetch operations will be multiple orders of magnitude slower than
|
||||
the computation being analyzed, making it useless to optimize this kernel for
|
||||
CPU efficiency ---~the expert should, in this case, focus instead on data
|
||||
locality, prefetching, etc. Second, code analyzers typically focus only on the
|
||||
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
|
||||
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
|
||||
bottleneck}.
|
||||
|
||||
Code analyzers also disregard control flow, and thus assume the code to be
|
||||
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
|
||||
instructions without influence on the control flow, executed in order, and
|
||||
jumping unconditionally back to the first instruction after the last ---~or,
|
||||
more accurately, the last jump is always assumed taken, and any control flow
|
||||
instruction in the middle is assumed not taken, while their computational cost
|
||||
is accounted for.
|
||||
|
||||
\paragraph{Metrics produced.} The insights they provide as an output vary with
|
||||
the code analyzer used. All of them are able to predict either the throughput
|
||||
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
|
||||
how many cycles one iteration of the loop takes, in average and in
|
||||
steady-state. Although throughput can already be measured at runtime with
|
||||
hardware counters, a static estimation ---~if reliable~--- is already an
|
||||
improvement, as a static analyzer is typically faster than running the actual
|
||||
program under profiling.
|
||||
|
||||
Each code analyzer relies on a model, or a collection of models, of the
|
||||
hardware on which it provides analyzes. Depending on what is, or is not
|
||||
modelled by a specific code analyzer, it may further extract any available and
|
||||
relevant metric from its model: whether the frontend is saturated, which
|
||||
computation units from the backend are stressed and by which precise
|
||||
instructions, when the CPU stalls and why, etc. Code analyzers may further
|
||||
point towards the resources that are limiting the kernel's performance, or
|
||||
\emph{bottlenecks}.
|
||||
|
||||
|
||||
\paragraph{Static vs.\ dynamic analyzers.}
|
|
@ -2,3 +2,4 @@
|
|||
|
||||
\input{00_intro.tex}
|
||||
\input{10_cpu_arch.tex}
|
||||
\input{20_code_analyzers.tex}
|
||||
|
|
|
@ -91,3 +91,40 @@
|
|||
howpublished={\url{https://www.qemu.org}}
|
||||
}
|
||||
|
||||
% OpenBLAS
|
||||
@inproceedings{openblas_2013,
|
||||
author = {Wang, Qian and Zhang, Xianyi and Zhang, Yunquan and Yi, Qing},
|
||||
title = {AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on X86 CPUs},
|
||||
year = {2013},
|
||||
isbn = {9781450323789},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
url = {https://doi.org/10.1145/2503210.2503219},
|
||||
doi = {10.1145/2503210.2503219},
|
||||
abstract = {Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.},
|
||||
booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis},
|
||||
articleno = {25},
|
||||
numpages = {12},
|
||||
keywords = {auto-tuning, code generation, DLA code optimization},
|
||||
location = {Denver, Colorado},
|
||||
series = {SC '13}
|
||||
}
|
||||
|
||||
@misc{openblas_webpage,
|
||||
title={{OpenBLAS}: an optimized {BLAS} library},
|
||||
author={Xianyi, Zhang},
|
||||
howpublished={\url{https://www.qemu.org}}
|
||||
}
|
||||
|
||||
@misc{intel_mkl,
|
||||
title={oneAPI Math Kernel Library ({oneMKL})},
|
||||
author={{Intel}},
|
||||
howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html}},
|
||||
year=2003,
|
||||
}
|
||||
|
||||
@misc{intel_vtune,
|
||||
title={{VTune} profiler},
|
||||
author={{Intel}},
|
||||
howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}},
|
||||
}
|
||||
|
|
Loading…
Reference in a new issue