129 lines
7.1 KiB
TeX
129 lines
7.1 KiB
TeX
\section{Kernel optimization and code analyzers}
|
|
|
|
Optimizing a program, in most contexts, mainly means optimizing it from an
|
|
algorithmic point of view ---~using efficient data structures, running some
|
|
computations in parallel on multiple cores, etc. As pointed out in our
|
|
introduction, though, optimizations close to the machine's microarchitecture
|
|
can yield large efficiency benefits, sometimes up to two orders of
|
|
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
|
|
carry for multiple reasons: they depend on the specific machine on which the
|
|
code is run; they require deep expert knowledge; they are most often manual,
|
|
requiring expert time ---~and thus making them expensive.
|
|
|
|
Such optimizations are, however, routinely used in some domains. Scientific
|
|
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
|
|
rely on the same operations, implemented by low-level libraries optimized in
|
|
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
|
|
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
|
|
algebra. Machine learning applications, on the other hand, may typically be
|
|
trained for extensive periods of time, on many cores and accelerators, on a
|
|
well-defined hardware, with small portions of code being executed many times on
|
|
different data; as such, they are very well suited for such specific and
|
|
low-level optimizations.
|
|
|
|
\medskip{}
|
|
|
|
When optimizing those short fragments of code whose efficiency is critical, or
|
|
\emph{computation kernels}, insights on what limits the code's performance, or
|
|
\emph{performance bottlenecks}, are precious to the expert. These insights can
|
|
be gained by reading the processor's hardware counters, described above in
|
|
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
|
|
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
|
|
these counters with profiling to derive further performance metrics at runtime.
|
|
|
|
\subsection{Code analyzers}
|
|
|
|
Another approach is to rely on \emph{code analyzers}, pieces of software that
|
|
analyze a code fragment ---~typically at assembly or binary level~---, and
|
|
provide insights on its performance metrics on a given hardware. Code analyzers
|
|
thus work statically, that is, without executing the code.
|
|
|
|
\paragraph{Common hypotheses.} Code analyzers operate under a common
|
|
hypotheses, derived from the typical intended usage.
|
|
|
|
The kernel analyzed is expected to be the body of a loop, or
|
|
nest of loops, that should be iterated many times enough to be approximated by
|
|
an infinite loop. The kernel will further be analyzed under the assumption that
|
|
it is in \emph{steady-state}, and will thus ignore startup or border effects
|
|
occurring in extremal cases. As the kernels analyzed are those worth optimizing
|
|
manually, it is reasonable to assume that they will be executed many times, and
|
|
focus on their steady-state.
|
|
|
|
The kernel is further assumed to be \emph{L1-resident}, that is, to work only
|
|
on data that resides in the L1 cache. This assumption is reasonable in two
|
|
ways. First, if data must be fetched from farther caches, or even the main
|
|
memory, these fetch operations will be multiple orders of magnitude slower than
|
|
the computation being analyzed, making it useless to optimize this kernel for
|
|
CPU efficiency ---~the expert should, in this case, focus instead on data
|
|
locality, prefetching, etc. Second, code analyzers typically focus only on the
|
|
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
|
|
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
|
|
bottleneck}.
|
|
|
|
Code analyzers also disregard control flow, and thus assume the code to be
|
|
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
|
|
instructions without influence on the control flow, executed in order, and
|
|
jumping unconditionally back to the first instruction after the last ---~or,
|
|
more accurately, the last jump is always assumed taken, and any control flow
|
|
instruction in the middle is assumed not taken, while their computational cost
|
|
is accounted for.
|
|
|
|
\paragraph{Metrics produced.} The insights they provide as an output vary with
|
|
the code analyzer used. All of them are able to predict either the throughput
|
|
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
|
|
how many cycles one iteration of the loop takes, in average and in
|
|
steady-state. Although throughput can already be measured at runtime with
|
|
hardware counters, a static estimation ---~if reliable~--- is already an
|
|
improvement, as a static analyzer is typically faster than running the actual
|
|
program under profiling.
|
|
|
|
Each code analyzer relies on a model, or a collection of models, of the
|
|
hardware on which it provides analyzes. Depending on what is, or is not
|
|
modelled by a specific code analyzer, it may further extract any available and
|
|
relevant metric from its model: whether the frontend is saturated, which
|
|
computation units from the backend are stressed and by which precise
|
|
instructions, when the CPU stalls and why, etc. Code analyzers may further
|
|
point towards the resources that are limiting the kernel's performance, or
|
|
\emph{bottlenecks}.
|
|
|
|
|
|
\paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code
|
|
analyzers among them, are generally either performing \emph{static} or
|
|
\emph{dynamic} analyses. Static analysers work on the program itself, be it
|
|
source code, assembly or any representation, without running it; while dynamic
|
|
analysers run the analyzed program, keeping it under scrutiny through either
|
|
instrumentation, monitoring or any relevant technique. Some analyzers mix both
|
|
strategies to further refine their analyses. As a general rule of thumb,
|
|
dynamic analyzers are typically more accurate, being able to study the actual
|
|
execution trace (or traces) of the program, but are significantly slower due to
|
|
instrumentation's large overhead and focus more on the general, average case
|
|
than on edge cases.
|
|
|
|
As most code analyzers are static, this manuscript largely focuses on static
|
|
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
|
|
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
|
|
accuracy, especially regarding data dependencies that may not be easily
|
|
obtained otherwise.
|
|
|
|
\paragraph{Input formats used.} The analyzers studied in this manuscript all
|
|
take as input either assembly code, or assembled binaries.
|
|
|
|
In the case of assembly code, as for instance with \llvmmca{}, analyzers
|
|
take either a short assembly snippet, treated as straight-line code and
|
|
analyzed as such; or longer pieces of assembly, part or parts of which being
|
|
marked for analysis my surrounding assembly comments.
|
|
|
|
In the case of assembled binaries, as all analyzers were run on Linux,
|
|
executables or object files are ELF files. Some analyzers work on sections of
|
|
the file defined by user-provided offsets in the binary, while others require
|
|
the presence of \iaca{} markers around the code portion or portions to be
|
|
analyzed. Those markers, introduced by \iaca{}, consist in the following
|
|
assembly snippets: \todo{}
|
|
|
|
\subsection{Examples with \llvmmca}
|
|
|
|
\subsection{Definitions}
|
|
|
|
\subsubsection{Throughput and reverse-throughput}
|
|
|
|
\subsubsection{Basic block of an assembly-level program}
|