170 lines
9 KiB
TeX
170 lines
9 KiB
TeX
\section{Kernel optimization and code analyzers}
|
|
|
|
Optimizing a program, in most contexts, mainly means optimizing it from an
|
|
algorithmic point of view ---~using efficient data structures, running some
|
|
computations in parallel on multiple cores, etc. As pointed out in our
|
|
introduction, though, optimizations close to the machine's microarchitecture
|
|
can yield large efficiency benefits, sometimes up to two orders of
|
|
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
|
|
carry for multiple reasons: they depend on the specific machine on which the
|
|
code is run; they require deep expert knowledge; they are most often manual,
|
|
requiring expert time ---~and thus making them expensive.
|
|
|
|
Such optimizations are, however, routinely used in some domains. Scientific
|
|
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
|
|
rely on the same operations, implemented by low-level libraries optimized in
|
|
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
|
|
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
|
|
algebra. Machine learning applications, on the other hand, may typically be
|
|
trained for extensive periods of time, on many cores and accelerators, on a
|
|
well-defined hardware, with small portions of code being executed many times on
|
|
different data; as such, they are very well suited for such specific and
|
|
low-level optimizations.
|
|
|
|
\medskip{}
|
|
|
|
When optimizing those short fragments of code whose efficiency is critical, or
|
|
\emph{computation kernels}, insights on what limits the code's performance, or
|
|
\emph{performance bottlenecks}, are precious to the expert. These insights can
|
|
be gained by reading the processor's hardware counters, described above in
|
|
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
|
|
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
|
|
these counters with profiling to derive further performance metrics at runtime.
|
|
|
|
\subsection{Code analyzers}
|
|
|
|
Another approach is to rely on \emph{code analyzers}, pieces of software that
|
|
analyze a code fragment ---~typically at assembly or binary level~---, and
|
|
provide insights on its performance metrics on a given hardware. Code analyzers
|
|
thus work statically, that is, without executing the code.
|
|
|
|
\paragraph{Common hypotheses.} Code analyzers operate under a common
|
|
hypotheses, derived from the typical intended usage.
|
|
|
|
The kernel analyzed is expected to be the body of a loop, or
|
|
nest of loops, that should be iterated many times enough to be approximated by
|
|
an infinite loop. The kernel will further be analyzed under the assumption that
|
|
it is in \emph{steady-state}, and will thus ignore startup or border effects
|
|
occurring in extremal cases. As the kernels analyzed are those worth optimizing
|
|
manually, it is reasonable to assume that they will be executed many times, and
|
|
focus on their steady-state.
|
|
|
|
The kernel is further assumed to be \emph{L1-resident}, that is, to work only
|
|
on data that resides in the L1 cache. This assumption is reasonable in two
|
|
ways. First, if data must be fetched from farther caches, or even the main
|
|
memory, these fetch operations will be multiple orders of magnitude slower than
|
|
the computation being analyzed, making it useless to optimize this kernel for
|
|
CPU efficiency ---~the expert should, in this case, focus instead on data
|
|
locality, prefetching, etc. Second, code analyzers typically focus only on the
|
|
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
|
|
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
|
|
bottleneck}.
|
|
|
|
Code analyzers also disregard control flow, and thus assume the code to be
|
|
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
|
|
instructions without influence on the control flow, executed in order, and
|
|
jumping unconditionally back to the first instruction after the last ---~or,
|
|
more accurately, the last jump is always assumed taken, and any control flow
|
|
instruction in the middle is assumed not taken, while their computational cost
|
|
is accounted for.
|
|
|
|
\paragraph{Metrics produced.} The insights they provide as an output vary with
|
|
the code analyzer used. All of them are able to predict either the throughput
|
|
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
|
|
how many cycles one iteration of the loop takes, in average and in
|
|
steady-state. Although throughput can already be measured at runtime with
|
|
hardware counters, a static estimation ---~if reliable~--- is already an
|
|
improvement, as a static analyzer is typically faster than running the actual
|
|
program under profiling.
|
|
|
|
Each code analyzer relies on a model, or a collection of models, of the
|
|
hardware on which it provides analyzes. Depending on what is, or is not
|
|
modelled by a specific code analyzer, it may further extract any available and
|
|
relevant metric from its model: whether the frontend is saturated, which
|
|
computation units from the backend are stressed and by which precise
|
|
instructions, when the CPU stalls and why, etc. Code analyzers may further
|
|
point towards the resources that are limiting the kernel's performance, or
|
|
\emph{bottlenecks}.
|
|
|
|
|
|
\paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code
|
|
analyzers among them, are generally either performing \emph{static} or
|
|
\emph{dynamic} analyses. Static analysers work on the program itself, be it
|
|
source code, assembly or any representation, without running it; while dynamic
|
|
analysers run the analyzed program, keeping it under scrutiny through either
|
|
instrumentation, monitoring or any relevant technique. Some analyzers mix both
|
|
strategies to further refine their analyses. As a general rule of thumb,
|
|
dynamic analyzers are typically more accurate, being able to study the actual
|
|
execution trace (or traces) of the program, but are significantly slower due to
|
|
instrumentation's large overhead and focus more on the general, average case
|
|
than on edge cases.
|
|
|
|
As most code analyzers are static, this manuscript largely focuses on static
|
|
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
|
|
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
|
|
accuracy, especially regarding data dependencies that may not be easily
|
|
obtained otherwise.
|
|
|
|
\paragraph{Input formats used.} The analyzers studied in this manuscript all
|
|
take as input either assembly code, or assembled binaries.
|
|
|
|
In the case of assembly code, as for instance with \llvmmca{}, analyzers
|
|
take either a short assembly snippet, treated as straight-line code and
|
|
analyzed as such; or longer pieces of assembly, part or parts of which being
|
|
marked for analysis my surrounding assembly comments.
|
|
|
|
In the case of assembled binaries, as all analyzers were run on Linux,
|
|
executables or object files are ELF files. Some analyzers work on sections of
|
|
the file defined by user-provided offsets in the binary, while others require
|
|
the presence of \iaca{} markers around the code portion or portions to be
|
|
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
|
|
statements, consist in the following x86 assembly snippets:
|
|
|
|
\hfill\begin{minipage}{0.35\textwidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
mov ebx, 111
|
|
db 0x64, 0x67, 0x90
|
|
\end{lstlisting}
|
|
\textit{\iaca{} start marker}
|
|
\end{minipage}\hfill\begin{minipage}{0.35\textwidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
mov ebx, 222
|
|
db 0x64, 0x67, 0x90
|
|
\end{lstlisting}
|
|
\textit{\iaca{} end marker}
|
|
\end{minipage}
|
|
|
|
\medskip
|
|
|
|
On UNIX-based operating systems, the standard format for assembled binaries
|
|
---~either object files (\lstc{.o}) or executables~--- is ELF~\cite{elf_tis}.
|
|
Such files are organized in sections, the assembled instructions themselves
|
|
being found in the \texttt{.text} section ---~the rest holding metadata,
|
|
program data (strings, icons, \ldots), debugging information, etc. When an ELF
|
|
is loaded to memory for execution, each segment may be \emph{mapped} to a
|
|
portion of the address space. For instance, if the \texttt{.text} section has
|
|
1024 bytes, starting at offset 4096 of the ELF file itself, it may be mapped at
|
|
virtual address \texttt{0x454000}; as such, the byte that could be read from
|
|
the program by dereferencing address \texttt{0x454010} would be the 16\up{th}
|
|
byte from the \texttt{.text} section, that is, the byte at offset 4112 in the
|
|
ELF file.
|
|
|
|
Throughout the ELF file, \emph{symbols} are defined as references, or pointers,
|
|
to specific offsets or chunks in the file. This mechanism is used, among
|
|
others, to refer to the program's function. For instance, a symbol
|
|
\texttt{main} may be defined, that would point to the offset of the first byte
|
|
of the \lstc{main} function, and may also hold its total number of bytes.
|
|
|
|
Both these mechanisms can be used to identify, without \iaca{} markers or the
|
|
like, a section of ELF file to be analyzed: an offset and size in the
|
|
\texttt{.text} section can be provided (which can be found with tools like
|
|
\lstc{objdump}), or a symbol name can be provided, if an entire function is to
|
|
be analyzed.
|
|
|
|
\subsection{Examples with \llvmmca}
|
|
|
|
\subsection{Definitions}
|
|
|
|
\subsubsection{Throughput and reverse-throughput}
|
|
|
|
\subsubsection{Basic block of an assembly-level program}
|