phd-thesis/manuscrit/20_foundations/20_code_analyzers.tex

353 lines
16 KiB
TeX

\section{Kernel optimization and code analyzers}
Optimizing a program, in most contexts, mainly means optimizing it from an
algorithmic point of view ---~using efficient data structures, running some
computations in parallel on multiple cores, etc. As pointed out in our
introduction, though, optimizations close to the machine's microarchitecture
can yield large efficiency benefits, sometimes up to two orders of
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
carry for multiple reasons: they depend on the specific machine on which the
code is run; they require deep expert knowledge; they are most often manual,
requiring expert time ---~and thus making them expensive.
Such optimizations are, however, routinely used in some domains. Scientific
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
rely on the same operations, implemented by low-level libraries optimized in
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
algebra. Machine learning applications, on the other hand, may typically be
trained for extensive periods of time, on many cores and accelerators, on a
well-defined hardware, with small portions of code being executed many times on
different data; as such, they are very well suited for such specific and
low-level optimizations.
\medskip{}
When optimizing those short fragments of code whose efficiency is critical, or
\emph{computation kernels}, insights on what limits the code's performance, or
\emph{performance bottlenecks}, are precious to the expert. These insights can
be gained by reading the processor's hardware counters, described above in
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
these counters with profiling to derive further performance metrics at runtime.
\subsection{Code analyzers}
Another approach is to rely on \emph{code analyzers}, pieces of software that
analyze a code fragment ---~typically at assembly or binary level~---, and
provide insights on its performance metrics on a given hardware. Code analyzers
thus work statically, that is, without executing the code.
\paragraph{Common hypotheses.} Code analyzers operate under a common
hypotheses, derived from the typical intended usage.
The kernel analyzed is expected to be the body of a loop, or
nest of loops, that should be iterated many times enough to be approximated by
an infinite loop. The kernel will further be analyzed under the assumption that
it is in \emph{steady-state}, and will thus ignore startup or border effects
occurring in extremal cases. As the kernels analyzed are those worth optimizing
manually, it is reasonable to assume that they will be executed many times, and
focus on their steady-state.
The kernel is further assumed to be \emph{L1-resident}, that is, to work only
on data that resides in the L1 cache. This assumption is reasonable in two
ways. First, if data must be fetched from farther caches, or even the main
memory, these fetch operations will be multiple orders of magnitude slower than
the computation being analyzed, making it useless to optimize this kernel for
CPU efficiency ---~the expert should, in this case, focus instead on data
locality, prefetching, etc. Second, code analyzers typically focus only on the
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
bottleneck}.
Code analyzers also disregard control flow, and thus assume the code to be
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
instructions without influence on the control flow, executed in order, and
jumping unconditionally back to the first instruction after the last ---~or,
more accurately, the last jump is always assumed taken, and any control flow
instruction in the middle is assumed not taken, while their computational cost
is accounted for.
\paragraph{Metrics produced.} The insights they provide as an output vary with
the code analyzer used. All of them are able to predict either the throughput
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
how many cycles one iteration of the loop takes, in average and in
steady-state. Although throughput can already be measured at runtime with
hardware counters, a static estimation ---~if reliable~--- is already an
improvement, as a static analyzer is typically faster than running the actual
program under profiling.
Each code analyzer relies on a model, or a collection of models, of the
hardware on which it provides analyzes. Depending on what is, or is not
modelled by a specific code analyzer, it may further extract any available and
relevant metric from its model: whether the frontend is saturated, which
computation units from the backend are stressed and by which precise
instructions, when the CPU stalls and why, etc. Code analyzers may further
point towards the resources that are limiting the kernel's performance, or
\emph{bottlenecks}.
\paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code
analyzers among them, are generally either performing \emph{static} or
\emph{dynamic} analyses. Static analysers work on the program itself, be it
source code, assembly or any representation, without running it; while dynamic
analysers run the analyzed program, keeping it under scrutiny through either
instrumentation, monitoring or any relevant technique. Some analyzers mix both
strategies to further refine their analyses. As a general rule of thumb,
dynamic analyzers are typically more accurate, being able to study the actual
execution trace (or traces) of the program, but are significantly slower due to
instrumentation's large overhead and focus more on the general, average case
than on edge cases.
As most code analyzers are static, this manuscript largely focuses on static
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
accuracy, especially regarding data dependencies that may not be easily
obtained otherwise.
\paragraph{Input formats used.} The analyzers studied in this manuscript all
take as input either assembly code, or assembled binaries.
In the case of assembly code, as for instance with \llvmmca{}, analyzers
take either a short assembly snippet, treated as straight-line code and
analyzed as such; or longer pieces of assembly, part or parts of which being
marked for analysis my surrounding assembly comments.
In the case of assembled binaries, as all analyzers were run on Linux,
executables or object files are ELF files. Some analyzers work on sections of
the file defined by user-provided offsets in the binary, while others require
the presence of \iaca{} markers around the code portion or portions to be
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
statements, consist in the following x86 assembly snippets:
\hfill\begin{minipage}{0.35\textwidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov ebx, 111
db 0x64, 0x67, 0x90
\end{lstlisting}
\textit{\iaca{} start marker}
\end{minipage}\hfill\begin{minipage}{0.35\textwidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov ebx, 222
db 0x64, 0x67, 0x90
\end{lstlisting}
\textit{\iaca{} end marker}
\end{minipage}
\medskip
On UNIX-based operating systems, the standard format for assembled binaries
---~either object files (\lstc{.o}) or executables~--- is ELF~\cite{elf_tis}.
Such files are organized in sections, the assembled instructions themselves
being found in the \texttt{.text} section ---~the rest holding metadata,
program data (strings, icons, \ldots), debugging information, etc. When an ELF
is loaded to memory for execution, each segment may be \emph{mapped} to a
portion of the address space. For instance, if the \texttt{.text} section has
1024 bytes, starting at offset 4096 of the ELF file itself, it may be mapped at
virtual address \texttt{0x454000}; as such, the byte that could be read from
the program by dereferencing address \texttt{0x454010} would be the 16\up{th}
byte from the \texttt{.text} section, that is, the byte at offset 4112 in the
ELF file.
Throughout the ELF file, \emph{symbols} are defined as references, or pointers,
to specific offsets or chunks in the file. This mechanism is used, among
others, to refer to the program's function. For instance, a symbol
\texttt{main} may be defined, that would point to the offset of the first byte
of the \lstc{main} function, and may also hold its total number of bytes.
Both these mechanisms can be used to identify, without \iaca{} markers or the
like, a section of ELF file to be analyzed: an offset and size in the
\texttt{.text} section can be provided (which can be found with tools like
\lstc{objdump}), or a symbol name can be provided, if an entire function is to
be analyzed.
\subsection{Examples with \llvmmca}
\todo{}
\subsection{Definitions}
\subsubsection{Throughput and reciprocal throughput}
Given a kernel $\kerK$ of straight-line assembly code, we have referred to
$\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many
cycles $\kerK$ will require to complete its execution in steady-state. We
define this notion here more formally.
\begin{notation}[$\kerK^n$]\label{not:kerK_N}
Given a kernel $\kerK$ and a positive integer $n \in \nat^*$, we note
$\kerK^n$ the kernel $\kerK$ repeated $n$ times, that is, the instructions
of $\kerK$ concatenated $n$ times.
\end{notation}
\begin{definition}[Reciprocal throughput of a kernel]\label{def:cyc_kerK}
The \emph{reciprocal throughput} of a kernel $\kerK$, noted $\cyc{\kerK}$
and measured in \emph{cycles per iteration}, is also called the
steady-state execution time of a kernel.
Let us note $C(\kerK)$ the number of cycles, \emph{in steady-state}, from the
moment the first instruction of $\kerK$ starts to be decoded to the
moment the last instruction of $\kerK$ is issued.
We then define \[
\cyc{\kerK} = \min_{n \in \nat^*} \left( \dfrac{C(\kerK^n)}{n} \right)
\]
\end{definition}
Due to the pipelined nature of execution units, this means that the same
instruction of each iteration of $\kerK$ will be retired ---~\ie{} yield its
result~--- every steady-state execution time. For this reason, the execution
time is measured until the last instruction is issued, not retired.
We define this as the minimum over concatenated kernels because subsequent
kernel iterations may ``share'' a cycle.
\begin{example}
Let $\kerK$ be a kernel of three instructions, and assume that a given processor can only
issue two instructions per cycle, but has no other bottleneck for $\kerK$.
Then, $C(\kerK) = 2$, as three
instructions cannot be issued in a single cycle; yet $C(\kerK^2) = 3$, as six
instructions can be issued in only three cycles. Thus, in this case,
$\cyc{\kerK} = 1.5$.
\end{example}
\begin{remark}
As $C(\kerK)$ depends on the microarchitecture of the processor considered,
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor
considered.
\end{remark}
\medskip
Although we define $\cyc{\kerK}$ as the minimum over $\nat^*$, only so many
kernels may be aggregated until we find the minimum.
\begin{lemma}\label{lem:cyc_k_conv}
Given a kernel $\kerK$,
\begin{enumerate}[(i)]
\item{}\label{lem:cyc_k_conv:low_n} the minimum considered in the
definition of $\cyc{\kerK}$ is reached for a small value of $n \leq
N_0$, $N_0$ being commensurate to the complexity of the
microarchitecture considered.
\item{}\label{lem:cyc_k_conv:conv} Furthermore, the sequence converges
towards $\cyc{\kerK}$:
\[
\lim_{n \to \infty} \dfrac{C(\kerK^n)}{n} = \cyc{\kerK}
\]
\end{enumerate}
\end{lemma}
\begin{proof}
Indeed, as the number of resources that can be shared between instructions
in a processor is finite (and relatively small, usually on the order of
magnitude of 10), and their number of possible states is also finite (and
also small), the total number of possible states of a processor at the end
of a kernel iteration cannot be higher than the combination of those states
---~and is usually way smaller, given that only a portion of those
resources are used by a kernel.
Thus, by the pigeon-hole principle, and as each state depends only on the
previous one, the states visited by $\left(C(\kerK^n)\right)_{n \in
\nat^*}$ are periodic of period $p$. As such, and as we are by hypothesis
in steady-state already (and not only periodic from a certain rank), for
any $n \geq p$, we have
\[
C(\kerK^n) = C(\kerK^{n-p}) + C(\kerK^p)
\]
Take $r_0 \in \nat^*$ realizing
$\min_{0 < r \leq p}\left(\sfrac{C(\kerK^r)}{r}\right)$.
For any $n \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$, $k, r \in \nat$,
\begin{align*}
C(\kerK^n) &= k \cdot C(\kerK^p) + C(\kerK^r) & \textit{(by induction)} \\
&= kp \dfrac{C(\kerK^p)}{p} + r \dfrac{C(\kerK^r)}{r} \\
&\geq kp \cdot \dfrac{C(\kerK^{r_0})}{r_0} + r \dfrac{C(\kerK^{r_0})}{r_0} \\
&\geq (kp+r) \dfrac{C(\kerK^{r_0})}{r_0} \\
&\geq n \dfrac{C(\kerK^{r_0})}{r_0} \\
\implies \dfrac{C(\kerK^n)}{n} &\geq \dfrac{C(\kerK^{r_0})}{r_0} = \cyc{\kerK}
\end{align*}
Thus, $r_0$ realizes the minimum from the definition of $\cyc{\kerK}$, with
$r_0 \geq p$, commensurate with the complexity of the microarchitecture,
proving~(\ref{lem:cyc_k_conv:low_n}).
\medskip{}
For any $n > r_0$, we decompose $n = r_0 + m$ and $m = k'p + r'$, $0 < r'
\leq p$, $k', r' \in \nat$.
\begin{align*}
C(\kerK^n) = C(\kerK^{r_0}) + k'p \dfrac{C(\kerK^p)}{p} +
\end{align*}
\todo{}
\end{proof}
\medskip
Throughout this manuscript, we mostly use reciprocal throughput as a metric, as
we find it more relevant from an optimisation point of view ---~an opinion we
detail in \autoref{chap:CesASMe}. However, the
\emph{throughput} of a kernel is most widely used in the literature in its
stead.
\medskip
\begin{definition}[Throughput of a kernel]
The \emph{throughput} of a kernel $\kerK$, measured in \emph{instructions
per cycle}, or IPC, is defined as the number of instructions in $\kerK$, divided
by the steady-state execution time of $\kerK$.
\end{definition}
In the literature or in analyzers' reports, the throughput of a kernel is often
referred to as its \emph{IPC} (its unit).
\begin{notation}[Experimental measure of $\cyc{\kerK}$]
We note $\cycmes{\kerK}{n}$ the experimental measure of $\kerK$, realized
by:
\begin{itemize}
\item sampling the hardware counter of total number of instructions
retired and the counter of total number of cycles elapsed,
\item executing $\kerK^n$,
\item sampling again the same counters, and noting respectively
$\Delta_n\text{ret}$ and $\Delta_{n}C$ their differences,
\item noting $\cycmes{\kerK}{n} = \dfrac{\Delta_{n}C\cdot
\card{\kerK}}{\Delta_n\text{ret}}$, where $\card{\kerK}$ is the
number of instructions in $\kerK$.
\end{itemize}
\end{notation}
\begin{lemma}
For any kernel $\kerK$,
$\cycmes{\kerK}{n} \xrightarrow[n \to \infty]{} \cyc{\kerK}$.
\end{lemma}
\begin{proof}
For an integer number of kernel iterations $n$,
$\sfrac{\Delta_n\text{ret}}{\card{\kerK}} = n$. While measurement
errors may make $\Delta_{n}\text{ret}$ fluctuate slightly, this
fluctuation will be below a constant threshold:
\[
\abs{\dfrac{\Delta_n\text{ret}}{\card{\kerK}} - n}
\leq E_\text{ret}
\]
The same way, and due to the pipelining effects we noted below
the definition of $\cyc{\kerK}$,
\[
\abs{\Delta_{n}C - C(\kerK^n)} \leq E_C
\]
with $E_C$ a constant.
As such, for a given $n$, \todo{}
\end{proof}
Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
for large values of $n$ in this manuscript whenever it is clear that this value
is a measure.
\subsubsection{Basic block of an assembly-level program}