353 lines
16 KiB
TeX
353 lines
16 KiB
TeX
\section{Kernel optimization and code analyzers}
|
|
|
|
Optimizing a program, in most contexts, mainly means optimizing it from an
|
|
algorithmic point of view ---~using efficient data structures, running some
|
|
computations in parallel on multiple cores, etc. As pointed out in our
|
|
introduction, though, optimizations close to the machine's microarchitecture
|
|
can yield large efficiency benefits, sometimes up to two orders of
|
|
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
|
|
carry for multiple reasons: they depend on the specific machine on which the
|
|
code is run; they require deep expert knowledge; they are most often manual,
|
|
requiring expert time ---~and thus making them expensive.
|
|
|
|
Such optimizations are, however, routinely used in some domains. Scientific
|
|
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
|
|
rely on the same operations, implemented by low-level libraries optimized in
|
|
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
|
|
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
|
|
algebra. Machine learning applications, on the other hand, may typically be
|
|
trained for extensive periods of time, on many cores and accelerators, on a
|
|
well-defined hardware, with small portions of code being executed many times on
|
|
different data; as such, they are very well suited for such specific and
|
|
low-level optimizations.
|
|
|
|
\medskip{}
|
|
|
|
When optimizing those short fragments of code whose efficiency is critical, or
|
|
\emph{computation kernels}, insights on what limits the code's performance, or
|
|
\emph{performance bottlenecks}, are precious to the expert. These insights can
|
|
be gained by reading the processor's hardware counters, described above in
|
|
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
|
|
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
|
|
these counters with profiling to derive further performance metrics at runtime.
|
|
|
|
\subsection{Code analyzers}
|
|
|
|
Another approach is to rely on \emph{code analyzers}, pieces of software that
|
|
analyze a code fragment ---~typically at assembly or binary level~---, and
|
|
provide insights on its performance metrics on a given hardware. Code analyzers
|
|
thus work statically, that is, without executing the code.
|
|
|
|
\paragraph{Common hypotheses.} Code analyzers operate under a common
|
|
hypotheses, derived from the typical intended usage.
|
|
|
|
The kernel analyzed is expected to be the body of a loop, or
|
|
nest of loops, that should be iterated many times enough to be approximated by
|
|
an infinite loop. The kernel will further be analyzed under the assumption that
|
|
it is in \emph{steady-state}, and will thus ignore startup or border effects
|
|
occurring in extremal cases. As the kernels analyzed are those worth optimizing
|
|
manually, it is reasonable to assume that they will be executed many times, and
|
|
focus on their steady-state.
|
|
|
|
The kernel is further assumed to be \emph{L1-resident}, that is, to work only
|
|
on data that resides in the L1 cache. This assumption is reasonable in two
|
|
ways. First, if data must be fetched from farther caches, or even the main
|
|
memory, these fetch operations will be multiple orders of magnitude slower than
|
|
the computation being analyzed, making it useless to optimize this kernel for
|
|
CPU efficiency ---~the expert should, in this case, focus instead on data
|
|
locality, prefetching, etc. Second, code analyzers typically focus only on the
|
|
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
|
|
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
|
|
bottleneck}.
|
|
|
|
Code analyzers also disregard control flow, and thus assume the code to be
|
|
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
|
|
instructions without influence on the control flow, executed in order, and
|
|
jumping unconditionally back to the first instruction after the last ---~or,
|
|
more accurately, the last jump is always assumed taken, and any control flow
|
|
instruction in the middle is assumed not taken, while their computational cost
|
|
is accounted for.
|
|
|
|
\paragraph{Metrics produced.} The insights they provide as an output vary with
|
|
the code analyzer used. All of them are able to predict either the throughput
|
|
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
|
|
how many cycles one iteration of the loop takes, in average and in
|
|
steady-state. Although throughput can already be measured at runtime with
|
|
hardware counters, a static estimation ---~if reliable~--- is already an
|
|
improvement, as a static analyzer is typically faster than running the actual
|
|
program under profiling.
|
|
|
|
Each code analyzer relies on a model, or a collection of models, of the
|
|
hardware on which it provides analyzes. Depending on what is, or is not
|
|
modelled by a specific code analyzer, it may further extract any available and
|
|
relevant metric from its model: whether the frontend is saturated, which
|
|
computation units from the backend are stressed and by which precise
|
|
instructions, when the CPU stalls and why, etc. Code analyzers may further
|
|
point towards the resources that are limiting the kernel's performance, or
|
|
\emph{bottlenecks}.
|
|
|
|
|
|
\paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code
|
|
analyzers among them, are generally either performing \emph{static} or
|
|
\emph{dynamic} analyses. Static analysers work on the program itself, be it
|
|
source code, assembly or any representation, without running it; while dynamic
|
|
analysers run the analyzed program, keeping it under scrutiny through either
|
|
instrumentation, monitoring or any relevant technique. Some analyzers mix both
|
|
strategies to further refine their analyses. As a general rule of thumb,
|
|
dynamic analyzers are typically more accurate, being able to study the actual
|
|
execution trace (or traces) of the program, but are significantly slower due to
|
|
instrumentation's large overhead and focus more on the general, average case
|
|
than on edge cases.
|
|
|
|
As most code analyzers are static, this manuscript largely focuses on static
|
|
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
|
|
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
|
|
accuracy, especially regarding data dependencies that may not be easily
|
|
obtained otherwise.
|
|
|
|
\paragraph{Input formats used.} The analyzers studied in this manuscript all
|
|
take as input either assembly code, or assembled binaries.
|
|
|
|
In the case of assembly code, as for instance with \llvmmca{}, analyzers
|
|
take either a short assembly snippet, treated as straight-line code and
|
|
analyzed as such; or longer pieces of assembly, part or parts of which being
|
|
marked for analysis my surrounding assembly comments.
|
|
|
|
In the case of assembled binaries, as all analyzers were run on Linux,
|
|
executables or object files are ELF files. Some analyzers work on sections of
|
|
the file defined by user-provided offsets in the binary, while others require
|
|
the presence of \iaca{} markers around the code portion or portions to be
|
|
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
|
|
statements, consist in the following x86 assembly snippets:
|
|
|
|
\hfill\begin{minipage}{0.35\textwidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
mov ebx, 111
|
|
db 0x64, 0x67, 0x90
|
|
\end{lstlisting}
|
|
\textit{\iaca{} start marker}
|
|
\end{minipage}\hfill\begin{minipage}{0.35\textwidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
mov ebx, 222
|
|
db 0x64, 0x67, 0x90
|
|
\end{lstlisting}
|
|
\textit{\iaca{} end marker}
|
|
\end{minipage}
|
|
|
|
\medskip
|
|
|
|
On UNIX-based operating systems, the standard format for assembled binaries
|
|
---~either object files (\lstc{.o}) or executables~--- is ELF~\cite{elf_tis}.
|
|
Such files are organized in sections, the assembled instructions themselves
|
|
being found in the \texttt{.text} section ---~the rest holding metadata,
|
|
program data (strings, icons, \ldots), debugging information, etc. When an ELF
|
|
is loaded to memory for execution, each segment may be \emph{mapped} to a
|
|
portion of the address space. For instance, if the \texttt{.text} section has
|
|
1024 bytes, starting at offset 4096 of the ELF file itself, it may be mapped at
|
|
virtual address \texttt{0x454000}; as such, the byte that could be read from
|
|
the program by dereferencing address \texttt{0x454010} would be the 16\up{th}
|
|
byte from the \texttt{.text} section, that is, the byte at offset 4112 in the
|
|
ELF file.
|
|
|
|
Throughout the ELF file, \emph{symbols} are defined as references, or pointers,
|
|
to specific offsets or chunks in the file. This mechanism is used, among
|
|
others, to refer to the program's function. For instance, a symbol
|
|
\texttt{main} may be defined, that would point to the offset of the first byte
|
|
of the \lstc{main} function, and may also hold its total number of bytes.
|
|
|
|
Both these mechanisms can be used to identify, without \iaca{} markers or the
|
|
like, a section of ELF file to be analyzed: an offset and size in the
|
|
\texttt{.text} section can be provided (which can be found with tools like
|
|
\lstc{objdump}), or a symbol name can be provided, if an entire function is to
|
|
be analyzed.
|
|
|
|
\subsection{Examples with \llvmmca}
|
|
|
|
\todo{}
|
|
|
|
\subsection{Definitions}
|
|
|
|
\subsubsection{Throughput and reciprocal throughput}
|
|
|
|
Given a kernel $\kerK$ of straight-line assembly code, we have referred to
|
|
$\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many
|
|
cycles $\kerK$ will require to complete its execution in steady-state. We
|
|
define this notion here more formally.
|
|
|
|
\begin{notation}[$\kerK^n$]\label{not:kerK_N}
|
|
Given a kernel $\kerK$ and a positive integer $n \in \nat^*$, we note
|
|
$\kerK^n$ the kernel $\kerK$ repeated $n$ times, that is, the instructions
|
|
of $\kerK$ concatenated $n$ times.
|
|
\end{notation}
|
|
|
|
\begin{definition}[Reciprocal throughput of a kernel]\label{def:cyc_kerK}
|
|
The \emph{reciprocal throughput} of a kernel $\kerK$, noted $\cyc{\kerK}$
|
|
and measured in \emph{cycles per iteration}, is also called the
|
|
steady-state execution time of a kernel.
|
|
|
|
Let us note $C(\kerK)$ the number of cycles, \emph{in steady-state}, from the
|
|
moment the first instruction of $\kerK$ starts to be decoded to the
|
|
moment the last instruction of $\kerK$ is issued.
|
|
|
|
We then define \[
|
|
\cyc{\kerK} = \min_{n \in \nat^*} \left( \dfrac{C(\kerK^n)}{n} \right)
|
|
\]
|
|
\end{definition}
|
|
|
|
Due to the pipelined nature of execution units, this means that the same
|
|
instruction of each iteration of $\kerK$ will be retired ---~\ie{} yield its
|
|
result~--- every steady-state execution time. For this reason, the execution
|
|
time is measured until the last instruction is issued, not retired.
|
|
|
|
We define this as the minimum over concatenated kernels because subsequent
|
|
kernel iterations may ``share'' a cycle.
|
|
|
|
\begin{example}
|
|
Let $\kerK$ be a kernel of three instructions, and assume that a given processor can only
|
|
issue two instructions per cycle, but has no other bottleneck for $\kerK$.
|
|
Then, $C(\kerK) = 2$, as three
|
|
instructions cannot be issued in a single cycle; yet $C(\kerK^2) = 3$, as six
|
|
instructions can be issued in only three cycles. Thus, in this case,
|
|
$\cyc{\kerK} = 1.5$.
|
|
\end{example}
|
|
|
|
\begin{remark}
|
|
As $C(\kerK)$ depends on the microarchitecture of the processor considered,
|
|
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor
|
|
considered.
|
|
\end{remark}
|
|
|
|
\medskip
|
|
|
|
Although we define $\cyc{\kerK}$ as the minimum over $\nat^*$, only so many
|
|
kernels may be aggregated until we find the minimum.
|
|
|
|
\begin{lemma}\label{lem:cyc_k_conv}
|
|
Given a kernel $\kerK$,
|
|
|
|
\begin{enumerate}[(i)]
|
|
|
|
\item{}\label{lem:cyc_k_conv:low_n} the minimum considered in the
|
|
definition of $\cyc{\kerK}$ is reached for a small value of $n \leq
|
|
N_0$, $N_0$ being commensurate to the complexity of the
|
|
microarchitecture considered.
|
|
|
|
\item{}\label{lem:cyc_k_conv:conv} Furthermore, the sequence converges
|
|
towards $\cyc{\kerK}$:
|
|
\[
|
|
\lim_{n \to \infty} \dfrac{C(\kerK^n)}{n} = \cyc{\kerK}
|
|
\]
|
|
|
|
\end{enumerate}
|
|
|
|
\end{lemma}
|
|
|
|
\begin{proof}
|
|
Indeed, as the number of resources that can be shared between instructions
|
|
in a processor is finite (and relatively small, usually on the order of
|
|
magnitude of 10), and their number of possible states is also finite (and
|
|
also small), the total number of possible states of a processor at the end
|
|
of a kernel iteration cannot be higher than the combination of those states
|
|
---~and is usually way smaller, given that only a portion of those
|
|
resources are used by a kernel.
|
|
|
|
Thus, by the pigeon-hole principle, and as each state depends only on the
|
|
previous one, the states visited by $\left(C(\kerK^n)\right)_{n \in
|
|
\nat^*}$ are periodic of period $p$. As such, and as we are by hypothesis
|
|
in steady-state already (and not only periodic from a certain rank), for
|
|
any $n \geq p$, we have
|
|
\[
|
|
C(\kerK^n) = C(\kerK^{n-p}) + C(\kerK^p)
|
|
\]
|
|
|
|
Take $r_0 \in \nat^*$ realizing
|
|
$\min_{0 < r \leq p}\left(\sfrac{C(\kerK^r)}{r}\right)$.
|
|
|
|
For any $n \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$, $k, r \in \nat$,
|
|
\begin{align*}
|
|
C(\kerK^n) &= k \cdot C(\kerK^p) + C(\kerK^r) & \textit{(by induction)} \\
|
|
&= kp \dfrac{C(\kerK^p)}{p} + r \dfrac{C(\kerK^r)}{r} \\
|
|
&\geq kp \cdot \dfrac{C(\kerK^{r_0})}{r_0} + r \dfrac{C(\kerK^{r_0})}{r_0} \\
|
|
&\geq (kp+r) \dfrac{C(\kerK^{r_0})}{r_0} \\
|
|
&\geq n \dfrac{C(\kerK^{r_0})}{r_0} \\
|
|
\implies \dfrac{C(\kerK^n)}{n} &\geq \dfrac{C(\kerK^{r_0})}{r_0} = \cyc{\kerK}
|
|
\end{align*}
|
|
|
|
Thus, $r_0$ realizes the minimum from the definition of $\cyc{\kerK}$, with
|
|
$r_0 \geq p$, commensurate with the complexity of the microarchitecture,
|
|
proving~(\ref{lem:cyc_k_conv:low_n}).
|
|
|
|
\medskip{}
|
|
|
|
For any $n > r_0$, we decompose $n = r_0 + m$ and $m = k'p + r'$, $0 < r'
|
|
\leq p$, $k', r' \in \nat$.
|
|
|
|
\begin{align*}
|
|
C(\kerK^n) = C(\kerK^{r_0}) + k'p \dfrac{C(\kerK^p)}{p} +
|
|
\end{align*}
|
|
\todo{}
|
|
\end{proof}
|
|
|
|
\medskip
|
|
|
|
Throughout this manuscript, we mostly use reciprocal throughput as a metric, as
|
|
we find it more relevant from an optimisation point of view ---~an opinion we
|
|
detail in \autoref{chap:CesASMe}. However, the
|
|
\emph{throughput} of a kernel is most widely used in the literature in its
|
|
stead.
|
|
|
|
\medskip
|
|
|
|
\begin{definition}[Throughput of a kernel]
|
|
The \emph{throughput} of a kernel $\kerK$, measured in \emph{instructions
|
|
per cycle}, or IPC, is defined as the number of instructions in $\kerK$, divided
|
|
by the steady-state execution time of $\kerK$.
|
|
\end{definition}
|
|
|
|
In the literature or in analyzers' reports, the throughput of a kernel is often
|
|
referred to as its \emph{IPC} (its unit).
|
|
|
|
\begin{notation}[Experimental measure of $\cyc{\kerK}$]
|
|
We note $\cycmes{\kerK}{n}$ the experimental measure of $\kerK$, realized
|
|
by:
|
|
\begin{itemize}
|
|
\item sampling the hardware counter of total number of instructions
|
|
retired and the counter of total number of cycles elapsed,
|
|
\item executing $\kerK^n$,
|
|
\item sampling again the same counters, and noting respectively
|
|
$\Delta_n\text{ret}$ and $\Delta_{n}C$ their differences,
|
|
\item noting $\cycmes{\kerK}{n} = \dfrac{\Delta_{n}C\cdot
|
|
\card{\kerK}}{\Delta_n\text{ret}}$, where $\card{\kerK}$ is the
|
|
number of instructions in $\kerK$.
|
|
\end{itemize}
|
|
\end{notation}
|
|
|
|
\begin{lemma}
|
|
For any kernel $\kerK$,
|
|
$\cycmes{\kerK}{n} \xrightarrow[n \to \infty]{} \cyc{\kerK}$.
|
|
\end{lemma}
|
|
\begin{proof}
|
|
For an integer number of kernel iterations $n$,
|
|
$\sfrac{\Delta_n\text{ret}}{\card{\kerK}} = n$. While measurement
|
|
errors may make $\Delta_{n}\text{ret}$ fluctuate slightly, this
|
|
fluctuation will be below a constant threshold:
|
|
\[
|
|
\abs{\dfrac{\Delta_n\text{ret}}{\card{\kerK}} - n}
|
|
\leq E_\text{ret}
|
|
\]
|
|
|
|
The same way, and due to the pipelining effects we noted below
|
|
the definition of $\cyc{\kerK}$,
|
|
\[
|
|
\abs{\Delta_{n}C - C(\kerK^n)} \leq E_C
|
|
\]
|
|
with $E_C$ a constant.
|
|
|
|
As such, for a given $n$, \todo{}
|
|
\end{proof}
|
|
|
|
Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
|
|
for large values of $n$ in this manuscript whenever it is clear that this value
|
|
is a measure.
|
|
|
|
\subsubsection{Basic block of an assembly-level program}
|