2024-03-19 19:57:04 +01:00
|
|
|
\section{Kernel optimization and code analyzers}\label{ssec:code_analyzers}
|
2023-12-27 20:14:44 +01:00
|
|
|
|
|
|
|
Optimizing a program, in most contexts, mainly means optimizing it from an
|
|
|
|
algorithmic point of view ---~using efficient data structures, running some
|
|
|
|
computations in parallel on multiple cores, etc. As pointed out in our
|
|
|
|
introduction, though, optimizations close to the machine's microarchitecture
|
|
|
|
can yield large efficiency benefits, sometimes up to two orders of
|
|
|
|
magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to
|
|
|
|
carry for multiple reasons: they depend on the specific machine on which the
|
|
|
|
code is run; they require deep expert knowledge; they are most often manual,
|
|
|
|
requiring expert time ---~and thus making them expensive.
|
|
|
|
|
|
|
|
Such optimizations are, however, routinely used in some domains. Scientific
|
|
|
|
computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often
|
|
|
|
rely on the same operations, implemented by low-level libraries optimized in
|
|
|
|
such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's
|
|
|
|
MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear
|
|
|
|
algebra. Machine learning applications, on the other hand, may typically be
|
|
|
|
trained for extensive periods of time, on many cores and accelerators, on a
|
|
|
|
well-defined hardware, with small portions of code being executed many times on
|
|
|
|
different data; as such, they are very well suited for such specific and
|
|
|
|
low-level optimizations.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
When optimizing those short fragments of code whose efficiency is critical, or
|
|
|
|
\emph{computation kernels}, insights on what limits the code's performance, or
|
|
|
|
\emph{performance bottlenecks}, are precious to the expert. These insights can
|
|
|
|
be gained by reading the processor's hardware counters, described above in
|
|
|
|
\autoref{sssec:hw_counters}, typically accurate but of limited versatility.
|
|
|
|
Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate
|
|
|
|
these counters with profiling to derive further performance metrics at runtime.
|
|
|
|
|
|
|
|
\subsection{Code analyzers}
|
|
|
|
|
|
|
|
Another approach is to rely on \emph{code analyzers}, pieces of software that
|
|
|
|
analyze a code fragment ---~typically at assembly or binary level~---, and
|
|
|
|
provide insights on its performance metrics on a given hardware. Code analyzers
|
|
|
|
thus work statically, that is, without executing the code.
|
|
|
|
|
2024-08-15 18:53:08 +02:00
|
|
|
\paragraph{Common hypotheses.} Code analyzers operate under a set of common
|
2023-12-27 20:14:44 +01:00
|
|
|
hypotheses, derived from the typical intended usage.
|
|
|
|
|
|
|
|
The kernel analyzed is expected to be the body of a loop, or
|
|
|
|
nest of loops, that should be iterated many times enough to be approximated by
|
|
|
|
an infinite loop. The kernel will further be analyzed under the assumption that
|
|
|
|
it is in \emph{steady-state}, and will thus ignore startup or border effects
|
|
|
|
occurring in extremal cases. As the kernels analyzed are those worth optimizing
|
|
|
|
manually, it is reasonable to assume that they will be executed many times, and
|
|
|
|
focus on their steady-state.
|
|
|
|
|
|
|
|
The kernel is further assumed to be \emph{L1-resident}, that is, to work only
|
|
|
|
on data that resides in the L1 cache. This assumption is reasonable in two
|
|
|
|
ways. First, if data must be fetched from farther caches, or even the main
|
|
|
|
memory, these fetch operations will be multiple orders of magnitude slower than
|
|
|
|
the computation being analyzed, making it useless to optimize this kernel for
|
|
|
|
CPU efficiency ---~the expert should, in this case, focus instead on data
|
|
|
|
locality, prefetching, etc. Second, code analyzers typically focus only on the
|
|
|
|
CPU itself, and ignore memory effects. This hypothesis formalizes this focus;
|
|
|
|
code analyzers metrics are thus to be regarded \textit{assuming the CPU is the
|
|
|
|
bottleneck}.
|
|
|
|
|
|
|
|
Code analyzers also disregard control flow, and thus assume the code to be
|
|
|
|
\emph{straight-line code}: the kernel analyzed is considered as a sequence of
|
|
|
|
instructions without influence on the control flow, executed in order, and
|
|
|
|
jumping unconditionally back to the first instruction after the last ---~or,
|
|
|
|
more accurately, the last jump is always assumed taken, and any control flow
|
|
|
|
instruction in the middle is assumed not taken, while their computational cost
|
|
|
|
is accounted for.
|
|
|
|
|
|
|
|
\paragraph{Metrics produced.} The insights they provide as an output vary with
|
|
|
|
the code analyzer used. All of them are able to predict either the throughput
|
|
|
|
or reciprocal throughput ---~defined below~--- of the kernel studied, that is,
|
|
|
|
how many cycles one iteration of the loop takes, in average and in
|
|
|
|
steady-state. Although throughput can already be measured at runtime with
|
|
|
|
hardware counters, a static estimation ---~if reliable~--- is already an
|
|
|
|
improvement, as a static analyzer is typically faster than running the actual
|
|
|
|
program under profiling.
|
|
|
|
|
|
|
|
Each code analyzer relies on a model, or a collection of models, of the
|
|
|
|
hardware on which it provides analyzes. Depending on what is, or is not
|
|
|
|
modelled by a specific code analyzer, it may further extract any available and
|
|
|
|
relevant metric from its model: whether the frontend is saturated, which
|
|
|
|
computation units from the backend are stressed and by which precise
|
|
|
|
instructions, when the CPU stalls and why, etc. Code analyzers may further
|
|
|
|
point towards the resources that are limiting the kernel's performance, or
|
|
|
|
\emph{bottlenecks}.
|
|
|
|
|
|
|
|
|
2024-01-03 10:50:36 +01:00
|
|
|
\paragraph{Static vs.\ dynamic analyzers.} Tools analyzing code, and code
|
|
|
|
analyzers among them, are generally either performing \emph{static} or
|
2024-08-15 19:14:29 +02:00
|
|
|
\emph{dynamic} analyses. Static analyzers work on the program itself, be it
|
2024-01-03 10:50:36 +01:00
|
|
|
source code, assembly or any representation, without running it; while dynamic
|
2024-08-15 19:14:29 +02:00
|
|
|
analyzers run the analyzed program, keeping it under scrutiny through either
|
2024-01-03 10:50:36 +01:00
|
|
|
instrumentation, monitoring or any relevant technique. Some analyzers mix both
|
|
|
|
strategies to further refine their analyses. As a general rule of thumb,
|
|
|
|
dynamic analyzers are typically more accurate, being able to study the actual
|
|
|
|
execution trace (or traces) of the program, but are significantly slower due to
|
|
|
|
instrumentation's large overhead and focus more on the general, average case
|
|
|
|
than on edge cases.
|
|
|
|
|
|
|
|
As most code analyzers are static, this manuscript largely focuses on static
|
|
|
|
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
|
2024-08-15 18:53:08 +02:00
|
|
|
more thoroughly in \autoref{sec:sota} later, trading heavily run time to gain
|
|
|
|
in accuracy, especially regarding data dependencies that may not be easily
|
2024-01-03 10:50:36 +01:00
|
|
|
obtained otherwise.
|
|
|
|
|
|
|
|
\paragraph{Input formats used.} The analyzers studied in this manuscript all
|
|
|
|
take as input either assembly code, or assembled binaries.
|
|
|
|
|
|
|
|
In the case of assembly code, as for instance with \llvmmca{}, analyzers
|
|
|
|
take either a short assembly snippet, treated as straight-line code and
|
|
|
|
analyzed as such; or longer pieces of assembly, part or parts of which being
|
2024-08-15 18:53:08 +02:00
|
|
|
marked for analysis by surrounding assembly comments.
|
2024-01-03 10:50:36 +01:00
|
|
|
|
|
|
|
In the case of assembled binaries, as all analyzers were run on Linux,
|
|
|
|
executables or object files are ELF files. Some analyzers work on sections of
|
|
|
|
the file defined by user-provided offsets in the binary, while others require
|
2024-08-15 18:53:08 +02:00
|
|
|
the presence of \textit{\iaca{} markers} around the code portion or portions to be
|
2024-01-06 12:13:21 +01:00
|
|
|
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
|
|
|
|
statements, consist in the following x86 assembly snippets:
|
|
|
|
|
|
|
|
\hfill\begin{minipage}{0.35\textwidth}
|
|
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
|
|
mov ebx, 111
|
|
|
|
db 0x64, 0x67, 0x90
|
|
|
|
\end{lstlisting}
|
|
|
|
\textit{\iaca{} start marker}
|
|
|
|
\end{minipage}\hfill\begin{minipage}{0.35\textwidth}
|
|
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
|
|
mov ebx, 222
|
|
|
|
db 0x64, 0x67, 0x90
|
|
|
|
\end{lstlisting}
|
|
|
|
\textit{\iaca{} end marker}
|
|
|
|
\end{minipage}
|
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
|
|
|
On UNIX-based operating systems, the standard format for assembled binaries
|
|
|
|
---~either object files (\lstc{.o}) or executables~--- is ELF~\cite{elf_tis}.
|
|
|
|
Such files are organized in sections, the assembled instructions themselves
|
|
|
|
being found in the \texttt{.text} section ---~the rest holding metadata,
|
|
|
|
program data (strings, icons, \ldots), debugging information, etc. When an ELF
|
|
|
|
is loaded to memory for execution, each segment may be \emph{mapped} to a
|
|
|
|
portion of the address space. For instance, if the \texttt{.text} section has
|
|
|
|
1024 bytes, starting at offset 4096 of the ELF file itself, it may be mapped at
|
|
|
|
virtual address \texttt{0x454000}; as such, the byte that could be read from
|
|
|
|
the program by dereferencing address \texttt{0x454010} would be the 16\up{th}
|
|
|
|
byte from the \texttt{.text} section, that is, the byte at offset 4112 in the
|
|
|
|
ELF file.
|
|
|
|
|
|
|
|
Throughout the ELF file, \emph{symbols} are defined as references, or pointers,
|
|
|
|
to specific offsets or chunks in the file. This mechanism is used, among
|
|
|
|
others, to refer to the program's function. For instance, a symbol
|
|
|
|
\texttt{main} may be defined, that would point to the offset of the first byte
|
|
|
|
of the \lstc{main} function, and may also hold its total number of bytes.
|
|
|
|
|
|
|
|
Both these mechanisms can be used to identify, without \iaca{} markers or the
|
|
|
|
like, a section of ELF file to be analyzed: an offset and size in the
|
|
|
|
\texttt{.text} section can be provided (which can be found with tools like
|
|
|
|
\lstc{objdump}), or a symbol name can be provided, if an entire function is to
|
|
|
|
be analyzed.
|
2024-01-03 10:50:36 +01:00
|
|
|
|
|
|
|
\subsection{Examples with \llvmmca}
|
|
|
|
|
2024-03-28 11:32:08 +01:00
|
|
|
We have now covered enough of the theoretical background to introduce code
|
|
|
|
analyzers in a concrete way, through examples of their usage. For this purpose,
|
|
|
|
we use \llvmmca{}, one of the state-of-the-art code analyzers.
|
|
|
|
|
|
|
|
Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
|
|
|
|
implementations~--, we will base the following examples on ARM's Cortex A72,
|
|
|
|
which we introduce in depth later in \autoref{chap:frontend}. No specific
|
|
|
|
knowledge of this microarchitecture is required to understand the following
|
|
|
|
examples; for our purposes, if suffices to say that:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item the A72 has a single load port, a single store port and two integer arithmetics ports;
|
|
|
|
\item the \texttt{xN} registers are 64-bits registers;
|
|
|
|
\item the \texttt{ldr} instruction (\textbf{l}oa\textbf{d}
|
|
|
|
\textbf{r}egister) loads a value from memory into a register;
|
|
|
|
\item the \texttt{str} instruction (\textbf{st}ore
|
|
|
|
\textbf{r}egister) stores the value of a register to memory;
|
|
|
|
\item the \texttt{add} instruction adds integer values from its two last
|
|
|
|
operands and stores the result in the first.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\bigskip{}
|
|
|
|
|
|
|
|
\paragraph{Simple example: a single load.} We first start by running \llvmmca{}
|
|
|
|
on a single load operation: \lstarmasm{ldr x1, [x2]}.
|
|
|
|
|
|
|
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
|
|
|
|
|
|
|
|
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
|
|
|
|
the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
|
|
|
|
kernel contains only one instruction, which breaks down into a single \uop{}.
|
|
|
|
Iterating it takes 106 cycles instead of the expected 100 cycles, as this
|
|
|
|
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
|
|
|
decoding of the first instruction to the retirement of the last.
|
|
|
|
|
2024-08-15 18:53:08 +02:00
|
|
|
Row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. The
|
|
|
|
next two rows are simple ratios. Row 10 is the block's \emph{reverse
|
|
|
|
throughput}, which we will note $\cyc{\kerK}$ and formalize later in
|
|
|
|
\autoref{sssec:def:rthroughput}, but is roughly defined as the number of cycles
|
|
|
|
a single iteration of the kernel takes.
|
2024-03-28 11:32:08 +01:00
|
|
|
|
|
|
|
The next section, \emph{instruction info}, lists data about the instructions
|
|
|
|
present.
|
|
|
|
|
|
|
|
Finally, the last section, \emph{resources}, breaks down individual
|
|
|
|
instructions into load incurred on execution ports, first aggregating it by
|
|
|
|
full iteration of the kernel, then instruction by instruction. The maximal load
|
|
|
|
of each port is normalized to 1, which amounts to say that it is expressed in
|
|
|
|
number of cycles required to process the load.
|
|
|
|
|
|
|
|
Here, the only pressure is 1 on the port labeled \texttt{[2]}, that is, the
|
|
|
|
load port. Thus, the kernel cannot complete in less than a full cycle, as it
|
|
|
|
takes up all load resources available.
|
|
|
|
|
|
|
|
\paragraph{The timeline mode.} Another useful view that can be displayed by
|
|
|
|
\llvmmca{} is its timeline mode, enabled by passing an extra
|
|
|
|
\lstbash{--timeline} flag. In the previous example, it further outputs:
|
|
|
|
|
|
|
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
|
|
|
|
|
|
|
|
which indicates, for each instruction, the timeline of its execution. Here,
|
|
|
|
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
|
|
|
|
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
|
|
|
|
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
|
2024-08-15 18:53:08 +02:00
|
|
|
waiting to be dispatched to execution, an \texttt{=} is shown.
|
2024-03-28 11:32:08 +01:00
|
|
|
|
|
|
|
The identifier at the beginning of each row indicates the kernel iteration
|
|
|
|
number, and the instruction within.
|
|
|
|
|
|
|
|
Here, we can better understand the 106 cycles seen earlier: it takes a first
|
|
|
|
cycle to decode the first instruction, the instruction remains in the pipeline
|
|
|
|
for 5 cycles, and must finally be retired. In steady-state, however, the
|
|
|
|
instruction would be already decoded (while a previous instruction was being
|
|
|
|
executed), the retirement would also be taking place while another instruction
|
|
|
|
executes, and the pipeline would be accepting new instructions for four of
|
|
|
|
these five cycles. We can thus avoid using up 6 of those 106 cycles in
|
|
|
|
steady-state, taking us back to the expected 100 cycles.
|
|
|
|
|
|
|
|
\paragraph{Single integer add.} If we substitute this load operation with an
|
|
|
|
integer add operation, we find a reverse throughput halved:
|
|
|
|
|
|
|
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/10_add.out}
|
|
|
|
|
|
|
|
Indeed, as we have two integer arithmetics unit, two adds may be executed in
|
|
|
|
parallel, as can be seen in the timeline view.
|
|
|
|
|
|
|
|
\paragraph{Load and two adds.} If we combine those two instructions in a kernel
|
|
|
|
with a single load and two adds, we obtain a kernel that still fits in the
|
|
|
|
execution ports in a single cycle. \llvmmca{} confirms this:
|
|
|
|
|
|
|
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/20_laa.out}
|
|
|
|
|
|
|
|
We can indeed see that an iteration fully utilizes the three ports, but still
|
|
|
|
fits: the kernel still manages to have a reverse throughput of 1.
|
|
|
|
|
|
|
|
\newpage
|
|
|
|
\paragraph{Three adds.} A kernel of three adds, however, will not be able to
|
|
|
|
run in a single cycle:
|
|
|
|
|
|
|
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/30_aaa.out}
|
|
|
|
|
|
|
|
The resource pressure by iteration view confirms that we exceed the integer
|
|
|
|
arithmetic capacity of the processor for a single cycle. This is correctly
|
|
|
|
reflected in the timeline view: the instruction \texttt{[0,2]} starts executing
|
|
|
|
only at cycle 3, along with \texttt{[1,0]}.
|
|
|
|
|
|
|
|
\paragraph{Load, store and two adds.} A kernel of one load, two adds and one
|
|
|
|
store should, ports-wise, fit in a single cycle. However, \llvmmca{} finds for
|
|
|
|
this kernel a reverse throughput of 1.3:
|
|
|
|
|
|
|
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/40_laas.out}
|
|
|
|
|
|
|
|
While the resource pressure views confirm that the ports are able to handle
|
|
|
|
this kernel in a single cycle, the timeline shows that it is in fact the
|
|
|
|
frontend that stalls the computation. As only three instructions may be decoded
|
|
|
|
and issued per cycle, the backend is not fed with enough instructions per cycle
|
|
|
|
to reach a reverse throughput of 1.
|
2024-01-06 13:04:08 +01:00
|
|
|
|
2024-01-03 10:50:36 +01:00
|
|
|
\subsection{Definitions}
|
|
|
|
|
2024-03-28 11:32:08 +01:00
|
|
|
\subsubsection{Throughput and reciprocal
|
|
|
|
throughput}\label{sssec:def:rthroughput}
|
2024-01-06 13:04:08 +01:00
|
|
|
|
|
|
|
Given a kernel $\kerK$ of straight-line assembly code, we have referred to
|
|
|
|
$\cyc{\kerK}$ as the \emph{reciprocal throughput} of $\kerK$, that is, how many
|
|
|
|
cycles $\kerK$ will require to complete its execution in steady-state. We
|
|
|
|
define this notion here more formally.
|
|
|
|
|
2024-01-06 23:53:31 +01:00
|
|
|
\begin{notation}[$\kerK^n$]\label{not:kerK_N}
|
|
|
|
Given a kernel $\kerK$ and a positive integer $n \in \nat^*$, we note
|
|
|
|
$\kerK^n$ the kernel $\kerK$ repeated $n$ times, that is, the instructions
|
|
|
|
of $\kerK$ concatenated $n$ times.
|
|
|
|
\end{notation}
|
|
|
|
|
2024-03-28 16:11:56 +01:00
|
|
|
\begin{definition}[$C(\kerK)$]\label{def:ker_cycles}
|
2024-01-20 16:24:04 +01:00
|
|
|
The \emph{number of cycles} of a kernel $\kerK$ is defined, \emph{in
|
|
|
|
steady-state}, as the number of elapsed cycles from the moment the first
|
|
|
|
instruction of $\kerK$ starts to be decoded to the moment the last
|
|
|
|
instruction of $\kerK$ is issued.
|
|
|
|
|
|
|
|
We note $C(\kerK)$ the number of cycles of $\kerK$.
|
|
|
|
|
|
|
|
We extend this definition so that $C(\emptyset) = 0$; however, care must be
|
|
|
|
taken that, as we work in steady-state, this $\emptyset$ must be \emph{in
|
|
|
|
the context of a given kernel} (\ie{} we run $\kerK$ until steady-state is
|
|
|
|
reached, then consider how many cycles it takes to execute 0 further
|
|
|
|
instructions). This context is clarified by noting $\ckn{0}$.
|
|
|
|
\end{definition}
|
|
|
|
|
|
|
|
Due to the pipelined nature of execution units, this means that the same
|
|
|
|
instruction of each iteration of $\kerK$ will be retired ---~\ie{} yield its
|
|
|
|
result~--- every steady-state execution time. For this reason, the execution
|
|
|
|
time is measured until the last instruction is issued, not retired.
|
|
|
|
|
|
|
|
\begin{lemma}[Periodicity of $\ckn{n+1}-\ckn{n}$]
|
|
|
|
Given a kernel $\kerK$, the sequence $\left(\ckn{n+1} - \ckn{n}\right)_{n
|
|
|
|
\in \nat}$ is periodic, that is, there exists $p \in \nat^*$ such that
|
|
|
|
\[
|
|
|
|
\forall n \in \nat, \ckn{n+1} - \ckn{n} = \ckn{n+p+1} - \ckn{n+p}
|
|
|
|
\]
|
|
|
|
|
|
|
|
We note this period $\calP(\kerK)$.
|
|
|
|
\end{lemma}
|
|
|
|
|
|
|
|
\begin{proof}
|
|
|
|
The number of CPU resources that can be shared between instructions
|
|
|
|
in a processor is finite (and relatively small, usually on the order of
|
|
|
|
magnitude of 10). These resources are typically the number of \uops{}
|
|
|
|
issued for each port in the current cycle, the number of decoded
|
|
|
|
instructions, total number of issued \uops{} this cycle and such.
|
|
|
|
|
|
|
|
For each of these resources, their number of possible states is also finite
|
|
|
|
(and also small). Thus, the total number of possible states of a processor
|
|
|
|
at the end of a kernel iteration cannot be higher than the combination of
|
|
|
|
those states.
|
|
|
|
|
|
|
|
For a given kernel $\kerK$, We note $\sigma(\kerK)$ the CPU state reached
|
|
|
|
after executing $\kerK$, in steady-state.
|
|
|
|
|
|
|
|
Given a kernel $\kerK$, the set $\left\{\sigma(\kerK^n), n\in
|
|
|
|
\nat\right\}$ is a subset of the total set of possible states of the
|
|
|
|
processor, and is thus finite ---~and, in all realistic cases, is usually
|
|
|
|
way smaller than the full set, given that only a portion of those resources
|
|
|
|
are used by a kernel.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
We further note that, for all $n \in \nat, \sigma(\kerK^{n+1})$ is
|
|
|
|
function of only the processor considered, $\kerK$ and $\sigma(\kerK^n)$:
|
|
|
|
indeed, a steady-state for $\kerK^{n}$ is also a steady-state for
|
|
|
|
$\kerK^{n+1}$ and, knowing $\sigma(\kerK^n)$, the execution can be
|
|
|
|
continued for the following $\kerK$, reaching $\sigma(\kerK^{n+1})$.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that
|
|
|
|
$\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the
|
|
|
|
previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of
|
2024-08-15 18:53:08 +02:00
|
|
|
period $p$. As we consider only the execution's steady state, the sequence
|
|
|
|
is periodic from rank 0.
|
2024-01-20 16:24:04 +01:00
|
|
|
|
|
|
|
As the number of cycles needed to execute $\kerK$ only depend on the
|
|
|
|
initial state of the processor, we thus have
|
|
|
|
\[
|
|
|
|
\forall n \in \nat, \ckn{n+1} - \ckn{n} = \ckn{n+p+1} - \ckn{n+p}
|
|
|
|
\]
|
|
|
|
\end{proof}
|
|
|
|
|
2024-01-06 23:53:31 +01:00
|
|
|
\begin{definition}[Reciprocal throughput of a kernel]\label{def:cyc_kerK}
|
2024-01-06 13:04:08 +01:00
|
|
|
The \emph{reciprocal throughput} of a kernel $\kerK$, noted $\cyc{\kerK}$
|
|
|
|
and measured in \emph{cycles per iteration}, is also called the
|
|
|
|
steady-state execution time of a kernel.
|
|
|
|
|
2024-08-15 18:53:08 +02:00
|
|
|
We note $p = \calP(\kerK) \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$
|
|
|
|
(by the above lemma), and define \[
|
2024-01-20 16:24:04 +01:00
|
|
|
\cyc{\kerK} = \dfrac{\ckn{p}}{p}
|
2024-01-06 23:53:31 +01:00
|
|
|
\]
|
2024-01-06 13:04:08 +01:00
|
|
|
\end{definition}
|
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
We define this as the average on a whole period because subsequent
|
2024-01-06 23:53:31 +01:00
|
|
|
kernel iterations may ``share'' a cycle.
|
|
|
|
|
|
|
|
\begin{example}
|
|
|
|
Let $\kerK$ be a kernel of three instructions, and assume that a given processor can only
|
|
|
|
issue two instructions per cycle, but has no other bottleneck for $\kerK$.
|
|
|
|
Then, $C(\kerK) = 2$, as three
|
2024-01-20 16:24:04 +01:00
|
|
|
instructions cannot be issued in a single cycle; yet $\ckn{2} = 3$, as six
|
|
|
|
instructions can be issued in only three cycles. In this case, the period $p$
|
|
|
|
is clearly $2$. Thus, in this case,
|
2024-01-06 23:53:31 +01:00
|
|
|
$\cyc{\kerK} = 1.5$.
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
\begin{remark}
|
2024-01-07 15:13:21 +01:00
|
|
|
As $C(\kerK)$ depends on the microarchitecture of the processor considered,
|
2024-08-15 18:53:08 +02:00
|
|
|
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ implicitly depends on the
|
|
|
|
processor considered.
|
2024-01-07 15:13:21 +01:00
|
|
|
\end{remark}
|
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
\begin{lemma}
|
|
|
|
Let $\kerK$ be a kernel and $p = \calP(\kerK)$. For all $n \in \nat$ such
|
|
|
|
that $n = kp + r$, with $k, r \in \nat$, $1 \leq r \leq p$,
|
|
|
|
\[
|
|
|
|
\ckn{n} = k \ckn{p} + \ckn{r}
|
|
|
|
\]
|
|
|
|
\end{lemma}
|
|
|
|
\begin{proof}
|
|
|
|
From the previous lemma instantiated with $n = 0$, we have
|
|
|
|
\begin{align*}
|
|
|
|
\ckn{1} - \ckn{0} &= \ckn{p+1} - \ckn{p} \\
|
|
|
|
\iff{} \ckn{p} &= \ckn{p+1} - \ckn{1}
|
|
|
|
\end{align*}
|
|
|
|
and thus by induction, $\forall m \in \nat, \ckn{m+p} - \ckn{m} = \ckn{p}$.
|
2024-01-07 15:13:21 +01:00
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
\medskip{}
|
|
|
|
Thus, if $k = 0$, the property is trivial. If $k = 1$, it is a direct
|
|
|
|
application of the above:
|
|
|
|
\[
|
|
|
|
\ckn{p+r} = \ckn{p} + \ckn{r}
|
|
|
|
\]
|
2024-01-06 23:53:31 +01:00
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
We prove by induction the cases for $k > 1$.
|
|
|
|
\end{proof}
|
2024-01-07 15:13:21 +01:00
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
\begin{lemma}\label{lem:cyc_k_conv}
|
|
|
|
Given a kernel $\kerK$,
|
|
|
|
\[
|
|
|
|
\dfrac{C(\kerK^n)}{n} \limarrow{n}{\infty} \cyc{\kerK}
|
|
|
|
\]
|
2024-01-07 15:13:21 +01:00
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
Furthermore, this convergence is linear:
|
|
|
|
\[
|
|
|
|
\abs{\dfrac{\ckn{n}}{n} - \cyc{\kerK}} = \bigO{\dfrac{1}{n}}
|
|
|
|
\]
|
2024-01-07 15:13:21 +01:00
|
|
|
\end{lemma}
|
|
|
|
|
|
|
|
\begin{proof}
|
2024-08-15 18:53:08 +02:00
|
|
|
Let $n \in \nat^*$ and $p = \calP(\kerK) \in \nat^*$ the periodicity by the
|
|
|
|
above lemma.
|
2024-01-07 15:13:21 +01:00
|
|
|
|
2024-08-15 18:53:08 +02:00
|
|
|
Let $k, r \in \nat^*$ such that $n = kp+r$, $1 \leq r \leq p$.
|
2024-01-07 15:13:21 +01:00
|
|
|
|
2024-01-06 23:53:31 +01:00
|
|
|
\begin{align*}
|
2024-01-20 16:24:04 +01:00
|
|
|
\ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\
|
|
|
|
&= kp \dfrac{\ckn{p}}{p} + \ckn{r} \\
|
|
|
|
&= kp \cyc{\kerK} + \ckn{r} \\
|
|
|
|
\implies \abs{\ckn{n} - n \cyc{\kerK}} &= \abs{kp \cyc{\kerK} + \ckn{r} - (kp+r) \cyc{\kerK}}\\
|
|
|
|
&= \abs{\ckn{r} - r \cyc{\kerK}} \\
|
|
|
|
&\leq \ckn{r} + r \cyc{\kerK} & \textit{(all is positive)} \\
|
|
|
|
&\leq \left(\max_{m \leq p}\ckn{m}\right) + p \cyc{\kerK}
|
2024-01-07 15:13:21 +01:00
|
|
|
\end{align*}
|
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
This last right-hand expression is independent of $n$, which we note $M$.
|
|
|
|
Dividing by $n$, we obtain
|
|
|
|
\[
|
2024-01-23 18:40:13 +01:00
|
|
|
\abs{\dfrac{\ckn{n}}{n} - \cyc{\kerK}} \leq \dfrac{M}{n} \\
|
2024-01-20 16:24:04 +01:00
|
|
|
\]
|
2024-01-07 15:13:21 +01:00
|
|
|
|
2024-01-20 16:24:04 +01:00
|
|
|
from which both results follow.
|
2024-01-07 15:13:21 +01:00
|
|
|
\end{proof}
|
2024-01-06 23:53:31 +01:00
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
2024-01-06 13:04:08 +01:00
|
|
|
Throughout this manuscript, we mostly use reciprocal throughput as a metric, as
|
|
|
|
we find it more relevant from an optimisation point of view ---~an opinion we
|
|
|
|
detail in \autoref{chap:CesASMe}. However, the
|
|
|
|
\emph{throughput} of a kernel is most widely used in the literature in its
|
|
|
|
stead.
|
|
|
|
|
2024-01-06 23:53:31 +01:00
|
|
|
\medskip
|
|
|
|
|
2024-03-28 16:11:56 +01:00
|
|
|
\begin{definition}[Throughput of a kernel]\label{def:ipc}
|
2024-01-06 13:04:08 +01:00
|
|
|
The \emph{throughput} of a kernel $\kerK$, measured in \emph{instructions
|
|
|
|
per cycle}, or IPC, is defined as the number of instructions in $\kerK$, divided
|
2024-01-23 18:40:13 +01:00
|
|
|
by the steady-state execution time of $\kerK$:
|
|
|
|
\[
|
|
|
|
\operatorname{IPC}(\kerK) = \dfrac{\card{\kerK}}{\cyc{\kerK}}
|
|
|
|
\]
|
2024-01-06 13:04:08 +01:00
|
|
|
\end{definition}
|
|
|
|
|
|
|
|
In the literature or in analyzers' reports, the throughput of a kernel is often
|
|
|
|
referred to as its \emph{IPC} (its unit).
|
|
|
|
|
2024-03-28 16:11:56 +01:00
|
|
|
\begin{notation}[Experimental measure of $\cyc{\kerK}$]\label{def:cycmes_kerK}
|
2024-01-07 15:13:21 +01:00
|
|
|
We note $\cycmes{\kerK}{n}$ the experimental measure of $\kerK$, realized
|
|
|
|
by:
|
|
|
|
\begin{itemize}
|
|
|
|
\item sampling the hardware counter of total number of instructions
|
|
|
|
retired and the counter of total number of cycles elapsed,
|
|
|
|
\item executing $\kerK^n$,
|
|
|
|
\item sampling again the same counters, and noting respectively
|
|
|
|
$\Delta_n\text{ret}$ and $\Delta_{n}C$ their differences,
|
|
|
|
\item noting $\cycmes{\kerK}{n} = \dfrac{\Delta_{n}C\cdot
|
|
|
|
\card{\kerK}}{\Delta_n\text{ret}}$, where $\card{\kerK}$ is the
|
|
|
|
number of instructions in $\kerK$.
|
|
|
|
\end{itemize}
|
|
|
|
\end{notation}
|
|
|
|
|
|
|
|
\begin{lemma}
|
|
|
|
For any kernel $\kerK$,
|
2024-01-20 16:24:04 +01:00
|
|
|
$\cycmes{\kerK}{n} \limarrow{n}{\infty} \cyc{\kerK}$.
|
2024-01-07 15:13:21 +01:00
|
|
|
\end{lemma}
|
|
|
|
\begin{proof}
|
|
|
|
For an integer number of kernel iterations $n$,
|
|
|
|
$\sfrac{\Delta_n\text{ret}}{\card{\kerK}} = n$. While measurement
|
|
|
|
errors may make $\Delta_{n}\text{ret}$ fluctuate slightly, this
|
2024-01-31 18:09:46 +01:00
|
|
|
fluctuation will be below a constant threshold.
|
2024-01-07 15:13:21 +01:00
|
|
|
\[
|
|
|
|
\abs{\dfrac{\Delta_n\text{ret}}{\card{\kerK}} - n}
|
|
|
|
\leq E_\text{ret}
|
|
|
|
\]
|
|
|
|
|
|
|
|
The same way, and due to the pipelining effects we noted below
|
|
|
|
the definition of $\cyc{\kerK}$,
|
|
|
|
\[
|
|
|
|
\abs{\Delta_{n}C - C(\kerK^n)} \leq E_C
|
|
|
|
\]
|
|
|
|
with $E_C$ a constant.
|
|
|
|
|
2024-01-31 18:09:46 +01:00
|
|
|
As those errors are constant, while other quantities are linear, we thus
|
|
|
|
have
|
|
|
|
|
|
|
|
\[
|
|
|
|
\cycmes{\kerK}{n} = \dfrac{\Delta_n C}{\sfrac{\Delta_n
|
|
|
|
ret}{\card{\kerK}}} \limarrow{n}{\infty} \dfrac{C(\kerK^n)}{n}
|
|
|
|
\]
|
|
|
|
|
|
|
|
and, composing limits with the previous lemma, we thus obtain
|
|
|
|
|
|
|
|
\[
|
|
|
|
\cycmes{\kerK}{n} \limarrow{n}{\infty} \cyc{\kerK}
|
|
|
|
\]
|
|
|
|
\end{proof}
|
2024-01-07 15:13:21 +01:00
|
|
|
|
|
|
|
Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
|
|
|
|
for large values of $n$ in this manuscript whenever it is clear that this value
|
|
|
|
is a measure.
|
2024-01-03 10:50:36 +01:00
|
|
|
|
2024-08-17 13:03:32 +02:00
|
|
|
\subsubsection{Basic block of an assembly-level program}\label{sssec:def:bbs}
|
2024-01-31 18:36:25 +01:00
|
|
|
|
|
|
|
Code analyzers are meant to analyze sections of straight-line code, that is,
|
|
|
|
portions of code which do not contain control flow. As such, it is convenient
|
|
|
|
to split the program into \emph{basic blocks}, that is, portions of
|
|
|
|
straight-line code linked to other basic blocks to reflect control flow. We
|
|
|
|
define this notion here formally, to use it soundly in the following chapters
|
|
|
|
of this manuscript.
|
|
|
|
|
|
|
|
\begin{notation}
|
|
|
|
For the purposes of this section,
|
|
|
|
\begin{itemize}
|
|
|
|
\item we formalize a segment of assembly code as a sequence of
|
|
|
|
instructions;
|
|
|
|
\item we confuse an instruction with its address.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
An instruction is said to be a \emph{flow-altering instruction} if this
|
2024-08-15 18:53:08 +02:00
|
|
|
instruction may alter the normal control flow of the program. This is
|
|
|
|
typically true of jumps (conditional or unconditional), function calls,
|
|
|
|
function returns, \ldots
|
2024-01-31 18:36:25 +01:00
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
An address is said to be a \emph{jump site} if any flow-altering
|
|
|
|
instruction in the considered sequence may alter control to this address
|
|
|
|
(and this address is not the natural flow of the program, \eg{} in the case
|
|
|
|
of a conditional jump).
|
|
|
|
\end{notation}
|
|
|
|
|
|
|
|
\begin{definition}[Basic block decomposition]
|
2024-02-14 18:03:38 +01:00
|
|
|
Consider a sequence of assembly code $A$. We note the $J_A$ the set of jump
|
|
|
|
sites of $A$, $F_A$ the set of flow-altering instructions of $A$. As each
|
|
|
|
element of those sets is the address of an instruction, we note $F^+_A$ the
|
|
|
|
set of addresses of instructions \emph{directly following} an instruction
|
|
|
|
from $F_A$ ---~note that, as instructions may be longer than one byte, it
|
|
|
|
is not sufficient to increase by 1 each address from $F_A$.
|
|
|
|
|
|
|
|
We note $S_A = J_A \cup F^+_A$. We split the instructions from $A$ into
|
|
|
|
$BB(A)$, the set of segments that begin either at the beginning of $A$, or
|
|
|
|
at instructions from $S_A$ ---~less formally, we split $A$ at each point
|
|
|
|
from $S_A$, including each boundary in the following segment.
|
|
|
|
|
|
|
|
The members of $BB(A)$ are the \emph{basic blocks} of $A$, and are segments
|
|
|
|
of code which, by construction, will always be executed as straight-line
|
|
|
|
code, and whose execution will always begin from their first instruction.
|
2024-01-31 18:36:25 +01:00
|
|
|
\end{definition}
|
|
|
|
|
|
|
|
\begin{remark}
|
|
|
|
This definition gives a direct algorithm to split a segment of assembly
|
|
|
|
code into basic blocks, as long as we have access to a semantics of the
|
|
|
|
considered assembly that indicates whether an instruction is flow-altering,
|
|
|
|
and if so, what are its possible jump sites.
|
|
|
|
\end{remark}
|