Proof-read up to Foundations (incl)

This commit is contained in:
Théophile Bastian 2024-08-15 18:53:08 +02:00
parent c8c2b2db2a
commit 8d4887cc63
5 changed files with 48 additions and 44 deletions

View file

@ -1,6 +1,6 @@
\selectlanguage{french} \selectlanguage{french}
\begin{abstract} \begin{abstract}
Qu'il s'agisse de calculs massifs distribués sur plusieurs racks, de Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
calculs en environnement contraint --~comme de l'embarqué ou de calculs en environnement contraint --~comme de l'embarqué ou de
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
@ -43,7 +43,7 @@
the generated assembly with respect to the microarchitecture of the the generated assembly with respect to the microarchitecture of the
specific microprocessor used to fine-tune it. specific microprocessor used to fine-tune it.
Such an optimisation level requires a very detailed comprehension of both Such an optimisation level requires a very detailed understanding of both
the software and hardware aspects implied, and is most often the realm of the software and hardware aspects implied, and is most often the realm of
experts. \emph{Code analyzers}, however, are tools that help lowering the experts. \emph{Code analyzers}, however, are tools that help lowering the
expertise threshold required to perform such optimisations by automating expertise threshold required to perform such optimisations by automating

View file

@ -100,7 +100,7 @@ slow the whole computation.
\vspace{2em} \vspace{2em}
In this thesis, we explore the three major aspects that work towards a code In this thesis, we explore the three major aspects that work towards a code
analyzers' accuracy: a \emph{backend model}, a \emph{frontend model} and a analyzer's accuracy: a \emph{backend model}, a \emph{frontend model} and a
\emph{dependencies model}. We propose contributions to strengthen them, as well \emph{dependencies model}. We propose contributions to strengthen them, as well
as to automate the underlying models' synthesis. We focus on \emph{static} as to automate the underlying models' synthesis. We focus on \emph{static}
code analyzers, that derive metrics, including runtime predictions, from an code analyzers, that derive metrics, including runtime predictions, from an
@ -124,7 +124,7 @@ tool, akin to \palmed.
Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art
code analyzers' strengths and shortcomings. To this end, we introduce a code analyzers' strengths and shortcomings. To this end, we introduce a
fully-tooled approach in two parts: first, a benchmarks-generation procedure, fully-tooled approach in two parts: first, a benchmark-generation procedure,
yielding thousands of benchmarks relevant in the context of our approach; then, yielding thousands of benchmarks relevant in the context of our approach; then,
a benchmarking harness evaluating code analyzers on these benchmarks. We find a benchmarking harness evaluating code analyzers on these benchmarks. We find
that most state-of-the-art code analyzers struggle to correctly account for that most state-of-the-art code analyzers struggle to correctly account for
@ -154,11 +154,12 @@ we see this commitment as an opportunity to develop methodologies able to model
these processors. these processors.
This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is
to model a processor's backend resources without resorting to its hardware to model a processor's backend resources without resorting to its
counters. Our frontend study, in \autoref{chap:frontend}, also follows this vendor-specific hardware counters. Our frontend study, in
strategy by focusing on a processor whose hardware counters give little to no \autoref{chap:frontend}, also follows this strategy by focusing on a processor
insight on its frontend. While this goal is less relevant to \staticdeps{}, we whose hardware counters give little to no insight on its frontend. While this
rely on external libraries to abstract the underlying architecture. goal is less relevant to \staticdeps{}, we rely on external libraries to
abstract the underlying architecture.
\medskip{} \medskip{}

View file

@ -154,7 +154,7 @@ port for both memory loads and stores.
In most cases, execution units are \emph{fully pipelined}, meaning that while In most cases, execution units are \emph{fully pipelined}, meaning that while
processing a single \uop{} takes multiple cycles, the unit is able to start processing a single \uop{} takes multiple cycles, the unit is able to start
processing a new \uop{} every cycle: multiple \uops{} are then being processed, processing a new \uop{} every cycle: multiple \uops{} are thus being processed,
at different stages, during each cycle, akin to a factory's assembly line. at different stages, during each cycle, akin to a factory's assembly line.
\smallskip{} \smallskip{}
@ -204,11 +204,12 @@ For this reason, many processors are now \emph{out-of-order}, while processors
issuing \uops{} strictly in their original order are called \emph{in-order}. issuing \uops{} strictly in their original order are called \emph{in-order}.
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
instructions are picked to be issued. The reorder buffer acts as a sliding instructions are picked to be issued. The reorder buffer acts as a sliding
window of microarchitecturally-fixed size over \uops{}, from which the oldest window of microarchitecturally-fixed size over decoded \uops{}, from which the
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order oldest \uop{} whose dependencies are satisfied will be executed. Thus,
CPUs are only able to execute operations out of order as long as the out-of-order CPUs are only able to execute operations out of order as long as
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to the \uop{} to be executed is not too far ahead from the oldest \uop{} awaiting
be issued ---~specifically, not more than the size of the reorder buffer ahead. to be issued ---~specifically, not more than the size of the reorder buffer
ahead.
It is also important to note that out-of-order processors are only out-of-order It is also important to note that out-of-order processors are only out-of-order
\emph{from a certain point on}: a substantial part of the processor's frontend \emph{from a certain point on}: a substantial part of the processor's frontend
@ -238,13 +239,14 @@ word sizes.
Some instructions, however, operate on chunks of multiple words at once. These Some instructions, however, operate on chunks of multiple words at once. These
instructions are called \emph{vector instructions}, or \emph{SIMD} for Single instructions are called \emph{vector instructions}, or \emph{SIMD} for Single
Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add
two chunks of 128 bits, which can for instance be treated each as four integers two chunks of 128 bits, treated each as four integers
of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}. of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}.
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[width=0.6\textwidth]{simd.svg} \includegraphics[width=0.6\textwidth]{simd.svg}
\caption{Example of SIMD add instruction on 128b}\label{fig:cpu_simd} \caption{Example of SIMD $4 \times 32\,\text{bits}$ add instruction on
128 bits}\label{fig:cpu_simd}
\end{figure} \end{figure}
Such instructions present clear efficiency advantages. If the processor is able Such instructions present clear efficiency advantages. If the processor is able

View file

@ -38,7 +38,7 @@ analyze a code fragment ---~typically at assembly or binary level~---, and
provide insights on its performance metrics on a given hardware. Code analyzers provide insights on its performance metrics on a given hardware. Code analyzers
thus work statically, that is, without executing the code. thus work statically, that is, without executing the code.
\paragraph{Common hypotheses.} Code analyzers operate under a common \paragraph{Common hypotheses.} Code analyzers operate under a set of common
hypotheses, derived from the typical intended usage. hypotheses, derived from the typical intended usage.
The kernel analyzed is expected to be the body of a loop, or The kernel analyzed is expected to be the body of a loop, or
@ -101,8 +101,8 @@ than on edge cases.
As most code analyzers are static, this manuscript largely focuses on static As most code analyzers are static, this manuscript largely focuses on static
analysis. The only dynamic code analyzer we are aware of is \gus{}, described analysis. The only dynamic code analyzer we are aware of is \gus{}, described
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in more thoroughly in \autoref{sec:sota} later, trading heavily run time to gain
accuracy, especially regarding data dependencies that may not be easily in accuracy, especially regarding data dependencies that may not be easily
obtained otherwise. obtained otherwise.
\paragraph{Input formats used.} The analyzers studied in this manuscript all \paragraph{Input formats used.} The analyzers studied in this manuscript all
@ -111,12 +111,12 @@ take as input either assembly code, or assembled binaries.
In the case of assembly code, as for instance with \llvmmca{}, analyzers In the case of assembly code, as for instance with \llvmmca{}, analyzers
take either a short assembly snippet, treated as straight-line code and take either a short assembly snippet, treated as straight-line code and
analyzed as such; or longer pieces of assembly, part or parts of which being analyzed as such; or longer pieces of assembly, part or parts of which being
marked for analysis my surrounding assembly comments. marked for analysis by surrounding assembly comments.
In the case of assembled binaries, as all analyzers were run on Linux, In the case of assembled binaries, as all analyzers were run on Linux,
executables or object files are ELF files. Some analyzers work on sections of executables or object files are ELF files. Some analyzers work on sections of
the file defined by user-provided offsets in the binary, while others require the file defined by user-provided offsets in the binary, while others require
the presence of \iaca{} markers around the code portion or portions to be the presence of \textit{\iaca{} markers} around the code portion or portions to be
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
statements, consist in the following x86 assembly snippets: statements, consist in the following x86 assembly snippets:
@ -198,11 +198,11 @@ Iterating it takes 106 cycles instead of the expected 100 cycles, as this
execution is \emph{not} in steady-state, but accounts for the cycles from the execution is \emph{not} in steady-state, but accounts for the cycles from the
decoding of the first instruction to the retirement of the last. decoding of the first instruction to the retirement of the last.
The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. Row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. The
The next two rows are simple ratios. Row 10 is the block's \emph{reverse next two rows are simple ratios. Row 10 is the block's \emph{reverse
throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but throughput}, which we will note $\cyc{\kerK}$ and formalize later in
is roughly defined as the number of cycles a single iteration of the kernel \autoref{sssec:def:rthroughput}, but is roughly defined as the number of cycles
takes. a single iteration of the kernel takes.
The next section, \emph{instruction info}, lists data about the instructions The next section, \emph{instruction info}, lists data about the instructions
present. present.
@ -227,7 +227,7 @@ which indicates, for each instruction, the timeline of its execution. Here,
\texttt{D} stands for decode, \texttt{e} for being executed --~in the \texttt{D} stands for decode, \texttt{e} for being executed --~in the
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
waiting to be dispatched to execution, a \texttt{=} is shown. waiting to be dispatched to execution, an \texttt{=} is shown.
The identifier at the beginning of each row indicates the kernel iteration The identifier at the beginning of each row indicates the kernel iteration
number, and the instruction within. number, and the instruction within.
@ -361,7 +361,8 @@ time is measured until the last instruction is issued, not retired.
Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that
$\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the $\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the
previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of
period $p$. period $p$. As we consider only the execution's steady state, the sequence
is periodic from rank 0.
As the number of cycles needed to execute $\kerK$ only depend on the As the number of cycles needed to execute $\kerK$ only depend on the
initial state of the processor, we thus have initial state of the processor, we thus have
@ -375,8 +376,8 @@ time is measured until the last instruction is issued, not retired.
and measured in \emph{cycles per iteration}, is also called the and measured in \emph{cycles per iteration}, is also called the
steady-state execution time of a kernel. steady-state execution time of a kernel.
We note $p \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ (by the above We note $p = \calP(\kerK) \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$
lemma), and define \[ (by the above lemma), and define \[
\cyc{\kerK} = \dfrac{\ckn{p}}{p} \cyc{\kerK} = \dfrac{\ckn{p}}{p}
\] \]
\end{definition} \end{definition}
@ -396,8 +397,8 @@ $\cyc{\kerK} = 1.5$.
\begin{remark} \begin{remark}
As $C(\kerK)$ depends on the microarchitecture of the processor considered, As $C(\kerK)$ depends on the microarchitecture of the processor considered,
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor the throughput $\cyc{\kerK}$ of a kernel $\kerK$ implicitly depends on the
considered. processor considered.
\end{remark} \end{remark}
\medskip \medskip
@ -440,10 +441,10 @@ $\cyc{\kerK} = 1.5$.
\end{lemma} \end{lemma}
\begin{proof} \begin{proof}
Let $n \in \nat^*$. We note $p \in \nat^*$ the periodicity by the above Let $n \in \nat^*$ and $p = \calP(\kerK) \in \nat^*$ the periodicity by the
lemma. above lemma.
Let $k, r \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$. Let $k, r \in \nat^*$ such that $n = kp+r$, $1 \leq r \leq p$.
\begin{align*} \begin{align*}
\ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\ \ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\
@ -561,9 +562,9 @@ of this manuscript.
\smallskip{} \smallskip{}
An instruction is said to be a \emph{flow-altering instruction} if this An instruction is said to be a \emph{flow-altering instruction} if this
address may alter the normal control flow of the program. This is typically instruction may alter the normal control flow of the program. This is
true of jumps (conditional or unconditional), function calls, function typically true of jumps (conditional or unconditional), function calls,
returns, \ldots function returns, \ldots
\smallskip{} \smallskip{}

View file

@ -81,7 +81,7 @@ approach, but also limits it to microarchitectures offering such counters, and
requires a manual analysis of each microarchitecture to be supported in order requires a manual analysis of each microarchitecture to be supported in order
to find a fitting set of blocking instructions. Although we have no theoretical to find a fitting set of blocking instructions. Although we have no theoretical
guarantee of the existence of such instructions, this should never be a guarantee of the existence of such instructions, this should never be a
problem, as all pragmatic microarchitecture design will yield to their problem, as all pragmatic microarchitecture design will lead to their
existence. existence.
\subsection{Code analyzers and their models} \subsection{Code analyzers and their models}
@ -89,8 +89,8 @@ existence.
Going further than data extraction at the individual instruction level, Going further than data extraction at the individual instruction level,
academics and industrials interested in this domain now mostly work on academics and industrials interested in this domain now mostly work on
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
tool embeds a model --~or collection of models~-- on which to base its tool embeds a model --~or collection of models~-- on which its inference is
inference, and whose definition, embedded data and obtention method varies from based, and whose definition, embedded data and obtention method varies from
tool to tool. These tools often use, to some extent, the data on individual tool to tool. These tools often use, to some extent, the data on individual
instructions obtained either from the manufacturer or the third-party efforts instructions obtained either from the manufacturer or the third-party efforts
mentioned above. mentioned above.
@ -142,7 +142,7 @@ its context. This approach, in our experiments, was significantly less accurate
than those not based on machine learning. In our opinion, its main issue, than those not based on machine learning. In our opinion, its main issue,
however, is to be a \textit{black-box model}: given a kernel, it is only able however, is to be a \textit{black-box model}: given a kernel, it is only able
to predict its reverse throughput. Doing so, even with perfect accuracy, does to predict its reverse throughput. Doing so, even with perfect accuracy, does
not explain the source of a performance problem: the model is unable to help in not explain the source of a performance problem: the model is unable to help
detecting which resource is the performance bottleneck of a kernel; in other detecting which resource is the performance bottleneck of a kernel; in other
words, it quantifies a potential issue, but does not help in \emph{explaining} words, it quantifies a potential issue, but does not help in \emph{explaining}
it --~or debugging it. it --~or debugging it.
@ -171,4 +171,4 @@ this manuscript), with results comparable with \llvmmca{}. Its source code
--~under free software license~-- is self-contained and reasonably concise --~under free software license~-- is self-contained and reasonably concise
(about 2,000 lines of Python for the main part), making it a good basis and (about 2,000 lines of Python for the main part), making it a good basis and
baseline for experiments. It is, however, closely tied by design to Intel baseline for experiments. It is, however, closely tied by design to Intel
microarchitectures, or microarchitectures very alike to Intel's ones. microarchitectures, or microarchitectures very close to Intel's ones.