Proof-read up to Foundations (incl)
This commit is contained in:
parent
c8c2b2db2a
commit
8d4887cc63
5 changed files with 48 additions and 44 deletions
|
@ -1,6 +1,6 @@
|
||||||
\selectlanguage{french}
|
\selectlanguage{french}
|
||||||
\begin{abstract}
|
\begin{abstract}
|
||||||
Qu'il s'agisse de calculs massifs distribués sur plusieurs racks, de
|
Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
|
||||||
calculs en environnement contraint --~comme de l'embarqué ou de
|
calculs en environnement contraint --~comme de l'embarqué ou de
|
||||||
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
|
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
|
||||||
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
|
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
|
||||||
|
@ -43,7 +43,7 @@
|
||||||
the generated assembly with respect to the microarchitecture of the
|
the generated assembly with respect to the microarchitecture of the
|
||||||
specific microprocessor used to fine-tune it.
|
specific microprocessor used to fine-tune it.
|
||||||
|
|
||||||
Such an optimisation level requires a very detailed comprehension of both
|
Such an optimisation level requires a very detailed understanding of both
|
||||||
the software and hardware aspects implied, and is most often the realm of
|
the software and hardware aspects implied, and is most often the realm of
|
||||||
experts. \emph{Code analyzers}, however, are tools that help lowering the
|
experts. \emph{Code analyzers}, however, are tools that help lowering the
|
||||||
expertise threshold required to perform such optimisations by automating
|
expertise threshold required to perform such optimisations by automating
|
||||||
|
|
|
@ -100,7 +100,7 @@ slow the whole computation.
|
||||||
\vspace{2em}
|
\vspace{2em}
|
||||||
|
|
||||||
In this thesis, we explore the three major aspects that work towards a code
|
In this thesis, we explore the three major aspects that work towards a code
|
||||||
analyzers' accuracy: a \emph{backend model}, a \emph{frontend model} and a
|
analyzer's accuracy: a \emph{backend model}, a \emph{frontend model} and a
|
||||||
\emph{dependencies model}. We propose contributions to strengthen them, as well
|
\emph{dependencies model}. We propose contributions to strengthen them, as well
|
||||||
as to automate the underlying models' synthesis. We focus on \emph{static}
|
as to automate the underlying models' synthesis. We focus on \emph{static}
|
||||||
code analyzers, that derive metrics, including runtime predictions, from an
|
code analyzers, that derive metrics, including runtime predictions, from an
|
||||||
|
@ -124,7 +124,7 @@ tool, akin to \palmed.
|
||||||
|
|
||||||
Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art
|
Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art
|
||||||
code analyzers' strengths and shortcomings. To this end, we introduce a
|
code analyzers' strengths and shortcomings. To this end, we introduce a
|
||||||
fully-tooled approach in two parts: first, a benchmarks-generation procedure,
|
fully-tooled approach in two parts: first, a benchmark-generation procedure,
|
||||||
yielding thousands of benchmarks relevant in the context of our approach; then,
|
yielding thousands of benchmarks relevant in the context of our approach; then,
|
||||||
a benchmarking harness evaluating code analyzers on these benchmarks. We find
|
a benchmarking harness evaluating code analyzers on these benchmarks. We find
|
||||||
that most state-of-the-art code analyzers struggle to correctly account for
|
that most state-of-the-art code analyzers struggle to correctly account for
|
||||||
|
@ -154,11 +154,12 @@ we see this commitment as an opportunity to develop methodologies able to model
|
||||||
these processors.
|
these processors.
|
||||||
|
|
||||||
This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is
|
This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is
|
||||||
to model a processor's backend resources without resorting to its hardware
|
to model a processor's backend resources without resorting to its
|
||||||
counters. Our frontend study, in \autoref{chap:frontend}, also follows this
|
vendor-specific hardware counters. Our frontend study, in
|
||||||
strategy by focusing on a processor whose hardware counters give little to no
|
\autoref{chap:frontend}, also follows this strategy by focusing on a processor
|
||||||
insight on its frontend. While this goal is less relevant to \staticdeps{}, we
|
whose hardware counters give little to no insight on its frontend. While this
|
||||||
rely on external libraries to abstract the underlying architecture.
|
goal is less relevant to \staticdeps{}, we rely on external libraries to
|
||||||
|
abstract the underlying architecture.
|
||||||
|
|
||||||
\medskip{}
|
\medskip{}
|
||||||
|
|
||||||
|
|
|
@ -154,7 +154,7 @@ port for both memory loads and stores.
|
||||||
|
|
||||||
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
||||||
processing a single \uop{} takes multiple cycles, the unit is able to start
|
processing a single \uop{} takes multiple cycles, the unit is able to start
|
||||||
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
processing a new \uop{} every cycle: multiple \uops{} are thus being processed,
|
||||||
at different stages, during each cycle, akin to a factory's assembly line.
|
at different stages, during each cycle, akin to a factory's assembly line.
|
||||||
|
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
@ -204,11 +204,12 @@ For this reason, many processors are now \emph{out-of-order}, while processors
|
||||||
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
||||||
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
|
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
|
||||||
instructions are picked to be issued. The reorder buffer acts as a sliding
|
instructions are picked to be issued. The reorder buffer acts as a sliding
|
||||||
window of microarchitecturally-fixed size over \uops{}, from which the oldest
|
window of microarchitecturally-fixed size over decoded \uops{}, from which the
|
||||||
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
|
oldest \uop{} whose dependencies are satisfied will be executed. Thus,
|
||||||
CPUs are only able to execute operations out of order as long as the
|
out-of-order CPUs are only able to execute operations out of order as long as
|
||||||
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
|
the \uop{} to be executed is not too far ahead from the oldest \uop{} awaiting
|
||||||
be issued ---~specifically, not more than the size of the reorder buffer ahead.
|
to be issued ---~specifically, not more than the size of the reorder buffer
|
||||||
|
ahead.
|
||||||
|
|
||||||
It is also important to note that out-of-order processors are only out-of-order
|
It is also important to note that out-of-order processors are only out-of-order
|
||||||
\emph{from a certain point on}: a substantial part of the processor's frontend
|
\emph{from a certain point on}: a substantial part of the processor's frontend
|
||||||
|
@ -238,13 +239,14 @@ word sizes.
|
||||||
Some instructions, however, operate on chunks of multiple words at once. These
|
Some instructions, however, operate on chunks of multiple words at once. These
|
||||||
instructions are called \emph{vector instructions}, or \emph{SIMD} for Single
|
instructions are called \emph{vector instructions}, or \emph{SIMD} for Single
|
||||||
Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add
|
Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add
|
||||||
two chunks of 128 bits, which can for instance be treated each as four integers
|
two chunks of 128 bits, treated each as four integers
|
||||||
of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}.
|
of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=0.6\textwidth]{simd.svg}
|
\includegraphics[width=0.6\textwidth]{simd.svg}
|
||||||
\caption{Example of SIMD add instruction on 128b}\label{fig:cpu_simd}
|
\caption{Example of SIMD $4 \times 32\,\text{bits}$ add instruction on
|
||||||
|
128 bits}\label{fig:cpu_simd}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Such instructions present clear efficiency advantages. If the processor is able
|
Such instructions present clear efficiency advantages. If the processor is able
|
||||||
|
|
|
@ -38,7 +38,7 @@ analyze a code fragment ---~typically at assembly or binary level~---, and
|
||||||
provide insights on its performance metrics on a given hardware. Code analyzers
|
provide insights on its performance metrics on a given hardware. Code analyzers
|
||||||
thus work statically, that is, without executing the code.
|
thus work statically, that is, without executing the code.
|
||||||
|
|
||||||
\paragraph{Common hypotheses.} Code analyzers operate under a common
|
\paragraph{Common hypotheses.} Code analyzers operate under a set of common
|
||||||
hypotheses, derived from the typical intended usage.
|
hypotheses, derived from the typical intended usage.
|
||||||
|
|
||||||
The kernel analyzed is expected to be the body of a loop, or
|
The kernel analyzed is expected to be the body of a loop, or
|
||||||
|
@ -101,8 +101,8 @@ than on edge cases.
|
||||||
|
|
||||||
As most code analyzers are static, this manuscript largely focuses on static
|
As most code analyzers are static, this manuscript largely focuses on static
|
||||||
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
|
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
|
||||||
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
|
more thoroughly in \autoref{sec:sota} later, trading heavily run time to gain
|
||||||
accuracy, especially regarding data dependencies that may not be easily
|
in accuracy, especially regarding data dependencies that may not be easily
|
||||||
obtained otherwise.
|
obtained otherwise.
|
||||||
|
|
||||||
\paragraph{Input formats used.} The analyzers studied in this manuscript all
|
\paragraph{Input formats used.} The analyzers studied in this manuscript all
|
||||||
|
@ -111,12 +111,12 @@ take as input either assembly code, or assembled binaries.
|
||||||
In the case of assembly code, as for instance with \llvmmca{}, analyzers
|
In the case of assembly code, as for instance with \llvmmca{}, analyzers
|
||||||
take either a short assembly snippet, treated as straight-line code and
|
take either a short assembly snippet, treated as straight-line code and
|
||||||
analyzed as such; or longer pieces of assembly, part or parts of which being
|
analyzed as such; or longer pieces of assembly, part or parts of which being
|
||||||
marked for analysis my surrounding assembly comments.
|
marked for analysis by surrounding assembly comments.
|
||||||
|
|
||||||
In the case of assembled binaries, as all analyzers were run on Linux,
|
In the case of assembled binaries, as all analyzers were run on Linux,
|
||||||
executables or object files are ELF files. Some analyzers work on sections of
|
executables or object files are ELF files. Some analyzers work on sections of
|
||||||
the file defined by user-provided offsets in the binary, while others require
|
the file defined by user-provided offsets in the binary, while others require
|
||||||
the presence of \iaca{} markers around the code portion or portions to be
|
the presence of \textit{\iaca{} markers} around the code portion or portions to be
|
||||||
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
|
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
|
||||||
statements, consist in the following x86 assembly snippets:
|
statements, consist in the following x86 assembly snippets:
|
||||||
|
|
||||||
|
@ -198,11 +198,11 @@ Iterating it takes 106 cycles instead of the expected 100 cycles, as this
|
||||||
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
||||||
decoding of the first instruction to the retirement of the last.
|
decoding of the first instruction to the retirement of the last.
|
||||||
|
|
||||||
The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}.
|
Row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. The
|
||||||
The next two rows are simple ratios. Row 10 is the block's \emph{reverse
|
next two rows are simple ratios. Row 10 is the block's \emph{reverse
|
||||||
throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but
|
throughput}, which we will note $\cyc{\kerK}$ and formalize later in
|
||||||
is roughly defined as the number of cycles a single iteration of the kernel
|
\autoref{sssec:def:rthroughput}, but is roughly defined as the number of cycles
|
||||||
takes.
|
a single iteration of the kernel takes.
|
||||||
|
|
||||||
The next section, \emph{instruction info}, lists data about the instructions
|
The next section, \emph{instruction info}, lists data about the instructions
|
||||||
present.
|
present.
|
||||||
|
@ -227,7 +227,7 @@ which indicates, for each instruction, the timeline of its execution. Here,
|
||||||
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
|
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
|
||||||
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
|
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
|
||||||
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
|
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
|
||||||
waiting to be dispatched to execution, a \texttt{=} is shown.
|
waiting to be dispatched to execution, an \texttt{=} is shown.
|
||||||
|
|
||||||
The identifier at the beginning of each row indicates the kernel iteration
|
The identifier at the beginning of each row indicates the kernel iteration
|
||||||
number, and the instruction within.
|
number, and the instruction within.
|
||||||
|
@ -361,7 +361,8 @@ time is measured until the last instruction is issued, not retired.
|
||||||
Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that
|
Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that
|
||||||
$\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the
|
$\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the
|
||||||
previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of
|
previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of
|
||||||
period $p$.
|
period $p$. As we consider only the execution's steady state, the sequence
|
||||||
|
is periodic from rank 0.
|
||||||
|
|
||||||
As the number of cycles needed to execute $\kerK$ only depend on the
|
As the number of cycles needed to execute $\kerK$ only depend on the
|
||||||
initial state of the processor, we thus have
|
initial state of the processor, we thus have
|
||||||
|
@ -375,8 +376,8 @@ time is measured until the last instruction is issued, not retired.
|
||||||
and measured in \emph{cycles per iteration}, is also called the
|
and measured in \emph{cycles per iteration}, is also called the
|
||||||
steady-state execution time of a kernel.
|
steady-state execution time of a kernel.
|
||||||
|
|
||||||
We note $p \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ (by the above
|
We note $p = \calP(\kerK) \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$
|
||||||
lemma), and define \[
|
(by the above lemma), and define \[
|
||||||
\cyc{\kerK} = \dfrac{\ckn{p}}{p}
|
\cyc{\kerK} = \dfrac{\ckn{p}}{p}
|
||||||
\]
|
\]
|
||||||
\end{definition}
|
\end{definition}
|
||||||
|
@ -396,8 +397,8 @@ $\cyc{\kerK} = 1.5$.
|
||||||
|
|
||||||
\begin{remark}
|
\begin{remark}
|
||||||
As $C(\kerK)$ depends on the microarchitecture of the processor considered,
|
As $C(\kerK)$ depends on the microarchitecture of the processor considered,
|
||||||
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor
|
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ implicitly depends on the
|
||||||
considered.
|
processor considered.
|
||||||
\end{remark}
|
\end{remark}
|
||||||
|
|
||||||
\medskip
|
\medskip
|
||||||
|
@ -440,10 +441,10 @@ $\cyc{\kerK} = 1.5$.
|
||||||
\end{lemma}
|
\end{lemma}
|
||||||
|
|
||||||
\begin{proof}
|
\begin{proof}
|
||||||
Let $n \in \nat^*$. We note $p \in \nat^*$ the periodicity by the above
|
Let $n \in \nat^*$ and $p = \calP(\kerK) \in \nat^*$ the periodicity by the
|
||||||
lemma.
|
above lemma.
|
||||||
|
|
||||||
Let $k, r \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$.
|
Let $k, r \in \nat^*$ such that $n = kp+r$, $1 \leq r \leq p$.
|
||||||
|
|
||||||
\begin{align*}
|
\begin{align*}
|
||||||
\ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\
|
\ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\
|
||||||
|
@ -561,9 +562,9 @@ of this manuscript.
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
|
||||||
An instruction is said to be a \emph{flow-altering instruction} if this
|
An instruction is said to be a \emph{flow-altering instruction} if this
|
||||||
address may alter the normal control flow of the program. This is typically
|
instruction may alter the normal control flow of the program. This is
|
||||||
true of jumps (conditional or unconditional), function calls, function
|
typically true of jumps (conditional or unconditional), function calls,
|
||||||
returns, \ldots
|
function returns, \ldots
|
||||||
|
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
|
||||||
|
|
|
@ -81,7 +81,7 @@ approach, but also limits it to microarchitectures offering such counters, and
|
||||||
requires a manual analysis of each microarchitecture to be supported in order
|
requires a manual analysis of each microarchitecture to be supported in order
|
||||||
to find a fitting set of blocking instructions. Although we have no theoretical
|
to find a fitting set of blocking instructions. Although we have no theoretical
|
||||||
guarantee of the existence of such instructions, this should never be a
|
guarantee of the existence of such instructions, this should never be a
|
||||||
problem, as all pragmatic microarchitecture design will yield to their
|
problem, as all pragmatic microarchitecture design will lead to their
|
||||||
existence.
|
existence.
|
||||||
|
|
||||||
\subsection{Code analyzers and their models}
|
\subsection{Code analyzers and their models}
|
||||||
|
@ -89,8 +89,8 @@ existence.
|
||||||
Going further than data extraction at the individual instruction level,
|
Going further than data extraction at the individual instruction level,
|
||||||
academics and industrials interested in this domain now mostly work on
|
academics and industrials interested in this domain now mostly work on
|
||||||
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
||||||
tool embeds a model --~or collection of models~-- on which to base its
|
tool embeds a model --~or collection of models~-- on which its inference is
|
||||||
inference, and whose definition, embedded data and obtention method varies from
|
based, and whose definition, embedded data and obtention method varies from
|
||||||
tool to tool. These tools often use, to some extent, the data on individual
|
tool to tool. These tools often use, to some extent, the data on individual
|
||||||
instructions obtained either from the manufacturer or the third-party efforts
|
instructions obtained either from the manufacturer or the third-party efforts
|
||||||
mentioned above.
|
mentioned above.
|
||||||
|
@ -142,7 +142,7 @@ its context. This approach, in our experiments, was significantly less accurate
|
||||||
than those not based on machine learning. In our opinion, its main issue,
|
than those not based on machine learning. In our opinion, its main issue,
|
||||||
however, is to be a \textit{black-box model}: given a kernel, it is only able
|
however, is to be a \textit{black-box model}: given a kernel, it is only able
|
||||||
to predict its reverse throughput. Doing so, even with perfect accuracy, does
|
to predict its reverse throughput. Doing so, even with perfect accuracy, does
|
||||||
not explain the source of a performance problem: the model is unable to help in
|
not explain the source of a performance problem: the model is unable to help
|
||||||
detecting which resource is the performance bottleneck of a kernel; in other
|
detecting which resource is the performance bottleneck of a kernel; in other
|
||||||
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
||||||
it --~or debugging it.
|
it --~or debugging it.
|
||||||
|
@ -171,4 +171,4 @@ this manuscript), with results comparable with \llvmmca{}. Its source code
|
||||||
--~under free software license~-- is self-contained and reasonably concise
|
--~under free software license~-- is self-contained and reasonably concise
|
||||||
(about 2,000 lines of Python for the main part), making it a good basis and
|
(about 2,000 lines of Python for the main part), making it a good basis and
|
||||||
baseline for experiments. It is, however, closely tied by design to Intel
|
baseline for experiments. It is, however, closely tied by design to Intel
|
||||||
microarchitectures, or microarchitectures very alike to Intel's ones.
|
microarchitectures, or microarchitectures very close to Intel's ones.
|
||||||
|
|
Loading…
Reference in a new issue