Proof-read up to Foundations (incl)
This commit is contained in:
parent
c8c2b2db2a
commit
8d4887cc63
5 changed files with 48 additions and 44 deletions
|
@ -1,6 +1,6 @@
|
|||
\selectlanguage{french}
|
||||
\begin{abstract}
|
||||
Qu'il s'agisse de calculs massifs distribués sur plusieurs racks, de
|
||||
Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
|
||||
calculs en environnement contraint --~comme de l'embarqué ou de
|
||||
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
|
||||
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
|
||||
|
@ -43,7 +43,7 @@
|
|||
the generated assembly with respect to the microarchitecture of the
|
||||
specific microprocessor used to fine-tune it.
|
||||
|
||||
Such an optimisation level requires a very detailed comprehension of both
|
||||
Such an optimisation level requires a very detailed understanding of both
|
||||
the software and hardware aspects implied, and is most often the realm of
|
||||
experts. \emph{Code analyzers}, however, are tools that help lowering the
|
||||
expertise threshold required to perform such optimisations by automating
|
||||
|
|
|
@ -100,7 +100,7 @@ slow the whole computation.
|
|||
\vspace{2em}
|
||||
|
||||
In this thesis, we explore the three major aspects that work towards a code
|
||||
analyzers' accuracy: a \emph{backend model}, a \emph{frontend model} and a
|
||||
analyzer's accuracy: a \emph{backend model}, a \emph{frontend model} and a
|
||||
\emph{dependencies model}. We propose contributions to strengthen them, as well
|
||||
as to automate the underlying models' synthesis. We focus on \emph{static}
|
||||
code analyzers, that derive metrics, including runtime predictions, from an
|
||||
|
@ -124,7 +124,7 @@ tool, akin to \palmed.
|
|||
|
||||
Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art
|
||||
code analyzers' strengths and shortcomings. To this end, we introduce a
|
||||
fully-tooled approach in two parts: first, a benchmarks-generation procedure,
|
||||
fully-tooled approach in two parts: first, a benchmark-generation procedure,
|
||||
yielding thousands of benchmarks relevant in the context of our approach; then,
|
||||
a benchmarking harness evaluating code analyzers on these benchmarks. We find
|
||||
that most state-of-the-art code analyzers struggle to correctly account for
|
||||
|
@ -154,11 +154,12 @@ we see this commitment as an opportunity to develop methodologies able to model
|
|||
these processors.
|
||||
|
||||
This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is
|
||||
to model a processor's backend resources without resorting to its hardware
|
||||
counters. Our frontend study, in \autoref{chap:frontend}, also follows this
|
||||
strategy by focusing on a processor whose hardware counters give little to no
|
||||
insight on its frontend. While this goal is less relevant to \staticdeps{}, we
|
||||
rely on external libraries to abstract the underlying architecture.
|
||||
to model a processor's backend resources without resorting to its
|
||||
vendor-specific hardware counters. Our frontend study, in
|
||||
\autoref{chap:frontend}, also follows this strategy by focusing on a processor
|
||||
whose hardware counters give little to no insight on its frontend. While this
|
||||
goal is less relevant to \staticdeps{}, we rely on external libraries to
|
||||
abstract the underlying architecture.
|
||||
|
||||
\medskip{}
|
||||
|
||||
|
|
|
@ -154,7 +154,7 @@ port for both memory loads and stores.
|
|||
|
||||
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
||||
processing a single \uop{} takes multiple cycles, the unit is able to start
|
||||
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
||||
processing a new \uop{} every cycle: multiple \uops{} are thus being processed,
|
||||
at different stages, during each cycle, akin to a factory's assembly line.
|
||||
|
||||
\smallskip{}
|
||||
|
@ -204,11 +204,12 @@ For this reason, many processors are now \emph{out-of-order}, while processors
|
|||
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
||||
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
|
||||
instructions are picked to be issued. The reorder buffer acts as a sliding
|
||||
window of microarchitecturally-fixed size over \uops{}, from which the oldest
|
||||
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
|
||||
CPUs are only able to execute operations out of order as long as the
|
||||
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
|
||||
be issued ---~specifically, not more than the size of the reorder buffer ahead.
|
||||
window of microarchitecturally-fixed size over decoded \uops{}, from which the
|
||||
oldest \uop{} whose dependencies are satisfied will be executed. Thus,
|
||||
out-of-order CPUs are only able to execute operations out of order as long as
|
||||
the \uop{} to be executed is not too far ahead from the oldest \uop{} awaiting
|
||||
to be issued ---~specifically, not more than the size of the reorder buffer
|
||||
ahead.
|
||||
|
||||
It is also important to note that out-of-order processors are only out-of-order
|
||||
\emph{from a certain point on}: a substantial part of the processor's frontend
|
||||
|
@ -238,13 +239,14 @@ word sizes.
|
|||
Some instructions, however, operate on chunks of multiple words at once. These
|
||||
instructions are called \emph{vector instructions}, or \emph{SIMD} for Single
|
||||
Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add
|
||||
two chunks of 128 bits, which can for instance be treated each as four integers
|
||||
two chunks of 128 bits, treated each as four integers
|
||||
of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.6\textwidth]{simd.svg}
|
||||
\caption{Example of SIMD add instruction on 128b}\label{fig:cpu_simd}
|
||||
\caption{Example of SIMD $4 \times 32\,\text{bits}$ add instruction on
|
||||
128 bits}\label{fig:cpu_simd}
|
||||
\end{figure}
|
||||
|
||||
Such instructions present clear efficiency advantages. If the processor is able
|
||||
|
|
|
@ -38,7 +38,7 @@ analyze a code fragment ---~typically at assembly or binary level~---, and
|
|||
provide insights on its performance metrics on a given hardware. Code analyzers
|
||||
thus work statically, that is, without executing the code.
|
||||
|
||||
\paragraph{Common hypotheses.} Code analyzers operate under a common
|
||||
\paragraph{Common hypotheses.} Code analyzers operate under a set of common
|
||||
hypotheses, derived from the typical intended usage.
|
||||
|
||||
The kernel analyzed is expected to be the body of a loop, or
|
||||
|
@ -101,8 +101,8 @@ than on edge cases.
|
|||
|
||||
As most code analyzers are static, this manuscript largely focuses on static
|
||||
analysis. The only dynamic code analyzer we are aware of is \gus{}, described
|
||||
more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
|
||||
accuracy, especially regarding data dependencies that may not be easily
|
||||
more thoroughly in \autoref{sec:sota} later, trading heavily run time to gain
|
||||
in accuracy, especially regarding data dependencies that may not be easily
|
||||
obtained otherwise.
|
||||
|
||||
\paragraph{Input formats used.} The analyzers studied in this manuscript all
|
||||
|
@ -111,12 +111,12 @@ take as input either assembly code, or assembled binaries.
|
|||
In the case of assembly code, as for instance with \llvmmca{}, analyzers
|
||||
take either a short assembly snippet, treated as straight-line code and
|
||||
analyzed as such; or longer pieces of assembly, part or parts of which being
|
||||
marked for analysis my surrounding assembly comments.
|
||||
marked for analysis by surrounding assembly comments.
|
||||
|
||||
In the case of assembled binaries, as all analyzers were run on Linux,
|
||||
executables or object files are ELF files. Some analyzers work on sections of
|
||||
the file defined by user-provided offsets in the binary, while others require
|
||||
the presence of \iaca{} markers around the code portion or portions to be
|
||||
the presence of \textit{\iaca{} markers} around the code portion or portions to be
|
||||
analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
|
||||
statements, consist in the following x86 assembly snippets:
|
||||
|
||||
|
@ -198,11 +198,11 @@ Iterating it takes 106 cycles instead of the expected 100 cycles, as this
|
|||
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
||||
decoding of the first instruction to the retirement of the last.
|
||||
|
||||
The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}.
|
||||
The next two rows are simple ratios. Row 10 is the block's \emph{reverse
|
||||
throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but
|
||||
is roughly defined as the number of cycles a single iteration of the kernel
|
||||
takes.
|
||||
Row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. The
|
||||
next two rows are simple ratios. Row 10 is the block's \emph{reverse
|
||||
throughput}, which we will note $\cyc{\kerK}$ and formalize later in
|
||||
\autoref{sssec:def:rthroughput}, but is roughly defined as the number of cycles
|
||||
a single iteration of the kernel takes.
|
||||
|
||||
The next section, \emph{instruction info}, lists data about the instructions
|
||||
present.
|
||||
|
@ -227,7 +227,7 @@ which indicates, for each instruction, the timeline of its execution. Here,
|
|||
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
|
||||
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
|
||||
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
|
||||
waiting to be dispatched to execution, a \texttt{=} is shown.
|
||||
waiting to be dispatched to execution, an \texttt{=} is shown.
|
||||
|
||||
The identifier at the beginning of each row indicates the kernel iteration
|
||||
number, and the instruction within.
|
||||
|
@ -361,7 +361,8 @@ time is measured until the last instruction is issued, not retired.
|
|||
Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that
|
||||
$\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the
|
||||
previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of
|
||||
period $p$.
|
||||
period $p$. As we consider only the execution's steady state, the sequence
|
||||
is periodic from rank 0.
|
||||
|
||||
As the number of cycles needed to execute $\kerK$ only depend on the
|
||||
initial state of the processor, we thus have
|
||||
|
@ -375,8 +376,8 @@ time is measured until the last instruction is issued, not retired.
|
|||
and measured in \emph{cycles per iteration}, is also called the
|
||||
steady-state execution time of a kernel.
|
||||
|
||||
We note $p \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ (by the above
|
||||
lemma), and define \[
|
||||
We note $p = \calP(\kerK) \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$
|
||||
(by the above lemma), and define \[
|
||||
\cyc{\kerK} = \dfrac{\ckn{p}}{p}
|
||||
\]
|
||||
\end{definition}
|
||||
|
@ -396,8 +397,8 @@ $\cyc{\kerK} = 1.5$.
|
|||
|
||||
\begin{remark}
|
||||
As $C(\kerK)$ depends on the microarchitecture of the processor considered,
|
||||
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor
|
||||
considered.
|
||||
the throughput $\cyc{\kerK}$ of a kernel $\kerK$ implicitly depends on the
|
||||
processor considered.
|
||||
\end{remark}
|
||||
|
||||
\medskip
|
||||
|
@ -440,10 +441,10 @@ $\cyc{\kerK} = 1.5$.
|
|||
\end{lemma}
|
||||
|
||||
\begin{proof}
|
||||
Let $n \in \nat^*$. We note $p \in \nat^*$ the periodicity by the above
|
||||
lemma.
|
||||
Let $n \in \nat^*$ and $p = \calP(\kerK) \in \nat^*$ the periodicity by the
|
||||
above lemma.
|
||||
|
||||
Let $k, r \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$.
|
||||
Let $k, r \in \nat^*$ such that $n = kp+r$, $1 \leq r \leq p$.
|
||||
|
||||
\begin{align*}
|
||||
\ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\
|
||||
|
@ -561,9 +562,9 @@ of this manuscript.
|
|||
\smallskip{}
|
||||
|
||||
An instruction is said to be a \emph{flow-altering instruction} if this
|
||||
address may alter the normal control flow of the program. This is typically
|
||||
true of jumps (conditional or unconditional), function calls, function
|
||||
returns, \ldots
|
||||
instruction may alter the normal control flow of the program. This is
|
||||
typically true of jumps (conditional or unconditional), function calls,
|
||||
function returns, \ldots
|
||||
|
||||
\smallskip{}
|
||||
|
||||
|
|
|
@ -81,7 +81,7 @@ approach, but also limits it to microarchitectures offering such counters, and
|
|||
requires a manual analysis of each microarchitecture to be supported in order
|
||||
to find a fitting set of blocking instructions. Although we have no theoretical
|
||||
guarantee of the existence of such instructions, this should never be a
|
||||
problem, as all pragmatic microarchitecture design will yield to their
|
||||
problem, as all pragmatic microarchitecture design will lead to their
|
||||
existence.
|
||||
|
||||
\subsection{Code analyzers and their models}
|
||||
|
@ -89,8 +89,8 @@ existence.
|
|||
Going further than data extraction at the individual instruction level,
|
||||
academics and industrials interested in this domain now mostly work on
|
||||
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
||||
tool embeds a model --~or collection of models~-- on which to base its
|
||||
inference, and whose definition, embedded data and obtention method varies from
|
||||
tool embeds a model --~or collection of models~-- on which its inference is
|
||||
based, and whose definition, embedded data and obtention method varies from
|
||||
tool to tool. These tools often use, to some extent, the data on individual
|
||||
instructions obtained either from the manufacturer or the third-party efforts
|
||||
mentioned above.
|
||||
|
@ -142,7 +142,7 @@ its context. This approach, in our experiments, was significantly less accurate
|
|||
than those not based on machine learning. In our opinion, its main issue,
|
||||
however, is to be a \textit{black-box model}: given a kernel, it is only able
|
||||
to predict its reverse throughput. Doing so, even with perfect accuracy, does
|
||||
not explain the source of a performance problem: the model is unable to help in
|
||||
not explain the source of a performance problem: the model is unable to help
|
||||
detecting which resource is the performance bottleneck of a kernel; in other
|
||||
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
||||
it --~or debugging it.
|
||||
|
@ -171,4 +171,4 @@ this manuscript), with results comparable with \llvmmca{}. Its source code
|
|||
--~under free software license~-- is self-contained and reasonably concise
|
||||
(about 2,000 lines of Python for the main part), making it a good basis and
|
||||
baseline for experiments. It is, however, closely tied by design to Intel
|
||||
microarchitectures, or microarchitectures very alike to Intel's ones.
|
||||
microarchitectures, or microarchitectures very close to Intel's ones.
|
||||
|
|
Loading…
Reference in a new issue