Typography: -- to ---

This commit is contained in:
Théophile Bastian 2024-09-01 16:56:48 +02:00
parent 103e6a0687
commit d1401b068f
11 changed files with 54 additions and 54 deletions

View file

@ -1,8 +1,8 @@
\selectlanguage{french}
\begin{abstract}
Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
calculs en environnement contraint --~comme de l'embarqué ou de
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
calculs en environnement contraint ---~comme de l'embarqué ou de
l'\emph{edge computing}~--- ou encore de tentatives de réduire l'empreinte
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent
à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais
@ -34,11 +34,11 @@
\selectlanguage{english}
\begin{abstract}
Be it massively distributed computation over multiple server racks,
constrained computation --~such as in embedded environments or in
\emph{edge computing}~--, or still an attempt to reduce the ecological
constrained computation ---~such as in embedded environments or in
\emph{edge computing}~---, or still an attempt to reduce the ecological
footprint of a frequently-run program, many use-cases make it relevant to
deeply optimize a program. This optimisation is often limited to high-level
optimisation --~choice of algorithms, parallel computing, \ldots{} Yet, it
optimisation ---~choice of algorithms, parallel computing, \ldots{} Yet, it
is possible to carry it further to low-level optimisations, by inspecting
the generated assembly with respect to the microarchitecture of the
specific microprocessor used to fine-tune it.

View file

@ -101,7 +101,7 @@ processor.
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
first be broken down into a sequence of instructions. While on some ISAs, each
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
instruction is made of a constant amount of bytes ---~\eg{} ARM~---, this is not
always the case: for instance, x84-64 instructions can be as short as one byte,
while the ISA only limits an instruction to 15
bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by

View file

@ -167,8 +167,8 @@ We have now covered enough of the theoretical background to introduce code
analyzers in a concrete way, through examples of their usage. For this purpose,
we use \llvmmca{}, one of the state-of-the-art code analyzers.
Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
implementations~--, we will base the following examples on ARM's Cortex A72,
Due to its relative simplicity ---~at least compared to \eg{} Intel's x86-64
implementations~---, we will base the following examples on ARM's Cortex A72,
which we introduce in depth later in \autoref{chap:frontend}. No specific
knowledge of this microarchitecture is required to understand the following
examples; for our purposes, if suffices to say that:
@ -192,7 +192,7 @@ on a single load operation: \lstarmasm{ldr x1, [x2]}.
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
the execution of the kernel ---~here, 100 times, as seen row 2~---. This simple
kernel contains only one instruction, which breaks down into a single \uop{}.
Iterating it takes 106 cycles instead of the expected 100 cycles, as this
execution is \emph{not} in steady-state, but accounts for the cycles from the
@ -224,9 +224,9 @@ takes up all load resources available.
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
which indicates, for each instruction, the timeline of its execution. Here,
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
\texttt{D} stands for decode, \texttt{e} for being executed ---~in the
pipeline~---, \texttt{E} for last cycle of its execution ---~leaving the
pipeline~---, \texttt{R} for retiring. When an instruction is decoded and
waiting to be dispatched to execution, an \texttt{=} is shown.
The identifier at the beginning of each row indicates the kernel iteration

View file

@ -17,8 +17,8 @@ Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
further gives data tables with throughput and latencies for some instructions.
While the manual provides a huge collection of important insights --~from the
optimisation perspective~-- on their microarchitectures, it lacks exhaustive
While the manual provides a huge collection of important insights ---~from the
optimisation perspective~--- on their microarchitectures, it lacks exhaustive
and (conveniently) machine-parsable data tables and does not detail port usages
of each instruction.
@ -30,7 +30,7 @@ AMD, since 2020, releases lengthy and complete optimisation manuals for its
microarchitecture. For instance, the Zen4 optimisation
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
processor's workflow and ports, and a spreadsheet of about 3,400 x86
instructions --~with operands variants broken down~-- and their port usage,
instructions ---~with operands variants broken down~--- and their port usage,
throughput and latencies. Such an effort, which certainly translates to a
non-negligible financial cost to the company, showcases the importance and
recent expectations on such documents.
@ -89,7 +89,7 @@ existence.
Going further than data extraction at the individual instruction level,
academics and industrials interested in this domain now mostly work on
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
tool embeds a model --~or collection of models~-- on which its inference is
tool embeds a model ---~or collection of models~--- on which its inference is
based, and whose definition, embedded data and obtention method varies from
tool to tool. These tools often use, to some extent, the data on individual
instructions obtained either from the manufacturer or the third-party efforts
@ -106,16 +106,16 @@ microarchitectures. Yet, being closed-source and relying on data that is
partially unavailable to the public, the model is not totally satisfactory to
academics or engineers trying to understand specific performance results. It
also makes it vulnerable to deprecation, as the community is unable to
\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel
\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
binary was recently removed from official download pages.
\medskip{}
In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was
In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
developed as an internal tool at Sony, and was proposed for inclusion in
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
data tables that \llvm{} --~a compiler~-- has to maintain for each
data tables that \llvm{} ---~a compiler~--- has to maintain for each
microarchitecture in order to produce optimized code. The project has since
then evolved to be fairly accurate, as seen in the experiments later presented
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
@ -125,7 +125,7 @@ to its deprecation.
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
(at the time) of an open-source --~and thus, open-model~-- alternative to IACA.
(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
It still lacks, however, a good model of frontend and data dependencies, making
it less performant than other code analyzers in our experiments later in this
@ -145,7 +145,7 @@ to predict its reverse throughput. Doing so, even with perfect accuracy, does
not explain the source of a performance problem: the model is unable to help
detecting which resource is the performance bottleneck of a kernel; in other
words, it quantifies a potential issue, but does not help in \emph{explaining}
it --~or debugging it.
it ---~or debugging it.
\medskip{}
@ -168,7 +168,7 @@ Abel and Reineke, the authors of \uopsinfo{}, recently released
reverse-engineering through the use of hardware counters to model the frontend
and pipelines. We found this tool to be very accurate (see experiments later in
this manuscript), with results comparable with \llvmmca{}. Its source code
--~under free software license~-- is self-contained and reasonably concise
---~under free software license~--- is self-contained and reasonably concise
(about 2,000 lines of Python for the main part), making it a good basis and
baseline for experiments. It is, however, closely tied by design to Intel
microarchitectures, or microarchitectures very close to Intel's ones.

View file

@ -3,7 +3,7 @@ of good microarchitectural analysis and predictions in many aspects. One thing,
however, that we found lacking, was a generic method to obtain a model for a
given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are
performant and quite exhaustive models of Intel's x86-64 implementations, they
are restricted to Intel CPUs --~and few others for \uopsinfo{}. These models
are restricted to Intel CPUs ---~and few others for \uopsinfo{}. These models
were, at least up to a point, handcrafted. While \iaca{} is based on insider's
knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s
method is based on specific hardware counters and handpicked instructions with

View file

@ -39,11 +39,11 @@ that $\texttt{\footnotesize{}UNROLL\_SIZE} \times
parameters of the benchmark generation.
\pipedream{} must be able to distinguish between variants of instructions with
the same mnemonic --~\eg{} \lstxasm{mov}~-- but different operand kinds,
altering the semantics and performance of the instruction --~such as a
the same mnemonic ---~\eg{} \lstxasm{mov}~--- but different operand kinds,
altering the semantics and performance of the instruction ---~such as a
\lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To
this end, \pipedream{} represents instructions fully qualified with their
operands' kind --~this can be seen as a process akin to C++'s name mangling.
operands' kind ---~this can be seen as a process akin to C++'s name mangling.
As \pipedream{} gets a multiset of instructions as a kernel, these
instructions' arguments must be instantiated to turn them into actual assembly

View file

@ -136,7 +136,7 @@
\figthreerowlegend{Polybench}{polybench-W-zen}
\end{figleftlabel}
\caption{IPC prediction profile heatmaps~--~predictions closer to the
\caption{IPC prediction profile heatmaps~---~predictions closer to the
red line are more accurate. Predicted IPC ratio (Y) against native
IPC (X)}
\label{fig:palmed_heatmaps}

View file

@ -64,13 +64,13 @@ each instruction. Its parameters are:
The first step in modeling a processor's frontend should certainly be to
characterize the number of \uops{} that can be dispatched in a cycle. We assume
that a model of the backend is known --~by taking for instance a model
that a model of the backend is known ---~by taking for instance a model
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
best of our knowledge, we can safely further assume that instructions that load
a single backend port only once are also composed of a single \uop{}.
Generating a few combinations of a diversity of those and measuring their
effective throughput --~making sure using the backend model that the latter is
not the bottleneck~-- and keeping the maximal throughput reached should provide
effective throughput ---~making sure using the backend model that the latter is
not the bottleneck~--- and keeping the maximal throughput reached should provide
a good value.
\medskip{}
@ -88,7 +88,7 @@ The core of the model presented in this chapter is the discovery, for each
instruction, of its \uop{} count. Still assuming the knowledge of a backend
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
generic enough to be used on any processor. The basic instructions may be
easily selected using the backend model --~we assume their existence in most
easily selected using the backend model ---~we assume their existence in most
microarchitectures, as pragmatic concerns guide the ports design. Counting the
\uops{} of an instruction thus follows, using only elapsed cycles counters.
@ -141,7 +141,7 @@ be investigated if the model does not reach the expected accuracy.
\uops{} are repeatedly streamed from the decode queue, without even the
necessity to hit a cache. We are unaware of similar features in other
commercial processors. In embedded programming, however, \emph{hardware
loops} --~which are set up explicitly by the programmer~-- achieve,
loops} ---~which are set up explicitly by the programmer~--- achieve,
among others, the same
goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.
@ -157,8 +157,8 @@ be investigated if the model does not reach the expected accuracy.
\item{} In reality, there is an intermediary step between instructions and
\uops{}: macro-ops. Although it serves a designing and semantic
purpose, we omit this step in the current model as --~we
believe~-- it is of little importance to predict performance.
purpose, we omit this step in the current model as ---~we
believe~--- it is of little importance to predict performance.
\item{} On x86 architectures at least, common pairs of micro- or
macro-operations may be ``fused'' into a single one, up to various

View file

@ -29,7 +29,7 @@ We were also able to show in Section~\ref{ssec:memlatbound}
that state-of-the-art static analyzers struggle to
account for memory-carried dependencies; a weakness significantly impacting
their overall results on our benchmarks. We believe that detecting
and accounting for these dependencies is an important topic --~which we will
and accounting for these dependencies is an important topic ---~which we will
tackle in the following chapter.
Moreover, we present this work in the form of a modular software package, each

View file

@ -43,7 +43,7 @@ of a CPU and, in particular, how \uops{} transit in (decoded) and out
If a \uop{} has not been retired yet (issued and executed), it cannot be
replaced in the ROB by any freshly decoded instruction. In other words, every
non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the
non-retired decoded \uop{} ---~also called \emph{in-flight}~--- remains in the
reorder buffer. This is possible thanks to the notion of \emph{full reorder
buffer}:

View file

@ -54,7 +54,7 @@ loop:
\end{lstlisting}
\end{minipage}\\
a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
---~although there is no such dependency inherent to the kernel.
----~although there is no such dependency inherent to the kernel.
However, each iteration of the
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
@ -64,8 +64,8 @@ write it back. This creates a dependency to the previous iteration of the inner
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
does not report a dependency's distance, they are considered meaningful; and as
they cannot be detected by \staticdeps{} --~which is unaware of the outer and
inner loop~--, they introduce unfairness in the evaluation. The actual loss of
they cannot be detected by \staticdeps{} ---~which is unaware of the outer and
inner loop~---, they introduce unfairness in the evaluation. The actual loss of
precision introduced by not discovering such dependencies is instead assessed
later by enriching \uica{} with \staticdeps{}.
@ -99,14 +99,14 @@ source and destination program counters are not in the same basic block are
discarded, as \staticdeps{} cannot detect them by construction.
For each of the considered basic blocks, we run our static analysis,
\staticdeps{}. We discard the $\Delta{}k$ parameter --~how many loop iterations
the dependency spans~--, as our dynamic analysis does not report an equivalent
\staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations
the dependency spans~---, as our dynamic analysis does not report an equivalent
parameter, but only a pair of program counters.
Dynamic dependencies from \depsim{} are converted to
\emph{periodic dependencies} in the sense of \staticdeps{} as described in
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
least 80\% of the block's iterations are kept --~else, dependencies are
least 80\% of the block's iterations are kept ---~else, dependencies are
considered measurement artifacts. The \emph{periodic coverage}
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
proportion of dependencies found by \staticdeps{} among the periodic
@ -117,7 +117,7 @@ dependencies extracted from \depsim{}:
\smallskip{}
We also keep the raw dependencies from \depsim{} --~that is, without converting
We also keep the raw dependencies from \depsim{} ---~that is, without converting
them to periodic dependencies. From these, we consider two metrics:
the unweighted dependencies coverage, \[
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
@ -158,20 +158,20 @@ very close to 100\,\%, giving us good confidence on the accuracy of
The same methodology can be re-used as a proxy for estimating the rate of
aliasing independent pointers in our dataset. Indeed, a major approximation
made by \staticdeps{} is to assume that any new encountered pointer --~function
parameters, value read from memory, \ldots~-- does \emph{not} alias with
made by \staticdeps{} is to assume that any new encountered pointer ---~function
parameters, value read from memory, \ldots~--- does \emph{not} alias with
previously encountered values. This is implemented by the use of a fresh
random value for each value yet unknown.
Determining which pointers may point to which other pointers --~and, by
extension, may point to the same memory region~-- is called a \emph{points-to
Determining which pointers may point to which other pointers ---~and, by
extension, may point to the same memory region~--- is called a \emph{points-to
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
the pointers for which taking a fresh value was \emph{not} representative of
the reality.
If we detect, through dynamic analysis, that a value derived from a
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
--~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to}
--~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to}
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
execution is equal to \lstc{b + l} at the very end of the execution: although
the pointers will not alias (that is, share the same value at the same moment),
@ -215,7 +215,7 @@ The results of this analysis are presented in
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
good confidence that our hypothesis of independent pointers is reasonable, at
least within the scope of Polybench, which we believe representative of
scientific computation --~one of the prominent use-cases of tools such as code
scientific computation ---~one of the prominent use-cases of tools such as code
analyzers.
\subsection{Enriching \uica{}'s model}
@ -329,7 +329,7 @@ constituting basic blocks.
\centering
\includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
\captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
\end{minipage}\hfill\begin{minipage}{0.48\linewidth}
\centering
\includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
@ -344,14 +344,14 @@ constituting basic blocks.
\toprule
\textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
\midrule
Seq.\ (\ref{messeq:depsim}) --~\depsim{}
Seq.\ (\ref{messeq:depsim}) ---~\depsim{}
& 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum)
Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum)
& 2307 ms & 677 ms & 557 ms & 2700 ms \\
Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single)
Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single)
& 529 ms & 545 ms & 425 ms & 588 ms \\
\midrule
Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup
Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup
& $\times$36.1 & $\times$33.5 & $\times$30.1 &
$\times$41.7 \\
\bottomrule