Typography: -- to ---
This commit is contained in:
parent
103e6a0687
commit
d1401b068f
11 changed files with 54 additions and 54 deletions
|
@ -1,8 +1,8 @@
|
|||
\selectlanguage{french}
|
||||
\begin{abstract}
|
||||
Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
|
||||
calculs en environnement contraint --~comme de l'embarqué ou de
|
||||
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
|
||||
calculs en environnement contraint ---~comme de l'embarqué ou de
|
||||
l'\emph{edge computing}~--- ou encore de tentatives de réduire l'empreinte
|
||||
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
|
||||
justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent
|
||||
à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais
|
||||
|
@ -34,11 +34,11 @@
|
|||
\selectlanguage{english}
|
||||
\begin{abstract}
|
||||
Be it massively distributed computation over multiple server racks,
|
||||
constrained computation --~such as in embedded environments or in
|
||||
\emph{edge computing}~--, or still an attempt to reduce the ecological
|
||||
constrained computation ---~such as in embedded environments or in
|
||||
\emph{edge computing}~---, or still an attempt to reduce the ecological
|
||||
footprint of a frequently-run program, many use-cases make it relevant to
|
||||
deeply optimize a program. This optimisation is often limited to high-level
|
||||
optimisation --~choice of algorithms, parallel computing, \ldots{} Yet, it
|
||||
optimisation ---~choice of algorithms, parallel computing, \ldots{} Yet, it
|
||||
is possible to carry it further to low-level optimisations, by inspecting
|
||||
the generated assembly with respect to the microarchitecture of the
|
||||
specific microprocessor used to fine-tune it.
|
||||
|
|
|
@ -101,7 +101,7 @@ processor.
|
|||
|
||||
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
||||
first be broken down into a sequence of instructions. While on some ISAs, each
|
||||
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
|
||||
instruction is made of a constant amount of bytes ---~\eg{} ARM~---, this is not
|
||||
always the case: for instance, x84-64 instructions can be as short as one byte,
|
||||
while the ISA only limits an instruction to 15
|
||||
bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by
|
||||
|
|
|
@ -167,8 +167,8 @@ We have now covered enough of the theoretical background to introduce code
|
|||
analyzers in a concrete way, through examples of their usage. For this purpose,
|
||||
we use \llvmmca{}, one of the state-of-the-art code analyzers.
|
||||
|
||||
Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
|
||||
implementations~--, we will base the following examples on ARM's Cortex A72,
|
||||
Due to its relative simplicity ---~at least compared to \eg{} Intel's x86-64
|
||||
implementations~---, we will base the following examples on ARM's Cortex A72,
|
||||
which we introduce in depth later in \autoref{chap:frontend}. No specific
|
||||
knowledge of this microarchitecture is required to understand the following
|
||||
examples; for our purposes, if suffices to say that:
|
||||
|
@ -192,7 +192,7 @@ on a single load operation: \lstarmasm{ldr x1, [x2]}.
|
|||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
|
||||
|
||||
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
|
||||
the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
|
||||
the execution of the kernel ---~here, 100 times, as seen row 2~---. This simple
|
||||
kernel contains only one instruction, which breaks down into a single \uop{}.
|
||||
Iterating it takes 106 cycles instead of the expected 100 cycles, as this
|
||||
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
||||
|
@ -224,9 +224,9 @@ takes up all load resources available.
|
|||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
|
||||
|
||||
which indicates, for each instruction, the timeline of its execution. Here,
|
||||
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
|
||||
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
|
||||
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
|
||||
\texttt{D} stands for decode, \texttt{e} for being executed ---~in the
|
||||
pipeline~---, \texttt{E} for last cycle of its execution ---~leaving the
|
||||
pipeline~---, \texttt{R} for retiring. When an instruction is decoded and
|
||||
waiting to be dispatched to execution, an \texttt{=} is shown.
|
||||
|
||||
The identifier at the beginning of each row indicates the kernel iteration
|
||||
|
|
|
@ -17,8 +17,8 @@ Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
|
|||
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
|
||||
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
|
||||
further gives data tables with throughput and latencies for some instructions.
|
||||
While the manual provides a huge collection of important insights --~from the
|
||||
optimisation perspective~-- on their microarchitectures, it lacks exhaustive
|
||||
While the manual provides a huge collection of important insights ---~from the
|
||||
optimisation perspective~--- on their microarchitectures, it lacks exhaustive
|
||||
and (conveniently) machine-parsable data tables and does not detail port usages
|
||||
of each instruction.
|
||||
|
||||
|
@ -30,7 +30,7 @@ AMD, since 2020, releases lengthy and complete optimisation manuals for its
|
|||
microarchitecture. For instance, the Zen4 optimisation
|
||||
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
|
||||
processor's workflow and ports, and a spreadsheet of about 3,400 x86
|
||||
instructions --~with operands variants broken down~-- and their port usage,
|
||||
instructions ---~with operands variants broken down~--- and their port usage,
|
||||
throughput and latencies. Such an effort, which certainly translates to a
|
||||
non-negligible financial cost to the company, showcases the importance and
|
||||
recent expectations on such documents.
|
||||
|
@ -89,7 +89,7 @@ existence.
|
|||
Going further than data extraction at the individual instruction level,
|
||||
academics and industrials interested in this domain now mostly work on
|
||||
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
||||
tool embeds a model --~or collection of models~-- on which its inference is
|
||||
tool embeds a model ---~or collection of models~--- on which its inference is
|
||||
based, and whose definition, embedded data and obtention method varies from
|
||||
tool to tool. These tools often use, to some extent, the data on individual
|
||||
instructions obtained either from the manufacturer or the third-party efforts
|
||||
|
@ -106,16 +106,16 @@ microarchitectures. Yet, being closed-source and relying on data that is
|
|||
partially unavailable to the public, the model is not totally satisfactory to
|
||||
academics or engineers trying to understand specific performance results. It
|
||||
also makes it vulnerable to deprecation, as the community is unable to
|
||||
\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel
|
||||
\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
|
||||
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
|
||||
binary was recently removed from official download pages.
|
||||
|
||||
\medskip{}
|
||||
|
||||
In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was
|
||||
In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
|
||||
developed as an internal tool at Sony, and was proposed for inclusion in
|
||||
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
|
||||
data tables that \llvm{} --~a compiler~-- has to maintain for each
|
||||
data tables that \llvm{} ---~a compiler~--- has to maintain for each
|
||||
microarchitecture in order to produce optimized code. The project has since
|
||||
then evolved to be fairly accurate, as seen in the experiments later presented
|
||||
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
|
||||
|
@ -125,7 +125,7 @@ to its deprecation.
|
|||
|
||||
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
|
||||
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
|
||||
(at the time) of an open-source --~and thus, open-model~-- alternative to IACA.
|
||||
(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
|
||||
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
|
||||
It still lacks, however, a good model of frontend and data dependencies, making
|
||||
it less performant than other code analyzers in our experiments later in this
|
||||
|
@ -145,7 +145,7 @@ to predict its reverse throughput. Doing so, even with perfect accuracy, does
|
|||
not explain the source of a performance problem: the model is unable to help
|
||||
detecting which resource is the performance bottleneck of a kernel; in other
|
||||
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
||||
it --~or debugging it.
|
||||
it ---~or debugging it.
|
||||
|
||||
\medskip{}
|
||||
|
||||
|
@ -168,7 +168,7 @@ Abel and Reineke, the authors of \uopsinfo{}, recently released
|
|||
reverse-engineering through the use of hardware counters to model the frontend
|
||||
and pipelines. We found this tool to be very accurate (see experiments later in
|
||||
this manuscript), with results comparable with \llvmmca{}. Its source code
|
||||
--~under free software license~-- is self-contained and reasonably concise
|
||||
---~under free software license~--- is self-contained and reasonably concise
|
||||
(about 2,000 lines of Python for the main part), making it a good basis and
|
||||
baseline for experiments. It is, however, closely tied by design to Intel
|
||||
microarchitectures, or microarchitectures very close to Intel's ones.
|
||||
|
|
|
@ -3,7 +3,7 @@ of good microarchitectural analysis and predictions in many aspects. One thing,
|
|||
however, that we found lacking, was a generic method to obtain a model for a
|
||||
given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are
|
||||
performant and quite exhaustive models of Intel's x86-64 implementations, they
|
||||
are restricted to Intel CPUs --~and few others for \uopsinfo{}. These models
|
||||
are restricted to Intel CPUs ---~and few others for \uopsinfo{}. These models
|
||||
were, at least up to a point, handcrafted. While \iaca{} is based on insider's
|
||||
knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s
|
||||
method is based on specific hardware counters and handpicked instructions with
|
||||
|
|
|
@ -39,11 +39,11 @@ that $\texttt{\footnotesize{}UNROLL\_SIZE} \times
|
|||
parameters of the benchmark generation.
|
||||
|
||||
\pipedream{} must be able to distinguish between variants of instructions with
|
||||
the same mnemonic --~\eg{} \lstxasm{mov}~-- but different operand kinds,
|
||||
altering the semantics and performance of the instruction --~such as a
|
||||
the same mnemonic ---~\eg{} \lstxasm{mov}~--- but different operand kinds,
|
||||
altering the semantics and performance of the instruction ---~such as a
|
||||
\lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To
|
||||
this end, \pipedream{} represents instructions fully qualified with their
|
||||
operands' kind --~this can be seen as a process akin to C++'s name mangling.
|
||||
operands' kind ---~this can be seen as a process akin to C++'s name mangling.
|
||||
|
||||
As \pipedream{} gets a multiset of instructions as a kernel, these
|
||||
instructions' arguments must be instantiated to turn them into actual assembly
|
||||
|
|
|
@ -136,7 +136,7 @@
|
|||
\figthreerowlegend{Polybench}{polybench-W-zen}
|
||||
\end{figleftlabel}
|
||||
|
||||
\caption{IPC prediction profile heatmaps~--~predictions closer to the
|
||||
\caption{IPC prediction profile heatmaps~---~predictions closer to the
|
||||
red line are more accurate. Predicted IPC ratio (Y) against native
|
||||
IPC (X)}
|
||||
\label{fig:palmed_heatmaps}
|
||||
|
|
|
@ -64,13 +64,13 @@ each instruction. Its parameters are:
|
|||
|
||||
The first step in modeling a processor's frontend should certainly be to
|
||||
characterize the number of \uops{} that can be dispatched in a cycle. We assume
|
||||
that a model of the backend is known --~by taking for instance a model
|
||||
that a model of the backend is known ---~by taking for instance a model
|
||||
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
|
||||
best of our knowledge, we can safely further assume that instructions that load
|
||||
a single backend port only once are also composed of a single \uop{}.
|
||||
Generating a few combinations of a diversity of those and measuring their
|
||||
effective throughput --~making sure using the backend model that the latter is
|
||||
not the bottleneck~-- and keeping the maximal throughput reached should provide
|
||||
effective throughput ---~making sure using the backend model that the latter is
|
||||
not the bottleneck~--- and keeping the maximal throughput reached should provide
|
||||
a good value.
|
||||
|
||||
\medskip{}
|
||||
|
@ -88,7 +88,7 @@ The core of the model presented in this chapter is the discovery, for each
|
|||
instruction, of its \uop{} count. Still assuming the knowledge of a backend
|
||||
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
|
||||
generic enough to be used on any processor. The basic instructions may be
|
||||
easily selected using the backend model --~we assume their existence in most
|
||||
easily selected using the backend model ---~we assume their existence in most
|
||||
microarchitectures, as pragmatic concerns guide the ports design. Counting the
|
||||
\uops{} of an instruction thus follows, using only elapsed cycles counters.
|
||||
|
||||
|
@ -141,7 +141,7 @@ be investigated if the model does not reach the expected accuracy.
|
|||
\uops{} are repeatedly streamed from the decode queue, without even the
|
||||
necessity to hit a cache. We are unaware of similar features in other
|
||||
commercial processors. In embedded programming, however, \emph{hardware
|
||||
loops} --~which are set up explicitly by the programmer~-- achieve,
|
||||
loops} ---~which are set up explicitly by the programmer~--- achieve,
|
||||
among others, the same
|
||||
goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.
|
||||
|
||||
|
@ -157,8 +157,8 @@ be investigated if the model does not reach the expected accuracy.
|
|||
|
||||
\item{} In reality, there is an intermediary step between instructions and
|
||||
\uops{}: macro-ops. Although it serves a designing and semantic
|
||||
purpose, we omit this step in the current model as --~we
|
||||
believe~-- it is of little importance to predict performance.
|
||||
purpose, we omit this step in the current model as ---~we
|
||||
believe~--- it is of little importance to predict performance.
|
||||
|
||||
\item{} On x86 architectures at least, common pairs of micro- or
|
||||
macro-operations may be ``fused'' into a single one, up to various
|
||||
|
|
|
@ -29,7 +29,7 @@ We were also able to show in Section~\ref{ssec:memlatbound}
|
|||
that state-of-the-art static analyzers struggle to
|
||||
account for memory-carried dependencies; a weakness significantly impacting
|
||||
their overall results on our benchmarks. We believe that detecting
|
||||
and accounting for these dependencies is an important topic --~which we will
|
||||
and accounting for these dependencies is an important topic ---~which we will
|
||||
tackle in the following chapter.
|
||||
|
||||
Moreover, we present this work in the form of a modular software package, each
|
||||
|
|
|
@ -43,7 +43,7 @@ of a CPU and, in particular, how \uops{} transit in (decoded) and out
|
|||
|
||||
If a \uop{} has not been retired yet (issued and executed), it cannot be
|
||||
replaced in the ROB by any freshly decoded instruction. In other words, every
|
||||
non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the
|
||||
non-retired decoded \uop{} ---~also called \emph{in-flight}~--- remains in the
|
||||
reorder buffer. This is possible thanks to the notion of \emph{full reorder
|
||||
buffer}:
|
||||
|
||||
|
|
|
@ -54,7 +54,7 @@ loop:
|
|||
\end{lstlisting}
|
||||
\end{minipage}\\
|
||||
a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
|
||||
---~although there is no such dependency inherent to the kernel.
|
||||
----~although there is no such dependency inherent to the kernel.
|
||||
|
||||
However, each iteration of the
|
||||
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
|
||||
|
@ -64,8 +64,8 @@ write it back. This creates a dependency to the previous iteration of the inner
|
|||
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
|
||||
enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
|
||||
does not report a dependency's distance, they are considered meaningful; and as
|
||||
they cannot be detected by \staticdeps{} --~which is unaware of the outer and
|
||||
inner loop~--, they introduce unfairness in the evaluation. The actual loss of
|
||||
they cannot be detected by \staticdeps{} ---~which is unaware of the outer and
|
||||
inner loop~---, they introduce unfairness in the evaluation. The actual loss of
|
||||
precision introduced by not discovering such dependencies is instead assessed
|
||||
later by enriching \uica{} with \staticdeps{}.
|
||||
|
||||
|
@ -99,14 +99,14 @@ source and destination program counters are not in the same basic block are
|
|||
discarded, as \staticdeps{} cannot detect them by construction.
|
||||
|
||||
For each of the considered basic blocks, we run our static analysis,
|
||||
\staticdeps{}. We discard the $\Delta{}k$ parameter --~how many loop iterations
|
||||
the dependency spans~--, as our dynamic analysis does not report an equivalent
|
||||
\staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations
|
||||
the dependency spans~---, as our dynamic analysis does not report an equivalent
|
||||
parameter, but only a pair of program counters.
|
||||
|
||||
Dynamic dependencies from \depsim{} are converted to
|
||||
\emph{periodic dependencies} in the sense of \staticdeps{} as described in
|
||||
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
|
||||
least 80\% of the block's iterations are kept --~else, dependencies are
|
||||
least 80\% of the block's iterations are kept ---~else, dependencies are
|
||||
considered measurement artifacts. The \emph{periodic coverage}
|
||||
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
|
||||
proportion of dependencies found by \staticdeps{} among the periodic
|
||||
|
@ -117,7 +117,7 @@ dependencies extracted from \depsim{}:
|
|||
|
||||
\smallskip{}
|
||||
|
||||
We also keep the raw dependencies from \depsim{} --~that is, without converting
|
||||
We also keep the raw dependencies from \depsim{} ---~that is, without converting
|
||||
them to periodic dependencies. From these, we consider two metrics:
|
||||
the unweighted dependencies coverage, \[
|
||||
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
|
||||
|
@ -158,20 +158,20 @@ very close to 100\,\%, giving us good confidence on the accuracy of
|
|||
|
||||
The same methodology can be re-used as a proxy for estimating the rate of
|
||||
aliasing independent pointers in our dataset. Indeed, a major approximation
|
||||
made by \staticdeps{} is to assume that any new encountered pointer --~function
|
||||
parameters, value read from memory, \ldots~-- does \emph{not} alias with
|
||||
made by \staticdeps{} is to assume that any new encountered pointer ---~function
|
||||
parameters, value read from memory, \ldots~--- does \emph{not} alias with
|
||||
previously encountered values. This is implemented by the use of a fresh
|
||||
random value for each value yet unknown.
|
||||
|
||||
Determining which pointers may point to which other pointers --~and, by
|
||||
extension, may point to the same memory region~-- is called a \emph{points-to
|
||||
Determining which pointers may point to which other pointers ---~and, by
|
||||
extension, may point to the same memory region~--- is called a \emph{points-to
|
||||
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
|
||||
the pointers for which taking a fresh value was \emph{not} representative of
|
||||
the reality.
|
||||
|
||||
If we detect, through dynamic analysis, that a value derived from a
|
||||
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
|
||||
--~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to}
|
||||
--~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to}
|
||||
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
|
||||
execution is equal to \lstc{b + l} at the very end of the execution: although
|
||||
the pointers will not alias (that is, share the same value at the same moment),
|
||||
|
@ -215,7 +215,7 @@ The results of this analysis are presented in
|
|||
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
|
||||
good confidence that our hypothesis of independent pointers is reasonable, at
|
||||
least within the scope of Polybench, which we believe representative of
|
||||
scientific computation --~one of the prominent use-cases of tools such as code
|
||||
scientific computation ---~one of the prominent use-cases of tools such as code
|
||||
analyzers.
|
||||
|
||||
\subsection{Enriching \uica{}'s model}
|
||||
|
@ -329,7 +329,7 @@ constituting basic blocks.
|
|||
\centering
|
||||
\includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
|
||||
\captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
|
||||
on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
|
||||
on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
|
||||
\end{minipage}\hfill\begin{minipage}{0.48\linewidth}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
|
||||
|
@ -344,14 +344,14 @@ constituting basic blocks.
|
|||
\toprule
|
||||
\textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
|
||||
\midrule
|
||||
Seq.\ (\ref{messeq:depsim}) --~\depsim{}
|
||||
Seq.\ (\ref{messeq:depsim}) ---~\depsim{}
|
||||
& 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
|
||||
Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum)
|
||||
Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum)
|
||||
& 2307 ms & 677 ms & 557 ms & 2700 ms \\
|
||||
Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single)
|
||||
Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single)
|
||||
& 529 ms & 545 ms & 425 ms & 588 ms \\
|
||||
\midrule
|
||||
Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup
|
||||
Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup
|
||||
& $\times$36.1 & $\times$33.5 & $\times$30.1 &
|
||||
$\times$41.7 \\
|
||||
\bottomrule
|
||||
|
|
Loading…
Reference in a new issue