Typography: -- to ---
This commit is contained in:
parent
103e6a0687
commit
d1401b068f
11 changed files with 54 additions and 54 deletions
manuscrit
00_opening
20_foundations
30_palmed
40_A72-frontend
50_CesASMe
60_staticdeps
|
@ -1,8 +1,8 @@
|
||||||
\selectlanguage{french}
|
\selectlanguage{french}
|
||||||
\begin{abstract}
|
\begin{abstract}
|
||||||
Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
|
Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
|
||||||
calculs en environnement contraint --~comme de l'embarqué ou de
|
calculs en environnement contraint ---~comme de l'embarqué ou de
|
||||||
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
|
l'\emph{edge computing}~--- ou encore de tentatives de réduire l'empreinte
|
||||||
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
|
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
|
||||||
justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent
|
justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent
|
||||||
à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais
|
à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais
|
||||||
|
@ -34,11 +34,11 @@
|
||||||
\selectlanguage{english}
|
\selectlanguage{english}
|
||||||
\begin{abstract}
|
\begin{abstract}
|
||||||
Be it massively distributed computation over multiple server racks,
|
Be it massively distributed computation over multiple server racks,
|
||||||
constrained computation --~such as in embedded environments or in
|
constrained computation ---~such as in embedded environments or in
|
||||||
\emph{edge computing}~--, or still an attempt to reduce the ecological
|
\emph{edge computing}~---, or still an attempt to reduce the ecological
|
||||||
footprint of a frequently-run program, many use-cases make it relevant to
|
footprint of a frequently-run program, many use-cases make it relevant to
|
||||||
deeply optimize a program. This optimisation is often limited to high-level
|
deeply optimize a program. This optimisation is often limited to high-level
|
||||||
optimisation --~choice of algorithms, parallel computing, \ldots{} Yet, it
|
optimisation ---~choice of algorithms, parallel computing, \ldots{} Yet, it
|
||||||
is possible to carry it further to low-level optimisations, by inspecting
|
is possible to carry it further to low-level optimisations, by inspecting
|
||||||
the generated assembly with respect to the microarchitecture of the
|
the generated assembly with respect to the microarchitecture of the
|
||||||
specific microprocessor used to fine-tune it.
|
specific microprocessor used to fine-tune it.
|
||||||
|
|
|
@ -101,7 +101,7 @@ processor.
|
||||||
|
|
||||||
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
||||||
first be broken down into a sequence of instructions. While on some ISAs, each
|
first be broken down into a sequence of instructions. While on some ISAs, each
|
||||||
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
|
instruction is made of a constant amount of bytes ---~\eg{} ARM~---, this is not
|
||||||
always the case: for instance, x84-64 instructions can be as short as one byte,
|
always the case: for instance, x84-64 instructions can be as short as one byte,
|
||||||
while the ISA only limits an instruction to 15
|
while the ISA only limits an instruction to 15
|
||||||
bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by
|
bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by
|
||||||
|
|
|
@ -167,8 +167,8 @@ We have now covered enough of the theoretical background to introduce code
|
||||||
analyzers in a concrete way, through examples of their usage. For this purpose,
|
analyzers in a concrete way, through examples of their usage. For this purpose,
|
||||||
we use \llvmmca{}, one of the state-of-the-art code analyzers.
|
we use \llvmmca{}, one of the state-of-the-art code analyzers.
|
||||||
|
|
||||||
Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
|
Due to its relative simplicity ---~at least compared to \eg{} Intel's x86-64
|
||||||
implementations~--, we will base the following examples on ARM's Cortex A72,
|
implementations~---, we will base the following examples on ARM's Cortex A72,
|
||||||
which we introduce in depth later in \autoref{chap:frontend}. No specific
|
which we introduce in depth later in \autoref{chap:frontend}. No specific
|
||||||
knowledge of this microarchitecture is required to understand the following
|
knowledge of this microarchitecture is required to understand the following
|
||||||
examples; for our purposes, if suffices to say that:
|
examples; for our purposes, if suffices to say that:
|
||||||
|
@ -192,7 +192,7 @@ on a single load operation: \lstarmasm{ldr x1, [x2]}.
|
||||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
|
||||||
|
|
||||||
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
|
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
|
||||||
the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
|
the execution of the kernel ---~here, 100 times, as seen row 2~---. This simple
|
||||||
kernel contains only one instruction, which breaks down into a single \uop{}.
|
kernel contains only one instruction, which breaks down into a single \uop{}.
|
||||||
Iterating it takes 106 cycles instead of the expected 100 cycles, as this
|
Iterating it takes 106 cycles instead of the expected 100 cycles, as this
|
||||||
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
execution is \emph{not} in steady-state, but accounts for the cycles from the
|
||||||
|
@ -224,9 +224,9 @@ takes up all load resources available.
|
||||||
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
|
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
|
||||||
|
|
||||||
which indicates, for each instruction, the timeline of its execution. Here,
|
which indicates, for each instruction, the timeline of its execution. Here,
|
||||||
\texttt{D} stands for decode, \texttt{e} for being executed --~in the
|
\texttt{D} stands for decode, \texttt{e} for being executed ---~in the
|
||||||
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
|
pipeline~---, \texttt{E} for last cycle of its execution ---~leaving the
|
||||||
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
|
pipeline~---, \texttt{R} for retiring. When an instruction is decoded and
|
||||||
waiting to be dispatched to execution, an \texttt{=} is shown.
|
waiting to be dispatched to execution, an \texttt{=} is shown.
|
||||||
|
|
||||||
The identifier at the beginning of each row indicates the kernel iteration
|
The identifier at the beginning of each row indicates the kernel iteration
|
||||||
|
|
|
@ -17,8 +17,8 @@ Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
|
||||||
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
|
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
|
||||||
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
|
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
|
||||||
further gives data tables with throughput and latencies for some instructions.
|
further gives data tables with throughput and latencies for some instructions.
|
||||||
While the manual provides a huge collection of important insights --~from the
|
While the manual provides a huge collection of important insights ---~from the
|
||||||
optimisation perspective~-- on their microarchitectures, it lacks exhaustive
|
optimisation perspective~--- on their microarchitectures, it lacks exhaustive
|
||||||
and (conveniently) machine-parsable data tables and does not detail port usages
|
and (conveniently) machine-parsable data tables and does not detail port usages
|
||||||
of each instruction.
|
of each instruction.
|
||||||
|
|
||||||
|
@ -30,7 +30,7 @@ AMD, since 2020, releases lengthy and complete optimisation manuals for its
|
||||||
microarchitecture. For instance, the Zen4 optimisation
|
microarchitecture. For instance, the Zen4 optimisation
|
||||||
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
|
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
|
||||||
processor's workflow and ports, and a spreadsheet of about 3,400 x86
|
processor's workflow and ports, and a spreadsheet of about 3,400 x86
|
||||||
instructions --~with operands variants broken down~-- and their port usage,
|
instructions ---~with operands variants broken down~--- and their port usage,
|
||||||
throughput and latencies. Such an effort, which certainly translates to a
|
throughput and latencies. Such an effort, which certainly translates to a
|
||||||
non-negligible financial cost to the company, showcases the importance and
|
non-negligible financial cost to the company, showcases the importance and
|
||||||
recent expectations on such documents.
|
recent expectations on such documents.
|
||||||
|
@ -89,7 +89,7 @@ existence.
|
||||||
Going further than data extraction at the individual instruction level,
|
Going further than data extraction at the individual instruction level,
|
||||||
academics and industrials interested in this domain now mostly work on
|
academics and industrials interested in this domain now mostly work on
|
||||||
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
||||||
tool embeds a model --~or collection of models~-- on which its inference is
|
tool embeds a model ---~or collection of models~--- on which its inference is
|
||||||
based, and whose definition, embedded data and obtention method varies from
|
based, and whose definition, embedded data and obtention method varies from
|
||||||
tool to tool. These tools often use, to some extent, the data on individual
|
tool to tool. These tools often use, to some extent, the data on individual
|
||||||
instructions obtained either from the manufacturer or the third-party efforts
|
instructions obtained either from the manufacturer or the third-party efforts
|
||||||
|
@ -106,16 +106,16 @@ microarchitectures. Yet, being closed-source and relying on data that is
|
||||||
partially unavailable to the public, the model is not totally satisfactory to
|
partially unavailable to the public, the model is not totally satisfactory to
|
||||||
academics or engineers trying to understand specific performance results. It
|
academics or engineers trying to understand specific performance results. It
|
||||||
also makes it vulnerable to deprecation, as the community is unable to
|
also makes it vulnerable to deprecation, as the community is unable to
|
||||||
\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel
|
\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
|
||||||
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
|
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
|
||||||
binary was recently removed from official download pages.
|
binary was recently removed from official download pages.
|
||||||
|
|
||||||
\medskip{}
|
\medskip{}
|
||||||
|
|
||||||
In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was
|
In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
|
||||||
developed as an internal tool at Sony, and was proposed for inclusion in
|
developed as an internal tool at Sony, and was proposed for inclusion in
|
||||||
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
|
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
|
||||||
data tables that \llvm{} --~a compiler~-- has to maintain for each
|
data tables that \llvm{} ---~a compiler~--- has to maintain for each
|
||||||
microarchitecture in order to produce optimized code. The project has since
|
microarchitecture in order to produce optimized code. The project has since
|
||||||
then evolved to be fairly accurate, as seen in the experiments later presented
|
then evolved to be fairly accurate, as seen in the experiments later presented
|
||||||
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
|
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
|
||||||
|
@ -125,7 +125,7 @@ to its deprecation.
|
||||||
|
|
||||||
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
|
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
|
||||||
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
|
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
|
||||||
(at the time) of an open-source --~and thus, open-model~-- alternative to IACA.
|
(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
|
||||||
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
|
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
|
||||||
It still lacks, however, a good model of frontend and data dependencies, making
|
It still lacks, however, a good model of frontend and data dependencies, making
|
||||||
it less performant than other code analyzers in our experiments later in this
|
it less performant than other code analyzers in our experiments later in this
|
||||||
|
@ -145,7 +145,7 @@ to predict its reverse throughput. Doing so, even with perfect accuracy, does
|
||||||
not explain the source of a performance problem: the model is unable to help
|
not explain the source of a performance problem: the model is unable to help
|
||||||
detecting which resource is the performance bottleneck of a kernel; in other
|
detecting which resource is the performance bottleneck of a kernel; in other
|
||||||
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
||||||
it --~or debugging it.
|
it ---~or debugging it.
|
||||||
|
|
||||||
\medskip{}
|
\medskip{}
|
||||||
|
|
||||||
|
@ -168,7 +168,7 @@ Abel and Reineke, the authors of \uopsinfo{}, recently released
|
||||||
reverse-engineering through the use of hardware counters to model the frontend
|
reverse-engineering through the use of hardware counters to model the frontend
|
||||||
and pipelines. We found this tool to be very accurate (see experiments later in
|
and pipelines. We found this tool to be very accurate (see experiments later in
|
||||||
this manuscript), with results comparable with \llvmmca{}. Its source code
|
this manuscript), with results comparable with \llvmmca{}. Its source code
|
||||||
--~under free software license~-- is self-contained and reasonably concise
|
---~under free software license~--- is self-contained and reasonably concise
|
||||||
(about 2,000 lines of Python for the main part), making it a good basis and
|
(about 2,000 lines of Python for the main part), making it a good basis and
|
||||||
baseline for experiments. It is, however, closely tied by design to Intel
|
baseline for experiments. It is, however, closely tied by design to Intel
|
||||||
microarchitectures, or microarchitectures very close to Intel's ones.
|
microarchitectures, or microarchitectures very close to Intel's ones.
|
||||||
|
|
|
@ -3,7 +3,7 @@ of good microarchitectural analysis and predictions in many aspects. One thing,
|
||||||
however, that we found lacking, was a generic method to obtain a model for a
|
however, that we found lacking, was a generic method to obtain a model for a
|
||||||
given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are
|
given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are
|
||||||
performant and quite exhaustive models of Intel's x86-64 implementations, they
|
performant and quite exhaustive models of Intel's x86-64 implementations, they
|
||||||
are restricted to Intel CPUs --~and few others for \uopsinfo{}. These models
|
are restricted to Intel CPUs ---~and few others for \uopsinfo{}. These models
|
||||||
were, at least up to a point, handcrafted. While \iaca{} is based on insider's
|
were, at least up to a point, handcrafted. While \iaca{} is based on insider's
|
||||||
knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s
|
knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s
|
||||||
method is based on specific hardware counters and handpicked instructions with
|
method is based on specific hardware counters and handpicked instructions with
|
||||||
|
|
|
@ -39,11 +39,11 @@ that $\texttt{\footnotesize{}UNROLL\_SIZE} \times
|
||||||
parameters of the benchmark generation.
|
parameters of the benchmark generation.
|
||||||
|
|
||||||
\pipedream{} must be able to distinguish between variants of instructions with
|
\pipedream{} must be able to distinguish between variants of instructions with
|
||||||
the same mnemonic --~\eg{} \lstxasm{mov}~-- but different operand kinds,
|
the same mnemonic ---~\eg{} \lstxasm{mov}~--- but different operand kinds,
|
||||||
altering the semantics and performance of the instruction --~such as a
|
altering the semantics and performance of the instruction ---~such as a
|
||||||
\lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To
|
\lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To
|
||||||
this end, \pipedream{} represents instructions fully qualified with their
|
this end, \pipedream{} represents instructions fully qualified with their
|
||||||
operands' kind --~this can be seen as a process akin to C++'s name mangling.
|
operands' kind ---~this can be seen as a process akin to C++'s name mangling.
|
||||||
|
|
||||||
As \pipedream{} gets a multiset of instructions as a kernel, these
|
As \pipedream{} gets a multiset of instructions as a kernel, these
|
||||||
instructions' arguments must be instantiated to turn them into actual assembly
|
instructions' arguments must be instantiated to turn them into actual assembly
|
||||||
|
|
|
@ -136,7 +136,7 @@
|
||||||
\figthreerowlegend{Polybench}{polybench-W-zen}
|
\figthreerowlegend{Polybench}{polybench-W-zen}
|
||||||
\end{figleftlabel}
|
\end{figleftlabel}
|
||||||
|
|
||||||
\caption{IPC prediction profile heatmaps~--~predictions closer to the
|
\caption{IPC prediction profile heatmaps~---~predictions closer to the
|
||||||
red line are more accurate. Predicted IPC ratio (Y) against native
|
red line are more accurate. Predicted IPC ratio (Y) against native
|
||||||
IPC (X)}
|
IPC (X)}
|
||||||
\label{fig:palmed_heatmaps}
|
\label{fig:palmed_heatmaps}
|
||||||
|
|
|
@ -64,13 +64,13 @@ each instruction. Its parameters are:
|
||||||
|
|
||||||
The first step in modeling a processor's frontend should certainly be to
|
The first step in modeling a processor's frontend should certainly be to
|
||||||
characterize the number of \uops{} that can be dispatched in a cycle. We assume
|
characterize the number of \uops{} that can be dispatched in a cycle. We assume
|
||||||
that a model of the backend is known --~by taking for instance a model
|
that a model of the backend is known ---~by taking for instance a model
|
||||||
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
|
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
|
||||||
best of our knowledge, we can safely further assume that instructions that load
|
best of our knowledge, we can safely further assume that instructions that load
|
||||||
a single backend port only once are also composed of a single \uop{}.
|
a single backend port only once are also composed of a single \uop{}.
|
||||||
Generating a few combinations of a diversity of those and measuring their
|
Generating a few combinations of a diversity of those and measuring their
|
||||||
effective throughput --~making sure using the backend model that the latter is
|
effective throughput ---~making sure using the backend model that the latter is
|
||||||
not the bottleneck~-- and keeping the maximal throughput reached should provide
|
not the bottleneck~--- and keeping the maximal throughput reached should provide
|
||||||
a good value.
|
a good value.
|
||||||
|
|
||||||
\medskip{}
|
\medskip{}
|
||||||
|
@ -88,7 +88,7 @@ The core of the model presented in this chapter is the discovery, for each
|
||||||
instruction, of its \uop{} count. Still assuming the knowledge of a backend
|
instruction, of its \uop{} count. Still assuming the knowledge of a backend
|
||||||
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
|
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
|
||||||
generic enough to be used on any processor. The basic instructions may be
|
generic enough to be used on any processor. The basic instructions may be
|
||||||
easily selected using the backend model --~we assume their existence in most
|
easily selected using the backend model ---~we assume their existence in most
|
||||||
microarchitectures, as pragmatic concerns guide the ports design. Counting the
|
microarchitectures, as pragmatic concerns guide the ports design. Counting the
|
||||||
\uops{} of an instruction thus follows, using only elapsed cycles counters.
|
\uops{} of an instruction thus follows, using only elapsed cycles counters.
|
||||||
|
|
||||||
|
@ -141,7 +141,7 @@ be investigated if the model does not reach the expected accuracy.
|
||||||
\uops{} are repeatedly streamed from the decode queue, without even the
|
\uops{} are repeatedly streamed from the decode queue, without even the
|
||||||
necessity to hit a cache. We are unaware of similar features in other
|
necessity to hit a cache. We are unaware of similar features in other
|
||||||
commercial processors. In embedded programming, however, \emph{hardware
|
commercial processors. In embedded programming, however, \emph{hardware
|
||||||
loops} --~which are set up explicitly by the programmer~-- achieve,
|
loops} ---~which are set up explicitly by the programmer~--- achieve,
|
||||||
among others, the same
|
among others, the same
|
||||||
goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.
|
goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.
|
||||||
|
|
||||||
|
@ -157,8 +157,8 @@ be investigated if the model does not reach the expected accuracy.
|
||||||
|
|
||||||
\item{} In reality, there is an intermediary step between instructions and
|
\item{} In reality, there is an intermediary step between instructions and
|
||||||
\uops{}: macro-ops. Although it serves a designing and semantic
|
\uops{}: macro-ops. Although it serves a designing and semantic
|
||||||
purpose, we omit this step in the current model as --~we
|
purpose, we omit this step in the current model as ---~we
|
||||||
believe~-- it is of little importance to predict performance.
|
believe~--- it is of little importance to predict performance.
|
||||||
|
|
||||||
\item{} On x86 architectures at least, common pairs of micro- or
|
\item{} On x86 architectures at least, common pairs of micro- or
|
||||||
macro-operations may be ``fused'' into a single one, up to various
|
macro-operations may be ``fused'' into a single one, up to various
|
||||||
|
|
|
@ -29,7 +29,7 @@ We were also able to show in Section~\ref{ssec:memlatbound}
|
||||||
that state-of-the-art static analyzers struggle to
|
that state-of-the-art static analyzers struggle to
|
||||||
account for memory-carried dependencies; a weakness significantly impacting
|
account for memory-carried dependencies; a weakness significantly impacting
|
||||||
their overall results on our benchmarks. We believe that detecting
|
their overall results on our benchmarks. We believe that detecting
|
||||||
and accounting for these dependencies is an important topic --~which we will
|
and accounting for these dependencies is an important topic ---~which we will
|
||||||
tackle in the following chapter.
|
tackle in the following chapter.
|
||||||
|
|
||||||
Moreover, we present this work in the form of a modular software package, each
|
Moreover, we present this work in the form of a modular software package, each
|
||||||
|
|
|
@ -43,7 +43,7 @@ of a CPU and, in particular, how \uops{} transit in (decoded) and out
|
||||||
|
|
||||||
If a \uop{} has not been retired yet (issued and executed), it cannot be
|
If a \uop{} has not been retired yet (issued and executed), it cannot be
|
||||||
replaced in the ROB by any freshly decoded instruction. In other words, every
|
replaced in the ROB by any freshly decoded instruction. In other words, every
|
||||||
non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the
|
non-retired decoded \uop{} ---~also called \emph{in-flight}~--- remains in the
|
||||||
reorder buffer. This is possible thanks to the notion of \emph{full reorder
|
reorder buffer. This is possible thanks to the notion of \emph{full reorder
|
||||||
buffer}:
|
buffer}:
|
||||||
|
|
||||||
|
|
|
@ -54,7 +54,7 @@ loop:
|
||||||
\end{lstlisting}
|
\end{lstlisting}
|
||||||
\end{minipage}\\
|
\end{minipage}\\
|
||||||
a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
|
a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
|
||||||
---~although there is no such dependency inherent to the kernel.
|
----~although there is no such dependency inherent to the kernel.
|
||||||
|
|
||||||
However, each iteration of the
|
However, each iteration of the
|
||||||
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
|
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
|
||||||
|
@ -64,8 +64,8 @@ write it back. This creates a dependency to the previous iteration of the inner
|
||||||
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
|
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
|
||||||
enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
|
enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
|
||||||
does not report a dependency's distance, they are considered meaningful; and as
|
does not report a dependency's distance, they are considered meaningful; and as
|
||||||
they cannot be detected by \staticdeps{} --~which is unaware of the outer and
|
they cannot be detected by \staticdeps{} ---~which is unaware of the outer and
|
||||||
inner loop~--, they introduce unfairness in the evaluation. The actual loss of
|
inner loop~---, they introduce unfairness in the evaluation. The actual loss of
|
||||||
precision introduced by not discovering such dependencies is instead assessed
|
precision introduced by not discovering such dependencies is instead assessed
|
||||||
later by enriching \uica{} with \staticdeps{}.
|
later by enriching \uica{} with \staticdeps{}.
|
||||||
|
|
||||||
|
@ -99,14 +99,14 @@ source and destination program counters are not in the same basic block are
|
||||||
discarded, as \staticdeps{} cannot detect them by construction.
|
discarded, as \staticdeps{} cannot detect them by construction.
|
||||||
|
|
||||||
For each of the considered basic blocks, we run our static analysis,
|
For each of the considered basic blocks, we run our static analysis,
|
||||||
\staticdeps{}. We discard the $\Delta{}k$ parameter --~how many loop iterations
|
\staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations
|
||||||
the dependency spans~--, as our dynamic analysis does not report an equivalent
|
the dependency spans~---, as our dynamic analysis does not report an equivalent
|
||||||
parameter, but only a pair of program counters.
|
parameter, but only a pair of program counters.
|
||||||
|
|
||||||
Dynamic dependencies from \depsim{} are converted to
|
Dynamic dependencies from \depsim{} are converted to
|
||||||
\emph{periodic dependencies} in the sense of \staticdeps{} as described in
|
\emph{periodic dependencies} in the sense of \staticdeps{} as described in
|
||||||
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
|
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
|
||||||
least 80\% of the block's iterations are kept --~else, dependencies are
|
least 80\% of the block's iterations are kept ---~else, dependencies are
|
||||||
considered measurement artifacts. The \emph{periodic coverage}
|
considered measurement artifacts. The \emph{periodic coverage}
|
||||||
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
|
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
|
||||||
proportion of dependencies found by \staticdeps{} among the periodic
|
proportion of dependencies found by \staticdeps{} among the periodic
|
||||||
|
@ -117,7 +117,7 @@ dependencies extracted from \depsim{}:
|
||||||
|
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
|
||||||
We also keep the raw dependencies from \depsim{} --~that is, without converting
|
We also keep the raw dependencies from \depsim{} ---~that is, without converting
|
||||||
them to periodic dependencies. From these, we consider two metrics:
|
them to periodic dependencies. From these, we consider two metrics:
|
||||||
the unweighted dependencies coverage, \[
|
the unweighted dependencies coverage, \[
|
||||||
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
|
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
|
||||||
|
@ -158,20 +158,20 @@ very close to 100\,\%, giving us good confidence on the accuracy of
|
||||||
|
|
||||||
The same methodology can be re-used as a proxy for estimating the rate of
|
The same methodology can be re-used as a proxy for estimating the rate of
|
||||||
aliasing independent pointers in our dataset. Indeed, a major approximation
|
aliasing independent pointers in our dataset. Indeed, a major approximation
|
||||||
made by \staticdeps{} is to assume that any new encountered pointer --~function
|
made by \staticdeps{} is to assume that any new encountered pointer ---~function
|
||||||
parameters, value read from memory, \ldots~-- does \emph{not} alias with
|
parameters, value read from memory, \ldots~--- does \emph{not} alias with
|
||||||
previously encountered values. This is implemented by the use of a fresh
|
previously encountered values. This is implemented by the use of a fresh
|
||||||
random value for each value yet unknown.
|
random value for each value yet unknown.
|
||||||
|
|
||||||
Determining which pointers may point to which other pointers --~and, by
|
Determining which pointers may point to which other pointers ---~and, by
|
||||||
extension, may point to the same memory region~-- is called a \emph{points-to
|
extension, may point to the same memory region~--- is called a \emph{points-to
|
||||||
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
|
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
|
||||||
the pointers for which taking a fresh value was \emph{not} representative of
|
the pointers for which taking a fresh value was \emph{not} representative of
|
||||||
the reality.
|
the reality.
|
||||||
|
|
||||||
If we detect, through dynamic analysis, that a value derived from a
|
If we detect, through dynamic analysis, that a value derived from a
|
||||||
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
|
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
|
||||||
--~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to}
|
--~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to}
|
||||||
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
|
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
|
||||||
execution is equal to \lstc{b + l} at the very end of the execution: although
|
execution is equal to \lstc{b + l} at the very end of the execution: although
|
||||||
the pointers will not alias (that is, share the same value at the same moment),
|
the pointers will not alias (that is, share the same value at the same moment),
|
||||||
|
@ -215,7 +215,7 @@ The results of this analysis are presented in
|
||||||
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
|
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
|
||||||
good confidence that our hypothesis of independent pointers is reasonable, at
|
good confidence that our hypothesis of independent pointers is reasonable, at
|
||||||
least within the scope of Polybench, which we believe representative of
|
least within the scope of Polybench, which we believe representative of
|
||||||
scientific computation --~one of the prominent use-cases of tools such as code
|
scientific computation ---~one of the prominent use-cases of tools such as code
|
||||||
analyzers.
|
analyzers.
|
||||||
|
|
||||||
\subsection{Enriching \uica{}'s model}
|
\subsection{Enriching \uica{}'s model}
|
||||||
|
@ -329,7 +329,7 @@ constituting basic blocks.
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
|
\includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
|
||||||
\captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
|
\captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
|
||||||
on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
|
on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
|
||||||
\end{minipage}\hfill\begin{minipage}{0.48\linewidth}
|
\end{minipage}\hfill\begin{minipage}{0.48\linewidth}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
|
\includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
|
||||||
|
@ -344,14 +344,14 @@ constituting basic blocks.
|
||||||
\toprule
|
\toprule
|
||||||
\textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
|
\textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
|
||||||
\midrule
|
\midrule
|
||||||
Seq.\ (\ref{messeq:depsim}) --~\depsim{}
|
Seq.\ (\ref{messeq:depsim}) ---~\depsim{}
|
||||||
& 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
|
& 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
|
||||||
Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum)
|
Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum)
|
||||||
& 2307 ms & 677 ms & 557 ms & 2700 ms \\
|
& 2307 ms & 677 ms & 557 ms & 2700 ms \\
|
||||||
Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single)
|
Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single)
|
||||||
& 529 ms & 545 ms & 425 ms & 588 ms \\
|
& 529 ms & 545 ms & 425 ms & 588 ms \\
|
||||||
\midrule
|
\midrule
|
||||||
Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup
|
Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup
|
||||||
& $\times$36.1 & $\times$33.5 & $\times$30.1 &
|
& $\times$36.1 & $\times$33.5 & $\times$30.1 &
|
||||||
$\times$41.7 \\
|
$\times$41.7 \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
|
|
Loading…
Add table
Reference in a new issue