Typography: -- to ---

This commit is contained in:
Théophile Bastian 2024-09-01 16:56:48 +02:00
parent 103e6a0687
commit d1401b068f
11 changed files with 54 additions and 54 deletions

View file

@ -1,8 +1,8 @@
\selectlanguage{french} \selectlanguage{french}
\begin{abstract} \begin{abstract}
Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
calculs en environnement contraint --~comme de l'embarqué ou de calculs en environnement contraint ---~comme de l'embarqué ou de
l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte l'\emph{edge computing}~--- ou encore de tentatives de réduire l'empreinte
écologique d'un programme fréquemment utilisé, de nombreux cas d'usage écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent
à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais
@ -34,11 +34,11 @@
\selectlanguage{english} \selectlanguage{english}
\begin{abstract} \begin{abstract}
Be it massively distributed computation over multiple server racks, Be it massively distributed computation over multiple server racks,
constrained computation --~such as in embedded environments or in constrained computation ---~such as in embedded environments or in
\emph{edge computing}~--, or still an attempt to reduce the ecological \emph{edge computing}~---, or still an attempt to reduce the ecological
footprint of a frequently-run program, many use-cases make it relevant to footprint of a frequently-run program, many use-cases make it relevant to
deeply optimize a program. This optimisation is often limited to high-level deeply optimize a program. This optimisation is often limited to high-level
optimisation --~choice of algorithms, parallel computing, \ldots{} Yet, it optimisation ---~choice of algorithms, parallel computing, \ldots{} Yet, it
is possible to carry it further to low-level optimisations, by inspecting is possible to carry it further to low-level optimisations, by inspecting
the generated assembly with respect to the microarchitecture of the the generated assembly with respect to the microarchitecture of the
specific microprocessor used to fine-tune it. specific microprocessor used to fine-tune it.

View file

@ -101,7 +101,7 @@ processor.
The CPU frontend constantly fetches a flow of instruction bytes. This flow must The CPU frontend constantly fetches a flow of instruction bytes. This flow must
first be broken down into a sequence of instructions. While on some ISAs, each first be broken down into a sequence of instructions. While on some ISAs, each
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not instruction is made of a constant amount of bytes ---~\eg{} ARM~---, this is not
always the case: for instance, x84-64 instructions can be as short as one byte, always the case: for instance, x84-64 instructions can be as short as one byte,
while the ISA only limits an instruction to 15 while the ISA only limits an instruction to 15
bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by

View file

@ -167,8 +167,8 @@ We have now covered enough of the theoretical background to introduce code
analyzers in a concrete way, through examples of their usage. For this purpose, analyzers in a concrete way, through examples of their usage. For this purpose,
we use \llvmmca{}, one of the state-of-the-art code analyzers. we use \llvmmca{}, one of the state-of-the-art code analyzers.
Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64 Due to its relative simplicity ---~at least compared to \eg{} Intel's x86-64
implementations~--, we will base the following examples on ARM's Cortex A72, implementations~---, we will base the following examples on ARM's Cortex A72,
which we introduce in depth later in \autoref{chap:frontend}. No specific which we introduce in depth later in \autoref{chap:frontend}. No specific
knowledge of this microarchitecture is required to understand the following knowledge of this microarchitecture is required to understand the following
examples; for our purposes, if suffices to say that: examples; for our purposes, if suffices to say that:
@ -192,7 +192,7 @@ on a single load operation: \lstarmasm{ldr x1, [x2]}.
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out} \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}
The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
the execution of the kernel --~here, 100 times, as seen row 2~--. This simple the execution of the kernel ---~here, 100 times, as seen row 2~---. This simple
kernel contains only one instruction, which breaks down into a single \uop{}. kernel contains only one instruction, which breaks down into a single \uop{}.
Iterating it takes 106 cycles instead of the expected 100 cycles, as this Iterating it takes 106 cycles instead of the expected 100 cycles, as this
execution is \emph{not} in steady-state, but accounts for the cycles from the execution is \emph{not} in steady-state, but accounts for the cycles from the
@ -224,9 +224,9 @@ takes up all load resources available.
\lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out} \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}
which indicates, for each instruction, the timeline of its execution. Here, which indicates, for each instruction, the timeline of its execution. Here,
\texttt{D} stands for decode, \texttt{e} for being executed --~in the \texttt{D} stands for decode, \texttt{e} for being executed ---~in the
pipeline~--, \texttt{E} for last cycle of its execution --~leaving the pipeline~---, \texttt{E} for last cycle of its execution ---~leaving the
pipeline~--, \texttt{R} for retiring. When an instruction is decoded and pipeline~---, \texttt{R} for retiring. When an instruction is decoded and
waiting to be dispatched to execution, an \texttt{=} is shown. waiting to be dispatched to execution, an \texttt{=} is shown.
The identifier at the beginning of each row indicates the kernel iteration The identifier at the beginning of each row indicates the kernel iteration

View file

@ -17,8 +17,8 @@ Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
regularly updated, whose nearly 1,000 pages give relevant details to Intel's regularly updated, whose nearly 1,000 pages give relevant details to Intel's
microarchitectures, such as block diagrams, pipelines, ports available, etc. It microarchitectures, such as block diagrams, pipelines, ports available, etc. It
further gives data tables with throughput and latencies for some instructions. further gives data tables with throughput and latencies for some instructions.
While the manual provides a huge collection of important insights --~from the While the manual provides a huge collection of important insights ---~from the
optimisation perspective~-- on their microarchitectures, it lacks exhaustive optimisation perspective~--- on their microarchitectures, it lacks exhaustive
and (conveniently) machine-parsable data tables and does not detail port usages and (conveniently) machine-parsable data tables and does not detail port usages
of each instruction. of each instruction.
@ -30,7 +30,7 @@ AMD, since 2020, releases lengthy and complete optimisation manuals for its
microarchitecture. For instance, the Zen4 optimisation microarchitecture. For instance, the Zen4 optimisation
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
processor's workflow and ports, and a spreadsheet of about 3,400 x86 processor's workflow and ports, and a spreadsheet of about 3,400 x86
instructions --~with operands variants broken down~-- and their port usage, instructions ---~with operands variants broken down~--- and their port usage,
throughput and latencies. Such an effort, which certainly translates to a throughput and latencies. Such an effort, which certainly translates to a
non-negligible financial cost to the company, showcases the importance and non-negligible financial cost to the company, showcases the importance and
recent expectations on such documents. recent expectations on such documents.
@ -89,7 +89,7 @@ existence.
Going further than data extraction at the individual instruction level, Going further than data extraction at the individual instruction level,
academics and industrials interested in this domain now mostly work on academics and industrials interested in this domain now mostly work on
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
tool embeds a model --~or collection of models~-- on which its inference is tool embeds a model ---~or collection of models~--- on which its inference is
based, and whose definition, embedded data and obtention method varies from based, and whose definition, embedded data and obtention method varies from
tool to tool. These tools often use, to some extent, the data on individual tool to tool. These tools often use, to some extent, the data on individual
instructions obtained either from the manufacturer or the third-party efforts instructions obtained either from the manufacturer or the third-party efforts
@ -106,16 +106,16 @@ microarchitectures. Yet, being closed-source and relying on data that is
partially unavailable to the public, the model is not totally satisfactory to partially unavailable to the public, the model is not totally satisfactory to
academics or engineers trying to understand specific performance results. It academics or engineers trying to understand specific performance results. It
also makes it vulnerable to deprecation, as the community is unable to also makes it vulnerable to deprecation, as the community is unable to
\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel \textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
binary was recently removed from official download pages. binary was recently removed from official download pages.
\medskip{} \medskip{}
In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
developed as an internal tool at Sony, and was proposed for inclusion in developed as an internal tool at Sony, and was proposed for inclusion in
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the \llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
data tables that \llvm{} --~a compiler~-- has to maintain for each data tables that \llvm{} ---~a compiler~--- has to maintain for each
microarchitecture in order to produce optimized code. The project has since microarchitecture in order to produce optimized code. The project has since
then evolved to be fairly accurate, as seen in the experiments later presented then evolved to be fairly accurate, as seen in the experiments later presented
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
@ -125,7 +125,7 @@ to its deprecation.
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.} Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
(at the time) of an open-source --~and thus, open-model~-- alternative to IACA. (at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}. As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
It still lacks, however, a good model of frontend and data dependencies, making It still lacks, however, a good model of frontend and data dependencies, making
it less performant than other code analyzers in our experiments later in this it less performant than other code analyzers in our experiments later in this
@ -145,7 +145,7 @@ to predict its reverse throughput. Doing so, even with perfect accuracy, does
not explain the source of a performance problem: the model is unable to help not explain the source of a performance problem: the model is unable to help
detecting which resource is the performance bottleneck of a kernel; in other detecting which resource is the performance bottleneck of a kernel; in other
words, it quantifies a potential issue, but does not help in \emph{explaining} words, it quantifies a potential issue, but does not help in \emph{explaining}
it --~or debugging it. it ---~or debugging it.
\medskip{} \medskip{}
@ -168,7 +168,7 @@ Abel and Reineke, the authors of \uopsinfo{}, recently released
reverse-engineering through the use of hardware counters to model the frontend reverse-engineering through the use of hardware counters to model the frontend
and pipelines. We found this tool to be very accurate (see experiments later in and pipelines. We found this tool to be very accurate (see experiments later in
this manuscript), with results comparable with \llvmmca{}. Its source code this manuscript), with results comparable with \llvmmca{}. Its source code
--~under free software license~-- is self-contained and reasonably concise ---~under free software license~--- is self-contained and reasonably concise
(about 2,000 lines of Python for the main part), making it a good basis and (about 2,000 lines of Python for the main part), making it a good basis and
baseline for experiments. It is, however, closely tied by design to Intel baseline for experiments. It is, however, closely tied by design to Intel
microarchitectures, or microarchitectures very close to Intel's ones. microarchitectures, or microarchitectures very close to Intel's ones.

View file

@ -3,7 +3,7 @@ of good microarchitectural analysis and predictions in many aspects. One thing,
however, that we found lacking, was a generic method to obtain a model for a however, that we found lacking, was a generic method to obtain a model for a
given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are
performant and quite exhaustive models of Intel's x86-64 implementations, they performant and quite exhaustive models of Intel's x86-64 implementations, they
are restricted to Intel CPUs --~and few others for \uopsinfo{}. These models are restricted to Intel CPUs ---~and few others for \uopsinfo{}. These models
were, at least up to a point, handcrafted. While \iaca{} is based on insider's were, at least up to a point, handcrafted. While \iaca{} is based on insider's
knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s
method is based on specific hardware counters and handpicked instructions with method is based on specific hardware counters and handpicked instructions with

View file

@ -39,11 +39,11 @@ that $\texttt{\footnotesize{}UNROLL\_SIZE} \times
parameters of the benchmark generation. parameters of the benchmark generation.
\pipedream{} must be able to distinguish between variants of instructions with \pipedream{} must be able to distinguish between variants of instructions with
the same mnemonic --~\eg{} \lstxasm{mov}~-- but different operand kinds, the same mnemonic ---~\eg{} \lstxasm{mov}~--- but different operand kinds,
altering the semantics and performance of the instruction --~such as a altering the semantics and performance of the instruction ---~such as a
\lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To \lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To
this end, \pipedream{} represents instructions fully qualified with their this end, \pipedream{} represents instructions fully qualified with their
operands' kind --~this can be seen as a process akin to C++'s name mangling. operands' kind ---~this can be seen as a process akin to C++'s name mangling.
As \pipedream{} gets a multiset of instructions as a kernel, these As \pipedream{} gets a multiset of instructions as a kernel, these
instructions' arguments must be instantiated to turn them into actual assembly instructions' arguments must be instantiated to turn them into actual assembly

View file

@ -136,7 +136,7 @@
\figthreerowlegend{Polybench}{polybench-W-zen} \figthreerowlegend{Polybench}{polybench-W-zen}
\end{figleftlabel} \end{figleftlabel}
\caption{IPC prediction profile heatmaps~--~predictions closer to the \caption{IPC prediction profile heatmaps~---~predictions closer to the
red line are more accurate. Predicted IPC ratio (Y) against native red line are more accurate. Predicted IPC ratio (Y) against native
IPC (X)} IPC (X)}
\label{fig:palmed_heatmaps} \label{fig:palmed_heatmaps}

View file

@ -64,13 +64,13 @@ each instruction. Its parameters are:
The first step in modeling a processor's frontend should certainly be to The first step in modeling a processor's frontend should certainly be to
characterize the number of \uops{} that can be dispatched in a cycle. We assume characterize the number of \uops{} that can be dispatched in a cycle. We assume
that a model of the backend is known --~by taking for instance a model that a model of the backend is known ---~by taking for instance a model
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
best of our knowledge, we can safely further assume that instructions that load best of our knowledge, we can safely further assume that instructions that load
a single backend port only once are also composed of a single \uop{}. a single backend port only once are also composed of a single \uop{}.
Generating a few combinations of a diversity of those and measuring their Generating a few combinations of a diversity of those and measuring their
effective throughput --~making sure using the backend model that the latter is effective throughput ---~making sure using the backend model that the latter is
not the bottleneck~-- and keeping the maximal throughput reached should provide not the bottleneck~--- and keeping the maximal throughput reached should provide
a good value. a good value.
\medskip{} \medskip{}
@ -88,7 +88,7 @@ The core of the model presented in this chapter is the discovery, for each
instruction, of its \uop{} count. Still assuming the knowledge of a backend instruction, of its \uop{} count. Still assuming the knowledge of a backend
model, the method described in \autoref{ssec:a72_insn_muop_count} should be model, the method described in \autoref{ssec:a72_insn_muop_count} should be
generic enough to be used on any processor. The basic instructions may be generic enough to be used on any processor. The basic instructions may be
easily selected using the backend model --~we assume their existence in most easily selected using the backend model ---~we assume their existence in most
microarchitectures, as pragmatic concerns guide the ports design. Counting the microarchitectures, as pragmatic concerns guide the ports design. Counting the
\uops{} of an instruction thus follows, using only elapsed cycles counters. \uops{} of an instruction thus follows, using only elapsed cycles counters.
@ -141,7 +141,7 @@ be investigated if the model does not reach the expected accuracy.
\uops{} are repeatedly streamed from the decode queue, without even the \uops{} are repeatedly streamed from the decode queue, without even the
necessity to hit a cache. We are unaware of similar features in other necessity to hit a cache. We are unaware of similar features in other
commercial processors. In embedded programming, however, \emph{hardware commercial processors. In embedded programming, however, \emph{hardware
loops} --~which are set up explicitly by the programmer~-- achieve, loops} ---~which are set up explicitly by the programmer~--- achieve,
among others, the same among others, the same
goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}. goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.
@ -157,8 +157,8 @@ be investigated if the model does not reach the expected accuracy.
\item{} In reality, there is an intermediary step between instructions and \item{} In reality, there is an intermediary step between instructions and
\uops{}: macro-ops. Although it serves a designing and semantic \uops{}: macro-ops. Although it serves a designing and semantic
purpose, we omit this step in the current model as --~we purpose, we omit this step in the current model as ---~we
believe~-- it is of little importance to predict performance. believe~--- it is of little importance to predict performance.
\item{} On x86 architectures at least, common pairs of micro- or \item{} On x86 architectures at least, common pairs of micro- or
macro-operations may be ``fused'' into a single one, up to various macro-operations may be ``fused'' into a single one, up to various

View file

@ -29,7 +29,7 @@ We were also able to show in Section~\ref{ssec:memlatbound}
that state-of-the-art static analyzers struggle to that state-of-the-art static analyzers struggle to
account for memory-carried dependencies; a weakness significantly impacting account for memory-carried dependencies; a weakness significantly impacting
their overall results on our benchmarks. We believe that detecting their overall results on our benchmarks. We believe that detecting
and accounting for these dependencies is an important topic --~which we will and accounting for these dependencies is an important topic ---~which we will
tackle in the following chapter. tackle in the following chapter.
Moreover, we present this work in the form of a modular software package, each Moreover, we present this work in the form of a modular software package, each

View file

@ -43,7 +43,7 @@ of a CPU and, in particular, how \uops{} transit in (decoded) and out
If a \uop{} has not been retired yet (issued and executed), it cannot be If a \uop{} has not been retired yet (issued and executed), it cannot be
replaced in the ROB by any freshly decoded instruction. In other words, every replaced in the ROB by any freshly decoded instruction. In other words, every
non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the non-retired decoded \uop{} ---~also called \emph{in-flight}~--- remains in the
reorder buffer. This is possible thanks to the notion of \emph{full reorder reorder buffer. This is possible thanks to the notion of \emph{full reorder
buffer}: buffer}:

View file

@ -54,7 +54,7 @@ loop:
\end{lstlisting} \end{lstlisting}
\end{minipage}\\ \end{minipage}\\
a read-after-write dependency from line 4 to line 2 is reported by \depsim{} a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
---~although there is no such dependency inherent to the kernel. ----~although there is no such dependency inherent to the kernel.
However, each iteration of the However, each iteration of the
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner) \texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
@ -64,8 +64,8 @@ write it back. This creates a dependency to the previous iteration of the inner
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
enough. Such dependencies, however, pollute the evaluation results: as \depsim{} enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
does not report a dependency's distance, they are considered meaningful; and as does not report a dependency's distance, they are considered meaningful; and as
they cannot be detected by \staticdeps{} --~which is unaware of the outer and they cannot be detected by \staticdeps{} ---~which is unaware of the outer and
inner loop~--, they introduce unfairness in the evaluation. The actual loss of inner loop~---, they introduce unfairness in the evaluation. The actual loss of
precision introduced by not discovering such dependencies is instead assessed precision introduced by not discovering such dependencies is instead assessed
later by enriching \uica{} with \staticdeps{}. later by enriching \uica{} with \staticdeps{}.
@ -99,14 +99,14 @@ source and destination program counters are not in the same basic block are
discarded, as \staticdeps{} cannot detect them by construction. discarded, as \staticdeps{} cannot detect them by construction.
For each of the considered basic blocks, we run our static analysis, For each of the considered basic blocks, we run our static analysis,
\staticdeps{}. We discard the $\Delta{}k$ parameter --~how many loop iterations \staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations
the dependency spans~--, as our dynamic analysis does not report an equivalent the dependency spans~---, as our dynamic analysis does not report an equivalent
parameter, but only a pair of program counters. parameter, but only a pair of program counters.
Dynamic dependencies from \depsim{} are converted to Dynamic dependencies from \depsim{} are converted to
\emph{periodic dependencies} in the sense of \staticdeps{} as described in \emph{periodic dependencies} in the sense of \staticdeps{} as described in
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at \autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
least 80\% of the block's iterations are kept --~else, dependencies are least 80\% of the block's iterations are kept ---~else, dependencies are
considered measurement artifacts. The \emph{periodic coverage} considered measurement artifacts. The \emph{periodic coverage}
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
proportion of dependencies found by \staticdeps{} among the periodic proportion of dependencies found by \staticdeps{} among the periodic
@ -117,7 +117,7 @@ dependencies extracted from \depsim{}:
\smallskip{} \smallskip{}
We also keep the raw dependencies from \depsim{} --~that is, without converting We also keep the raw dependencies from \depsim{} ---~that is, without converting
them to periodic dependencies. From these, we consider two metrics: them to periodic dependencies. From these, we consider two metrics:
the unweighted dependencies coverage, \[ the unweighted dependencies coverage, \[
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}} \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
@ -158,20 +158,20 @@ very close to 100\,\%, giving us good confidence on the accuracy of
The same methodology can be re-used as a proxy for estimating the rate of The same methodology can be re-used as a proxy for estimating the rate of
aliasing independent pointers in our dataset. Indeed, a major approximation aliasing independent pointers in our dataset. Indeed, a major approximation
made by \staticdeps{} is to assume that any new encountered pointer --~function made by \staticdeps{} is to assume that any new encountered pointer ---~function
parameters, value read from memory, \ldots~-- does \emph{not} alias with parameters, value read from memory, \ldots~--- does \emph{not} alias with
previously encountered values. This is implemented by the use of a fresh previously encountered values. This is implemented by the use of a fresh
random value for each value yet unknown. random value for each value yet unknown.
Determining which pointers may point to which other pointers --~and, by Determining which pointers may point to which other pointers ---~and, by
extension, may point to the same memory region~-- is called a \emph{points-to extension, may point to the same memory region~--- is called a \emph{points-to
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
the pointers for which taking a fresh value was \emph{not} representative of the pointers for which taking a fresh value was \emph{not} representative of
the reality. the reality.
If we detect, through dynamic analysis, that a value derived from a If we detect, through dynamic analysis, that a value derived from a
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b} pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
--~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to} --~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to}
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the \lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
execution is equal to \lstc{b + l} at the very end of the execution: although execution is equal to \lstc{b + l} at the very end of the execution: although
the pointers will not alias (that is, share the same value at the same moment), the pointers will not alias (that is, share the same value at the same moment),
@ -215,7 +215,7 @@ The results of this analysis are presented in
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us \autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
good confidence that our hypothesis of independent pointers is reasonable, at good confidence that our hypothesis of independent pointers is reasonable, at
least within the scope of Polybench, which we believe representative of least within the scope of Polybench, which we believe representative of
scientific computation --~one of the prominent use-cases of tools such as code scientific computation ---~one of the prominent use-cases of tools such as code
analyzers. analyzers.
\subsection{Enriching \uica{}'s model} \subsection{Enriching \uica{}'s model}
@ -329,7 +329,7 @@ constituting basic blocks.
\centering \centering
\includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg} \includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
\captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times \captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot} on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
\end{minipage}\hfill\begin{minipage}{0.48\linewidth} \end{minipage}\hfill\begin{minipage}{0.48\linewidth}
\centering \centering
\includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg} \includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
@ -344,14 +344,14 @@ constituting basic blocks.
\toprule \toprule
\textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\ \textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
\midrule \midrule
Seq.\ (\ref{messeq:depsim}) --~\depsim{} Seq.\ (\ref{messeq:depsim}) ---~\depsim{}
& 18083 ms & 17645 ms & 17080 ms & 18650 ms \\ & 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum) Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum)
& 2307 ms & 677 ms & 557 ms & 2700 ms \\ & 2307 ms & 677 ms & 557 ms & 2700 ms \\
Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single) Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single)
& 529 ms & 545 ms & 425 ms & 588 ms \\ & 529 ms & 545 ms & 425 ms & 588 ms \\
\midrule \midrule
Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup
& $\times$36.1 & $\times$33.5 & $\times$30.1 & & $\times$36.1 & $\times$33.5 & $\times$30.1 &
$\times$41.7 \\ $\times$41.7 \\
\bottomrule \bottomrule