Typography: -- to ---

2024-09-01 16:56:48 +02:00 · 2024-09-01 16:56:48 +02:00 · d1401b068f
commit d1401b068f
parent 103e6a0687
11 changed files with 54 additions and 54 deletions
--- a/manuscrit/00_opening/10_abstract.tex
+++ b/manuscrit/00_opening/10_abstract.tex
@ -1,8 +1,8 @@
 \selectlanguage{french}
 \begin{abstract}
    Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
-    calculs en environnement contraint --~comme de l'embarqué ou de
-    l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
+    calculs en environnement contraint ---~comme de l'embarqué ou de
+    l'\emph{edge computing}~--- ou encore de tentatives de réduire l'empreinte
    écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
    justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent
    à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais
@ -34,11 +34,11 @@
 \selectlanguage{english}
 \begin{abstract}
    Be it massively distributed computation over multiple server racks,
-    constrained computation --~such as in embedded environments or in
-    \emph{edge computing}~--, or still an attempt to reduce the ecological
+    constrained computation ---~such as in embedded environments or in
+    \emph{edge computing}~---, or still an attempt to reduce the ecological
    footprint of a frequently-run program, many use-cases make it relevant to
    deeply optimize a program. This optimisation is often limited to high-level
-    optimisation --~choice of algorithms, parallel computing, \ldots{} Yet, it
+    optimisation ---~choice of algorithms, parallel computing, \ldots{} Yet, it
    is possible to carry it further to low-level optimisations, by inspecting
    the generated assembly with respect to the microarchitecture of the
    specific microprocessor used to fine-tune it.
--- a/manuscrit/20_foundations/10_cpu_arch.tex
+++ b/manuscrit/20_foundations/10_cpu_arch.tex
@ -101,7 +101,7 @@ processor.

 The CPU frontend constantly fetches a flow of instruction bytes. This flow must
 first be broken down into a sequence of instructions. While on some ISAs, each
-instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
+instruction is made of a constant amount of bytes ---~\eg{} ARM~---, this is not
 always the case: for instance, x84-64 instructions can be as short as one byte,
 while the ISA only limits an instruction to 15
 bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by
--- a/manuscrit/20_foundations/20_code_analyzers.tex
+++ b/manuscrit/20_foundations/20_code_analyzers.tex
@ -167,8 +167,8 @@ We have now covered enough of the theoretical background to introduce code
 analyzers in a concrete way, through examples of their usage. For this purpose,
 we use \llvmmca{}, one of the state-of-the-art code analyzers.

-Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64
-implementations~--, we will base the following examples on ARM's Cortex A72,
+Due to its relative simplicity ---~at least compared to \eg{} Intel's x86-64
+implementations~---, we will base the following examples on ARM's Cortex A72,
 which we introduce in depth later in \autoref{chap:frontend}. No specific
 knowledge of this microarchitecture is required to understand the following
 examples; for our purposes, if suffices to say that:
@ -192,7 +192,7 @@ on a single load operation: \lstarmasm{ldr x1, [x2]}.
 \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out}

 The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating
-the execution of the kernel --~here, 100 times, as seen row 2~--. This simple
+the execution of the kernel ---~here, 100 times, as seen row 2~---. This simple
 kernel contains only one instruction, which breaks down into a single \uop{}.
 Iterating it takes 106 cycles instead of the expected 100 cycles, as this
 execution is \emph{not} in steady-state, but accounts for the cycles from the
@ -224,9 +224,9 @@ takes up all load resources available.
 \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out}

 which indicates, for each instruction, the timeline of its execution. Here,
-\texttt{D} stands for decode, \texttt{e} for being executed --~in the
-pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
-pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
+\texttt{D} stands for decode, \texttt{e} for being executed ---~in the
+pipeline~---, \texttt{E} for last cycle of its execution ---~leaving the
+pipeline~---, \texttt{R} for retiring. When an instruction is decoded and
 waiting to be dispatched to execution, an \texttt{=} is shown.

 The identifier at the beginning of each row indicates the kernel iteration
--- a/manuscrit/20_foundations/30_sota.tex
+++ b/manuscrit/20_foundations/30_sota.tex
@ -17,8 +17,8 @@ Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
 regularly updated, whose nearly 1,000 pages give relevant details to Intel's
 microarchitectures, such as block diagrams, pipelines, ports available, etc. It
 further gives data tables with throughput and latencies for some instructions.
-While the manual provides a huge collection of important insights --~from the
-optimisation perspective~-- on their microarchitectures, it lacks exhaustive
+While the manual provides a huge collection of important insights ---~from the
+optimisation perspective~--- on their microarchitectures, it lacks exhaustive
 and (conveniently) machine-parsable data tables and does not detail port usages
 of each instruction.

@ -30,7 +30,7 @@ AMD, since 2020, releases lengthy and complete optimisation manuals for its
 microarchitecture. For instance, the Zen4 optimisation
 manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
 processor's workflow and ports, and a spreadsheet of about 3,400 x86
-instructions --~with operands variants broken down~-- and their port usage,
+instructions ---~with operands variants broken down~--- and their port usage,
 throughput and latencies. Such an effort, which certainly translates to a
 non-negligible financial cost to the company, showcases the importance and
 recent expectations on such documents.
@ -89,7 +89,7 @@ existence.
 Going further than data extraction at the individual instruction level,
 academics and industrials interested in this domain now mostly work on
 code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
-tool embeds a model --~or collection of models~-- on which its inference is
+tool embeds a model ---~or collection of models~--- on which its inference is
 based, and whose definition, embedded data and obtention method varies from
 tool to tool. These tools often use, to some extent, the data on individual
 instructions obtained either from the manufacturer or the third-party efforts
@ -106,16 +106,16 @@ microarchitectures. Yet, being closed-source and relying on data that is
 partially unavailable to the public, the model is not totally satisfactory to
 academics or engineers trying to understand specific performance results. It
 also makes it vulnerable to deprecation, as the community is unable to
-\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel
+\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
 in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
 binary was recently removed from official download pages.

 \medskip{}

-In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was
+In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
 developed as an internal tool at Sony, and was proposed for inclusion in
 \llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
-data tables that \llvm{} --~a compiler~-- has to maintain for each
+data tables that \llvm{} ---~a compiler~--- has to maintain for each
 microarchitecture in order to produce optimized code.  The project has since
 then evolved to be fairly accurate, as seen in the experiments later presented
 in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
@ -125,7 +125,7 @@ to its deprecation.

 Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
 starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
-(at the time) of an open-source --~and thus, open-model~-- alternative to IACA.
+(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
 As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
 It still lacks, however, a good model of frontend and data dependencies, making
 it less performant than other code analyzers in our experiments later in this
@ -145,7 +145,7 @@ to predict its reverse throughput. Doing so, even with perfect accuracy, does
 not explain the source of a performance problem: the model is unable to help
 detecting which resource is the performance bottleneck of a kernel; in other
 words, it quantifies a potential issue, but does not help in \emph{explaining}
-it --~or debugging it.
+it ---~or debugging it.

 \medskip{}

@ -168,7 +168,7 @@ Abel and Reineke, the authors of \uopsinfo{}, recently released
 reverse-engineering through the use of hardware counters to model the frontend
 and pipelines. We found this tool to be very accurate (see experiments later in
 this manuscript), with results comparable with \llvmmca{}. Its source code
--~under free software license~-- is self-contained and reasonably concise
+---~under free software license~--- is self-contained and reasonably concise
 (about 2,000 lines of Python for the main part), making it a good basis and
 baseline for experiments. It is, however, closely tied by design to Intel
 microarchitectures, or microarchitectures very close to Intel's ones.
--- a/manuscrit/30_palmed/00_intro.tex
+++ b/manuscrit/30_palmed/00_intro.tex
@ -3,7 +3,7 @@ of good microarchitectural analysis and predictions in many aspects. One thing,
 however, that we found lacking, was a generic method to obtain a model for a
 given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are
 performant and quite exhaustive models of Intel's x86-64 implementations, they
-are restricted to Intel CPUs --~and few others for \uopsinfo{}. These models
+are restricted to Intel CPUs ---~and few others for \uopsinfo{}. These models
 were, at least up to a point, handcrafted. While \iaca{} is based on insider's
 knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s
 method is based on specific hardware counters and handpicked instructions with
--- a/manuscrit/30_palmed/30_pipedream.tex
+++ b/manuscrit/30_palmed/30_pipedream.tex
@ -39,11 +39,11 @@ that $\texttt{\footnotesize{}UNROLL\_SIZE} \times
 parameters of the benchmark generation.

 \pipedream{} must be able to distinguish between variants of instructions with
-the same mnemonic --~\eg{} \lstxasm{mov}~-- but different operand kinds,
-altering the semantics and performance of the instruction --~such as a
+the same mnemonic ---~\eg{} \lstxasm{mov}~--- but different operand kinds,
+altering the semantics and performance of the instruction ---~such as a
 \lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To
 this end, \pipedream{} represents instructions fully qualified with their
-operands' kind --~this can be seen as a process akin to C++'s name mangling.
+operands' kind ---~this can be seen as a process akin to C++'s name mangling.

 As \pipedream{} gets a multiset of instructions as a kernel, these
 instructions' arguments must be instantiated to turn them into actual assembly
--- a/manuscrit/30_palmed/40-1_results_fig.tex
+++ b/manuscrit/30_palmed/40-1_results_fig.tex
@ -136,7 +136,7 @@
        \figthreerowlegend{Polybench}{polybench-W-zen}
    \end{figleftlabel}

-    \caption{IPC prediction profile heatmaps~--~predictions closer to the
+    \caption{IPC prediction profile heatmaps~---~predictions closer to the
        red line are more accurate. Predicted IPC ratio (Y) against native
    IPC (X)}
    \label{fig:palmed_heatmaps}
--- a/manuscrit/40_A72-frontend/50_future_works.tex
+++ b/manuscrit/40_A72-frontend/50_future_works.tex
@ -64,13 +64,13 @@ each instruction. Its parameters are:

 The first step in modeling a processor's frontend should certainly be to
 characterize the number of \uops{} that can be dispatched in a cycle. We assume
-that a model of the backend is known --~by taking for instance a model
+that a model of the backend is known ---~by taking for instance a model
 generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
 best of our knowledge, we can safely further assume that instructions that load
 a single backend port only once are also composed of a single \uop{}.
 Generating a few combinations of a diversity of those and measuring their
-effective throughput --~making sure using the backend model that the latter is
-not the bottleneck~-- and keeping the maximal throughput reached should provide
+effective throughput ---~making sure using the backend model that the latter is
+not the bottleneck~--- and keeping the maximal throughput reached should provide
 a good value.

 \medskip{}
@ -88,7 +88,7 @@ The core of the model presented in this chapter is the discovery, for each
 instruction, of its \uop{} count. Still assuming the knowledge of a backend
 model, the method described in \autoref{ssec:a72_insn_muop_count} should be
 generic enough to be used on any processor. The basic instructions may be
-easily selected using the backend model --~we assume their existence in most
+easily selected using the backend model ---~we assume their existence in most
 microarchitectures, as pragmatic concerns guide the ports design. Counting the
 \uops{} of an instruction thus follows, using only elapsed cycles counters.

@ -141,7 +141,7 @@ be investigated if the model does not reach the expected accuracy.
        \uops{} are repeatedly streamed from the decode queue, without even the
        necessity to hit a cache. We are unaware of similar features in other
        commercial processors. In embedded programming, however, \emph{hardware
-        loops} --~which are set up explicitly by the programmer~-- achieve,
+        loops} ---~which are set up explicitly by the programmer~--- achieve,
        among others, the same
        goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.

@ -157,8 +157,8 @@ be investigated if the model does not reach the expected accuracy.

    \item{} In reality, there is an intermediary step between instructions and
        \uops{}: macro-ops. Although it serves a designing and semantic
-        purpose, we omit this step in the current model as --~we
-        believe~-- it is of little importance to predict performance.
+        purpose, we omit this step in the current model as ---~we
+        believe~--- it is of little importance to predict performance.

    \item{} On x86 architectures at least, common pairs of micro- or
        macro-operations may be ``fused'' into a single one, up to various
--- a/manuscrit/50_CesASMe/30_future_works.tex
+++ b/manuscrit/50_CesASMe/30_future_works.tex
@ -29,7 +29,7 @@ We were also able to show in Section~\ref{ssec:memlatbound}
 that state-of-the-art static analyzers struggle to
 account for memory-carried dependencies; a weakness significantly impacting
 their overall results on our benchmarks. We believe that detecting
-and accounting for these dependencies is an important topic --~which we will
+and accounting for these dependencies is an important topic ---~which we will
 tackle in the following chapter.

 Moreover, we present this work in the form of a modular software package, each
--- a/manuscrit/60_staticdeps/35_rob_proof.tex
+++ b/manuscrit/60_staticdeps/35_rob_proof.tex
@ -43,7 +43,7 @@ of a CPU and, in particular, how \uops{} transit in (decoded) and out

 If a \uop{} has not been retired yet (issued and executed), it cannot be
 replaced in the ROB by any freshly decoded instruction. In other words, every
-non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the
+non-retired decoded \uop{} ---~also called \emph{in-flight}~--- remains in the
 reorder buffer. This is possible thanks to the notion of \emph{full reorder
 buffer}:

--- a/manuscrit/60_staticdeps/50_eval.tex
+++ b/manuscrit/60_staticdeps/50_eval.tex
@ -54,7 +54,7 @@ loop:
 \end{lstlisting}
 \end{minipage}\\
 a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
---~although there is no such dependency inherent to the kernel.
+----~although there is no such dependency inherent to the kernel.

 However, each iteration of the
 \texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
@ -64,8 +64,8 @@ write it back. This creates a dependency to the previous iteration of the inner
 loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
 enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
 does not report a dependency's distance, they are considered meaningful; and as
-they cannot be detected by \staticdeps{} --~which is unaware of the outer and
-inner loop~--, they introduce unfairness in the evaluation. The actual loss of
+they cannot be detected by \staticdeps{} ---~which is unaware of the outer and
+inner loop~---, they introduce unfairness in the evaluation. The actual loss of
 precision introduced by not discovering such dependencies is instead assessed
 later by enriching \uica{} with \staticdeps{}.

@ -99,14 +99,14 @@ source and destination program counters are not in the same basic block are
 discarded, as \staticdeps{} cannot detect them by construction.

 For each of the considered basic blocks, we run our static analysis,
-\staticdeps{}. We discard the $\Delta{}k$ parameter --~how many loop iterations
-the dependency spans~--, as our dynamic analysis does not report an equivalent
+\staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations
+the dependency spans~---, as our dynamic analysis does not report an equivalent
 parameter, but only a pair of program counters.

 Dynamic dependencies from \depsim{} are converted to
 \emph{periodic dependencies} in the sense of \staticdeps{} as described in
 \autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
-least 80\% of the block's iterations are kept --~else, dependencies are
+least 80\% of the block's iterations are kept ---~else, dependencies are
 considered measurement artifacts. The \emph{periodic coverage}
 of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
 proportion of dependencies found by \staticdeps{} among the periodic
@ -117,7 +117,7 @@ dependencies extracted from \depsim{}:

 \smallskip{}

-We also keep the raw dependencies from \depsim{} --~that is, without converting
+We also keep the raw dependencies from \depsim{} ---~that is, without converting
 them to periodic dependencies. From these, we consider two metrics:
 the unweighted dependencies coverage, \[
    \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
@ -158,20 +158,20 @@ very close to 100\,\%, giving us good confidence on the accuracy of

 The same methodology can be re-used as a proxy for estimating the rate of
 aliasing independent pointers in our dataset. Indeed, a major approximation
-made by \staticdeps{} is to assume that any new encountered pointer --~function
-parameters, value read from memory, \ldots~-- does \emph{not} alias with
+made by \staticdeps{} is to assume that any new encountered pointer ---~function
+parameters, value read from memory, \ldots~--- does \emph{not} alias with
 previously encountered values. This is implemented by the use of a fresh
 random value for each value yet unknown.

-Determining which pointers may point to which other pointers --~and, by
-extension, may point to the same memory region~-- is called a \emph{points-to
+Determining which pointers may point to which other pointers ---~and, by
+extension, may point to the same memory region~--- is called a \emph{points-to
 analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
 the pointers for which taking a fresh value was \emph{not} representative of
 the reality.

 If we detect, through dynamic analysis, that a value derived from a
 pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
--~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to}
+--~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to}
 \lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
 execution is equal to \lstc{b + l} at the very end of the execution: although
 the pointers will not alias (that is, share the same value at the same moment),
@ -215,7 +215,7 @@ The results of this analysis are presented in
 \autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
 good confidence that our hypothesis of independent pointers is reasonable, at
 least within the scope of Polybench, which we believe representative of
-scientific computation --~one of the prominent use-cases of tools such as code
+scientific computation ---~one of the prominent use-cases of tools such as code
 analyzers.

 \subsection{Enriching \uica{}'s model}
@ -329,7 +329,7 @@ constituting basic blocks.
        \centering
        \includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
        \captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
-        on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
+        on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
    \end{minipage}\hfill\begin{minipage}{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
@ -344,14 +344,14 @@ constituting basic blocks.
        \toprule
        \textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
 \midrule
-        Seq.\ (\ref{messeq:depsim}) --~\depsim{}
+        Seq.\ (\ref{messeq:depsim}) ---~\depsim{}
                          & 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
-        Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum)
+        Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum)
                          & 2307 ms & 677 ms &  557 ms & 2700 ms \\
-        Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single)
+        Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single)
                          & 529 ms & 545 ms & 425 ms & 588 ms \\
 \midrule
-        Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup
+        Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup
                          & $\times$36.1 & $\times$33.5 & $\times$30.1 &
                          $\times$41.7 \\
 \bottomrule