diff --git a/manuscrit/00_opening/10_abstract.tex b/manuscrit/00_opening/10_abstract.tex index d64b031..dd49d1c 100644 --- a/manuscrit/00_opening/10_abstract.tex +++ b/manuscrit/00_opening/10_abstract.tex @@ -1,8 +1,8 @@ \selectlanguage{french} \begin{abstract} Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de - calculs en environnement contraint --~comme de l'embarqué ou de - l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte + calculs en environnement contraint ---~comme de l'embarqué ou de + l'\emph{edge computing}~--- ou encore de tentatives de réduire l'empreinte écologique d'un programme fréquemment utilisé, de nombreux cas d'usage justifient l'optimisation poussée d'un programme. Celle-ci s'arrête souvent à l'optimisation de haut niveau (algorithmique, parallélisme, \ldots), mais @@ -34,11 +34,11 @@ \selectlanguage{english} \begin{abstract} Be it massively distributed computation over multiple server racks, - constrained computation --~such as in embedded environments or in - \emph{edge computing}~--, or still an attempt to reduce the ecological + constrained computation ---~such as in embedded environments or in + \emph{edge computing}~---, or still an attempt to reduce the ecological footprint of a frequently-run program, many use-cases make it relevant to deeply optimize a program. This optimisation is often limited to high-level - optimisation --~choice of algorithms, parallel computing, \ldots{} Yet, it + optimisation ---~choice of algorithms, parallel computing, \ldots{} Yet, it is possible to carry it further to low-level optimisations, by inspecting the generated assembly with respect to the microarchitecture of the specific microprocessor used to fine-tune it. diff --git a/manuscrit/20_foundations/10_cpu_arch.tex b/manuscrit/20_foundations/10_cpu_arch.tex index 89e5e2a..34976c7 100644 --- a/manuscrit/20_foundations/10_cpu_arch.tex +++ b/manuscrit/20_foundations/10_cpu_arch.tex @@ -101,7 +101,7 @@ processor. The CPU frontend constantly fetches a flow of instruction bytes. This flow must first be broken down into a sequence of instructions. While on some ISAs, each -instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not +instruction is made of a constant amount of bytes ---~\eg{} ARM~---, this is not always the case: for instance, x84-64 instructions can be as short as one byte, while the ISA only limits an instruction to 15 bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by diff --git a/manuscrit/20_foundations/20_code_analyzers.tex b/manuscrit/20_foundations/20_code_analyzers.tex index 97eb5cd..d86eb8c 100644 --- a/manuscrit/20_foundations/20_code_analyzers.tex +++ b/manuscrit/20_foundations/20_code_analyzers.tex @@ -167,8 +167,8 @@ We have now covered enough of the theoretical background to introduce code analyzers in a concrete way, through examples of their usage. For this purpose, we use \llvmmca{}, one of the state-of-the-art code analyzers. -Due to its relative simplicity --~at least compared to \eg{} Intel's x86-64 -implementations~--, we will base the following examples on ARM's Cortex A72, +Due to its relative simplicity ---~at least compared to \eg{} Intel's x86-64 +implementations~---, we will base the following examples on ARM's Cortex A72, which we introduce in depth later in \autoref{chap:frontend}. No specific knowledge of this microarchitecture is required to understand the following examples; for our purposes, if suffices to say that: @@ -192,7 +192,7 @@ on a single load operation: \lstarmasm{ldr x1, [x2]}. \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/01_ldr.out} The first rows (2-10) are high-level metrics. \llvmmca{} works by simulating -the execution of the kernel --~here, 100 times, as seen row 2~--. This simple +the execution of the kernel ---~here, 100 times, as seen row 2~---. This simple kernel contains only one instruction, which breaks down into a single \uop{}. Iterating it takes 106 cycles instead of the expected 100 cycles, as this execution is \emph{not} in steady-state, but accounts for the cycles from the @@ -224,9 +224,9 @@ takes up all load resources available. \lstinputlisting[language=bash]{assets/src/20_foundations/llvm_mca_examples/02_ldr_timeline.out} which indicates, for each instruction, the timeline of its execution. Here, -\texttt{D} stands for decode, \texttt{e} for being executed --~in the -pipeline~--, \texttt{E} for last cycle of its execution --~leaving the -pipeline~--, \texttt{R} for retiring. When an instruction is decoded and +\texttt{D} stands for decode, \texttt{e} for being executed ---~in the +pipeline~---, \texttt{E} for last cycle of its execution ---~leaving the +pipeline~---, \texttt{R} for retiring. When an instruction is decoded and waiting to be dispatched to execution, an \texttt{=} is shown. The identifier at the beginning of each row indicates the kernel iteration diff --git a/manuscrit/20_foundations/30_sota.tex b/manuscrit/20_foundations/30_sota.tex index 115a42b..9406def 100644 --- a/manuscrit/20_foundations/30_sota.tex +++ b/manuscrit/20_foundations/30_sota.tex @@ -17,8 +17,8 @@ Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1}, regularly updated, whose nearly 1,000 pages give relevant details to Intel's microarchitectures, such as block diagrams, pipelines, ports available, etc. It further gives data tables with throughput and latencies for some instructions. -While the manual provides a huge collection of important insights --~from the -optimisation perspective~-- on their microarchitectures, it lacks exhaustive +While the manual provides a huge collection of important insights ---~from the +optimisation perspective~--- on their microarchitectures, it lacks exhaustive and (conveniently) machine-parsable data tables and does not detail port usages of each instruction. @@ -30,7 +30,7 @@ AMD, since 2020, releases lengthy and complete optimisation manuals for its microarchitecture. For instance, the Zen4 optimisation manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the processor's workflow and ports, and a spreadsheet of about 3,400 x86 -instructions --~with operands variants broken down~-- and their port usage, +instructions ---~with operands variants broken down~--- and their port usage, throughput and latencies. Such an effort, which certainly translates to a non-negligible financial cost to the company, showcases the importance and recent expectations on such documents. @@ -89,7 +89,7 @@ existence. Going further than data extraction at the individual instruction level, academics and industrials interested in this domain now mostly work on code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such -tool embeds a model --~or collection of models~-- on which its inference is +tool embeds a model ---~or collection of models~--- on which its inference is based, and whose definition, embedded data and obtention method varies from tool to tool. These tools often use, to some extent, the data on individual instructions obtained either from the manufacturer or the third-party efforts @@ -106,16 +106,16 @@ microarchitectures. Yet, being closed-source and relying on data that is partially unavailable to the public, the model is not totally satisfactory to academics or engineers trying to understand specific performance results. It also makes it vulnerable to deprecation, as the community is unable to -\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel +\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel in 2019. Thus, \iaca{} does not support recent microarchitectures, and its binary was recently removed from official download pages. \medskip{} -In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was +In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was developed as an internal tool at Sony, and was proposed for inclusion in \llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the -data tables that \llvm{} --~a compiler~-- has to maintain for each +data tables that \llvm{} ---~a compiler~--- has to maintain for each microarchitecture in order to produce optimized code. The project has since then evolved to be fairly accurate, as seen in the experiments later presented in this manuscript. It is the alternative Intel offers to \iaca{} subsequently @@ -125,7 +125,7 @@ to its deprecation. Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.} starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack -(at the time) of an open-source --~and thus, open-model~-- alternative to IACA. +(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA. As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}. It still lacks, however, a good model of frontend and data dependencies, making it less performant than other code analyzers in our experiments later in this @@ -145,7 +145,7 @@ to predict its reverse throughput. Doing so, even with perfect accuracy, does not explain the source of a performance problem: the model is unable to help detecting which resource is the performance bottleneck of a kernel; in other words, it quantifies a potential issue, but does not help in \emph{explaining} -it --~or debugging it. +it ---~or debugging it. \medskip{} @@ -168,7 +168,7 @@ Abel and Reineke, the authors of \uopsinfo{}, recently released reverse-engineering through the use of hardware counters to model the frontend and pipelines. We found this tool to be very accurate (see experiments later in this manuscript), with results comparable with \llvmmca{}. Its source code ---~under free software license~-- is self-contained and reasonably concise +---~under free software license~--- is self-contained and reasonably concise (about 2,000 lines of Python for the main part), making it a good basis and baseline for experiments. It is, however, closely tied by design to Intel microarchitectures, or microarchitectures very close to Intel's ones. diff --git a/manuscrit/30_palmed/00_intro.tex b/manuscrit/30_palmed/00_intro.tex index 67c7479..65202b6 100644 --- a/manuscrit/30_palmed/00_intro.tex +++ b/manuscrit/30_palmed/00_intro.tex @@ -3,7 +3,7 @@ of good microarchitectural analysis and predictions in many aspects. One thing, however, that we found lacking, was a generic method to obtain a model for a given microarchitecture. Indeed, while \eg{} \iaca{} and \uopsinfo{} are performant and quite exhaustive models of Intel's x86-64 implementations, they -are restricted to Intel CPUs --~and few others for \uopsinfo{}. These models +are restricted to Intel CPUs ---~and few others for \uopsinfo{}. These models were, at least up to a point, handcrafted. While \iaca{} is based on insider's knowledge from Intel (and thus would not work for \eg{} AMD), \uopsinfo{}'s method is based on specific hardware counters and handpicked instructions with diff --git a/manuscrit/30_palmed/30_pipedream.tex b/manuscrit/30_palmed/30_pipedream.tex index 035d214..ce5420a 100644 --- a/manuscrit/30_palmed/30_pipedream.tex +++ b/manuscrit/30_palmed/30_pipedream.tex @@ -39,11 +39,11 @@ that $\texttt{\footnotesize{}UNROLL\_SIZE} \times parameters of the benchmark generation. \pipedream{} must be able to distinguish between variants of instructions with -the same mnemonic --~\eg{} \lstxasm{mov}~-- but different operand kinds, -altering the semantics and performance of the instruction --~such as a +the same mnemonic ---~\eg{} \lstxasm{mov}~--- but different operand kinds, +altering the semantics and performance of the instruction ---~such as a \lstxasm{mov} loading from memory versus a \lstxasm{mov} between registers. To this end, \pipedream{} represents instructions fully qualified with their -operands' kind --~this can be seen as a process akin to C++'s name mangling. +operands' kind ---~this can be seen as a process akin to C++'s name mangling. As \pipedream{} gets a multiset of instructions as a kernel, these instructions' arguments must be instantiated to turn them into actual assembly diff --git a/manuscrit/30_palmed/40-1_results_fig.tex b/manuscrit/30_palmed/40-1_results_fig.tex index e8fd610..191d5b3 100644 --- a/manuscrit/30_palmed/40-1_results_fig.tex +++ b/manuscrit/30_palmed/40-1_results_fig.tex @@ -136,7 +136,7 @@ \figthreerowlegend{Polybench}{polybench-W-zen} \end{figleftlabel} - \caption{IPC prediction profile heatmaps~--~predictions closer to the + \caption{IPC prediction profile heatmaps~---~predictions closer to the red line are more accurate. Predicted IPC ratio (Y) against native IPC (X)} \label{fig:palmed_heatmaps} diff --git a/manuscrit/40_A72-frontend/50_future_works.tex b/manuscrit/40_A72-frontend/50_future_works.tex index 918f852..a28753a 100644 --- a/manuscrit/40_A72-frontend/50_future_works.tex +++ b/manuscrit/40_A72-frontend/50_future_works.tex @@ -64,13 +64,13 @@ each instruction. Its parameters are: The first step in modeling a processor's frontend should certainly be to characterize the number of \uops{} that can be dispatched in a cycle. We assume -that a model of the backend is known --~by taking for instance a model +that a model of the backend is known ---~by taking for instance a model generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the best of our knowledge, we can safely further assume that instructions that load a single backend port only once are also composed of a single \uop{}. Generating a few combinations of a diversity of those and measuring their -effective throughput --~making sure using the backend model that the latter is -not the bottleneck~-- and keeping the maximal throughput reached should provide +effective throughput ---~making sure using the backend model that the latter is +not the bottleneck~--- and keeping the maximal throughput reached should provide a good value. \medskip{} @@ -88,7 +88,7 @@ The core of the model presented in this chapter is the discovery, for each instruction, of its \uop{} count. Still assuming the knowledge of a backend model, the method described in \autoref{ssec:a72_insn_muop_count} should be generic enough to be used on any processor. The basic instructions may be -easily selected using the backend model --~we assume their existence in most +easily selected using the backend model ---~we assume their existence in most microarchitectures, as pragmatic concerns guide the ports design. Counting the \uops{} of an instruction thus follows, using only elapsed cycles counters. @@ -141,7 +141,7 @@ be investigated if the model does not reach the expected accuracy. \uops{} are repeatedly streamed from the decode queue, without even the necessity to hit a cache. We are unaware of similar features in other commercial processors. In embedded programming, however, \emph{hardware - loops} --~which are set up explicitly by the programmer~-- achieve, + loops} ---~which are set up explicitly by the programmer~--- achieve, among others, the same goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}. @@ -157,8 +157,8 @@ be investigated if the model does not reach the expected accuracy. \item{} In reality, there is an intermediary step between instructions and \uops{}: macro-ops. Although it serves a designing and semantic - purpose, we omit this step in the current model as --~we - believe~-- it is of little importance to predict performance. + purpose, we omit this step in the current model as ---~we + believe~--- it is of little importance to predict performance. \item{} On x86 architectures at least, common pairs of micro- or macro-operations may be ``fused'' into a single one, up to various diff --git a/manuscrit/50_CesASMe/30_future_works.tex b/manuscrit/50_CesASMe/30_future_works.tex index a3dc9e6..059ee46 100644 --- a/manuscrit/50_CesASMe/30_future_works.tex +++ b/manuscrit/50_CesASMe/30_future_works.tex @@ -29,7 +29,7 @@ We were also able to show in Section~\ref{ssec:memlatbound} that state-of-the-art static analyzers struggle to account for memory-carried dependencies; a weakness significantly impacting their overall results on our benchmarks. We believe that detecting -and accounting for these dependencies is an important topic --~which we will +and accounting for these dependencies is an important topic ---~which we will tackle in the following chapter. Moreover, we present this work in the form of a modular software package, each diff --git a/manuscrit/60_staticdeps/35_rob_proof.tex b/manuscrit/60_staticdeps/35_rob_proof.tex index 8ad85a2..cc6a01d 100644 --- a/manuscrit/60_staticdeps/35_rob_proof.tex +++ b/manuscrit/60_staticdeps/35_rob_proof.tex @@ -43,7 +43,7 @@ of a CPU and, in particular, how \uops{} transit in (decoded) and out If a \uop{} has not been retired yet (issued and executed), it cannot be replaced in the ROB by any freshly decoded instruction. In other words, every -non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the +non-retired decoded \uop{} ---~also called \emph{in-flight}~--- remains in the reorder buffer. This is possible thanks to the notion of \emph{full reorder buffer}: diff --git a/manuscrit/60_staticdeps/50_eval.tex b/manuscrit/60_staticdeps/50_eval.tex index daddef9..e79c6f0 100644 --- a/manuscrit/60_staticdeps/50_eval.tex +++ b/manuscrit/60_staticdeps/50_eval.tex @@ -54,7 +54,7 @@ loop: \end{lstlisting} \end{minipage}\\ a read-after-write dependency from line 4 to line 2 is reported by \depsim{} ----~although there is no such dependency inherent to the kernel. +----~although there is no such dependency inherent to the kernel. However, each iteration of the \texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner) @@ -64,8 +64,8 @@ write it back. This creates a dependency to the previous iteration of the inner loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large enough. Such dependencies, however, pollute the evaluation results: as \depsim{} does not report a dependency's distance, they are considered meaningful; and as -they cannot be detected by \staticdeps{} --~which is unaware of the outer and -inner loop~--, they introduce unfairness in the evaluation. The actual loss of +they cannot be detected by \staticdeps{} ---~which is unaware of the outer and +inner loop~---, they introduce unfairness in the evaluation. The actual loss of precision introduced by not discovering such dependencies is instead assessed later by enriching \uica{} with \staticdeps{}. @@ -99,14 +99,14 @@ source and destination program counters are not in the same basic block are discarded, as \staticdeps{} cannot detect them by construction. For each of the considered basic blocks, we run our static analysis, -\staticdeps{}. We discard the $\Delta{}k$ parameter --~how many loop iterations -the dependency spans~--, as our dynamic analysis does not report an equivalent +\staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations +the dependency spans~---, as our dynamic analysis does not report an equivalent parameter, but only a pair of program counters. Dynamic dependencies from \depsim{} are converted to \emph{periodic dependencies} in the sense of \staticdeps{} as described in \autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at -least 80\% of the block's iterations are kept --~else, dependencies are +least 80\% of the block's iterations are kept ---~else, dependencies are considered measurement artifacts. The \emph{periodic coverage} of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the proportion of dependencies found by \staticdeps{} among the periodic @@ -117,7 +117,7 @@ dependencies extracted from \depsim{}: \smallskip{} -We also keep the raw dependencies from \depsim{} --~that is, without converting +We also keep the raw dependencies from \depsim{} ---~that is, without converting them to periodic dependencies. From these, we consider two metrics: the unweighted dependencies coverage, \[ \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}} @@ -158,20 +158,20 @@ very close to 100\,\%, giving us good confidence on the accuracy of The same methodology can be re-used as a proxy for estimating the rate of aliasing independent pointers in our dataset. Indeed, a major approximation -made by \staticdeps{} is to assume that any new encountered pointer --~function -parameters, value read from memory, \ldots~-- does \emph{not} alias with +made by \staticdeps{} is to assume that any new encountered pointer ---~function +parameters, value read from memory, \ldots~--- does \emph{not} alias with previously encountered values. This is implemented by the use of a fresh random value for each value yet unknown. -Determining which pointers may point to which other pointers --~and, by -extension, may point to the same memory region~-- is called a \emph{points-to +Determining which pointers may point to which other pointers ---~and, by +extension, may point to the same memory region~--- is called a \emph{points-to analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes the pointers for which taking a fresh value was \emph{not} representative of the reality. If we detect, through dynamic analysis, that a value derived from a pointer \lstc{a} shares a value with one derived from a pointer \lstc{b} ---~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to} +--~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to} \lstc{b}. This is true even if \lstc{a + k} at the very beginning of the execution is equal to \lstc{b + l} at the very end of the execution: although the pointers will not alias (that is, share the same value at the same moment), @@ -215,7 +215,7 @@ The results of this analysis are presented in \autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us good confidence that our hypothesis of independent pointers is reasonable, at least within the scope of Polybench, which we believe representative of -scientific computation --~one of the prominent use-cases of tools such as code +scientific computation ---~one of the prominent use-cases of tools such as code analyzers. \subsection{Enriching \uica{}'s model} @@ -329,7 +329,7 @@ constituting basic blocks. \centering \includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg} \captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times - on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot} + on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot} \end{minipage}\hfill\begin{minipage}{0.48\linewidth} \centering \includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg} @@ -344,14 +344,14 @@ constituting basic blocks. \toprule \textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\ \midrule - Seq.\ (\ref{messeq:depsim}) --~\depsim{} + Seq.\ (\ref{messeq:depsim}) ---~\depsim{} & 18083 ms & 17645 ms & 17080 ms & 18650 ms \\ - Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum) + Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum) & 2307 ms & 677 ms & 557 ms & 2700 ms \\ - Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single) + Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single) & 529 ms & 545 ms & 425 ms & 588 ms \\ \midrule - Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup + Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup & $\times$36.1 & $\times$33.5 & $\times$30.1 & $\times$41.7 \\ \bottomrule