diff --git a/manuscrit/00_opening/10_abstract.tex b/manuscrit/00_opening/10_abstract.tex index f70a35e..d64b031 100644 --- a/manuscrit/00_opening/10_abstract.tex +++ b/manuscrit/00_opening/10_abstract.tex @@ -1,6 +1,6 @@ \selectlanguage{french} \begin{abstract} - Qu'il s'agisse de calculs massifs distribués sur plusieurs racks, de + Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de calculs en environnement contraint --~comme de l'embarqué ou de l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte écologique d'un programme fréquemment utilisé, de nombreux cas d'usage @@ -43,7 +43,7 @@ the generated assembly with respect to the microarchitecture of the specific microprocessor used to fine-tune it. - Such an optimisation level requires a very detailed comprehension of both + Such an optimisation level requires a very detailed understanding of both the software and hardware aspects implied, and is most often the realm of experts. \emph{Code analyzers}, however, are tools that help lowering the expertise threshold required to perform such optimisations by automating diff --git a/manuscrit/10_introduction/main.tex b/manuscrit/10_introduction/main.tex index 597c4d7..c58f62d 100644 --- a/manuscrit/10_introduction/main.tex +++ b/manuscrit/10_introduction/main.tex @@ -100,7 +100,7 @@ slow the whole computation. \vspace{2em} In this thesis, we explore the three major aspects that work towards a code -analyzers' accuracy: a \emph{backend model}, a \emph{frontend model} and a +analyzer's accuracy: a \emph{backend model}, a \emph{frontend model} and a \emph{dependencies model}. We propose contributions to strengthen them, as well as to automate the underlying models' synthesis. We focus on \emph{static} code analyzers, that derive metrics, including runtime predictions, from an @@ -124,7 +124,7 @@ tool, akin to \palmed. Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art code analyzers' strengths and shortcomings. To this end, we introduce a -fully-tooled approach in two parts: first, a benchmarks-generation procedure, +fully-tooled approach in two parts: first, a benchmark-generation procedure, yielding thousands of benchmarks relevant in the context of our approach; then, a benchmarking harness evaluating code analyzers on these benchmarks. We find that most state-of-the-art code analyzers struggle to correctly account for @@ -154,11 +154,12 @@ we see this commitment as an opportunity to develop methodologies able to model these processors. This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is -to model a processor's backend resources without resorting to its hardware -counters. Our frontend study, in \autoref{chap:frontend}, also follows this -strategy by focusing on a processor whose hardware counters give little to no -insight on its frontend. While this goal is less relevant to \staticdeps{}, we -rely on external libraries to abstract the underlying architecture. +to model a processor's backend resources without resorting to its +vendor-specific hardware counters. Our frontend study, in +\autoref{chap:frontend}, also follows this strategy by focusing on a processor +whose hardware counters give little to no insight on its frontend. While this +goal is less relevant to \staticdeps{}, we rely on external libraries to +abstract the underlying architecture. \medskip{} diff --git a/manuscrit/20_foundations/10_cpu_arch.tex b/manuscrit/20_foundations/10_cpu_arch.tex index c4e0194..89e5e2a 100644 --- a/manuscrit/20_foundations/10_cpu_arch.tex +++ b/manuscrit/20_foundations/10_cpu_arch.tex @@ -154,7 +154,7 @@ port for both memory loads and stores. In most cases, execution units are \emph{fully pipelined}, meaning that while processing a single \uop{} takes multiple cycles, the unit is able to start -processing a new \uop{} every cycle: multiple \uops{} are then being processed, +processing a new \uop{} every cycle: multiple \uops{} are thus being processed, at different stages, during each cycle, akin to a factory's assembly line. \smallskip{} @@ -204,11 +204,12 @@ For this reason, many processors are now \emph{out-of-order}, while processors issuing \uops{} strictly in their original order are called \emph{in-order}. Out-of-order microarchitectures feature a \emph{reorder buffer}, from which instructions are picked to be issued. The reorder buffer acts as a sliding -window of microarchitecturally-fixed size over \uops{}, from which the oldest -\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order -CPUs are only able to execute operations out of order as long as the -\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to -be issued ---~specifically, not more than the size of the reorder buffer ahead. +window of microarchitecturally-fixed size over decoded \uops{}, from which the +oldest \uop{} whose dependencies are satisfied will be executed. Thus, +out-of-order CPUs are only able to execute operations out of order as long as +the \uop{} to be executed is not too far ahead from the oldest \uop{} awaiting +to be issued ---~specifically, not more than the size of the reorder buffer +ahead. It is also important to note that out-of-order processors are only out-of-order \emph{from a certain point on}: a substantial part of the processor's frontend @@ -238,13 +239,14 @@ word sizes. Some instructions, however, operate on chunks of multiple words at once. These instructions are called \emph{vector instructions}, or \emph{SIMD} for Single Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add -two chunks of 128 bits, which can for instance be treated each as four integers +two chunks of 128 bits, treated each as four integers of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}. \begin{figure} \centering \includegraphics[width=0.6\textwidth]{simd.svg} - \caption{Example of SIMD add instruction on 128b}\label{fig:cpu_simd} + \caption{Example of SIMD $4 \times 32\,\text{bits}$ add instruction on + 128 bits}\label{fig:cpu_simd} \end{figure} Such instructions present clear efficiency advantages. If the processor is able diff --git a/manuscrit/20_foundations/20_code_analyzers.tex b/manuscrit/20_foundations/20_code_analyzers.tex index 3754960..99831f1 100644 --- a/manuscrit/20_foundations/20_code_analyzers.tex +++ b/manuscrit/20_foundations/20_code_analyzers.tex @@ -38,7 +38,7 @@ analyze a code fragment ---~typically at assembly or binary level~---, and provide insights on its performance metrics on a given hardware. Code analyzers thus work statically, that is, without executing the code. -\paragraph{Common hypotheses.} Code analyzers operate under a common +\paragraph{Common hypotheses.} Code analyzers operate under a set of common hypotheses, derived from the typical intended usage. The kernel analyzed is expected to be the body of a loop, or @@ -101,8 +101,8 @@ than on edge cases. As most code analyzers are static, this manuscript largely focuses on static analysis. The only dynamic code analyzer we are aware of is \gus{}, described -more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in -accuracy, especially regarding data dependencies that may not be easily +more thoroughly in \autoref{sec:sota} later, trading heavily run time to gain +in accuracy, especially regarding data dependencies that may not be easily obtained otherwise. \paragraph{Input formats used.} The analyzers studied in this manuscript all @@ -111,12 +111,12 @@ take as input either assembly code, or assembled binaries. In the case of assembly code, as for instance with \llvmmca{}, analyzers take either a short assembly snippet, treated as straight-line code and analyzed as such; or longer pieces of assembly, part or parts of which being -marked for analysis my surrounding assembly comments. +marked for analysis by surrounding assembly comments. In the case of assembled binaries, as all analyzers were run on Linux, executables or object files are ELF files. Some analyzers work on sections of the file defined by user-provided offsets in the binary, while others require -the presence of \iaca{} markers around the code portion or portions to be +the presence of \textit{\iaca{} markers} around the code portion or portions to be analyzed. Those markers, introduced by \iaca{} as C-level preprocessor statements, consist in the following x86 assembly snippets: @@ -198,11 +198,11 @@ Iterating it takes 106 cycles instead of the expected 100 cycles, as this execution is \emph{not} in steady-state, but accounts for the cycles from the decoding of the first instruction to the retirement of the last. -The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. -The next two rows are simple ratios. Row 10 is the block's \emph{reverse -throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but -is roughly defined as the number of cycles a single iteration of the kernel -takes. +Row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. The +next two rows are simple ratios. Row 10 is the block's \emph{reverse +throughput}, which we will note $\cyc{\kerK}$ and formalize later in +\autoref{sssec:def:rthroughput}, but is roughly defined as the number of cycles +a single iteration of the kernel takes. The next section, \emph{instruction info}, lists data about the instructions present. @@ -227,7 +227,7 @@ which indicates, for each instruction, the timeline of its execution. Here, \texttt{D} stands for decode, \texttt{e} for being executed --~in the pipeline~--, \texttt{E} for last cycle of its execution --~leaving the pipeline~--, \texttt{R} for retiring. When an instruction is decoded and -waiting to be dispatched to execution, a \texttt{=} is shown. +waiting to be dispatched to execution, an \texttt{=} is shown. The identifier at the beginning of each row indicates the kernel iteration number, and the instruction within. @@ -361,7 +361,8 @@ time is measured until the last instruction is issued, not retired. Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that $\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of - period $p$. + period $p$. As we consider only the execution's steady state, the sequence + is periodic from rank 0. As the number of cycles needed to execute $\kerK$ only depend on the initial state of the processor, we thus have @@ -375,8 +376,8 @@ time is measured until the last instruction is issued, not retired. and measured in \emph{cycles per iteration}, is also called the steady-state execution time of a kernel. - We note $p \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ (by the above - lemma), and define \[ + We note $p = \calP(\kerK) \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ + (by the above lemma), and define \[ \cyc{\kerK} = \dfrac{\ckn{p}}{p} \] \end{definition} @@ -396,8 +397,8 @@ $\cyc{\kerK} = 1.5$. \begin{remark} As $C(\kerK)$ depends on the microarchitecture of the processor considered, - the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor - considered. + the throughput $\cyc{\kerK}$ of a kernel $\kerK$ implicitly depends on the + processor considered. \end{remark} \medskip @@ -440,10 +441,10 @@ $\cyc{\kerK} = 1.5$. \end{lemma} \begin{proof} - Let $n \in \nat^*$. We note $p \in \nat^*$ the periodicity by the above - lemma. + Let $n \in \nat^*$ and $p = \calP(\kerK) \in \nat^*$ the periodicity by the + above lemma. - Let $k, r \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$. + Let $k, r \in \nat^*$ such that $n = kp+r$, $1 \leq r \leq p$. \begin{align*} \ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\ @@ -561,9 +562,9 @@ of this manuscript. \smallskip{} An instruction is said to be a \emph{flow-altering instruction} if this - address may alter the normal control flow of the program. This is typically - true of jumps (conditional or unconditional), function calls, function - returns, \ldots + instruction may alter the normal control flow of the program. This is + typically true of jumps (conditional or unconditional), function calls, + function returns, \ldots \smallskip{} diff --git a/manuscrit/20_foundations/30_sota.tex b/manuscrit/20_foundations/30_sota.tex index 7fa930e..115a42b 100644 --- a/manuscrit/20_foundations/30_sota.tex +++ b/manuscrit/20_foundations/30_sota.tex @@ -81,7 +81,7 @@ approach, but also limits it to microarchitectures offering such counters, and requires a manual analysis of each microarchitecture to be supported in order to find a fitting set of blocking instructions. Although we have no theoretical guarantee of the existence of such instructions, this should never be a -problem, as all pragmatic microarchitecture design will yield to their +problem, as all pragmatic microarchitecture design will lead to their existence. \subsection{Code analyzers and their models} @@ -89,8 +89,8 @@ existence. Going further than data extraction at the individual instruction level, academics and industrials interested in this domain now mostly work on code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such -tool embeds a model --~or collection of models~-- on which to base its -inference, and whose definition, embedded data and obtention method varies from +tool embeds a model --~or collection of models~-- on which its inference is +based, and whose definition, embedded data and obtention method varies from tool to tool. These tools often use, to some extent, the data on individual instructions obtained either from the manufacturer or the third-party efforts mentioned above. @@ -142,7 +142,7 @@ its context. This approach, in our experiments, was significantly less accurate than those not based on machine learning. In our opinion, its main issue, however, is to be a \textit{black-box model}: given a kernel, it is only able to predict its reverse throughput. Doing so, even with perfect accuracy, does -not explain the source of a performance problem: the model is unable to help in +not explain the source of a performance problem: the model is unable to help detecting which resource is the performance bottleneck of a kernel; in other words, it quantifies a potential issue, but does not help in \emph{explaining} it --~or debugging it. @@ -171,4 +171,4 @@ this manuscript), with results comparable with \llvmmca{}. Its source code --~under free software license~-- is self-contained and reasonably concise (about 2,000 lines of Python for the main part), making it a good basis and baseline for experiments. It is, however, closely tied by design to Intel -microarchitectures, or microarchitectures very alike to Intel's ones. +microarchitectures, or microarchitectures very close to Intel's ones.