Proof-read up to Foundations (incl)

2024-08-15 18:53:08 +02:00 · 2024-08-15 18:53:08 +02:00 · 8d4887cc63
commit 8d4887cc63
parent c8c2b2db2a
5 changed files with 48 additions and 44 deletions
--- a/manuscrit/00_opening/10_abstract.tex
+++ b/manuscrit/00_opening/10_abstract.tex
@ -1,6 +1,6 @@
 \selectlanguage{french}
 \begin{abstract}
-    Qu'il s'agisse de calculs massifs distribués sur plusieurs racks, de
+    Qu'il s'agisse de calculs massifs distribués sur plusieurs baies, de
    calculs en environnement contraint --~comme de l'embarqué ou de
    l'\emph{edge computing}~-- ou encore de tentatives de réduire l'empreinte
    écologique d'un programme fréquemment utilisé, de nombreux cas d'usage
@ -43,7 +43,7 @@
    the generated assembly with respect to the microarchitecture of the
    specific microprocessor used to fine-tune it.

-    Such an optimisation level requires a very detailed comprehension of both
+    Such an optimisation level requires a very detailed understanding of both
    the software and hardware aspects implied, and is most often the realm of
    experts. \emph{Code analyzers}, however, are tools that help lowering the
    expertise threshold required to perform such optimisations by automating
--- a/manuscrit/10_introduction/main.tex
+++ b/manuscrit/10_introduction/main.tex
@ -100,7 +100,7 @@ slow the whole computation.
 \vspace{2em}

 In this thesis, we explore the three major aspects that work towards a code
-analyzers' accuracy: a \emph{backend model}, a \emph{frontend model} and a
+analyzer's accuracy: a \emph{backend model}, a \emph{frontend model} and a
 \emph{dependencies model}. We propose contributions to strengthen them, as well
 as to automate the underlying models' synthesis.  We focus on \emph{static}
 code analyzers, that derive metrics, including runtime predictions, from an
@ -124,7 +124,7 @@ tool, akin to \palmed.

 Chapter~\ref{chap:CesASMe} makes an extensive study of the state-of-the-art
 code analyzers' strengths and shortcomings. To this end, we introduce a
-fully-tooled approach in two parts: first, a benchmarks-generation procedure,
+fully-tooled approach in two parts: first, a benchmark-generation procedure,
 yielding thousands of benchmarks relevant in the context of our approach; then,
 a benchmarking harness evaluating code analyzers on these benchmarks. We find
 that most state-of-the-art code analyzers struggle to correctly account for
@ -154,11 +154,12 @@ we see this commitment as an opportunity to develop methodologies able to model
 these processors.

 This is particularly true of \palmed, in \autoref{chap:palmed}, whose goal is
-to model a processor's backend resources without resorting to its hardware
-counters. Our frontend study, in \autoref{chap:frontend}, also follows this
-strategy by focusing on a processor whose hardware counters give little to no
-insight on its frontend. While this goal is less relevant to \staticdeps{}, we
-rely on external libraries to abstract the underlying architecture.
+to model a processor's backend resources without resorting to its
+vendor-specific hardware counters. Our frontend study, in
+\autoref{chap:frontend}, also follows this strategy by focusing on a processor
+whose hardware counters give little to no insight on its frontend. While this
+goal is less relevant to \staticdeps{}, we rely on external libraries to
+abstract the underlying architecture.

 \medskip{}

--- a/manuscrit/20_foundations/10_cpu_arch.tex
+++ b/manuscrit/20_foundations/10_cpu_arch.tex
@ -154,7 +154,7 @@ port for both memory loads and stores.

 In most cases, execution units are \emph{fully pipelined}, meaning that while
 processing a single \uop{} takes multiple cycles, the unit is able to start
-processing a new \uop{} every cycle: multiple \uops{} are then being processed,
+processing a new \uop{} every cycle: multiple \uops{} are thus being processed,
 at different stages, during each cycle, akin to a factory's assembly line.

 \smallskip{}
@ -204,11 +204,12 @@ For this reason, many processors are now \emph{out-of-order}, while processors
 issuing \uops{} strictly in their original order are called \emph{in-order}.
 Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
 instructions are picked to be issued. The reorder buffer acts as a sliding
-window of microarchitecturally-fixed size over \uops{}, from which the oldest
-\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
-CPUs are only able to execute operations out of order as long as the
-\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
-be issued ---~specifically, not more than the size of the reorder buffer ahead.
+window of microarchitecturally-fixed size over decoded \uops{}, from which the
+oldest \uop{} whose dependencies are satisfied will be executed. Thus,
+out-of-order CPUs are only able to execute operations out of order as long as
+the \uop{} to be executed is not too far ahead from the oldest \uop{} awaiting
+to be issued ---~specifically, not more than the size of the reorder buffer
+ahead.

 It is also important to note that out-of-order processors are only out-of-order
 \emph{from a certain point on}: a substantial part of the processor's frontend
@ -238,13 +239,14 @@ word sizes.
 Some instructions, however, operate on chunks of multiple words at once. These
 instructions are called \emph{vector instructions}, or \emph{SIMD} for Single
 Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add
-two chunks of 128 bits, which can for instance be treated each as four integers
+two chunks of 128 bits, treated each as four integers
 of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}.

 \begin{figure}
    \centering
    \includegraphics[width=0.6\textwidth]{simd.svg}
-    \caption{Example of SIMD add instruction on 128b}\label{fig:cpu_simd}
+    \caption{Example of SIMD $4 \times 32\,\text{bits}$ add instruction on
+    128 bits}\label{fig:cpu_simd}
 \end{figure}

 Such instructions present clear efficiency advantages. If the processor is able
--- a/manuscrit/20_foundations/20_code_analyzers.tex
+++ b/manuscrit/20_foundations/20_code_analyzers.tex
@ -38,7 +38,7 @@ analyze a code fragment ---~typically at assembly or binary level~---, and
 provide insights on its performance metrics on a given hardware. Code analyzers
 thus work statically, that is, without executing the code.

-\paragraph{Common hypotheses.} Code analyzers operate under a common
+\paragraph{Common hypotheses.} Code analyzers operate under a set of common
 hypotheses, derived from the typical intended usage.

 The kernel analyzed is expected to be the body of a loop, or
@ -101,8 +101,8 @@ than on edge cases.

 As most code analyzers are static, this manuscript largely focuses on static
 analysis. The only dynamic code analyzer we are aware of is \gus{}, described
-more thoroughly in \autoref{sec:sota} later, trading heavily runtime to gain in
-accuracy, especially regarding data dependencies that may not be easily
+more thoroughly in \autoref{sec:sota} later, trading heavily run time to gain
+in accuracy, especially regarding data dependencies that may not be easily
 obtained otherwise.

 \paragraph{Input formats used.} The analyzers studied in this manuscript all
@ -111,12 +111,12 @@ take as input either assembly code, or assembled binaries.
 In the case of assembly code, as for instance with \llvmmca{}, analyzers
 take either a short assembly snippet, treated as straight-line code and
 analyzed as such; or longer pieces of assembly, part or parts of which being
-marked for analysis my surrounding assembly comments.
+marked for analysis by surrounding assembly comments.

 In the case of assembled binaries, as all analyzers were run on Linux,
 executables or object files are ELF files. Some analyzers work on sections of
 the file defined by user-provided offsets in the binary, while others require
-the presence of \iaca{} markers around the code portion or portions to be
+the presence of \textit{\iaca{} markers} around the code portion or portions to be
 analyzed. Those markers, introduced by \iaca{} as C-level preprocessor
 statements, consist in the following x86 assembly snippets:

@ -198,11 +198,11 @@ Iterating it takes 106 cycles instead of the expected 100 cycles, as this
 execution is \emph{not} in steady-state, but accounts for the cycles from the
 decoding of the first instruction to the retirement of the last.

-The row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}.
-The next two rows are simple ratios. Row 10 is the block's \emph{reverse
-throughput}, which we formalize later in \autoref{sssec:def:rthroughput}, but
-is roughly defined as the number of cycles a single iteration of the kernel
-takes.
+Row 7 indicates that each cycle, the frontend can issue at most 3 \uops{}. The
+next two rows are simple ratios. Row 10 is the block's \emph{reverse
+throughput}, which we will note $\cyc{\kerK}$ and formalize later in
+\autoref{sssec:def:rthroughput}, but is roughly defined as the number of cycles
+a single iteration of the kernel takes.

 The next section, \emph{instruction info}, lists data about the instructions
 present.
@ -227,7 +227,7 @@ which indicates, for each instruction, the timeline of its execution. Here,
 \texttt{D} stands for decode, \texttt{e} for being executed --~in the
 pipeline~--, \texttt{E} for last cycle of its execution --~leaving the
 pipeline~--, \texttt{R} for retiring. When an instruction is decoded and
-waiting to be dispatched to execution, a \texttt{=} is shown.
+waiting to be dispatched to execution, an \texttt{=} is shown.

 The identifier at the beginning of each row indicates the kernel iteration
 number, and the instruction within.
@ -361,7 +361,8 @@ time is measured until the last instruction is issued, not retired.
    Thus, by the pigeon-hole principle, there exists $p \in \nat$ such that
    $\sigma(\kerK) = \sigma(\kerK^{p+1})$. By induction, as each state depends only on the
    previous one, we thus obtain that $(\sigma(\kerK^n))_n$ is periodic of
-    period $p$.
+    period $p$. As we consider only the execution's steady state, the sequence
+    is periodic from rank 0.

    As the number of cycles needed to execute $\kerK$ only depend on the
    initial state of the processor, we thus have
@ -375,8 +376,8 @@ time is measured until the last instruction is issued, not retired.
    and measured in \emph{cycles per iteration}, is also called the
    steady-state execution time of a kernel.

-    We note $p \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$ (by the above
-    lemma), and define \[
+    We note $p = \calP(\kerK) \in \nat^*$ the period of $\ckn{n+1} - \ckn{n}$
+    (by the above lemma), and define \[
        \cyc{\kerK} = \dfrac{\ckn{p}}{p}
    \]
 \end{definition}
@ -396,8 +397,8 @@ $\cyc{\kerK} = 1.5$.

 \begin{remark}
    As $C(\kerK)$ depends on the microarchitecture of the processor considered,
-    the throughput $\cyc{\kerK}$ of a kernel $\kerK$ depends on the processor
-    considered.
+    the throughput $\cyc{\kerK}$ of a kernel $\kerK$ implicitly depends on the
+    processor considered.
 \end{remark}

 \medskip
@ -440,10 +441,10 @@ $\cyc{\kerK} = 1.5$.
 \end{lemma}

 \begin{proof}
-    Let $n \in \nat^*$. We note $p \in \nat^*$ the periodicity by the above
-    lemma.
+    Let $n \in \nat^*$ and $p = \calP(\kerK) \in \nat^*$ the periodicity by the
+    above lemma.

-    Let $k, r \in \nat^*$ such that $n = kp+r$, $0 < r \leq p$.
+    Let $k, r \in \nat^*$ such that $n = kp+r$, $1 \leq r \leq p$.

    \begin{align*}
        \ckn{n} &= k \cdot \ckn{p} + \ckn{r} & \textit{(by lemma)} \\
@ -561,9 +562,9 @@ of this manuscript.
    \smallskip{}

    An instruction is said to be a \emph{flow-altering instruction} if this
-    address may alter the normal control flow of the program. This is typically
-    true of jumps (conditional or unconditional), function calls, function
-    returns, \ldots
+    instruction may alter the normal control flow of the program. This is
+    typically true of jumps (conditional or unconditional), function calls,
+    function returns, \ldots

    \smallskip{}

--- a/manuscrit/20_foundations/30_sota.tex
+++ b/manuscrit/20_foundations/30_sota.tex
@ -81,7 +81,7 @@ approach, but also limits it to microarchitectures offering such counters, and
 requires a manual analysis of each microarchitecture to be supported in order
 to find a fitting set of blocking instructions. Although we have no theoretical
 guarantee of the existence of such instructions, this should never be a
-problem, as all pragmatic microarchitecture design will yield to their
+problem, as all pragmatic microarchitecture design will lead to their
 existence.

 \subsection{Code analyzers and their models}
@ -89,8 +89,8 @@ existence.
 Going further than data extraction at the individual instruction level,
 academics and industrials interested in this domain now mostly work on
 code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
-tool embeds a model --~or collection of models~-- on which to base its
-inference, and whose definition, embedded data and obtention method varies from
+tool embeds a model --~or collection of models~-- on which its inference is
+based, and whose definition, embedded data and obtention method varies from
 tool to tool. These tools often use, to some extent, the data on individual
 instructions obtained either from the manufacturer or the third-party efforts
 mentioned above.
@ -142,7 +142,7 @@ its context. This approach, in our experiments, was significantly less accurate
 than those not based on machine learning. In our opinion, its main issue,
 however, is to be a \textit{black-box model}: given a kernel, it is only able
 to predict its reverse throughput. Doing so, even with perfect accuracy, does
-not explain the source of a performance problem: the model is unable to help in
+not explain the source of a performance problem: the model is unable to help
 detecting which resource is the performance bottleneck of a kernel; in other
 words, it quantifies a potential issue, but does not help in \emph{explaining}
 it --~or debugging it.
@ -171,4 +171,4 @@ this manuscript), with results comparable with \llvmmca{}. Its source code
 --~under free software license~-- is self-contained and reasonably concise
 (about 2,000 lines of Python for the main part), making it a good basis and
 baseline for experiments. It is, however, closely tied by design to Intel
-microarchitectures, or microarchitectures very alike to Intel's ones.
+microarchitectures, or microarchitectures very close to Intel's ones.