Proof-read chapter 5 (staticdeps)

2024-09-01 16:05:21 +02:00 · 2024-09-01 16:05:21 +02:00 · b5b0296102
commit b5b0296102
parent 9cfeddeef7
10 changed files with 107 additions and 90 deletions
--- a/manuscrit/60_staticdeps/10_types_of_deps.tex
+++ b/manuscrit/60_staticdeps/10_types_of_deps.tex
@ -20,7 +20,7 @@ down into four categories:
 For instance, in the kernel presented in the introduction of this chapter, the
 first instruction (\lstxasm{add \%rax, \%rbx}) reads its first operand, the
-register \reg{rax}, and both reads and write its second operand \reg{rbx}. The
+register \reg{rax}, and both reads and writes its second operand \reg{rbx}. The
 second \lstxasm{add} has the same behaviour. Thus, as \reg{rbx} is written at
 line 1, and read at line 2, there is a read-after-write dependency between the
 two.
@ -51,7 +51,7 @@ however, other channels.
 As we saw in the introduction to this chapter, as well as in the previous
 chapter, dependencies can also be \emph{memory-carried}, in more or less
 straightforward ways, such as in the examples from
-\autoref{lst:mem_carried_exn}, where the last line always depend on the first.
+\autoref{lst:mem_carried_exn}, where the last line always depends on the first.
 \begin{lstfloat}[h!]
 \begin{minipage}[t]{0.32\linewidth}
@ -67,8 +67,8 @@ add -8(%rbx), %rcx\end{lstlisting}
 \end{minipage}\hfill
 \begin{minipage}[t]{0.32\linewidth}
    \begin{lstlisting}[language={[x86masm]Assembler}]
 lea 16(%rbx), %r10
 add %rax, (%rbx)
 lea 16(%rbx), %r10
 add -16(%r10), %rcx\end{lstlisting}
 \end{minipage}\hfill
 \caption{Examples of memory-carried dependencies.}\label{lst:mem_carried_exn}
@ -94,27 +94,36 @@ with a large emphasis on memory-carried dependencies.
 \paragraph{Presence of loops.} The previous examples were all pieces of
 \emph{straight-line code} in which a dependency arose. However, many
 dependencies are actually \emph{loop-carried}, such as those in
-\autoref{lst:loop_carried_exn}.
+\autoref{lst:loop_carried_exn}. In \autoref{lst:loop_carried_exn:sumA}, line 2
 depends on the previous iteration's line 2 as \reg{r10} is read, then written
 back. In \autoref{lst:loop_carried_exn:quasiFibo}, line 3 depends on line 2 of
 the same iteration; but line 2 alsp depends on line 3 two iterations ago by
 reading \lstxasm{-16(\%rbx, \%r10)}.
 \begin{lstfloat}
-\begin{minipage}[t]{0.48\linewidth}
+    \centering
-    \begin{lstlisting}[language={[x86masm]Assembler}]
+
-# Compute sum(A), %rax points to A
+    \begin{sublstfloat}[b]{0.48\linewidth}
        \begin{lstlisting}[language={[x86masm]Assembler}]
 loop:
    add (%rax), %r10
    add $8, %rax
    jmp loop
 \end{lstlisting}
-\end{minipage}\hfill
+\caption{Compute the sum of array \lstxasm{A}'s terms in \reg{r10}. \reg{rax} points to
-\begin{minipage}[t]{0.48\linewidth}
+        \lstxasm{A}.}\label{lst:loop_carried_exn:sumA}
-    \begin{lstlisting}[language={[x86masm]Assembler}]
+    \end{sublstfloat}
-# Compute B[i] = A[i] + B[i-2]
+    \hfill
    \begin{sublstfloat}[b]{0.48\linewidth}
        \begin{lstlisting}[language={[x86masm]Assembler}]
 loop:
    mov -16(%rbx, %r10), (%rbx, %r10)
    add (%rax, %r10), (%rbx, %r10)
    add $8, %r10
    jmp loop
 \end{lstlisting}
-\end{minipage}\hfill
+        \caption{Compute \lstxasm{B[i] = A[i] + B[i-2]}. \reg{rax} points to
-\caption{Examples of loop-carried dependencies.}\label{lst:loop_carried_exn}
+        \lstxasm{A}, \reg{rbx} points to \lstxasm{B}.}\label{lst:loop_carried_exn:quasiFibo}
    \end{sublstfloat}
    \caption{Examples of loop-carried dependencies.}\label{lst:loop_carried_exn}
 \end{lstfloat}
--- a/manuscrit/60_staticdeps/20_dynamic.tex
+++ b/manuscrit/60_staticdeps/20_dynamic.tex
@ -5,18 +5,20 @@ the actual data dependencies occurring throughout an execution. While such
 analyzers are often too slow to use in practice, they can be used as a baseline
 to evaluate static alternatives.
-Due to its complexity, \gus{} is, however, ill-suited for the implementation of
+As it is a complex tool performing a wide range of analyses, \gus{} is,
-a simple experiment if one is not already familiar with its codebase. For this
+however, unnecessarily complex to simply serve as a baseline. For the same
-reason, we implement \depsim{}, a dynamic analysis tool based on top of
+reason, it is also impractically slower than a simple dynamic analysis. For
 this reason, we implement \depsim{}, a dynamic analysis tool based on top of
 \valgrind{}, whose goal is to report dependencies encountered at runtime.
 \subsection{Valgrind}
 Most low-level developers and computer scientists know
-\valgrind{}~\cite{tool:valgrind} as a memory
+\valgrind{}~\cite{tool:valgrind} as a memory analysis tool, reporting bad
-analysis tool, reporting bad memory accesses, memory leaks and the like.
+memory accesses, memory leaks and the like. However, this is only a small part
-However, this is only the \texttt{memcheck} tool built into \valgrind{}, while the
+of \valgrind{} ---~the \texttt{memcheck} tool. The whole program is actually a
-whole program is actually a binary instrumentation framework.
+binary instrumentation framework, upon which the famous \texttt{memcheck} is
 built.
 \valgrind{} supports a wide variety of platforms, including x86-64 and
 ARM. However, at the time of the writing, it supports AVX2, but does not yet
@ -41,7 +43,7 @@ for some assembly code, independently of the Valgrind framework.
 \subsection{Depsim}\label{ssec:depsim}
-The tool we write to extract runtime-gathered dependencies, \depsim{}, is
+The tool we wrote to extract runtime-gathered dependencies, \depsim{}, is
 able to extract dependencies through both registers, memory and temporary
 variables ---~in its intermediate representation, Valgrind keeps some values
 assigned to temporary variables in static single-assignment (SSA) form.
@ -92,5 +94,5 @@ avoid excessive instrumentation slowdown.
 We further annotate every write to the shadow memory with the timestamp at
 which it occurred. Whenever a dependency should be added, we first check that
 the dependency has not expired ---~that is, that it is not older than a given
-threshold. This threshold is tunable for each run --~and may be set to infinity
+threshold. This threshold is tunable for each run ---~and may be set to infinity
 to keep every dependency.
--- a/manuscrit/60_staticdeps/30_static_principle.tex
+++ b/manuscrit/60_staticdeps/30_static_principle.tex
@ -1,28 +1,24 @@
 \section{Static dependencies detection}\label{ssec:staticdeps_detection}
-Depending on the type of dependencies considered, it is more or less difficult
+Depending on their type, some dependencies are significantly harder to
-to statically detect them.
+statically detect than others.
-\paragraph{Register-carried dependencies in straight-line code.} This case is
+\textbf{Register-carried dependencies}, when in straight-line code, can be
-the easiest to statically detect, and is most often supported by code analyzers
+detected by keeping track of which instruction last wrote each register in a
---~for instance, \llvmmca{} supports it. The same strategy that was used to
+\emph{shadow register file}. This is most often supported by code analyzers
-dynamically find dependencies in \autoref{ssec:depsim} can still be used: a
+---~for instance, \llvmmca{} and \uica{} support it.
 shadow register file simply keeps track of which instruction last wrote each
 register.
-\paragraph{Register-carried, loop-carried dependencies.} Loop-carried
+Loop-carried dependencies can, to some extent, be detected the same way. As the
-dependencies can, to some extent, be detected the same way. As the basic block
+basic block is always assumed to be the body of an infinite loop, a
-is always assumed to be the body of an infinite loop, a straight-line analysis
+straight-line analysis can be performed on a duplicated kernel. This strategy
-can be performed on a duplicated kernel. This strategy is \eg{} adopted by
+is \eg{} adopted by \osaca{}~\cite{osaca2} (§II.D). When dealing only with
-\osaca{}~\cite{osaca2} (§II.D).
+register accesses, this strategy is always sufficient: as each iteration always
 executes the same basic block, it is not possible for an instruction to depend
 on another instruction two iterations earlier or more.
-When dealing only with register accesses, this
+\smallskip
 strategy is always sufficient: as each iteration always executes the same basic
 block, it is not possible for an instruction to depend on another instruction
 two iterations earlier or more.
-\paragraph{Memory-carried dependencies in straight-line code.} Memory
+\textbf{Memory-carried dependencies}, however, are significantly harder to tackle. While basic
 dependencies, however, are significantly harder to tackle. While basic
 heuristics can handle some simple cases, in the general case two main
 difficulties arise:
 \begin{enumerate}[(i)]
@ -41,12 +37,14 @@ difficulties arise:
 Tracking memory-carried dependencies is, to the best of our knowledge, not done
 in code analyzers, as our results in \autoref{chap:CesASMe} suggests.
-\paragraph{Loop-carried, memory-carried dependencies.} While the strategy
+\smallskip{}
-previously used for register-carried dependencies is sufficient to detect
+
-loop-carried dependencies from one occurrence to the next one, it is not
+While the strategy previously used for register-carried dependencies is
-sufficient at all times when the dependencies tracked are memory-carried. For
+sufficient to detect loop-carried dependencies from one occurrence to the next
-instance, in the second example from \autoref{lst:loop_carried_exn}, an
+one, it is not sufficient at all times when the dependencies tracked are
-instruction depends on another two iterations ago.
+memory-carried. For instance, in the second example from
 \autoref{lst:loop_carried_exn}, an instruction depends on another two
 iterations ago.
 Dependencies can reach arbitrarily old iterations of a loop: in this example,
 \lstxasm{-8192(\%rbx, \%r10)} may be used to reach 1\,024 iterations back.
@ -55,7 +53,7 @@ necessarily \emph{relevant} from a performance analysis point of view. Indeed,
 if an instruction $i_2$ depends on a result previously produced by an
 instruction $i_1$, this dependency is only relevant if it is possible that
 $i_1$ is not yet completed when $i_2$ is considered for issuing ---~else, the
-result is already produced, and $i_2$ needs not wait to execute.
+result is already produced, and $i_2$ needs never wait to execute.
 The reorder buffer (ROB) of a CPU can be modelled as a sliding window of fixed
 size over \uops{}. In particular, if a \uop{} $\mu_1$ is not yet retired, the
--- a/manuscrit/60_staticdeps/35_rob_proof.tex
+++ b/manuscrit/60_staticdeps/35_rob_proof.tex
@ -7,11 +7,11 @@ performance}\label{ssec:staticdeps:rob_proof}
    for the subtrace $\left(I_r\right)_{p < r < p'}$.
 \end{definition}
-\begin{theorem}[Long distance dependencies]\label{thm.longdist}
+\begin{theorem}[Long distance dependencies]\label{thm.longdist} Given a kernel
-    There exists $R \in \nat$, only dependent of microarchitectural parameters,
+    $\kerK$, there exists $R \in \nat$, only dependent of microarchitectural
-    such that the presence or absence of a dependency between two instructions
+    parameters, such that the presence or absence of a dependency between two
-    that are separated by at least $R-1$ other \uops{} has no impact on the
+    instructions that are separated by at least $R-1$ other \uops{} has no
-    performance of this kernel.
+    impact on the performance of this kernel.
    More formally, let $\kerK$ be a kernel of $n$ instructions. Let $(I_p)_{p
    \in \nat}$ be the trace of $\kerK$'s instructions executed in a loop. For
@ -29,7 +29,10 @@ the reorder buffer.
    It contains only decoded \uops{}.
    Let us denote $i_d$ the \uop{} at position $d$ in the reorder buffer.
    Assume $i_d$ just got decoded.
-    We have that for every $q$ and $q'$ in $[0,R)$:
+
    \nopagebreak{}As the buffer is a circular FIFO, we have that for every $q$
    and $q'$ in
    $[0,R)$:
    \[
        (q-d-1) \% R<(q'-d-1) \% R\
            \iff \ i_q \text{ was decoded before } i_{q'}
@ -45,7 +48,8 @@ buffer}:
 \begin{postulate}[Full reorder buffer]
    Let us denote by $i_d$ the \uop{} that just got decoded.
-    The reorder buffer is said to be full if for $q=(d+1) \% R$, \uop{} $i_q$ is not retired yet.
+    The reorder buffer is said to be \emph{full} if for $q=(d+1) \% R$, \uop{}
    $i_q$ is not retired yet.
    If the reorder buffer is full, then instruction decoding is stalled.
 \end{postulate}
@ -58,10 +62,10 @@ To prove the theorem above, we need to state that any two in-flight \uops{} are
 For any instruction $I_p$, we denote as $Q_p$ the range of indices such that
 $(i_q)_{q\in Q_p}$ are the \uops{} obtained from the decoding of $I_p$.
-Note that in practice, we may not have $\bigcup{}_p Q_p = [0, n)$, as \eg{}
+Note that in practice, it is possible that we do not have $\bigcup{}_p Q_p =
-branch mispredictions may introduce unwanted \uops{} in the pipeline. However,
+[0, n)$, as \eg{} branch mispredictions may introduce unwanted \uops{} in the
-as the worst case for the lemma below occurs when no such ``spurious'' \uops{}
+pipeline. However, as the worst case for the lemma below occurs when no such
-are present, we may safely ignore such occurrences.
+``spurious'' \uops{} are present, we may safely ignore such occurrences.
 \begin{lemma}[Distance of in-flight \uops{}]
    For any pair of instructions $(I_p,I_{p'})$, and two corresponding \uops{},
--- a/manuscrit/60_staticdeps/40_staticdeps.tex
+++ b/manuscrit/60_staticdeps/40_staticdeps.tex
@ -1,8 +1,9 @@
 \section{Staticdeps}
 The static analyzer we present, \staticdeps{}, only aims to tackle the
-difficulty~\ref{memcarried_difficulty_arith} mentioned above: tracking
+difficulty~(\ref{memcarried_difficulty_arith}) mentioned in
-dependencies across arbitrarily complex pointer arithmetic.
+\autoref{ssec:staticdeps_detection}: tracking dependencies across arbitrarily
 complex pointer arithmetic.
 To do so, \staticdeps{} works at the basic-block level, unrolled enough times
 to fill the reorder buffer as detailed above; this way, arbitrarily
@ -19,7 +20,7 @@ two arbitrary expressions can be costly.
    \caption{The \staticdeps{} algorithm}\label{alg:staticdeps}
 \end{algorithm}
-Instead, we use an heuristic based on random values. We consider the set $\calR
+Instead, we use a heuristic based on random values. We consider the set $\calR
 = \left\{0, 1, \ldots, 2^{64}-1\right\}$ of values representable by a 64-bits
 unsigned integer; we extend this set to $\bar\calR = \calR \cup \{\bot\}$,
 where $\bot$ denotes an invalid value. We then proceed as previously for
--- a/manuscrit/60_staticdeps/42_staticdeps_alg.tex
+++ b/manuscrit/60_staticdeps/42_staticdeps_alg.tex
@ -13,7 +13,7 @@
        \smallskip{}
        \Function{read\_memory}{address}
-            \State{} \textbf{Assert} address \neq{} ⊥
+            \State{} \textbf{Assert} address \neq{} $\bot$
            \If{address \not\in{} shadow\_memory}
                \State{} shadow\_memory[address] $\gets$ \Call{fresh}{}
            \EndIf{}
@ -21,7 +21,7 @@
        \EndFunction{}
        \Function{read\_register}{register}
-            \Comment{likewise, without dependency tracking}
+            \State{} \ldots \Comment{Likewise, without dependency tracking}
        \EndFunction{}
        \smallskip{}
@ -36,13 +36,13 @@
                \EndIf{}
                \State{} \Return{} \Call{read\_memory}{addr}
            \ElsIf{expr == IntegerArithmeticOp(operator, op1, …, opN)}
-                \If{\Call{expr\_value}{op\_i} == ⊥ for any i}
+                \If{\Call{expr\_value}{op\_i} == $\bot$ for any i}
-                    \State{} \Return{} ⊥
+                    \State{} \Return{} $\bot$
                \EndIf{}
-                \State\Return{} semantics(operator)(\Comment{provided by Valgrind's Vex}
+                \State\Return{} semantics(operator)(\Comment{Provided by Valgrind's Vex}
                    \State{} \quad \Call{expr\_value}{op1}, …, \Call{expr\_value}{opN})
            \Else{}
-                \Return{} ⊥
+                \Return{} $\bot$
            \EndIf{}
        \EndFunction{}
@ -52,8 +52,8 @@
                \State{} shadow\_register[reg] \gets{} \Call{expr\_value}{rhs}
            \ElsIf{lhs == Memory(addr\_expr)}
                \State{} addr \gets{} \Call{expr\_value}{addr\_expr}
-                \State{} last\_wrote\_at[addr] \gets{}(cur\_iter, cur\_instruction)
+                \State{} last\_wrote\_at[addr] \gets{} (cur\_iter, cur\_instruction)
-                \State{} shadow\_memory[addr] <- \Call{expr\_value}{rhs}
+                \State{} shadow\_memory[addr] \gets{} \Call{expr\_value}{rhs}
            \ElsIf{\ldots}
                \Comment{Etc.}
            \EndIf{}
--- a/manuscrit/60_staticdeps/50_eval.tex
+++ b/manuscrit/60_staticdeps/50_eval.tex
@ -220,8 +220,8 @@ analyzers.
 \subsection{Enriching \uica{}'s model}
-To estimate the real gain in performance debugging scenarios, however, we
+To estimate the real gain in performance debugging scenarios, we integrate
-integrate \staticdeps{} into \uica{}.
+\staticdeps{} into \uica{}.
 There is, however, a discrepancy between the two tools: while \staticdeps{}
 works at the assembly instruction level, \uica{} works at the \uop{} level. In
@ -233,11 +233,11 @@ for the \staticdeps{} analysis.
 We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
 are found to be dependant, we add a dependency between each couple $\mu_1 \in
-i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict
+i_1, \mu_2 \in i_2$. This approximation is thus largely pessimistic, and should
-execution times biased towards a slower computation kernel. A finer model, or a
+predict execution times biased towards a slower computation kernel. A finer
-finer (conservative) filtering of which \uops{} must be considered dependent
+model, or a finer (conservative) filtering of which \uops{} must be considered
---~\eg{} a memory dependency can only come from a memory-related \uop{}~---
+dependent ---~\eg{} a memory dependency can only come from a memory-related
-may enhance the accuracy of our integration.
+\uop{}~--- may enhance the accuracy of our integration.
 \begin{table}
    \centering
--- a/manuscrit/60_staticdeps/99_conclusion.tex
+++ b/manuscrit/60_staticdeps/99_conclusion.tex
@ -17,8 +17,10 @@ context} pointed out in the previous chapter.
 \medskip{}
 Our evaluation of \staticdeps{} against a dynamic analysis baseline,
-\depsim{}, shows that it only finds about 60\,\% of the existing dependencies.
+\depsim{}, shows that it finds between 95\,\% and 98\,\% of the existing
-We however enrich \uica{} with \staticdeps{}, and find that it performs on the
+dependencies, depending on the metric used, giving us good confidence in the
 reliability of \staticdeps{}.
 We further enrich \uica{} with \staticdeps{}, and find that it performs on the
 full \cesasme{}'s dataset as well as \uica{} alone on the pruned dataset of
 \cesasme{}, removing memory-carried bottlenecks. From this, we conclude that
 \staticdeps{} is very successful at finding the data dependencies through
--- a/manuscrit/90_wrapping_up/main.tex
+++ b/manuscrit/90_wrapping_up/main.tex
@ -44,15 +44,15 @@ create a full analyzer implementing this idea, such as what we did with \palmed{
 for backend models, or such as \uica{}'s implementation, focusing on frontend
 analysis.
-In hindsight, we advocate for the emergence of such a modular code analyzer.
+In hindsight, we advocate for the emergence of such a modular code analyzer. It
-It would maybe not be as convenient or well-integrated as ``production-ready''
+would maybe not be as convenient or well-integrated as ``production-ready''
-code analyzers, such as \llvmmca{} ---~which is packaged for Debian. It could,
+code analyzers, such as \llvmmca{} ---~which is officially packaged for Debian.
-however, greatly simplify the academic process of trying a new idea on any of
+It could, however, greatly simplify the academic process of trying a new idea
-the three main models, by decorrelating them. It would also ease the
+on any of the three main models, by decorrelating them. It would also ease the
-comparative evaluation of those ideas, while eliminating many of the discrepancies
+comparative evaluation of those ideas, while eliminating many of the
-between experimental setups that make an actual comparison difficult ---~the
+discrepancies between experimental setups that make an actual comparison
-reason that prompted us to make \cesasme{} in \autoref{chap:CesASMe}. Indeed,
+difficult ---~the reason that prompted us to make \cesasme{} in
-with such a modular tool, it would be easy to run the same experiment, in the
+\autoref{chap:CesASMe}. Indeed, with such a modular tool, it would be easy to
-same conditions, while only changing \eg{} the frontend model but keeping a
+run the same experiment, in the same conditions, while only changing \eg{} the
-well-tried backend model.
+frontend model but keeping a well-tried backend model.
--- a/manuscrit/include/packages.tex
+++ b/manuscrit/include/packages.tex
@ -88,6 +88,7 @@
 \newfloat{lstfloat}{htbp}{lop}
 \floatname{lstfloat}{Listing}
 \def\lstfloatautorefname{Listing}
 \DeclareCaptionSubType{lstfloat}
 \newfloat{algorithm}{htbp}{lop}
 \floatname{algorithm}{Algorithm}