Init staticdeps, write §1

2023-09-27 17:02:30 +02:00 · 2023-09-27 17:02:30 +02:00 · c2acf78476
commit c2acf78476
parent e1eb20bb2c
9 changed files with 202 additions and 30 deletions
--- a/manuscrit/60_staticdeps/00_intro.tex
+++ b/manuscrit/60_staticdeps/00_intro.tex
@ -0,0 +1,38 @@
+In the previous chapter, our major finding was that, in the current state of
+the art, code analyzers deal poorly with memory-carried dependencies. We found
+this flaw to be responsible, in our dataset, for a roughly $1.5\times$ increase
+in MAPE, and up to $2.6\times$ on the third quartile of error.
+
+The large impact of dependencies on the final runtime of a kernel is, in
+reality, not very surprising. In chapters~\ref{chap:palmed}
+and~\ref{chap:frontend}, we did not consider latency; hence, the only impact of
+an instruction was its throughput, each instruction being issued as soon as
+possible. Dependencies, however, force the processor to wait for some
+instructions' results before issuing some others; the \emph{latency} of an
+instruction becomes a critical factor.
+
+On Skylake, for instance, the instruction \lstxasm{add \%rax, \%rbx} has a
+latency of one full cycle. Thus, the kernel
+\begin{lstlisting}[language={[x86masm]Assembler}]
+add %rax, %rbx
+add %rbx, %rcx
+\end{lstlisting}
+executes, in steady state, in half a cycle without accounting for the
+dependency; yet these two instructions in isolation would take
+$1\,\sfrac{1}{4}$ cycles when accounting for the dependency. Some instructions
+still are more extreme; for instance, the \lstxasm{vfmadd*pd \%ymm0, \%ymm1,
+\%ymm2} family of instructions have a latency of four full cycles, while
+without dependencies, two can be issued every cycle.
+
+\medskip{}
+
+In the previous chapter, we also presented \gus{}, a dynamic code analyzer
+based on \qemu{}, which we found to be very effective to detect memory-carried
+dependencies and the slowdown they incur on the whole program. However, this
+solution results in a runtime increase of about two orders of magnitude, which
+may not be acceptable in many use cases.
+
+In this chapter, we instead present \staticdeps{}, a fully static analyzer able
+to detect memory-carried dependencies in many cases. We evaluate it by
+providing \uica{} with its analysis of dependencies, bringing it on-par with
+\gus{} on the full, non-pruned dataset of the previous chapter.
--- a/manuscrit/60_staticdeps/10_types_of_deps.tex
+++ b/manuscrit/60_staticdeps/10_types_of_deps.tex
@ -0,0 +1,113 @@
+\section{Types of dependencies}
+
+A dependency, in the most general sense, can be seen as an interaction between
+two instructions stemming from shared data. This definition is willingly broad
+as, depending on the circumstances, the CPU implementation, \ldots{}, some
+categories of dependencies must be taken into account, while some may be
+ignored.
+
+\paragraph{Read-write categories.} The first distinction that can be made
+between dependencies, and the one that is most often made, is whether the data
+through which the dependency is created is read or written. They can be broken
+down into four categories:
+
+\begin{itemize}
+    \item read-after-write (RaW);
+    \item write-after-write (WaW);
+    \item write-after-read (WaR);
+    \item read-after-read (RaR).
+\end{itemize}
+
+For instance, in the kernel presented in the introduction of this chapter, the
+first instruction (\lstxasm{add \%rax, \%rbx}) reads its first operand, the
+register \reg{rax}, and both reads and write its second operand \reg{rbx}. The
+second \lstxasm{add} has the same behaviour. Thus, as \reg{rbx} is written at
+line 1, and read at line 2, there is a read-after-write dependency between the
+two.
+
+Most of the time, \emph{dependency} is actually used to mean
+\emph{read-after-write dependency}, sometimes called ``flow dependency''.
+However, depending on the actual hardware implementation of the architecture,
+other kinds of dependencies might induce a latency. While a read-after-read
+dependency will not induce a latency in the vast majority of architectures, a
+write-after-read could prevent instructions to be re-ordered in a way that the
+writing instruction commits its result before the reading instruction uses the
+previously stored value. In most modern CPUs, the processor actually has more
+physical registers than what is exposed to the user through the ISA; a renaming
+phrase will allocate those registers to avoid the effects of WaR and WaW
+dependencies as much as possible.
+
+\smallskip{} For the present chapter, \textit{we only consider read-after-write
+dependencies}; however, all the techniques we present are applicable to other
+types of dependencies if the considered architecture requires to take them into
+account.
+
+\paragraph{Dependency medium.} In the example above, we only introduced
+dependencies induced through registers, or \emph{register-carried}. There are,
+however, other channels.
+
+\smallskip{}
+
+As we saw in the introduction to this chapter, as well as in the previous
+chapter, dependencies can also be \emph{memory-carried}, in more or less
+straightforward ways, such as in the following examples, where the last line
+always depend on the first:
+
+\begin{minipage}[t]{0.32\linewidth}
+    \begin{lstlisting}[language={[x86masm]Assembler}]
+add %rax, (%rbx)
+add (%rbx), %rcx\end{lstlisting}
+\end{minipage}\hfill
+\begin{minipage}[t]{0.32\linewidth}
+    \begin{lstlisting}[language={[x86masm]Assembler}]
+add %rax, (%rbx)
+add $8, %rbx
+add -8(%rbx), %rcx\end{lstlisting}
+\end{minipage}\hfill
+\begin{minipage}[t]{0.32\linewidth}
+    \begin{lstlisting}[language={[x86masm]Assembler}]
+lea 16(%rbx), %r10
+add %rax, (%rbx)
+add -16(%r10), %rcx\end{lstlisting}
+\end{minipage}\hfill
+
+\smallskip{}
+
+Some dependencies are also \emph{flag-carried}. These are very akin to
+register-carried dependency, but are not directly visible in the instruction.
+For instance, a subtract operation may set flags indicating whether the result
+is zero, and a subsequent jump may use this flag to chose whether the branch is
+taken or not.
+
+\smallskip{}
+
+Depending on the architecture, other channels may still exist.
+
+\medskip{}
+
+In this chapter, we focus on register-carried and memory-carried dependencies,
+with a large emphasis on memory-carried dependencies.
+
+\paragraph{Presence of loops.} The previous examples were all pieces of
+\emph{straight-line code} in which a dependency arose. However, many
+dependencies are actually \emph{loop-carried}, such as the following:
+
+\begin{minipage}[t]{0.48\linewidth}
+    \begin{lstlisting}[language={[x86masm]Assembler}]
+# Compute sum(A), %rax points to A
+loop:
+    add (%rax), %r10
+    add $8, %rax
+    jmp loop
+\end{lstlisting}
+\end{minipage}\hfill
+\begin{minipage}[t]{0.48\linewidth}
+    \begin{lstlisting}[language={[x86masm]Assembler}]
+# Compute B[i] = A[i] + B[i-1]
+loop:
+    mov -8(%rbx, %r10), (%rbx, %r10)
+    add (%rax, %r10), (%rbx, %r10)
+    add $8, %r10
+    jmp loop
+\end{lstlisting}
+\end{minipage}\hfill
--- a/manuscrit/60_staticdeps/20_dynamic.tex
+++ b/manuscrit/60_staticdeps/20_dynamic.tex
@ -0,0 +1,3 @@
+\section{A baseline: dynamic dependencies detection with Valgrind}
+
+
--- a/manuscrit/60_staticdeps/30_static_principle.tex
+++ b/manuscrit/60_staticdeps/30_static_principle.tex
@ -0,0 +1 @@
+\section{Static dependencies detection}
--- a/manuscrit/60_staticdeps/40_staticdeps.tex
+++ b/manuscrit/60_staticdeps/40_staticdeps.tex
@ -0,0 +1 @@
+\section{The \staticdeps{} heuristic}
--- a/manuscrit/60_staticdeps/50_eval.tex
+++ b/manuscrit/60_staticdeps/50_eval.tex
@ -0,0 +1 @@
+\section{Evaluation}
--- a/manuscrit/60_staticdeps/main.tex
+++ b/manuscrit/60_staticdeps/main.tex
@ -1 +1,8 @@
 \chapter{Static extraction of memory-carried dependencies}
+
+\input{00_intro.tex}
+\input{10_types_of_deps.tex}
+\input{20_dynamic.tex}
+\input{30_static_principle.tex}
+\input{40_staticdeps.tex}
+\input{50_eval.tex}
--- a/manuscrit/include/macros.tex
+++ b/manuscrit/include/macros.tex
@ -45,6 +45,7 @@
 \newcommand{\anica}{\texttt{AnICA}}
 \newcommand{\cesasme}{\texttt{CesASMe}}
 \newcommand{\benchsuitebb}{\texttt{benchsuite-bb}}
+\newcommand{\staticdeps}{\texttt{staticdeps}}

 \newcommand{\gdb}{\texttt{gdb}}

--- a/plan/60_staticdeps.md
+++ b/plan/60_staticdeps.md
@ -19,13 +19,22 @@
    * 2 O.M. slower => not acceptable in many cases
 * We need a static solution

+Dependencies are costly: assuming everything L1-resident, the latency of each
+μop on the dependency chain must be paid.
+
+On SKX,
+* `add %rax, %rdx`  -> lat = 1 cycle (throughput = 1/4C)
+    => `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
+* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
+
+
 ## Types of dependencies

 4 main types:
 1. RaW: "real" dependency
 2. WaW
 3. WaR
-4. RaW
+4. RaR

 * 4: not an issue.
 * 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem
@ -54,15 +63,34 @@ Can be:
        B[i] = A[i-1] + 2
        A[i] = 7
    ```
-## Cost of dependencies

-Dependencies are costly: assuming everything L1-resident, the latency of each
-μop on the dependency chain must be paid.
+## Dynamic detection: Valgrind

-On SKX,
-* `add %rax, %rdx`  -> lat = 1 cycle (throughput = 1/4C)
-    => `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
-* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
+* Mention Gus
+
+### Valgrind's VEX
+
+* Introduce Valgrind as an instrumentation tool
+* Introduce VEX
+* Should be portable to any architecture supported
+    * but suffers limitations for recent extension sets; eg avx512 not
+      supported (TODO check)
+
+### Depsim
+
+* Write a tool, valgrind-depsim, to instrument a binary to extract its
+  dependencies at runtime
+* Can extract memory, register and temp-based dependencies
+* Here, only the memory dependencies are relevant -- disable the other deps.
+* Instrument binary:
+    * for each write, add `write_addr -> writer_pc` to a hashmap
+    * for each read, fetch `writer_pc` from hashmap
+        * if found, add a dependency `reader_pc -> writer_pc`
+    * use the process' memory map to translate PC to addresses inside ELF files
+    * At the end, write deps file:
+        * `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
+    * Run for each binary in genbenchs
+    * Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage

 ## Static detection

@ -109,14 +137,6 @@ On SKX,

 * We need semantics for our assembly

-### Valgrind's VEX
-
-* Introduce Valgrind as an instrumentation tool
-* Introduce VEX
-* Should be portable to any architecture supported
-    * but suffers limitations for recent extension sets; eg avx512 not
-      supported (TODO check)
-
 ### Limitations

 * Does not track aliasing that originates from outside of the kernel.
@ -132,20 +152,7 @@ On SKX,

 #### With valgrind

-* Write a tool, valgrind-depsim, to instrument a binary to extract its
-  dependencies at runtime
-* Can extract memory, register and temp-based dependencies
-* Here, only the memory dependencies are relevant -- disable the other deps.
-* Instrument binary:
-    * for each write, add `write_addr -> writer_pc` to a hashmap
-    * for each read, fetch `writer_pc` from hashmap
-        * if found, add a dependency `reader_pc -> writer_pc`
-    * use the process' memory map to translate PC to addresses inside ELF files
-    * At the end, write deps file:
-        * `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
-    * Run for each binary in genbenchs
-    * Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
-
+Use valgrind-depsim.
 Then, compare with staticdeps: `eval/vg_depsim.py` script.

 * For each binary in genbenchs,
				`@ -0,0 +1,3 @@`
				`\section{A baseline: dynamic dependencies detection with Valgrind}`