diff --git a/manuscrit/60_staticdeps/00_intro.tex b/manuscrit/60_staticdeps/00_intro.tex new file mode 100644 index 0000000..98c54fe --- /dev/null +++ b/manuscrit/60_staticdeps/00_intro.tex @@ -0,0 +1,38 @@ +In the previous chapter, our major finding was that, in the current state of +the art, code analyzers deal poorly with memory-carried dependencies. We found +this flaw to be responsible, in our dataset, for a roughly $1.5\times$ increase +in MAPE, and up to $2.6\times$ on the third quartile of error. + +The large impact of dependencies on the final runtime of a kernel is, in +reality, not very surprising. In chapters~\ref{chap:palmed} +and~\ref{chap:frontend}, we did not consider latency; hence, the only impact of +an instruction was its throughput, each instruction being issued as soon as +possible. Dependencies, however, force the processor to wait for some +instructions' results before issuing some others; the \emph{latency} of an +instruction becomes a critical factor. + +On Skylake, for instance, the instruction \lstxasm{add \%rax, \%rbx} has a +latency of one full cycle. Thus, the kernel +\begin{lstlisting}[language={[x86masm]Assembler}] +add %rax, %rbx +add %rbx, %rcx +\end{lstlisting} +executes, in steady state, in half a cycle without accounting for the +dependency; yet these two instructions in isolation would take +$1\,\sfrac{1}{4}$ cycles when accounting for the dependency. Some instructions +still are more extreme; for instance, the \lstxasm{vfmadd*pd \%ymm0, \%ymm1, +\%ymm2} family of instructions have a latency of four full cycles, while +without dependencies, two can be issued every cycle. + +\medskip{} + +In the previous chapter, we also presented \gus{}, a dynamic code analyzer +based on \qemu{}, which we found to be very effective to detect memory-carried +dependencies and the slowdown they incur on the whole program. However, this +solution results in a runtime increase of about two orders of magnitude, which +may not be acceptable in many use cases. + +In this chapter, we instead present \staticdeps{}, a fully static analyzer able +to detect memory-carried dependencies in many cases. We evaluate it by +providing \uica{} with its analysis of dependencies, bringing it on-par with +\gus{} on the full, non-pruned dataset of the previous chapter. diff --git a/manuscrit/60_staticdeps/10_types_of_deps.tex b/manuscrit/60_staticdeps/10_types_of_deps.tex new file mode 100644 index 0000000..41b279a --- /dev/null +++ b/manuscrit/60_staticdeps/10_types_of_deps.tex @@ -0,0 +1,113 @@ +\section{Types of dependencies} + +A dependency, in the most general sense, can be seen as an interaction between +two instructions stemming from shared data. This definition is willingly broad +as, depending on the circumstances, the CPU implementation, \ldots{}, some +categories of dependencies must be taken into account, while some may be +ignored. + +\paragraph{Read-write categories.} The first distinction that can be made +between dependencies, and the one that is most often made, is whether the data +through which the dependency is created is read or written. They can be broken +down into four categories: + +\begin{itemize} + \item read-after-write (RaW); + \item write-after-write (WaW); + \item write-after-read (WaR); + \item read-after-read (RaR). +\end{itemize} + +For instance, in the kernel presented in the introduction of this chapter, the +first instruction (\lstxasm{add \%rax, \%rbx}) reads its first operand, the +register \reg{rax}, and both reads and write its second operand \reg{rbx}. The +second \lstxasm{add} has the same behaviour. Thus, as \reg{rbx} is written at +line 1, and read at line 2, there is a read-after-write dependency between the +two. + +Most of the time, \emph{dependency} is actually used to mean +\emph{read-after-write dependency}, sometimes called ``flow dependency''. +However, depending on the actual hardware implementation of the architecture, +other kinds of dependencies might induce a latency. While a read-after-read +dependency will not induce a latency in the vast majority of architectures, a +write-after-read could prevent instructions to be re-ordered in a way that the +writing instruction commits its result before the reading instruction uses the +previously stored value. In most modern CPUs, the processor actually has more +physical registers than what is exposed to the user through the ISA; a renaming +phrase will allocate those registers to avoid the effects of WaR and WaW +dependencies as much as possible. + +\smallskip{} For the present chapter, \textit{we only consider read-after-write +dependencies}; however, all the techniques we present are applicable to other +types of dependencies if the considered architecture requires to take them into +account. + +\paragraph{Dependency medium.} In the example above, we only introduced +dependencies induced through registers, or \emph{register-carried}. There are, +however, other channels. + +\smallskip{} + +As we saw in the introduction to this chapter, as well as in the previous +chapter, dependencies can also be \emph{memory-carried}, in more or less +straightforward ways, such as in the following examples, where the last line +always depend on the first: + +\begin{minipage}[t]{0.32\linewidth} + \begin{lstlisting}[language={[x86masm]Assembler}] +add %rax, (%rbx) +add (%rbx), %rcx\end{lstlisting} +\end{minipage}\hfill +\begin{minipage}[t]{0.32\linewidth} + \begin{lstlisting}[language={[x86masm]Assembler}] +add %rax, (%rbx) +add $8, %rbx +add -8(%rbx), %rcx\end{lstlisting} +\end{minipage}\hfill +\begin{minipage}[t]{0.32\linewidth} + \begin{lstlisting}[language={[x86masm]Assembler}] +lea 16(%rbx), %r10 +add %rax, (%rbx) +add -16(%r10), %rcx\end{lstlisting} +\end{minipage}\hfill + +\smallskip{} + +Some dependencies are also \emph{flag-carried}. These are very akin to +register-carried dependency, but are not directly visible in the instruction. +For instance, a subtract operation may set flags indicating whether the result +is zero, and a subsequent jump may use this flag to chose whether the branch is +taken or not. + +\smallskip{} + +Depending on the architecture, other channels may still exist. + +\medskip{} + +In this chapter, we focus on register-carried and memory-carried dependencies, +with a large emphasis on memory-carried dependencies. + +\paragraph{Presence of loops.} The previous examples were all pieces of +\emph{straight-line code} in which a dependency arose. However, many +dependencies are actually \emph{loop-carried}, such as the following: + +\begin{minipage}[t]{0.48\linewidth} + \begin{lstlisting}[language={[x86masm]Assembler}] +# Compute sum(A), %rax points to A +loop: + add (%rax), %r10 + add $8, %rax + jmp loop +\end{lstlisting} +\end{minipage}\hfill +\begin{minipage}[t]{0.48\linewidth} + \begin{lstlisting}[language={[x86masm]Assembler}] +# Compute B[i] = A[i] + B[i-1] +loop: + mov -8(%rbx, %r10), (%rbx, %r10) + add (%rax, %r10), (%rbx, %r10) + add $8, %r10 + jmp loop +\end{lstlisting} +\end{minipage}\hfill diff --git a/manuscrit/60_staticdeps/20_dynamic.tex b/manuscrit/60_staticdeps/20_dynamic.tex new file mode 100644 index 0000000..329d5d5 --- /dev/null +++ b/manuscrit/60_staticdeps/20_dynamic.tex @@ -0,0 +1,3 @@ +\section{A baseline: dynamic dependencies detection with Valgrind} + + diff --git a/manuscrit/60_staticdeps/30_static_principle.tex b/manuscrit/60_staticdeps/30_static_principle.tex new file mode 100644 index 0000000..cffcb8a --- /dev/null +++ b/manuscrit/60_staticdeps/30_static_principle.tex @@ -0,0 +1 @@ +\section{Static dependencies detection} diff --git a/manuscrit/60_staticdeps/40_staticdeps.tex b/manuscrit/60_staticdeps/40_staticdeps.tex new file mode 100644 index 0000000..77dc877 --- /dev/null +++ b/manuscrit/60_staticdeps/40_staticdeps.tex @@ -0,0 +1 @@ +\section{The \staticdeps{} heuristic} diff --git a/manuscrit/60_staticdeps/50_eval.tex b/manuscrit/60_staticdeps/50_eval.tex new file mode 100644 index 0000000..cd3fc7f --- /dev/null +++ b/manuscrit/60_staticdeps/50_eval.tex @@ -0,0 +1 @@ +\section{Evaluation} diff --git a/manuscrit/60_staticdeps/main.tex b/manuscrit/60_staticdeps/main.tex index f5cabea..13a3457 100644 --- a/manuscrit/60_staticdeps/main.tex +++ b/manuscrit/60_staticdeps/main.tex @@ -1 +1,8 @@ \chapter{Static extraction of memory-carried dependencies} + +\input{00_intro.tex} +\input{10_types_of_deps.tex} +\input{20_dynamic.tex} +\input{30_static_principle.tex} +\input{40_staticdeps.tex} +\input{50_eval.tex} diff --git a/manuscrit/include/macros.tex b/manuscrit/include/macros.tex index 2494c42..6a69260 100644 --- a/manuscrit/include/macros.tex +++ b/manuscrit/include/macros.tex @@ -45,6 +45,7 @@ \newcommand{\anica}{\texttt{AnICA}} \newcommand{\cesasme}{\texttt{CesASMe}} \newcommand{\benchsuitebb}{\texttt{benchsuite-bb}} +\newcommand{\staticdeps}{\texttt{staticdeps}} \newcommand{\gdb}{\texttt{gdb}} diff --git a/plan/60_staticdeps.md b/plan/60_staticdeps.md index fff14db..68069f9 100644 --- a/plan/60_staticdeps.md +++ b/plan/60_staticdeps.md @@ -19,13 +19,22 @@ * 2 O.M. slower => not acceptable in many cases * We need a static solution +Dependencies are costly: assuming everything L1-resident, the latency of each +μop on the dependency chain must be paid. + +On SKX, +* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C) + => `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps +* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C) + + ## Types of dependencies 4 main types: 1. RaW: "real" dependency 2. WaW 3. WaR -4. RaW +4. RaR * 4: not an issue. * 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem @@ -54,15 +63,34 @@ Can be: B[i] = A[i-1] + 2 A[i] = 7 ``` -## Cost of dependencies -Dependencies are costly: assuming everything L1-resident, the latency of each -μop on the dependency chain must be paid. +## Dynamic detection: Valgrind -On SKX, -* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C) - => `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps -* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C) +* Mention Gus + +### Valgrind's VEX + +* Introduce Valgrind as an instrumentation tool +* Introduce VEX +* Should be portable to any architecture supported + * but suffers limitations for recent extension sets; eg avx512 not + supported (TODO check) + +### Depsim + +* Write a tool, valgrind-depsim, to instrument a binary to extract its + dependencies at runtime +* Can extract memory, register and temp-based dependencies +* Here, only the memory dependencies are relevant -- disable the other deps. +* Instrument binary: + * for each write, add `write_addr -> writer_pc` to a hashmap + * for each read, fetch `writer_pc` from hashmap + * if found, add a dependency `reader_pc -> writer_pc` + * use the process' memory map to translate PC to addresses inside ELF files + * At the end, write deps file: + * `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path` + * Run for each binary in genbenchs + * Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage ## Static detection @@ -109,14 +137,6 @@ On SKX, * We need semantics for our assembly -### Valgrind's VEX - -* Introduce Valgrind as an instrumentation tool -* Introduce VEX -* Should be portable to any architecture supported - * but suffers limitations for recent extension sets; eg avx512 not - supported (TODO check) - ### Limitations * Does not track aliasing that originates from outside of the kernel. @@ -132,20 +152,7 @@ On SKX, #### With valgrind -* Write a tool, valgrind-depsim, to instrument a binary to extract its - dependencies at runtime -* Can extract memory, register and temp-based dependencies -* Here, only the memory dependencies are relevant -- disable the other deps. -* Instrument binary: - * for each write, add `write_addr -> writer_pc` to a hashmap - * for each read, fetch `writer_pc` from hashmap - * if found, add a dependency `reader_pc -> writer_pc` - * use the process' memory map to translate PC to addresses inside ELF files - * At the end, write deps file: - * `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path` - * Run for each binary in genbenchs - * Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage - +Use valgrind-depsim. Then, compare with staticdeps: `eval/vg_depsim.py` script. * For each binary in genbenchs,