Init staticdeps, write §1

This commit is contained in:
Théophile Bastian 2023-09-27 17:02:30 +02:00
parent e1eb20bb2c
commit c2acf78476
9 changed files with 202 additions and 30 deletions

View file

@ -0,0 +1,38 @@
In the previous chapter, our major finding was that, in the current state of
the art, code analyzers deal poorly with memory-carried dependencies. We found
this flaw to be responsible, in our dataset, for a roughly $1.5\times$ increase
in MAPE, and up to $2.6\times$ on the third quartile of error.
The large impact of dependencies on the final runtime of a kernel is, in
reality, not very surprising. In chapters~\ref{chap:palmed}
and~\ref{chap:frontend}, we did not consider latency; hence, the only impact of
an instruction was its throughput, each instruction being issued as soon as
possible. Dependencies, however, force the processor to wait for some
instructions' results before issuing some others; the \emph{latency} of an
instruction becomes a critical factor.
On Skylake, for instance, the instruction \lstxasm{add \%rax, \%rbx} has a
latency of one full cycle. Thus, the kernel
\begin{lstlisting}[language={[x86masm]Assembler}]
add %rax, %rbx
add %rbx, %rcx
\end{lstlisting}
executes, in steady state, in half a cycle without accounting for the
dependency; yet these two instructions in isolation would take
$1\,\sfrac{1}{4}$ cycles when accounting for the dependency. Some instructions
still are more extreme; for instance, the \lstxasm{vfmadd*pd \%ymm0, \%ymm1,
\%ymm2} family of instructions have a latency of four full cycles, while
without dependencies, two can be issued every cycle.
\medskip{}
In the previous chapter, we also presented \gus{}, a dynamic code analyzer
based on \qemu{}, which we found to be very effective to detect memory-carried
dependencies and the slowdown they incur on the whole program. However, this
solution results in a runtime increase of about two orders of magnitude, which
may not be acceptable in many use cases.
In this chapter, we instead present \staticdeps{}, a fully static analyzer able
to detect memory-carried dependencies in many cases. We evaluate it by
providing \uica{} with its analysis of dependencies, bringing it on-par with
\gus{} on the full, non-pruned dataset of the previous chapter.

View file

@ -0,0 +1,113 @@
\section{Types of dependencies}
A dependency, in the most general sense, can be seen as an interaction between
two instructions stemming from shared data. This definition is willingly broad
as, depending on the circumstances, the CPU implementation, \ldots{}, some
categories of dependencies must be taken into account, while some may be
ignored.
\paragraph{Read-write categories.} The first distinction that can be made
between dependencies, and the one that is most often made, is whether the data
through which the dependency is created is read or written. They can be broken
down into four categories:
\begin{itemize}
\item read-after-write (RaW);
\item write-after-write (WaW);
\item write-after-read (WaR);
\item read-after-read (RaR).
\end{itemize}
For instance, in the kernel presented in the introduction of this chapter, the
first instruction (\lstxasm{add \%rax, \%rbx}) reads its first operand, the
register \reg{rax}, and both reads and write its second operand \reg{rbx}. The
second \lstxasm{add} has the same behaviour. Thus, as \reg{rbx} is written at
line 1, and read at line 2, there is a read-after-write dependency between the
two.
Most of the time, \emph{dependency} is actually used to mean
\emph{read-after-write dependency}, sometimes called ``flow dependency''.
However, depending on the actual hardware implementation of the architecture,
other kinds of dependencies might induce a latency. While a read-after-read
dependency will not induce a latency in the vast majority of architectures, a
write-after-read could prevent instructions to be re-ordered in a way that the
writing instruction commits its result before the reading instruction uses the
previously stored value. In most modern CPUs, the processor actually has more
physical registers than what is exposed to the user through the ISA; a renaming
phrase will allocate those registers to avoid the effects of WaR and WaW
dependencies as much as possible.
\smallskip{} For the present chapter, \textit{we only consider read-after-write
dependencies}; however, all the techniques we present are applicable to other
types of dependencies if the considered architecture requires to take them into
account.
\paragraph{Dependency medium.} In the example above, we only introduced
dependencies induced through registers, or \emph{register-carried}. There are,
however, other channels.
\smallskip{}
As we saw in the introduction to this chapter, as well as in the previous
chapter, dependencies can also be \emph{memory-carried}, in more or less
straightforward ways, such as in the following examples, where the last line
always depend on the first:
\begin{minipage}[t]{0.32\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
add %rax, (%rbx)
add (%rbx), %rcx\end{lstlisting}
\end{minipage}\hfill
\begin{minipage}[t]{0.32\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
add %rax, (%rbx)
add $8, %rbx
add -8(%rbx), %rcx\end{lstlisting}
\end{minipage}\hfill
\begin{minipage}[t]{0.32\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
lea 16(%rbx), %r10
add %rax, (%rbx)
add -16(%r10), %rcx\end{lstlisting}
\end{minipage}\hfill
\smallskip{}
Some dependencies are also \emph{flag-carried}. These are very akin to
register-carried dependency, but are not directly visible in the instruction.
For instance, a subtract operation may set flags indicating whether the result
is zero, and a subsequent jump may use this flag to chose whether the branch is
taken or not.
\smallskip{}
Depending on the architecture, other channels may still exist.
\medskip{}
In this chapter, we focus on register-carried and memory-carried dependencies,
with a large emphasis on memory-carried dependencies.
\paragraph{Presence of loops.} The previous examples were all pieces of
\emph{straight-line code} in which a dependency arose. However, many
dependencies are actually \emph{loop-carried}, such as the following:
\begin{minipage}[t]{0.48\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
# Compute sum(A), %rax points to A
loop:
add (%rax), %r10
add $8, %rax
jmp loop
\end{lstlisting}
\end{minipage}\hfill
\begin{minipage}[t]{0.48\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
# Compute B[i] = A[i] + B[i-1]
loop:
mov -8(%rbx, %r10), (%rbx, %r10)
add (%rax, %r10), (%rbx, %r10)
add $8, %r10
jmp loop
\end{lstlisting}
\end{minipage}\hfill

View file

@ -0,0 +1,3 @@
\section{A baseline: dynamic dependencies detection with Valgrind}

View file

@ -0,0 +1 @@
\section{Static dependencies detection}

View file

@ -0,0 +1 @@
\section{The \staticdeps{} heuristic}

View file

@ -0,0 +1 @@
\section{Evaluation}

View file

@ -1 +1,8 @@
\chapter{Static extraction of memory-carried dependencies}
\input{00_intro.tex}
\input{10_types_of_deps.tex}
\input{20_dynamic.tex}
\input{30_static_principle.tex}
\input{40_staticdeps.tex}
\input{50_eval.tex}

View file

@ -45,6 +45,7 @@
\newcommand{\anica}{\texttt{AnICA}}
\newcommand{\cesasme}{\texttt{CesASMe}}
\newcommand{\benchsuitebb}{\texttt{benchsuite-bb}}
\newcommand{\staticdeps}{\texttt{staticdeps}}
\newcommand{\gdb}{\texttt{gdb}}

View file

@ -19,13 +19,22 @@
* 2 O.M. slower => not acceptable in many cases
* We need a static solution
Dependencies are costly: assuming everything L1-resident, the latency of each
μop on the dependency chain must be paid.
On SKX,
* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C)
=> `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
## Types of dependencies
4 main types:
1. RaW: "real" dependency
2. WaW
3. WaR
4. RaW
4. RaR
* 4: not an issue.
* 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem
@ -54,15 +63,34 @@ Can be:
B[i] = A[i-1] + 2
A[i] = 7
```
## Cost of dependencies
Dependencies are costly: assuming everything L1-resident, the latency of each
μop on the dependency chain must be paid.
## Dynamic detection: Valgrind
On SKX,
* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C)
=> `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
* Mention Gus
### Valgrind's VEX
* Introduce Valgrind as an instrumentation tool
* Introduce VEX
* Should be portable to any architecture supported
* but suffers limitations for recent extension sets; eg avx512 not
supported (TODO check)
### Depsim
* Write a tool, valgrind-depsim, to instrument a binary to extract its
dependencies at runtime
* Can extract memory, register and temp-based dependencies
* Here, only the memory dependencies are relevant -- disable the other deps.
* Instrument binary:
* for each write, add `write_addr -> writer_pc` to a hashmap
* for each read, fetch `writer_pc` from hashmap
* if found, add a dependency `reader_pc -> writer_pc`
* use the process' memory map to translate PC to addresses inside ELF files
* At the end, write deps file:
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
* Run for each binary in genbenchs
* Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
## Static detection
@ -109,14 +137,6 @@ On SKX,
* We need semantics for our assembly
### Valgrind's VEX
* Introduce Valgrind as an instrumentation tool
* Introduce VEX
* Should be portable to any architecture supported
* but suffers limitations for recent extension sets; eg avx512 not
supported (TODO check)
### Limitations
* Does not track aliasing that originates from outside of the kernel.
@ -132,20 +152,7 @@ On SKX,
#### With valgrind
* Write a tool, valgrind-depsim, to instrument a binary to extract its
dependencies at runtime
* Can extract memory, register and temp-based dependencies
* Here, only the memory dependencies are relevant -- disable the other deps.
* Instrument binary:
* for each write, add `write_addr -> writer_pc` to a hashmap
* for each read, fetch `writer_pc` from hashmap
* if found, add a dependency `reader_pc -> writer_pc`
* use the process' memory map to translate PC to addresses inside ELF files
* At the end, write deps file:
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
* Run for each binary in genbenchs
* Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
Use valgrind-depsim.
Then, compare with staticdeps: `eval/vg_depsim.py` script.
* For each binary in genbenchs,