Init staticdeps, write §1
This commit is contained in:
parent
e1eb20bb2c
commit
c2acf78476
9 changed files with 202 additions and 30 deletions
38
manuscrit/60_staticdeps/00_intro.tex
Normal file
38
manuscrit/60_staticdeps/00_intro.tex
Normal file
|
@ -0,0 +1,38 @@
|
|||
In the previous chapter, our major finding was that, in the current state of
|
||||
the art, code analyzers deal poorly with memory-carried dependencies. We found
|
||||
this flaw to be responsible, in our dataset, for a roughly $1.5\times$ increase
|
||||
in MAPE, and up to $2.6\times$ on the third quartile of error.
|
||||
|
||||
The large impact of dependencies on the final runtime of a kernel is, in
|
||||
reality, not very surprising. In chapters~\ref{chap:palmed}
|
||||
and~\ref{chap:frontend}, we did not consider latency; hence, the only impact of
|
||||
an instruction was its throughput, each instruction being issued as soon as
|
||||
possible. Dependencies, however, force the processor to wait for some
|
||||
instructions' results before issuing some others; the \emph{latency} of an
|
||||
instruction becomes a critical factor.
|
||||
|
||||
On Skylake, for instance, the instruction \lstxasm{add \%rax, \%rbx} has a
|
||||
latency of one full cycle. Thus, the kernel
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
add %rax, %rbx
|
||||
add %rbx, %rcx
|
||||
\end{lstlisting}
|
||||
executes, in steady state, in half a cycle without accounting for the
|
||||
dependency; yet these two instructions in isolation would take
|
||||
$1\,\sfrac{1}{4}$ cycles when accounting for the dependency. Some instructions
|
||||
still are more extreme; for instance, the \lstxasm{vfmadd*pd \%ymm0, \%ymm1,
|
||||
\%ymm2} family of instructions have a latency of four full cycles, while
|
||||
without dependencies, two can be issued every cycle.
|
||||
|
||||
\medskip{}
|
||||
|
||||
In the previous chapter, we also presented \gus{}, a dynamic code analyzer
|
||||
based on \qemu{}, which we found to be very effective to detect memory-carried
|
||||
dependencies and the slowdown they incur on the whole program. However, this
|
||||
solution results in a runtime increase of about two orders of magnitude, which
|
||||
may not be acceptable in many use cases.
|
||||
|
||||
In this chapter, we instead present \staticdeps{}, a fully static analyzer able
|
||||
to detect memory-carried dependencies in many cases. We evaluate it by
|
||||
providing \uica{} with its analysis of dependencies, bringing it on-par with
|
||||
\gus{} on the full, non-pruned dataset of the previous chapter.
|
113
manuscrit/60_staticdeps/10_types_of_deps.tex
Normal file
113
manuscrit/60_staticdeps/10_types_of_deps.tex
Normal file
|
@ -0,0 +1,113 @@
|
|||
\section{Types of dependencies}
|
||||
|
||||
A dependency, in the most general sense, can be seen as an interaction between
|
||||
two instructions stemming from shared data. This definition is willingly broad
|
||||
as, depending on the circumstances, the CPU implementation, \ldots{}, some
|
||||
categories of dependencies must be taken into account, while some may be
|
||||
ignored.
|
||||
|
||||
\paragraph{Read-write categories.} The first distinction that can be made
|
||||
between dependencies, and the one that is most often made, is whether the data
|
||||
through which the dependency is created is read or written. They can be broken
|
||||
down into four categories:
|
||||
|
||||
\begin{itemize}
|
||||
\item read-after-write (RaW);
|
||||
\item write-after-write (WaW);
|
||||
\item write-after-read (WaR);
|
||||
\item read-after-read (RaR).
|
||||
\end{itemize}
|
||||
|
||||
For instance, in the kernel presented in the introduction of this chapter, the
|
||||
first instruction (\lstxasm{add \%rax, \%rbx}) reads its first operand, the
|
||||
register \reg{rax}, and both reads and write its second operand \reg{rbx}. The
|
||||
second \lstxasm{add} has the same behaviour. Thus, as \reg{rbx} is written at
|
||||
line 1, and read at line 2, there is a read-after-write dependency between the
|
||||
two.
|
||||
|
||||
Most of the time, \emph{dependency} is actually used to mean
|
||||
\emph{read-after-write dependency}, sometimes called ``flow dependency''.
|
||||
However, depending on the actual hardware implementation of the architecture,
|
||||
other kinds of dependencies might induce a latency. While a read-after-read
|
||||
dependency will not induce a latency in the vast majority of architectures, a
|
||||
write-after-read could prevent instructions to be re-ordered in a way that the
|
||||
writing instruction commits its result before the reading instruction uses the
|
||||
previously stored value. In most modern CPUs, the processor actually has more
|
||||
physical registers than what is exposed to the user through the ISA; a renaming
|
||||
phrase will allocate those registers to avoid the effects of WaR and WaW
|
||||
dependencies as much as possible.
|
||||
|
||||
\smallskip{} For the present chapter, \textit{we only consider read-after-write
|
||||
dependencies}; however, all the techniques we present are applicable to other
|
||||
types of dependencies if the considered architecture requires to take them into
|
||||
account.
|
||||
|
||||
\paragraph{Dependency medium.} In the example above, we only introduced
|
||||
dependencies induced through registers, or \emph{register-carried}. There are,
|
||||
however, other channels.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
As we saw in the introduction to this chapter, as well as in the previous
|
||||
chapter, dependencies can also be \emph{memory-carried}, in more or less
|
||||
straightforward ways, such as in the following examples, where the last line
|
||||
always depend on the first:
|
||||
|
||||
\begin{minipage}[t]{0.32\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
add %rax, (%rbx)
|
||||
add (%rbx), %rcx\end{lstlisting}
|
||||
\end{minipage}\hfill
|
||||
\begin{minipage}[t]{0.32\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
add %rax, (%rbx)
|
||||
add $8, %rbx
|
||||
add -8(%rbx), %rcx\end{lstlisting}
|
||||
\end{minipage}\hfill
|
||||
\begin{minipage}[t]{0.32\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
lea 16(%rbx), %r10
|
||||
add %rax, (%rbx)
|
||||
add -16(%r10), %rcx\end{lstlisting}
|
||||
\end{minipage}\hfill
|
||||
|
||||
\smallskip{}
|
||||
|
||||
Some dependencies are also \emph{flag-carried}. These are very akin to
|
||||
register-carried dependency, but are not directly visible in the instruction.
|
||||
For instance, a subtract operation may set flags indicating whether the result
|
||||
is zero, and a subsequent jump may use this flag to chose whether the branch is
|
||||
taken or not.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
Depending on the architecture, other channels may still exist.
|
||||
|
||||
\medskip{}
|
||||
|
||||
In this chapter, we focus on register-carried and memory-carried dependencies,
|
||||
with a large emphasis on memory-carried dependencies.
|
||||
|
||||
\paragraph{Presence of loops.} The previous examples were all pieces of
|
||||
\emph{straight-line code} in which a dependency arose. However, many
|
||||
dependencies are actually \emph{loop-carried}, such as the following:
|
||||
|
||||
\begin{minipage}[t]{0.48\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
# Compute sum(A), %rax points to A
|
||||
loop:
|
||||
add (%rax), %r10
|
||||
add $8, %rax
|
||||
jmp loop
|
||||
\end{lstlisting}
|
||||
\end{minipage}\hfill
|
||||
\begin{minipage}[t]{0.48\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
# Compute B[i] = A[i] + B[i-1]
|
||||
loop:
|
||||
mov -8(%rbx, %r10), (%rbx, %r10)
|
||||
add (%rax, %r10), (%rbx, %r10)
|
||||
add $8, %r10
|
||||
jmp loop
|
||||
\end{lstlisting}
|
||||
\end{minipage}\hfill
|
3
manuscrit/60_staticdeps/20_dynamic.tex
Normal file
3
manuscrit/60_staticdeps/20_dynamic.tex
Normal file
|
@ -0,0 +1,3 @@
|
|||
\section{A baseline: dynamic dependencies detection with Valgrind}
|
||||
|
||||
|
1
manuscrit/60_staticdeps/30_static_principle.tex
Normal file
1
manuscrit/60_staticdeps/30_static_principle.tex
Normal file
|
@ -0,0 +1 @@
|
|||
\section{Static dependencies detection}
|
1
manuscrit/60_staticdeps/40_staticdeps.tex
Normal file
1
manuscrit/60_staticdeps/40_staticdeps.tex
Normal file
|
@ -0,0 +1 @@
|
|||
\section{The \staticdeps{} heuristic}
|
1
manuscrit/60_staticdeps/50_eval.tex
Normal file
1
manuscrit/60_staticdeps/50_eval.tex
Normal file
|
@ -0,0 +1 @@
|
|||
\section{Evaluation}
|
|
@ -1 +1,8 @@
|
|||
\chapter{Static extraction of memory-carried dependencies}
|
||||
|
||||
\input{00_intro.tex}
|
||||
\input{10_types_of_deps.tex}
|
||||
\input{20_dynamic.tex}
|
||||
\input{30_static_principle.tex}
|
||||
\input{40_staticdeps.tex}
|
||||
\input{50_eval.tex}
|
||||
|
|
|
@ -45,6 +45,7 @@
|
|||
\newcommand{\anica}{\texttt{AnICA}}
|
||||
\newcommand{\cesasme}{\texttt{CesASMe}}
|
||||
\newcommand{\benchsuitebb}{\texttt{benchsuite-bb}}
|
||||
\newcommand{\staticdeps}{\texttt{staticdeps}}
|
||||
|
||||
\newcommand{\gdb}{\texttt{gdb}}
|
||||
|
||||
|
|
|
@ -19,13 +19,22 @@
|
|||
* 2 O.M. slower => not acceptable in many cases
|
||||
* We need a static solution
|
||||
|
||||
Dependencies are costly: assuming everything L1-resident, the latency of each
|
||||
μop on the dependency chain must be paid.
|
||||
|
||||
On SKX,
|
||||
* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C)
|
||||
=> `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
|
||||
* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
|
||||
|
||||
|
||||
## Types of dependencies
|
||||
|
||||
4 main types:
|
||||
1. RaW: "real" dependency
|
||||
2. WaW
|
||||
3. WaR
|
||||
4. RaW
|
||||
4. RaR
|
||||
|
||||
* 4: not an issue.
|
||||
* 2,3 : assuming the μarch has a renamer & enough μarch registers, not a problem
|
||||
|
@ -54,15 +63,34 @@ Can be:
|
|||
B[i] = A[i-1] + 2
|
||||
A[i] = 7
|
||||
```
|
||||
## Cost of dependencies
|
||||
|
||||
Dependencies are costly: assuming everything L1-resident, the latency of each
|
||||
μop on the dependency chain must be paid.
|
||||
## Dynamic detection: Valgrind
|
||||
|
||||
On SKX,
|
||||
* `add %rax, %rdx` -> lat = 1 cycle (throughput = 1/4C)
|
||||
=> `add %rax, %rdx ; add %rdx, %rcx` : 1.25C, would be 0.5C without deps
|
||||
* `vfmadd*pd %ymm0, %ymm1, %ymm2`: lat = 4C (TP = 1/2C)
|
||||
* Mention Gus
|
||||
|
||||
### Valgrind's VEX
|
||||
|
||||
* Introduce Valgrind as an instrumentation tool
|
||||
* Introduce VEX
|
||||
* Should be portable to any architecture supported
|
||||
* but suffers limitations for recent extension sets; eg avx512 not
|
||||
supported (TODO check)
|
||||
|
||||
### Depsim
|
||||
|
||||
* Write a tool, valgrind-depsim, to instrument a binary to extract its
|
||||
dependencies at runtime
|
||||
* Can extract memory, register and temp-based dependencies
|
||||
* Here, only the memory dependencies are relevant -- disable the other deps.
|
||||
* Instrument binary:
|
||||
* for each write, add `write_addr -> writer_pc` to a hashmap
|
||||
* for each read, fetch `writer_pc` from hashmap
|
||||
* if found, add a dependency `reader_pc -> writer_pc`
|
||||
* use the process' memory map to translate PC to addresses inside ELF files
|
||||
* At the end, write deps file:
|
||||
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
|
||||
* Run for each binary in genbenchs
|
||||
* Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
|
||||
|
||||
## Static detection
|
||||
|
||||
|
@ -109,14 +137,6 @@ On SKX,
|
|||
|
||||
* We need semantics for our assembly
|
||||
|
||||
### Valgrind's VEX
|
||||
|
||||
* Introduce Valgrind as an instrumentation tool
|
||||
* Introduce VEX
|
||||
* Should be portable to any architecture supported
|
||||
* but suffers limitations for recent extension sets; eg avx512 not
|
||||
supported (TODO check)
|
||||
|
||||
### Limitations
|
||||
|
||||
* Does not track aliasing that originates from outside of the kernel.
|
||||
|
@ -132,20 +152,7 @@ On SKX,
|
|||
|
||||
#### With valgrind
|
||||
|
||||
* Write a tool, valgrind-depsim, to instrument a binary to extract its
|
||||
dependencies at runtime
|
||||
* Can extract memory, register and temp-based dependencies
|
||||
* Here, only the memory dependencies are relevant -- disable the other deps.
|
||||
* Instrument binary:
|
||||
* for each write, add `write_addr -> writer_pc` to a hashmap
|
||||
* for each read, fetch `writer_pc` from hashmap
|
||||
* if found, add a dependency `reader_pc -> writer_pc`
|
||||
* use the process' memory map to translate PC to addresses inside ELF files
|
||||
* At the end, write deps file:
|
||||
* `#occur, src_elf_pc, src_elf_path, dst_elf_pc, dst_elf_path`
|
||||
* Run for each binary in genbenchs
|
||||
* Takes about 1h on 30 parallel cores on Pinocchio; heavy memory usage
|
||||
|
||||
Use valgrind-depsim.
|
||||
Then, compare with staticdeps: `eval/vg_depsim.py` script.
|
||||
|
||||
* For each binary in genbenchs,
|
||||
|
|
Loading…
Reference in a new issue