phd-thesis/manuscrit/60_staticdeps/00_intro.tex

In the previous chapter, our major finding was that, in the current state of
the art, code analyzers deal poorly with memory-carried dependencies. We found
this flaw to be responsible, in our dataset, for a roughly $1.5\times$ increase
in MAPE, and up to $2.6\times$ on the third quartile of error.

The large impact of dependencies on the final runtime of a kernel is, in
reality, not very surprising. In chapters~\ref{chap:palmed}
and~\ref{chap:frontend}, we did not consider latency; hence, the only impact of
an instruction was its throughput, each instruction being issued as soon as
possible. Dependencies, however, force the processor to wait for some
instructions' results before issuing some others; the \emph{latency} of an
instruction becomes a critical factor.

On Skylake, for instance, the instruction \lstxasm{add \%rax, \%rbx} has a
latency of one full cycle. Thus, the kernel
\begin{lstlisting}[language={[x86masm]Assembler}]
add %rax, %rbx
add %rbx, %rcx
\end{lstlisting}
executes, in steady state, in half a cycle without accounting for the
dependency; yet these two instructions in isolation would take
$1\,\sfrac{1}{4}$ cycles when accounting for the dependency. Some instructions
still are more extreme; for instance, the \lstxasm{vfmadd*pd \%ymm0, \%ymm1,
\%ymm2} family of instructions have a latency of four full cycles, while
without dependencies, two can be issued every cycle.

\medskip{}

In the previous chapter, we also presented \gus{}, a dynamic code analyzer
based on \qemu{}, which we found to be very effective to detect memory-carried
dependencies and the slowdown they incur on the whole program. However, this
solution results in a runtime increase of about two orders of magnitude, which
may not be acceptable in many use cases.

In this chapter, we instead present \staticdeps{}, a fully static analyzer able
to detect memory-carried dependencies in many cases. We evaluate it by
providing \uica{} with its analysis of dependencies, bringing it on-par with
\gus{} on the full, non-pruned dataset of the previous chapter.