phd-thesis/manuscrit/60_staticdeps/30_static_principle.tex

\section{Static dependencies detection}\label{ssec:staticdeps_detection}

Depending on their type, some dependencies are significantly harder to
statically detect than others.

\textbf{Register-carried dependencies}, when in straight-line code, can be
detected by keeping track of which instruction last wrote each register in a
\emph{shadow register file}. This is most often supported by code analyzers
---~for instance, \llvmmca{} and \uica{} support it.

Loop-carried dependencies can, to some extent, be detected the same way. As the
basic block is always assumed to be the body of an infinite loop, a
straight-line analysis can be performed on a duplicated kernel. This strategy
is \eg{} adopted by \osaca{}~\cite[§II.D]{osaca2}. When dealing only with
register accesses, this strategy is always sufficient: as each iteration always
executes the same basic block, it is not possible for an instruction to depend
on another instruction two iterations earlier or more.

\smallskip

\textbf{Memory-carried dependencies}, however, are significantly harder to tackle. While basic
heuristics can handle some simple cases, in the general case two main
difficulties arise:
\begin{enumerate}[(i)]
    \item{}\label{memcarried_difficulty_alias} pointers may \emph{alias}, \ie{}
        point to the same address or array; for instance, if \reg{rax} points
        to an array, it may be that \reg{rbx} points to $\reg{rax} + 8$, making
        the detection of such a dependency difficult;
    \item{}\label{memcarried_difficulty_arith} arbitrary arithmetic operations
        may be performed on pointers, possibly through diverting paths: \eg{}
        it might be necessary to detect that $\reg{rax} + 16 << 2$ is identical
        to $\reg{rax} + 128 / 2$; this requires semantics for assembly
        instructions and tracking formal expressions across register values
        ---~and possibly even memory.
\end{enumerate}

Tracking memory-carried dependencies is, to the best of our knowledge, not done
in code analyzers, as our results in \autoref{chap:CesASMe} suggests.

\smallskip{}

While the strategy previously used for register-carried dependencies is
sufficient to detect loop-carried dependencies from one occurrence to the next
one, it is not sufficient at all times when the dependencies tracked are
memory-carried. For instance, in the second example from
\autoref{lst:loop_carried_exn}, an instruction depends on another two
iterations ago.

Dependencies can reach arbitrarily old iterations of a loop: in this example,
\lstxasm{-8192(\%rbx, \%r10)} may be used to reach 1\,024 iterations back.
However, while far-reaching dependencies may \emph{exist}, they are not
necessarily \emph{relevant} from a performance analysis point of view. Indeed,
if an instruction $i_2$ depends on a result previously produced by an
instruction $i_1$, this dependency is only relevant if it is possible that
$i_1$ is not yet completed when $i_2$ is considered for issuing ---~else, the
result is already produced, and $i_2$ needs never wait to execute.

The reorder buffer (ROB) of a CPU can be modelled as a sliding window of fixed
size over \uops{}. In particular, if a \uop{} $\mu_1$ is not yet retired, the
ROB may not contain \uops{} more than the ROB's size ahead of $\mu_1$. This is
in particular also true for instructions, as the vast majority of instructions
decode to at least one \uop{}\footnote{Some \texttt{mov} instructions from
    register to register may, for instance, only have an impact on the renamer;
no \uops{} are dispatched to the backend.}. We formalize this in
\autoref{ssec:staticdeps:rob_proof} below.

A possible solution to detect loop-carried dependencies in a kernel $\kerK$ is
thus to unroll it until it contains about $\card{\text{ROB}} +
\card{\kerK}$. This ensures that every instruction in the last kernel can find
dependencies reaching up to $\card{\text{ROB}}$ back.

On Intel CPUs, the reorder buffer size contained 224 \uops{} on Skylake (2015),
or 512 \uops{} on Golden Cove (2021)~\cite{wikichip_intel_rob_size}. These
sizes are small enough to reasonably use this solution without excessive
slowdown.
No results found.