phd-thesis/manuscrit/60_staticdeps/30_static_principle.tex

76 lines
4.2 KiB
TeX

\section{Static dependencies detection}\label{ssec:staticdeps_detection}
Depending on the type of dependencies considered, it is more or less difficult
to statically detect them.
\paragraph{Register-carried dependencies in straight-line code.} This case is
the easiest to statically detect, and is most often supported by code analyzers
---~for instance, \llvmmca{} supports it. The same strategy that was used to
dynamically find dependencies in \autoref{ssec:depsim} can still be used: a
shadow register file simply keeps track of which instruction last wrote each
register.
\paragraph{Register-carried, loop-carried dependencies.} Loop-carried
dependencies can, to some extent, be detected the same way. As the basic block
is always assumed to be the body of an infinite loop, a straight-line analysis
can be performed on a duplicated kernel. This strategy is \eg{} adopted by
\osaca{}~\cite{osaca2} (§II.D).
When dealing only with register accesses, this
strategy is always sufficient: as each iteration always executes the same basic
block, it is not possible for an instruction to depend on another instruction
two iterations earlier or more.
\paragraph{Memory-carried dependencies in straight-line code.} Memory
dependencies, however, are significantly harder to tackle. While basic
heuristics can handle some simple cases, in the general case two main
difficulties arise:
\begin{enumerate}[(i)]
\item{}\label{memcarried_difficulty_alias} pointers may \emph{alias}, \ie{}
point to the same address or array; for instance, if \reg{rax} points
to an array, it may be that \reg{rbx} points to $\reg{rax} + 8$, making
the detection of such a dependency difficult;
\item{}\label{memcarried_difficulty_arith} arbitrary arithmetic operations
may be performed on pointers, possibly through diverting paths: \eg{}
it might be necessary to detect that $\reg{rax} + 16 << 2$ is identical
to $\reg{rax} + 128 / 2$; this requires semantics for assembly
instructions and tracking formal expressions across register values
---~and possibly even memory.
\end{enumerate}
Tracking memory-carried dependencies is, to the best of our knowledge, not done
in code analyzers, as our results in \autoref{chap:CesASMe} suggests.
\paragraph{Loop-carried, memory-carried dependencies.} While the strategy
previously used for register-carried dependencies is sufficient to detect
loop-carried dependencies from one occurrence to the next one, it is not
sufficient at all times when the dependencies tracked are memory-carried. For
instance, in the second example from \autoref{lst:loop_carried_exn}, an
instruction depends on another two iterations ago.
Dependencies can reach arbitrarily old iterations of a loop: in this example,
\lstxasm{-8192(\%rbx, \%r10)} may be used to reach 1\,024 iterations back.
However, while far-reaching dependencies may \emph{exist}, they are not
necessarily \emph{relevant} from a performance analysis point of view. Indeed,
if an instruction $i_2$ depends on a result previously produced by an
instruction $i_1$, this dependency is only relevant if it is possible that
$i_1$ is not yet completed when $i_2$ is considered for issuing ---~else, the
result is already produced, and $i_2$ needs not wait to execute.
The reorder buffer (ROB) of a CPU can be modelled as a sliding window of fixed
size over \uops{}. In particular, if a \uop{} $\mu_1$ is not yet retired, the
ROB may not contain \uops{} more than the ROB's size ahead of $\mu_1$. This is
in particular also true for instructions, as the vast majority of instructions
decode to at least one \uop{}\footnote{Some \texttt{mov} instructions from
register to register may, for instance, only have an impact on the renamer;
no \uops{} are dispatched to the backend.}.
A possible solution to detect loop-carried dependencies in a kernel $\kerK$ is
thus to unroll it until it contains about $\card{\text{ROB}} +
\card{\kerK}$. This ensures that every instruction in the last kernel can find
dependencies reaching up to $\card{\text{ROB}}$ back.
On Intel CPUs, the reorder buffer size contained 224 \uops{} on Skylake (2015),
or 512 \uops{} on Golden Cove (2021)~\cite{wikichip_intel_rob_size}. These
sizes are small enough to reasonably use this solution without excessive
slowdown.