75 lines
4 KiB
TeX
75 lines
4 KiB
TeX
\section{Static dependencies detection}\label{ssec:staticdeps_detection}
|
|
|
|
Depending on their type, some dependencies are significantly harder to
|
|
statically detect than others.
|
|
|
|
\textbf{Register-carried dependencies}, when in straight-line code, can be
|
|
detected by keeping track of which instruction last wrote each register in a
|
|
\emph{shadow register file}. This is most often supported by code analyzers
|
|
---~for instance, \llvmmca{} and \uica{} support it.
|
|
|
|
Loop-carried dependencies can, to some extent, be detected the same way. As the
|
|
basic block is always assumed to be the body of an infinite loop, a
|
|
straight-line analysis can be performed on a duplicated kernel. This strategy
|
|
is \eg{} adopted by \osaca{}~\cite[§II.D]{osaca2}. When dealing only with
|
|
register accesses, this strategy is always sufficient: as each iteration always
|
|
executes the same basic block, it is not possible for an instruction to depend
|
|
on another instruction two iterations earlier or more.
|
|
|
|
\smallskip
|
|
|
|
\textbf{Memory-carried dependencies}, however, are significantly harder to tackle. While basic
|
|
heuristics can handle some simple cases, in the general case two main
|
|
difficulties arise:
|
|
\begin{enumerate}[(i)]
|
|
\item{}\label{memcarried_difficulty_alias} pointers may \emph{alias}, \ie{}
|
|
point to the same address or array; for instance, if \reg{rax} points
|
|
to an array, it may be that \reg{rbx} points to $\reg{rax} + 8$, making
|
|
the detection of such a dependency difficult;
|
|
\item{}\label{memcarried_difficulty_arith} arbitrary arithmetic operations
|
|
may be performed on pointers, possibly through diverting paths: \eg{}
|
|
it might be necessary to detect that $\reg{rax} + 16 << 2$ is identical
|
|
to $\reg{rax} + 128 / 2$; this requires semantics for assembly
|
|
instructions and tracking formal expressions across register values
|
|
---~and possibly even memory.
|
|
\end{enumerate}
|
|
|
|
Tracking memory-carried dependencies is, to the best of our knowledge, not done
|
|
in code analyzers, as our results in \autoref{chap:CesASMe} suggests.
|
|
|
|
\smallskip{}
|
|
|
|
While the strategy previously used for register-carried dependencies is
|
|
sufficient to detect loop-carried dependencies from one occurrence to the next
|
|
one, it is not sufficient at all times when the dependencies tracked are
|
|
memory-carried. For instance, in the second example from
|
|
\autoref{lst:loop_carried_exn}, an instruction depends on another two
|
|
iterations ago.
|
|
|
|
Dependencies can reach arbitrarily old iterations of a loop: in this example,
|
|
\lstxasm{-8192(\%rbx, \%r10)} may be used to reach 1\,024 iterations back.
|
|
However, while far-reaching dependencies may \emph{exist}, they are not
|
|
necessarily \emph{relevant} from a performance analysis point of view. Indeed,
|
|
if an instruction $i_2$ depends on a result previously produced by an
|
|
instruction $i_1$, this dependency is only relevant if it is possible that
|
|
$i_1$ is not yet completed when $i_2$ is considered for issuing ---~else, the
|
|
result is already produced, and $i_2$ needs never wait to execute.
|
|
|
|
The reorder buffer (ROB) of a CPU can be modelled as a sliding window of fixed
|
|
size over \uops{}. In particular, if a \uop{} $\mu_1$ is not yet retired, the
|
|
ROB may not contain \uops{} more than the ROB's size ahead of $\mu_1$. This is
|
|
in particular also true for instructions, as the vast majority of instructions
|
|
decode to at least one \uop{}\footnote{Some \texttt{mov} instructions from
|
|
register to register may, for instance, only have an impact on the renamer;
|
|
no \uops{} are dispatched to the backend.}. We formalize this in
|
|
\autoref{ssec:staticdeps:rob_proof} below.
|
|
|
|
A possible solution to detect loop-carried dependencies in a kernel $\kerK$ is
|
|
thus to unroll it until it contains about $\card{\text{ROB}} +
|
|
\card{\kerK}$. This ensures that every instruction in the last kernel can find
|
|
dependencies reaching up to $\card{\text{ROB}}$ back.
|
|
|
|
On Intel CPUs, the reorder buffer size contained 224 \uops{} on Skylake (2015),
|
|
or 512 \uops{} on Golden Cove (2021)~\cite{wikichip_intel_rob_size}. These
|
|
sizes are small enough to reasonably use this solution without excessive
|
|
slowdown.
|