76 lines
4.2 KiB
TeX
76 lines
4.2 KiB
TeX
\section{Static dependencies detection}\label{ssec:staticdeps_detection}
|
|
|
|
Depending on the type of dependencies considered, it is more or less difficult
|
|
to statically detect them.
|
|
|
|
\paragraph{Register-carried dependencies in straight-line code.} This case is
|
|
the easiest to statically detect, and is most often supported by code analyzers
|
|
---~for instance, \llvmmca{} supports it. The same strategy that was used to
|
|
dynamically find dependencies in \autoref{ssec:depsim} can still be used: a
|
|
shadow register file simply keeps track of which instruction last wrote each
|
|
register.
|
|
|
|
\paragraph{Register-carried, loop-carried dependencies.} Loop-carried
|
|
dependencies can, to some extent, be detected the same way. As the basic block
|
|
is always assumed to be the body of an infinite loop, a straight-line analysis
|
|
can be performed on a duplicated kernel. This strategy is \eg{} adopted by
|
|
\osaca{}~\cite{osaca2} (§II.D).
|
|
|
|
When dealing only with register accesses, this
|
|
strategy is always sufficient: as each iteration always executes the same basic
|
|
block, it is not possible for an instruction to depend on another instruction
|
|
two iterations earlier or more.
|
|
|
|
\paragraph{Memory-carried dependencies in straight-line code.} Memory
|
|
dependencies, however, are significantly harder to tackle. While basic
|
|
heuristics can handle some simple cases, in the general case two main
|
|
difficulties arise:
|
|
\begin{enumerate}[(i)]
|
|
\item{}\label{memcarried_difficulty_alias} pointers may \emph{alias}, \ie{}
|
|
point to the same address or array; for instance, if \reg{rax} points
|
|
to an array, it may be that \reg{rbx} points to $\reg{rax} + 8$, making
|
|
the detection of such a dependency difficult;
|
|
\item{}\label{memcarried_difficulty_arith} arbitrary arithmetic operations
|
|
may be performed on pointers, possibly through diverting paths: \eg{}
|
|
it might be necessary to detect that $\reg{rax} + 16 << 2$ is identical
|
|
to $\reg{rax} + 128 / 2$; this requires semantics for assembly
|
|
instructions and tracking formal expressions across register values
|
|
---~and possibly even memory.
|
|
\end{enumerate}
|
|
|
|
Tracking memory-carried dependencies is, to the best of our knowledge, not done
|
|
in code analyzers, as our results in \autoref{chap:CesASMe} suggests.
|
|
|
|
\paragraph{Loop-carried, memory-carried dependencies.} While the strategy
|
|
previously used for register-carried dependencies is sufficient to detect
|
|
loop-carried dependencies from one occurrence to the next one, it is not
|
|
sufficient at all times when the dependencies tracked are memory-carried. For
|
|
instance, in the second example from \autoref{lst:loop_carried_exn}, an
|
|
instruction depends on another two iterations ago.
|
|
|
|
Dependencies can reach arbitrarily old iterations of a loop: in this example,
|
|
\lstxasm{-8192(\%rbx, \%r10)} may be used to reach 1\,024 iterations back.
|
|
However, while far-reaching dependencies may \emph{exist}, they are not
|
|
necessarily \emph{relevant} from a performance analysis point of view. Indeed,
|
|
if an instruction $i_2$ depends on a result previously produced by an
|
|
instruction $i_1$, this dependency is only relevant if it is possible that
|
|
$i_1$ is not yet completed when $i_2$ is considered for issuing ---~else, the
|
|
result is already produced, and $i_2$ needs not wait to execute.
|
|
|
|
The reorder buffer (ROB) of a CPU can be modelled as a sliding window of fixed
|
|
size over \uops{}. In particular, if a \uop{} $\mu_1$ is not yet retired, the
|
|
ROB may not contain \uops{} more than the ROB's size ahead of $\mu_1$. This is
|
|
in particular also true for instructions, as the vast majority of instructions
|
|
decode to at least one \uop{}\footnote{Some \texttt{mov} instructions from
|
|
register to register may, for instance, only have an impact on the renamer;
|
|
no \uops{} are dispatched to the backend.}.
|
|
|
|
A possible solution to detect loop-carried dependencies in a kernel $\kerK$ is
|
|
thus to unroll it until it contains about $\card{\text{ROB}} +
|
|
\card{\kerK}$. This ensures that every instruction in the last kernel can find
|
|
dependencies reaching up to $\card{\text{ROB}}$ back.
|
|
|
|
On Intel CPUs, the reorder buffer size contained 224 \uops{} on Skylake (2015),
|
|
or 512 \uops{} on Golden Cove (2021)~\cite{wikichip_intel_rob_size}. These
|
|
sizes are small enough to reasonably use this solution without excessive
|
|
slowdown.
|