\subsection{Far-reaching dependencies do not impact performance}\label{ssec:staticdeps:rob_proof} \begin{definition}[Distance between instructions] Let $\left(I_p\right)_{0\leq p<n}$ be a trace of executed instructions. For $p<p'$, $\distance{I_p}{I_{p'}}$ is the overall number of decoded \uops{} for the subtrace $\left(I_r\right)_{p < r < p'}$. \end{definition} \begin{theorem}[Long distance dependencies]\label{thm.longdist} There exists $R \in \nat$, only dependent of microarchitectural parameters, such that the presence or absence of a dependency between two instructions that are separated by at least $R-1$ other \uops{} has no impact on the performance of this kernel. More formally, let $\kerK$ be a kernel of $n$ instructions. Let $(I_p)_{p \in \nat}$ be the trace of $\kerK$'s instructions executed in a loop. For any $p, q \in \nat$ such that $\distance{I_p}{I_q} \geq R-1$, $\cyc{\kerK}$ is invariant in the presence or absence of a dependency between the pairs of instructions $\left(I_{p+kn}, I_{q+kn}\right)_{k\in\nat}$. \end{theorem} To prove this assertion, we require a few postulates describing the functioning of a CPU and, in particular, how \uops{} transit in (decoded) and out (retired) the reorder buffer. \begin{postulate}[Reorder buffer as a circular buffer] The reorder buffer is a circular buffer of size $R \in \nat^+$. It contains only decoded \uops{}. Let us denote $i_d$ the \uop{} at position $d$ in the reorder buffer. Assume $i_d$ just got decoded. We have that for every $q$ and $q'$ in $[0,R)$: \[ (q-d-1) \% R<(q'-d-1) \% R\ \iff \ i_q \text{ was decoded before } i_{q'} \] \end{postulate} If a \uop{} has not been retired yet (issued and executed), it cannot be replaced in the ROB by any freshly decoded instruction. In other words, every non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the reorder buffer. This is possible thanks to the notion of \emph{full reorder buffer}: \begin{postulate}[Full reorder buffer] Let us denote by $i_d$ the \uop{} that just got decoded. The reorder buffer is said to be full if for $q=(d+1) \% R$, \uop{} $i_q$ is not retired yet. If the reorder buffer is full, then instruction decoding is stalled. \end{postulate} Let $(I_p)_{0\le{} p<n}$ be a trace of executed instructions. Each of these instructions are iteratively decoded, issued, and retired. We will also denote by $(i_q)_{0\le q<m}$ the trace of decoded \uops{}. To prove the theorem above, we need to state that any two in-flight \uops{} are distant of at most $R$ \uops{}. For any instruction $I_p$, we denote as $Q_p$ the range of indices such that $(i_q)_{q\in Q_p}$ are the \uops{} obtained from the decoding of $I_p$. Note that in practice, we may not have $\bigcup{}_p Q_p = [0, n)$, as \eg{} branch mispredictions may introduce unwanted \uops{} in the pipeline. However, as the worst case for the lemma below occurs when no such ``spurious'' \uops{} are present, we may safely ignore such occurrences. \begin{lemma}[Distance of in-flight \uops{}] For any pair of instructions $(I_p,I_{p'})$, and two corresponding \uops{}, $(i_q,i_{q'})$ such that $q \in Q_p, q' \in Q_{p'}$, \[ \operatorname{inflight}(i_q) \wedge \operatorname{inflight}(i_{q'}) \implies \distance{I_p}{I_{p'}} < R - 1 \] \end{lemma} \begin{proof} The sets $(Q_p)$ are disjoint pairwise, and for any pair of instructions $(I_p,I_{p'})$, and any two corresponding \uops{}, $(i_q,i_{q'})$ such that $q \in Q_p$, $q' \in Q_{p'}$, $p < p' \implies q < q'$. Thus, $\distance{I_p}{I_{p'}} < |q'-q|$. Observe that at any time, the content of the ROB can be seen as a window of length at most $R$ over $(i_q)_{0\le q<m}$. Consequently, if both $i_q$ and $i_{q'}$ are in-flight then $|q'-q|<R$, and thus $\distance{I_p}{I_{p'}} < R - 1$. \end{proof} \begin{postulate}[Issue delay] Reasons why the issue of a \uop{} $i$ is delayed can be: \begin{enumerate} \item $i$ is not yet in the reorder buffer \item $i$ depends on \uop{} $i'$ which is not retired yet \item ports on which $i$ can be mapped are all occupied \end{enumerate} \end{postulate} \begin{proof}[Proof of Long distance dependencies theorem] The theorem above is now a direct consequence of the previous observations. Let us consider two \uops{}, $i$ and $i'$, respectively introduced by instructions $I_p$ and $I_q$. Assume a delayed issue for \uop{} $i$ where the unique cause is a dependence from \uop{} $i'$, that is: \begin{enumerate} \item $i$ is already in the reorder buffer \item $i$ depends on \uop{} $i'$ which is not retired yet \item at least one port on which $i$ can be mapped is available \end{enumerate} Since $i'$ is not retired yet and $i'$ is ``before'' $i$, $i'$ is still in the reorder buffer, \ie{} both $i$ and $i'$ are in the reorder buffer. By the previous lemma, we have $\distance{I_p}{I_q} < R - 1$. By contrapositive, for any two instructions $I_a, I_b$ such that $\distance{I_a}{I_b} \geq R-1$, no \uop{} of $I_b$ may have its execution delayed by a dependency between $I_a$ and $I_b$. \end{proof} \begin{remark} What we stated earlier is a direct consequence of this theorem: to detect meaningful dependencies over a kernel $\kerK$, it suffices to analyze the kernel unrolled enough times to obtain a sequence of $R + \card{\kerK}$ instructions, as this yields a sequence where every instruction from the last occurrence of $\kerK$ is preceded by at least $R - 1$ instructions. \end{remark}