99 lines
4.6 KiB
TeX
99 lines
4.6 KiB
TeX
\subsection{Far-reaching dependencies do not impact
|
|
performance}\label{ssec:staticdeps:rob_proof}
|
|
|
|
\begin{definition}[Distance between instructions]
|
|
Let $\left(I_p\right)_{0\leq p<n}$ be a trace of executed instructions.
|
|
For $p<p'$, $\distance{I_p}{I_{p'}}$ is the overall number of decoded \uops{}
|
|
for the subtrace $\left(I_r\right)_{p\leq r\leq p'}$ minus one.
|
|
\end{definition}
|
|
|
|
\begin{theorem}[Long distance dependencies]\label{thm.longdist}
|
|
There exists $R \in \nat$ such that the presence or absence of a dependency
|
|
between two instructions that are separated by at least $R$ other \uops{}
|
|
has no impact on the performance of this kernel.
|
|
|
|
More formally, let $\kerK$ be a kernel of $n$ instructions. Let $(I_p)_{p
|
|
\in \nat}$ be the trace of $\kerK$'s instructions executed in a loop. For
|
|
any $p, q \in \nat$ such that $\distance{I_p}{I_q} > R$, $\cyc{\kerK}$ is
|
|
invariant in the presence or absence of a dependency between the pairs of
|
|
instructions $\left(I_{p+kn}, I_{q+kn}\right)_{k\in\nat}$.
|
|
\end{theorem}
|
|
|
|
To prove this assertion, we require a few postulates describing the functioning
|
|
of a CPU and, in particular, how \uops{} transit in (decoded) and out (retired)
|
|
the reorder buffer.
|
|
|
|
\begin{postulate}[Reorder buffer as a circular buffer]
|
|
The reorder buffer is a circular buffer of size $R \in \nat^\star$.
|
|
It contains only decoded \uops{}.
|
|
Let us denote $i_d$ the \uop{} at position $d$ in the reorder buffer.
|
|
Assume $i_d$ just got decoded.
|
|
We have that for every $q$ and $q'$ in $[0,R)$:
|
|
\[
|
|
(q-d-1) \% R<(q'-d-1) \% R\
|
|
\iff \ i_q \text{ was decoded before } i_{q'}
|
|
\]
|
|
\end{postulate}
|
|
|
|
|
|
If a \uop{} has not been retired yet (issued and executed), it cannot be
|
|
replaced in the ROB by any freshly decoded instruction. In other words, every
|
|
non-retired decoded \uop{} --~also called \emph{in-flight}~-- remains in the
|
|
reorder buffer. This is possible thanks to the notion of \emph{full reorder
|
|
buffer}:
|
|
|
|
\begin{postulate}[Full reorder buffer]
|
|
Let us denote by $i_d$ the \uop{} that just got decoded.
|
|
The reorder buffer is said to be full if for $q=(d+1) \% R$, \uop{} $i_q$ is not retired yet.
|
|
|
|
If the reorder buffer is full, then instruction decoding is stalled.
|
|
\end{postulate}
|
|
|
|
Let $(I_p)_{0\le{} p<n}$ be a trace of executed instructions.
|
|
Each of these instructions are iteratively decoded, issued, and retired.
|
|
We will also denote by $(i_q)_{0\le q<m}$ the trace of decoded \uops{}.
|
|
To prove the theorem above, need to state that two in-flight \uops{} are distant of at most $R$ \uops{}.
|
|
|
|
For any instruction $I_p$, we denote as $Q_p$ the range of indices such that
|
|
$(i_q)_{q\in Q_p}$ are the \uops{} obtained from the decoding of $I_p$.
|
|
|
|
\begin{lemma}[Distance of in-flight \uops{}]
|
|
For any pair of instructions $(I_p,I_{p'})$, and two corresponding \uops{},
|
|
$(i_q,i_{q'})$ such that q \in Q_p, q' \in Q_{p'}$,
|
|
\[
|
|
\operatorname{inflight}(i_q) \wedge \operatorname{inflight}(i_{q'}) \Rightarrow \distance{I_p}{I_{p'}}<R
|
|
\]
|
|
\end{lemma}
|
|
|
|
\begin{proof}
|
|
In case of branch misprediction, some additional instructions might also be decoded (and potentially issued).
|
|
In other words, $(i_q)_{0\le q<m}$ potentially contains more \uops{} than those matching the executed instruction.
|
|
However, if not surjective, the relation from $\{I_p\}_{0\le p<n}$ to $\{i_q\}_{0\le q<m}$ is clearly injective and increasing (using the order within which instructions and \uops{} are listed).
|
|
In other words, $\distance{I_p}{I_{p'}}\le |q'-q|$.
|
|
Observe that at any time, the content of the ROB can be seen as a window of length $R$ over $(i_q)_{0\le q<m}$.
|
|
Consequently, if both $i_q$ and $i_{q'}$ are in-flight then
|
|
$|q'-q|<R$.
|
|
\end{proof}
|
|
|
|
\begin{postulate}[Issue delay]
|
|
Reasons why the issue of a \uop{} $i$ is delayed can be:
|
|
\begin{enumerate}
|
|
\item $i$ is not yet in the reorder buffer
|
|
\item $i$ depends on \uop{} $i'$ which is not retired yet
|
|
\item ports on which $i$ can be mapped are all occupied
|
|
\end{enumerate}
|
|
\end{postulate}
|
|
|
|
\begin{proof}[Proof of Long distance dependencies theorem]
|
|
The theorem above is now a direct consequence (proof
|
|
by contradiction) of the previous observations. Let us consider a delayed issue
|
|
for \uop{} $i$ where the unique cause is a dependence from \uop{} $i'$, that
|
|
is:
|
|
\begin{enumerate}
|
|
\item $i$ is already in the reorder buffer
|
|
\item $i$ depends on \uop{} $i'$ which is not retired yet
|
|
\item at least one port on which $i$ can be mapped is available
|
|
\end{enumerate}
|
|
Since $i'$ is not retired yet and $i'$ is ``before'' $i$, $i'$ is still in the
|
|
reorder buffer, \ie{} both $i$ and $i'$ are in the reorder buffer.
|
|
\end{proof}
|