2023-09-20 16:09:50 +02:00
|
|
|
\section{Manually modelling the A72 frontend}
|
|
|
|
|
2024-04-20 16:51:58 +02:00
|
|
|
Our objective is now to manually construct a frontend model of the Cortex A72.
|
2024-08-17 15:43:20 +02:00
|
|
|
We strive, however, to remain as close to an algorithmic methodology as
|
2024-04-20 16:51:58 +02:00
|
|
|
possible: while our model's structure is manually crafted, its data should come
|
|
|
|
from experiments that can be later automated.
|
2023-09-20 17:46:06 +02:00
|
|
|
|
2024-04-10 17:38:36 +02:00
|
|
|
\subsection{Finding micro-operation count for each
|
|
|
|
instruction}\label{ssec:a72_insn_muop_count}
|
2023-09-20 17:46:06 +02:00
|
|
|
|
|
|
|
As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only
|
|
|
|
dispatch three \uops{} per cycle. The first important data to collect, thus, is
|
|
|
|
the number of \uops{} each instruction is decoded into.
|
|
|
|
|
|
|
|
To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not
|
|
|
|
thorough enough: for each instruction, it lists the ports on which load is
|
|
|
|
incurred, which sets a lower bound to the number of \uops{} the instruction is
|
|
|
|
decomposed into. This approach, however, is not really satisfying. First,
|
|
|
|
because it cannot be reproduced for another architecture whose optimisation
|
|
|
|
manual is not as detailed, cannot be automated, and fully trusts the
|
|
|
|
manufacturer. Second, because if an instruction loads \eg{} the integer ports,
|
|
|
|
it may have a single or multiple \uops{} executed on the integer ports; the
|
|
|
|
manual is only helpful to some extent to determine this.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
2024-08-17 15:43:20 +02:00
|
|
|
We instead use an approach akin to \palmed{}'s saturating kernels, itself
|
2023-09-20 17:46:06 +02:00
|
|
|
inspired by Agner Fog's method to identify ports in the absence of hardware
|
|
|
|
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
|
|
|
|
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
|
|
|
|
output, sometimes manually confronted with the software optimisation
|
|
|
|
guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures
|
|
|
|
it models.
|
|
|
|
|
|
|
|
The \palmed{} resource mapping we use as a basis is composed of 1\,975
|
|
|
|
instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in
|
|
|
|
\autoref{sec:palmed_design}, as instructions in the same class are mapped to
|
|
|
|
the same resources, and thus are decomposed into the same \uops{}; this results
|
|
|
|
in only 98 classes of instructions.
|
|
|
|
|
|
|
|
\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick
|
|
|
|
\emph{basic instructions}: for each port, we select one instruction which
|
|
|
|
decodes into a single \uop{} executed by this port. We use the following
|
|
|
|
instructions, in \pipedream{}'s notation:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0,
|
|
|
|
x1, x2};
|
|
|
|
\item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{}
|
|
|
|
\lstarmasm{mul w0, w1, w2};
|
|
|
|
\item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0,
|
|
|
|
[x1, x2]};
|
|
|
|
\item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0,
|
|
|
|
[x1, x2]};
|
|
|
|
\item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta
|
|
|
|
d0, d1} (floating-point rounding to integral);
|
|
|
|
\item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp
|
|
|
|
d0, d1} (floating-point comparison);
|
|
|
|
\item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin
|
|
|
|
d0, d1, d1} (floating-point minimum);
|
|
|
|
\item (Branch: no instruction, as they are unsupported by \pipedream{}).
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
As the integer ports are not specialized, a single basic instruction is
|
|
|
|
sufficient for both of them. The FP/SIMD ports are slightly specialized
|
|
|
|
(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that
|
|
|
|
stresses each of them independently, and one that stresses both without
|
|
|
|
distinction.
|
|
|
|
|
2023-09-21 14:45:15 +02:00
|
|
|
For each of these ports, we note $\basic{p}$ the basic instruction for
|
|
|
|
port \texttt{p}; \eg{}, $\basic{Int01}$ is \lstarmasm{ADC_RD_X_RN_X_RM_X}.
|
2023-09-20 16:09:50 +02:00
|
|
|
|
2024-03-28 16:11:56 +01:00
|
|
|
\paragraph{Counting the micro-ops of an
|
|
|
|
instruction.}\label{def:cycB}\label{def:cycF}\label{def:mucount} There are
|
|
|
|
three main sources of bottleneck for a kernel $\kerK$: backend, frontend and
|
|
|
|
dependencies. When measuring the execution time with \pipedream{}, we
|
|
|
|
eliminate (as far as possible) the dependencies, leaving us with only backend
|
|
|
|
and frontend. We note $\cycF{\kerK}$ the execution time of $\kerK$ if it was
|
|
|
|
only limited by its frontend, and $\cycB{\kerK}$ the execution time of $\kerK$
|
|
|
|
if it was only limited by its backend. If we consider a kernel $\kerK$ that is
|
|
|
|
simple enough to exhibit a purely linear frontend behaviour ---~that is, the
|
|
|
|
frontend's throughput is a linear function of the number of \uops{} in the
|
|
|
|
kernel~---, we then know that either $\cyc{\kerK} = \cycF{\kerK}$ or
|
|
|
|
$\cyc{\kerK} = \cycB{\kerK}$.
|
2023-09-21 14:45:15 +02:00
|
|
|
|
2023-09-23 16:45:00 +02:00
|
|
|
For a given instruction $i$ and for a certain $k \in \nat$, we then construct a
|
|
|
|
kernel $\kerK_k$
|
2023-09-21 14:45:15 +02:00
|
|
|
such that:
|
|
|
|
\begin{enumerate}[(i)]
|
2023-09-23 16:45:00 +02:00
|
|
|
\item\label{cnd:kerKk:compo} $\kerK_k$ is composed of the instruction $i$,
|
2023-09-21 14:45:15 +02:00
|
|
|
followed by $k$ basic instructions;
|
2023-09-23 16:45:00 +02:00
|
|
|
\item\label{cnd:kerKk:linear} the kernel $\kerK_k$ is simple enough to exhibit this purely linear
|
2023-09-21 14:45:15 +02:00
|
|
|
frontend behaviour;
|
2023-09-23 16:45:00 +02:00
|
|
|
\item\label{cnd:kerKk:fbound} $\cycB{\kerK_k} \leq \cycF{\kerK_k}$.
|
2023-09-21 14:45:15 +02:00
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
We denote by $\mucount{}\kerK$ the number of \uops{} in kernel $\kerK$. Under
|
|
|
|
the condition~(\ref{cnd:kerKk:linear}), we have for any $k \in \nat$
|
|
|
|
|
|
|
|
\begin{align*}
|
|
|
|
\cycF{\kerK_k} &= \dfrac{\mucount{}\left(\kerK_k\right)}{3} & \text{for the A72} \\
|
|
|
|
&= \dfrac{\mucount{}i + k}{3} & \text{by condition (\ref{cnd:kerKk:compo})} \\
|
|
|
|
&\geq \dfrac{k+1}{3}
|
|
|
|
\end{align*}
|
|
|
|
|
|
|
|
We pick $k_0 := 3 \ceil{\cyc{\imath}} - 1$. Thus, we have $\ceil{\cyc{\imath}} \leq
|
|
|
|
\cycF{\kerK_{k_0}} \leq \cyc{\kerK_{k_0}}$.
|
|
|
|
Condition (\ref{cnd:kerKk:fbound}) can then be relaxed as $\cycB{\kerK_{k_0}} \leq
|
|
|
|
\ceil{\cyc{\imath}}$, which we know to be true if the load from $\kerK_{k_0}$
|
|
|
|
on each port does not exceed $\ceil{\cyc{\imath}}$ (as execution takes at least
|
|
|
|
this number of cycles).
|
|
|
|
|
|
|
|
We build $\kerK_{k_0}$ by adding basic instructions to $i$, using the port
|
|
|
|
mapping to pick basic instructions that do not load a port over
|
|
|
|
$\ceil{\cyc{\imath}}$. This is always possible, as we can load independently
|
|
|
|
seven ports (leaving out the branch port), while each instruction can load at
|
|
|
|
most three ports by cycle it takes to execute ---~each \uop{} is executed by a
|
|
|
|
single port, and only three \uops{} can be dispatched per cycle~---, leaving
|
|
|
|
four ports under-loaded. We build $\kerK_{k_0 + 1}$ the same way, still not
|
|
|
|
loading a port over $\ceil{\cyc{\imath}}$; in particular, we still have
|
|
|
|
$\cycB{\kerK_{k_0 + 1}} \leq \ceil{\cyc{\imath}} \leq \cycF{\kerK_{k_0+1}}$. To
|
2023-09-25 16:49:46 +02:00
|
|
|
ensure that condition (\ref{cnd:kerKk:linear}) is valid, as we will see later
|
|
|
|
in \autoref{sssec:a72_dispatch_queues}, we spread as much as possible
|
|
|
|
instructions loading the same port: for instance, $i + \basic{Int01} +
|
|
|
|
\basic{FP01} + \basic{Int01}$ is preferred over $i + 2\times \basic{Int01} +
|
|
|
|
\basic{FP01}$.
|
2023-09-21 14:45:15 +02:00
|
|
|
|
|
|
|
Unless condition (\ref{cnd:kerKk:linear}) is not met or our ports model is
|
|
|
|
incorrect for this instruction, we should measure
|
|
|
|
$\ceil{\cyc{\imath}} \leq \cyc{\kerK_{k_0}}$ and $\cyc{\kerK_{k_0}} +
|
|
|
|
\sfrac{1}{3} = \cyc{\kerK_{k_0+1}}$. For instructions $i$ where it is not the
|
|
|
|
case, increasing $k_0$ by 3 or using other basic instructions eventually
|
2023-09-23 16:45:00 +02:00
|
|
|
yielded satisfying measures. Finally, we obtain
|
2023-09-21 14:45:15 +02:00
|
|
|
|
|
|
|
\[
|
|
|
|
\mucount{}i = 3 \cyc{\kerK_{k_0}} - k_0
|
|
|
|
\]
|
2023-09-21 16:19:34 +02:00
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Applying this procedure manually on each instruction class provides us with a
|
|
|
|
model mapping each supported instruction of the ISA to its \uop{} count.
|
|
|
|
|
|
|
|
\begin{example}[\uop{} count measure: \lstarmasm{ADC_RD_X_RN_X_RM_X}]
|
|
|
|
We measure the \uop{}-count of $i =$ \lstarmasm{ADC_RD_X_RN_X_RM_X}, our
|
|
|
|
basic instruction for the integer port.
|
|
|
|
|
|
|
|
We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
|
2024-08-17 15:43:20 +02:00
|
|
|
we consider $\kerK_2$ and $\kerK_3$. Our mapping indicates that this
|
2023-09-21 16:19:34 +02:00
|
|
|
instruction loads only the \texttt{Int01} port with a load of
|
|
|
|
$\sfrac{1}{2}$.
|
|
|
|
|
2024-08-17 15:43:20 +02:00
|
|
|
We select \eg{} $\kerK_2 = i + 2\times \basic{FP01}$ and $\kerK_3 = i +
|
2023-09-21 16:19:34 +02:00
|
|
|
\basic{FP01} + \basic{Ld} + \basic{FP01}$.
|
|
|
|
|
|
|
|
We measure
|
|
|
|
\begin{itemize}
|
2024-08-17 15:43:20 +02:00
|
|
|
\item $\cyc{\kerK_2} = 1.01 \simeq 1\,\text{cycle}$
|
|
|
|
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
2023-09-21 16:19:34 +02:00
|
|
|
\end{itemize}
|
|
|
|
which is consistent. We conclude that, as expected, $\mucount i =
|
2024-08-17 15:43:20 +02:00
|
|
|
3\cyc{\kerK_2} - 2 = 3-2 = 1$.
|
2023-09-21 16:19:34 +02:00
|
|
|
\end{example}
|
|
|
|
|
|
|
|
\begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
|
|
|
|
We measure the \uop{}-count of $i =$ \lstarmasm{ADDV_FD_H_VN_V_8H}, the
|
|
|
|
SIMD ``add across vector'' operation on a vector of eight sixteen-bits
|
|
|
|
operands.
|
|
|
|
|
|
|
|
We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
|
2024-08-17 15:43:20 +02:00
|
|
|
$\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads
|
2023-09-21 16:19:34 +02:00
|
|
|
the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
|
|
|
|
load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
|
|
|
|
load of 1 means two \uops{}. As there is already a \uop{} loading the
|
|
|
|
\texttt{FP1} port, which also loads the combined port \texttt{FP01},
|
|
|
|
this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
|
|
|
|
on either \texttt{FP0} or \texttt{FP1}.}.
|
|
|
|
|
2024-08-17 15:43:20 +02:00
|
|
|
We select \eg{} $\kerK_2 = i + 2\times \basic{Int01}$ and $\kerK_3 = i +
|
2023-09-21 16:19:34 +02:00
|
|
|
\basic{Int01} + \basic{Ld} + \basic{Int01}$.
|
|
|
|
|
|
|
|
We measure
|
|
|
|
\begin{itemize}
|
2024-08-17 15:43:20 +02:00
|
|
|
\item $\cyc{\kerK_2} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
|
|
|
\item $\cyc{\kerK_3} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
|
2023-09-21 16:19:34 +02:00
|
|
|
\end{itemize}
|
2024-08-17 15:43:20 +02:00
|
|
|
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_2} - 2 =
|
|
|
|
4-2 = 2$.
|
2023-09-21 16:19:34 +02:00
|
|
|
\end{example}
|
2023-09-21 17:48:47 +02:00
|
|
|
|
|
|
|
|
|
|
|
\subsection{Bubbles in the pipeline}
|
|
|
|
|
|
|
|
The frontend, however, does not always exhibit a purely linear behaviour. We
|
2023-09-22 17:32:53 +02:00
|
|
|
consider for instance the kernel
|
|
|
|
$\kerK =$ \lstarmasm{ADDV_FD_H_VN_V_8H} $+ 3\times\basic{Int01}$;
|
|
|
|
for the rest of this chapter, we refer to
|
2023-09-21 17:48:47 +02:00
|
|
|
\lstarmasm{ADDV_FD_H_VN_V_8H} as simply \lstarmasm{ADDV} when not stated
|
|
|
|
otherwise.
|
|
|
|
|
|
|
|
Backend-wise, \texttt{ADDV} fully loads \texttt{FP1} and \texttt{FP01}, while
|
|
|
|
$\basic{Int01}$ half-loads \texttt{Int01}. The port most loaded by $\kerK$ is
|
|
|
|
thus \texttt{Int01}, with a load of $1\,\sfrac{1}{2}$. We then expect
|
|
|
|
$\cycB{\kerK} = 1\,\sfrac{1}{2}$.
|
|
|
|
|
|
|
|
Frontend-wise, \texttt{ADDV} decomposes into two \uops{}, while $\basic{Int01}$
|
|
|
|
decomposes into a single \uops{}; thus, $\mucount{}\kerK = 5$. We then expect
|
|
|
|
$\cycF{\kerK} = 1\,\sfrac{2}{3}$.
|
|
|
|
|
|
|
|
As the frontend dominates the backend, we expect $\cyc{\kerK} = \cycF{\kerK} =
|
|
|
|
1\,\sfrac{2}{3}$. However, in reality, we measure $\cyc{\kerK} = 2.01 \simeq 2$
|
|
|
|
cycles.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
From then on, we strive to find a model that could reliably predict, given a
|
|
|
|
kernel, how many cycles it requires to execute, frontend-wise, in a
|
|
|
|
steady-state.
|
|
|
|
|
|
|
|
\subsubsection{No-cross model}
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
\centering
|
2023-09-22 17:32:53 +02:00
|
|
|
\hfill\begin{minipage}[c]{0.25\linewidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=3cm]{timeline_front_ex1_linear.svg}\\
|
|
|
|
\textit{With linear frontend}
|
|
|
|
\end{minipage}\begin{minipage}[c]{0.2\linewidth}
|
|
|
|
\centering
|
|
|
|
\Huge$\rightarrow$
|
|
|
|
\end{minipage}\begin{minipage}[c]{0.25\linewidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=3cm]{timeline_front_ex1_nocross.svg}\\
|
|
|
|
\textit{With no-cross frontend}
|
|
|
|
\end{minipage}\hfill~
|
|
|
|
|
|
|
|
\caption{Illustration of the no-cross frontend model. Rows represent CPU
|
|
|
|
cycles.}\label{fig:frontend_nocross}
|
2023-09-21 17:48:47 +02:00
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
On the x86-64 architectures they analyzed, \uica{}'s authors find that the
|
|
|
|
CPU's predecoder might cause an instruction's \uops{} to be postponed to the
|
2024-08-17 15:43:20 +02:00
|
|
|
next cycle if it is pre-decoded across a cycle boundary~\cite[§4.1]{uica}.
|
2023-09-21 17:48:47 +02:00
|
|
|
|
|
|
|
We hypothesize that the same kind of effect could postpone an instruction's
|
|
|
|
\uops{} until the next cycle if its \uops{} would cross a cycle boundary
|
2023-09-22 17:32:53 +02:00
|
|
|
otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross},
|
|
|
|
with a kernel composed of three instructions: the first two each decode to a
|
|
|
|
single \uop{}, while the third one decodes to two \uops{}. In this figure, each
|
|
|
|
row represents a CPU cycle, while each square represents a \uop{}-slot in the
|
2024-08-17 15:43:20 +02:00
|
|
|
frontend; there are thus at most three squares in each row. In the no-cross
|
|
|
|
case (right), the constraint forced the third instruction to start its decoding
|
|
|
|
at the beginning of the second cycle, leaving a ``bubble'' in the frontend on
|
|
|
|
the first cycle.
|
2023-09-22 17:32:53 +02:00
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
\begin{wrapfigure}{R}{0.2\linewidth}
|
|
|
|
\centering
|
|
|
|
\vspace{-2em}
|
|
|
|
\includegraphics[width=3cm]{timeline_front_nocross_addv_3add.svg}
|
|
|
|
\caption{No-cross frontend for $\texttt{ADDV} +
|
|
|
|
3\times\basic{Int01}$}\label{fig:frontend_nocross_addv_3add}
|
|
|
|
\end{wrapfigure}
|
|
|
|
|
|
|
|
This model explains the $\kerK = \texttt{ADDV} + 3\times\basic{Int01}$ example
|
|
|
|
introduced above, as depicted in \autoref{fig:frontend_nocross_addv_3add},
|
|
|
|
where $\kerK$ is represented twice, to ensure that the steady-state was
|
|
|
|
reached. Here, the frontend indeed requires two full cycles to issue $\kerK$,
|
|
|
|
which is consistent with our measure.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
The notion of steady-state is, in the general case, not as straightforward: it
|
|
|
|
is well possible that, after executing the kernel once, the second iteration of
|
|
|
|
the kernel does not begin at the cycle boundary. The state of the model,
|
|
|
|
however, is entirely defined by the number $s \in \left\{0,1,2\right\}$ of
|
|
|
|
\uops{} already decoded this cycle. Thus, if at the end of a full execution of
|
|
|
|
a kernel, $s$ is equal to a state previously encountered at the end of a
|
|
|
|
kernel, $k$ kernel iterations before, steady-state was reached for this
|
|
|
|
portion: we know that further executing the kernel $k$ times will bring us
|
|
|
|
again to the same state. The steady-state execution time, frontend-wise, of a
|
|
|
|
kernel is then the number of elapsed cycles between the beginning and end of
|
|
|
|
the steady-state pattern (as the start and end state are the same), divided by
|
|
|
|
the number of kernel repetitions inside the pattern.
|
|
|
|
|
|
|
|
The no-cross model is formalized by the \texttt{next\_state} function defined in
|
|
|
|
\autoref{lst:nocross_next_state} in Python.
|
|
|
|
|
|
|
|
\begin{lstfloat}
|
|
|
|
\lstinputlisting[language=Python]{assets/src/40_A72-frontend/nocross_next.py}
|
|
|
|
\caption{Implementation of the \texttt{next\_state} function for the no-cross
|
|
|
|
frontend model}\label{lst:nocross_next_state}
|
|
|
|
\end{lstfloat}
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
There are two main phases when repeatedly applying the \texttt{next\_state}
|
|
|
|
function. Consider the following example of a graph representation of the
|
|
|
|
\texttt{next\_state} function, ignoring the \texttt{cycles\_started} return
|
|
|
|
value:
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
\begin{tikzpicture}[
|
|
|
|
state/.style={circle,draw=black, thick, minimum size=0.5cm, align=center}
|
|
|
|
]
|
|
|
|
\node[state] at (0, 0) (0) {$0$};
|
|
|
|
\node[state] at (3, 0) (1) {$1$};
|
|
|
|
\node[state] at (6, 0) (2) {$2$};
|
|
|
|
|
|
|
|
\draw[->, thick] (0) to[bend left] (1);
|
|
|
|
\draw[->, thick] (1) to[bend left] (2);
|
|
|
|
\draw[->, thick] (2) to[bend left] (1);
|
|
|
|
\end{tikzpicture}
|
|
|
|
\end{center}
|
|
|
|
|
|
|
|
\begin{wrapfigure}{R}{0.2\linewidth}
|
|
|
|
\centering
|
2024-09-08 11:08:37 +02:00
|
|
|
%\vspace{-2em}
|
2023-09-22 17:32:53 +02:00
|
|
|
\includegraphics[width=3cm]{timeline_front_nocross_addv_2add.svg}
|
|
|
|
\caption{No-cross frontend for $\texttt{ADDV} +
|
|
|
|
2\times\basic{Int01}$}\label{fig:frontend_nocross_addv_2add}
|
|
|
|
\end{wrapfigure}
|
|
|
|
|
2024-09-08 11:08:37 +02:00
|
|
|
When repeatedly applied starting from $0$, the \texttt{next\_state} function
|
|
|
|
will yield the sequence $0, 1, 2, 1, 2, 1, 2, \ldots$. The first iteration
|
|
|
|
brings us to state $1$, which belongs to the steady-state; starting from there,
|
|
|
|
the next iterations will loop through the steady-state.
|
|
|
|
|
2023-09-22 17:32:53 +02:00
|
|
|
In the general case, the model iterates the \texttt{next\_state} function
|
|
|
|
starting from state $0$ until a previously-encountered state is reached
|
|
|
|
---~this requires at most three iterations. At this point, steady-state is
|
|
|
|
reached. The function is further iterated until the same state is encountered
|
|
|
|
again ---~also requiring at most three iterations~---. The number of elapsed
|
|
|
|
cycles during this second phase, divided by the number of iterations of the
|
|
|
|
function, is returned as the predicted steady-state execution time of the
|
|
|
|
kernel, frontend-wise.
|
|
|
|
|
|
|
|
\bigskip{}
|
|
|
|
|
|
|
|
This model, however, is not satisfactory in many cases. For instance, the
|
2023-09-23 16:45:00 +02:00
|
|
|
kernel $\kerK' = \texttt{ADDV} + 2\times\basic{Int01}$ is predicted to run in
|
|
|
|
$1.5$ cycles, as depicted in \autoref{fig:frontend_nocross_addv_2add}; however,
|
|
|
|
a \pipedream{} measure yields $\cyc{\kerK'} = 1.35 \simeq 1\,\sfrac{1}{3}$
|
2023-09-22 17:32:53 +02:00
|
|
|
cycles.
|
2023-09-25 16:49:46 +02:00
|
|
|
|
|
|
|
\subsubsection{Dispatch-queues model}\label{sssec:a72_dispatch_queues}
|
|
|
|
|
|
|
|
\begin{wrapfigure}{R}{0.2\linewidth}
|
|
|
|
\centering
|
|
|
|
\vspace{-2em}
|
|
|
|
\includegraphics[width=3cm]{timeline_front_dispq_addv_3add.svg}
|
|
|
|
\caption{Disp.\ queues frontend for $\texttt{ADDV} +
|
|
|
|
3\times\basic{Int01}$}\label{fig:frontend_dispq_addv_3add}
|
|
|
|
\vspace{-10em} % wow.
|
|
|
|
\end{wrapfigure}
|
|
|
|
|
|
|
|
The software optimisation guide, however, details additional dispatch
|
|
|
|
constraints in Section~4.1~\cite{ref:a72_optim}. While it confirms the dispatch
|
|
|
|
constraint of at most three \uops{} per cycle, it also introduces more specific
|
|
|
|
constraints. There are six dispatch pipelines, that each bottleneck at less
|
|
|
|
than three \uops{} dispatched each cycle:
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
\begin{minipage}{0.75\textwidth}
|
|
|
|
\begin{center}
|
|
|
|
\begin{tabular}{l l r}
|
|
|
|
\toprule
|
|
|
|
\textbf{Pipeline} & \textbf{Related ports} & \textbf{\uops{}/cyc.} \\
|
|
|
|
\midrule
|
|
|
|
\texttt{Branch} & \texttt{Branch} & 1 \\
|
|
|
|
\texttt{Int} & \texttt{Int01} & 2 \\
|
|
|
|
\texttt{IntM} & \texttt{IntM} & 2 \\
|
|
|
|
\texttt{FP0} & \texttt{FP0} & 1 \\
|
|
|
|
\texttt{FP1} & \texttt{FP1} & 1 \\
|
|
|
|
\texttt{LdSt} & \texttt{Ld}, \texttt{St} & 2 \\
|
|
|
|
\bottomrule
|
|
|
|
\end{tabular}
|
|
|
|
\end{center}
|
|
|
|
\end{minipage}
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
These dispatch constraints could also explain the
|
|
|
|
$\kerK = \texttt{ADDV} + 3\times\basic{Int01}$ measured run time, as detailed
|
|
|
|
in \autoref{fig:frontend_dispq_addv_3add}: the frontend enters steady state on
|
|
|
|
the fourth cycle and, on the fifth (and every second cycle from then on), can
|
|
|
|
only execute two \uops{}, as the third one would be a third \uop{} loading the
|
|
|
|
\texttt{Int} dispatch queue, which can only dispatch two \uops{} per cycle. As
|
|
|
|
this part of the CPU is in-order, the frontend is stalled until the next cycle,
|
|
|
|
leaving a dispatch bubble. In steady-state, the dispatch-queues model thus
|
|
|
|
predicts $\cycF{\kerK} = 2$ cycles, which is consistent with our measure.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
This model also explains the $\kerK' = \texttt{ADDV} + 2\times\basic{Int01}$
|
|
|
|
measured time, as detailed in \autoref{fig:frontend_dispq_addv_2add}: the
|
|
|
|
dispatch-queues constraints do not force any bubble in the dispatch pipeline.
|
|
|
|
The model thus predicts $\cycF{\kerK'} = 1\,\sfrac{1}{3}$, which is consistent
|
|
|
|
with our measure.
|
|
|
|
|
|
|
|
\begin{wrapfigure}{r}{0.2\linewidth}
|
|
|
|
\centering
|
|
|
|
\vspace{-2em}
|
|
|
|
\includegraphics[width=3cm]{timeline_front_dispq_addv_2add.svg}
|
|
|
|
\caption{Disp.\ queues frontend for $\texttt{ADDV} +
|
|
|
|
2\times\basic{Int01}$}\label{fig:frontend_dispq_addv_2add}
|
|
|
|
\end{wrapfigure}
|
|
|
|
|
|
|
|
\paragraph{Finding a dispatch model.} This model, however, cannot be deduced
|
|
|
|
straightforwardly from our previous \uops{} model: each \uop{} needs to further
|
|
|
|
be mapped to a dispatch queue.
|
|
|
|
|
|
|
|
The model yielded by \palmed{} is not usable as-is to deduce the dispatch
|
|
|
|
queues used by each instruction: the detected resources are not a perfect match
|
|
|
|
for the CPU, and some resources tend to be duplicated due to measurement
|
|
|
|
artifacts. For instance, the \texttt{Int01} resource might be duplicated into
|
|
|
|
$r_0$ and $r_1$, with some integer operations loading $r_0$, some loading
|
|
|
|
$r_1$, and a majority loading both --- while it makes for a reasonably good
|
|
|
|
throughput model, it would require extra cleaning work to be used as a dispatch
|
|
|
|
model. It is, however, a good basis for manual mapping: for instance, the
|
|
|
|
\texttt{Ld} and \texttt{St} ports are one-to-one matches, and allow to
|
|
|
|
automatically map all load and store operations.
|
|
|
|
|
|
|
|
We generate a base dispatch model from the Palmed model, and manually
|
|
|
|
cross-check each class of instructions using the optimisation manual, with
|
|
|
|
\pipedream{} measures in some cases to further clarify.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
This method trivially works for most instructions, which are built out of a single
|
|
|
|
\uop{} and for which we find a single relevant dispatch queue. However,
|
|
|
|
instructions where this is not the case require specific processing. For an
|
|
|
|
instruction $i$, with $u = \mucount{}i$ \uops{} and found to belong to $d$
|
|
|
|
distinct dispatch queues, we break down the following cases.
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item{} If $d > u$, our \uops{} model is wrong for this instruction,
|
|
|
|
as each \uop{} belongs to a single dispatch queue. This did not happen
|
|
|
|
in our model.
|
|
|
|
|
|
|
|
\item{} If $d = 1, u > 1$, each \uop{} belongs to the same dispatch queue.
|
|
|
|
|
|
|
|
\item{} If $d = u > 1$, each \uop{} belongs to its own dispatch queue.
|
|
|
|
For now, our model orders those \uops{} arbitrarily. However, the order
|
|
|
|
of those \uops{} might be important due to dispatch constraints, and
|
|
|
|
would require specific investigation with kernels meant to stress the
|
|
|
|
dispatch queues, assuming a certain \uop{} order.
|
|
|
|
|
|
|
|
\item{} If $1 < d < u$, we do not have enough data to determine how many
|
|
|
|
\uops{} belong to each queue; this would require further measurements.
|
|
|
|
We do not support those instructions, as they represent only 35
|
|
|
|
instructions out of the 1749 instructions supported by our
|
|
|
|
\palmed{}-generated backend model.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Due to the separation of \texttt{FP0} and \texttt{FP1} in two different
|
|
|
|
dispatch queues, this model also borrows the abstract resources'
|
|
|
|
formalism in a simple case: it actually models seven queues, including
|
|
|
|
\texttt{FP0}, \texttt{FP1} and \texttt{FP01}, the two former with a maximal
|
|
|
|
load of 1 and the latter with a maximal load of 2. Any \uop{} loading queues
|
|
|
|
\texttt{FP0} or \texttt{FP1} also load the \texttt{FP01} queue likewise.
|
|
|
|
|
|
|
|
\paragraph{Implementing the dispatch model.} A key observation to implement
|
|
|
|
this model is that, as with the no-cross model, the state of the model at the
|
|
|
|
end of a kernel occurrence is still only determined by the number of \uops{}
|
|
|
|
already dispatched in the current cycle. Indeed, since the dispatcher is
|
|
|
|
in-order, at the end of a kernel occurrence, the last \uops{} dispatched will
|
|
|
|
always be the same in steady-state, as the last instructions are the few last
|
|
|
|
of the kernel.
|
|
|
|
|
|
|
|
On account of this observation, the general structure of the no-cross
|
|
|
|
implementation remains correct: at most three kernel iterations to reach
|
|
|
|
steady-state, and at most three more to find a fixed point. The
|
|
|
|
\texttt{next\_state} function is adapted to account for the dispatch queues
|
|
|
|
limit.
|