2023-09-20 16:09:50 +02:00
|
|
|
\section{Manually modelling the A72 frontend}
|
|
|
|
|
2023-09-20 17:46:06 +02:00
|
|
|
\todo{}
|
|
|
|
|
|
|
|
\subsection{Finding micro-operation count for each instruction}
|
|
|
|
|
|
|
|
As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only
|
|
|
|
dispatch three \uops{} per cycle. The first important data to collect, thus, is
|
|
|
|
the number of \uops{} each instruction is decoded into.
|
|
|
|
|
|
|
|
To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not
|
|
|
|
thorough enough: for each instruction, it lists the ports on which load is
|
|
|
|
incurred, which sets a lower bound to the number of \uops{} the instruction is
|
|
|
|
decomposed into. This approach, however, is not really satisfying. First,
|
|
|
|
because it cannot be reproduced for another architecture whose optimisation
|
|
|
|
manual is not as detailed, cannot be automated, and fully trusts the
|
|
|
|
manufacturer. Second, because if an instruction loads \eg{} the integer ports,
|
|
|
|
it may have a single or multiple \uops{} executed on the integer ports; the
|
|
|
|
manual is only helpful to some extent to determine this.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
We instead use an approach akin to \palmed{}' saturating kernels, itself
|
|
|
|
inspired by Agner Fog's method to identify ports in the absence of hardware
|
|
|
|
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
|
|
|
|
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
|
|
|
|
output, sometimes manually confronted with the software optimisation
|
|
|
|
guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures
|
|
|
|
it models.
|
|
|
|
|
|
|
|
The \palmed{} resource mapping we use as a basis is composed of 1\,975
|
|
|
|
instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in
|
|
|
|
\autoref{sec:palmed_design}, as instructions in the same class are mapped to
|
|
|
|
the same resources, and thus are decomposed into the same \uops{}; this results
|
|
|
|
in only 98 classes of instructions.
|
|
|
|
|
|
|
|
\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick
|
|
|
|
\emph{basic instructions}: for each port, we select one instruction which
|
|
|
|
decodes into a single \uop{} executed by this port. We use the following
|
|
|
|
instructions, in \pipedream{}'s notation:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0,
|
|
|
|
x1, x2};
|
|
|
|
\item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{}
|
|
|
|
\lstarmasm{mul w0, w1, w2};
|
|
|
|
\item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0,
|
|
|
|
[x1, x2]};
|
|
|
|
\item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0,
|
|
|
|
[x1, x2]};
|
|
|
|
\item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta
|
|
|
|
d0, d1} (floating-point rounding to integral);
|
|
|
|
\item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp
|
|
|
|
d0, d1} (floating-point comparison);
|
|
|
|
\item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin
|
|
|
|
d0, d1, d1} (floating-point minimum);
|
|
|
|
\item (Branch: no instruction, as they are unsupported by \pipedream{}).
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
As the integer ports are not specialized, a single basic instruction is
|
|
|
|
sufficient for both of them. The FP/SIMD ports are slightly specialized
|
|
|
|
(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that
|
|
|
|
stresses each of them independently, and one that stresses both without
|
|
|
|
distinction.
|
|
|
|
|
2023-09-21 14:45:15 +02:00
|
|
|
For each of these ports, we note $\basic{p}$ the basic instruction for
|
|
|
|
port \texttt{p}; \eg{}, $\basic{Int01}$ is \lstarmasm{ADC_RD_X_RN_X_RM_X}.
|
2023-09-20 16:09:50 +02:00
|
|
|
|
2023-09-21 14:45:15 +02:00
|
|
|
\paragraph{Counting the micro-ops of an instruction.} There are three main
|
|
|
|
sources of bottleneck for a kernel $\kerK$: backend, frontend and dependencies.
|
|
|
|
When measuring the execution time with \pipedream{}, we eliminate (as far as
|
|
|
|
possible) the dependencies, leaving us with only backend and frontend. We note
|
|
|
|
$\cycF{\kerK}$ the execution time of $\kerK$ if it was only limited by its
|
|
|
|
frontend, and $\cycB{\kerK}$ the execution time of $\kerK$ if it was only
|
|
|
|
limited by its backend. If we consider a kernel $\kerK$ that is simple enough
|
|
|
|
to exhibit a purely linear frontend behaviour ---~that is, the frontend's
|
|
|
|
throughput is a linear function of the number of \uops{} in the kernel~---, we
|
|
|
|
then know that either $\cyc{\kerK} = \cycF{\kerK}$ or $\cyc{\kerK} =
|
|
|
|
\cycB{\kerK}$.
|
|
|
|
|
|
|
|
For a given instruction $i$, we then construct a sequence $\kerK_k$ of kernels
|
|
|
|
such that:
|
|
|
|
\begin{enumerate}[(i)]
|
|
|
|
\item\label{cnd:kerKk:compo} for all $k \in \nat$, $\kerK_k$ is composed of the instruction $i$,
|
|
|
|
followed by $k$ basic instructions;
|
|
|
|
\item\label{cnd:kerKk:linear} the kernels $\kerK_k$ are simple enough to exhibit this purely linear
|
|
|
|
frontend behaviour;
|
|
|
|
\item\label{cnd:kerKk:fbound} after a certain rank, $\cycB{\kerK_k} \leq \cycF{\kerK_k}$.
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
We denote by $\mucount{}\kerK$ the number of \uops{} in kernel $\kerK$. Under
|
|
|
|
the condition~(\ref{cnd:kerKk:linear}), we have for any $k \in \nat$
|
|
|
|
|
|
|
|
\begin{align*}
|
|
|
|
\cycF{\kerK_k} &= \dfrac{\mucount{}\left(\kerK_k\right)}{3} & \text{for the A72} \\
|
|
|
|
&= \dfrac{\mucount{}i + k}{3} & \text{by condition (\ref{cnd:kerKk:compo})} \\
|
|
|
|
&\geq \dfrac{k+1}{3}
|
|
|
|
\end{align*}
|
|
|
|
|
|
|
|
We pick $k_0 := 3 \ceil{\cyc{\imath}} - 1$. Thus, we have $\ceil{\cyc{\imath}} \leq
|
|
|
|
\cycF{\kerK_{k_0}} \leq \cyc{\kerK_{k_0}}$.
|
|
|
|
Condition (\ref{cnd:kerKk:fbound}) can then be relaxed as $\cycB{\kerK_{k_0}} \leq
|
|
|
|
\ceil{\cyc{\imath}}$, which we know to be true if the load from $\kerK_{k_0}$
|
|
|
|
on each port does not exceed $\ceil{\cyc{\imath}}$ (as execution takes at least
|
|
|
|
this number of cycles).
|
|
|
|
|
|
|
|
We build $\kerK_{k_0}$ by adding basic instructions to $i$, using the port
|
|
|
|
mapping to pick basic instructions that do not load a port over
|
|
|
|
$\ceil{\cyc{\imath}}$. This is always possible, as we can load independently
|
|
|
|
seven ports (leaving out the branch port), while each instruction can load at
|
|
|
|
most three ports by cycle it takes to execute ---~each \uop{} is executed by a
|
|
|
|
single port, and only three \uops{} can be dispatched per cycle~---, leaving
|
|
|
|
four ports under-loaded. We build $\kerK_{k_0 + 1}$ the same way, still not
|
|
|
|
loading a port over $\ceil{\cyc{\imath}}$; in particular, we still have
|
|
|
|
$\cycB{\kerK_{k_0 + 1}} \leq \ceil{\cyc{\imath}} \leq \cycF{\kerK_{k_0+1}}$. To
|
|
|
|
ensure that condition (\ref{cnd:kerKk:linear}) is valid, as we will see later in
|
|
|
|
\qtodo{ref}, we spread as much as possible instructions loading the same port:
|
|
|
|
for instance, $i + \basic{Int01} + \basic{FP01} + \basic{Int01}$ is preferred
|
|
|
|
over $i + 2\times \basic{Int01} + \basic{FP01}$.
|
|
|
|
|
|
|
|
Unless condition (\ref{cnd:kerKk:linear}) is not met or our ports model is
|
|
|
|
incorrect for this instruction, we should measure
|
|
|
|
$\ceil{\cyc{\imath}} \leq \cyc{\kerK_{k_0}}$ and $\cyc{\kerK_{k_0}} +
|
|
|
|
\sfrac{1}{3} = \cyc{\kerK_{k_0+1}}$. For instructions $i$ where it is not the
|
|
|
|
case, increasing $k_0$ by 3 or using other basic instructions eventually
|
|
|
|
yielded satisfying measures. Finally, we then obtain
|
|
|
|
|
|
|
|
\[
|
|
|
|
\mucount{}i = 3 \cyc{\kerK_{k_0}} - k_0
|
|
|
|
\]
|
2023-09-21 16:19:34 +02:00
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Applying this procedure manually on each instruction class provides us with a
|
|
|
|
model mapping each supported instruction of the ISA to its \uop{} count.
|
|
|
|
|
|
|
|
\begin{example}[\uop{} count measure: \lstarmasm{ADC_RD_X_RN_X_RM_X}]
|
|
|
|
We measure the \uop{}-count of $i =$ \lstarmasm{ADC_RD_X_RN_X_RM_X}, our
|
|
|
|
basic instruction for the integer port.
|
|
|
|
|
|
|
|
We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
|
|
|
|
we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this
|
|
|
|
instruction loads only the \texttt{Int01} port with a load of
|
|
|
|
$\sfrac{1}{2}$.
|
|
|
|
|
|
|
|
We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i +
|
|
|
|
\basic{FP01} + \basic{Ld} + \basic{FP01}$.
|
|
|
|
|
|
|
|
We measure
|
|
|
|
\begin{itemize}
|
|
|
|
\item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$
|
|
|
|
\item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
|
|
|
\end{itemize}
|
|
|
|
which is consistent. We conclude that, as expected, $\mucount i =
|
|
|
|
3\cyc{\kerK_3} = 3-2 = 1$.
|
|
|
|
\end{example}
|
|
|
|
|
|
|
|
\begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
|
|
|
|
We measure the \uop{}-count of $i =$ \lstarmasm{ADDV_FD_H_VN_V_8H}, the
|
|
|
|
SIMD ``add across vector'' operation on a vector of eight sixteen-bits
|
|
|
|
operands.
|
|
|
|
|
|
|
|
We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
|
|
|
|
$\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads
|
|
|
|
the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
|
|
|
|
load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
|
|
|
|
load of 1 means two \uops{}. As there is already a \uop{} loading the
|
|
|
|
\texttt{FP1} port, which also loads the combined port \texttt{FP01},
|
|
|
|
this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
|
|
|
|
on either \texttt{FP0} or \texttt{FP1}.}.
|
|
|
|
|
|
|
|
We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i +
|
|
|
|
\basic{Int01} + \basic{Ld} + \basic{Int01}$.
|
|
|
|
|
|
|
|
We measure
|
|
|
|
\begin{itemize}
|
|
|
|
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
|
|
|
\item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
|
|
|
|
\end{itemize}
|
|
|
|
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 =
|
|
|
|
2$.
|
|
|
|
\end{example}
|
2023-09-21 17:48:47 +02:00
|
|
|
|
|
|
|
|
|
|
|
\subsection{Bubbles in the pipeline}
|
|
|
|
|
|
|
|
The frontend, however, does not always exhibit a purely linear behaviour. We
|
|
|
|
consider for instance the kernel $\kerK =$ \lstarmasm{ADDV_FD_H_VN_V_8H} $+
|
|
|
|
3\times\basic{Int01}$; for the rest of this chapter, we refer to
|
|
|
|
\lstarmasm{ADDV_FD_H_VN_V_8H} as simply \lstarmasm{ADDV} when not stated
|
|
|
|
otherwise.
|
|
|
|
|
|
|
|
Backend-wise, \texttt{ADDV} fully loads \texttt{FP1} and \texttt{FP01}, while
|
|
|
|
$\basic{Int01}$ half-loads \texttt{Int01}. The port most loaded by $\kerK$ is
|
|
|
|
thus \texttt{Int01}, with a load of $1\,\sfrac{1}{2}$. We then expect
|
|
|
|
$\cycB{\kerK} = 1\,\sfrac{1}{2}$.
|
|
|
|
|
|
|
|
Frontend-wise, \texttt{ADDV} decomposes into two \uops{}, while $\basic{Int01}$
|
|
|
|
decomposes into a single \uops{}; thus, $\mucount{}\kerK = 5$. We then expect
|
|
|
|
$\cycF{\kerK} = 1\,\sfrac{2}{3}$.
|
|
|
|
|
|
|
|
As the frontend dominates the backend, we expect $\cyc{\kerK} = \cycF{\kerK} =
|
|
|
|
1\,\sfrac{2}{3}$. However, in reality, we measure $\cyc{\kerK} = 2.01 \simeq 2$
|
|
|
|
cycles.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
From then on, we strive to find a model that could reliably predict, given a
|
|
|
|
kernel, how many cycles it requires to execute, frontend-wise, in a
|
|
|
|
steady-state.
|
|
|
|
|
|
|
|
\subsubsection{No-cross model}
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.7\linewidth]{timeline_front_nocross.svg}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
|
|
On the x86-64 architectures they analyzed, \uica{}'s authors find that the
|
|
|
|
CPU's predecoder might cause an instruction's \uops{} to be postponed to the
|
|
|
|
next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1).
|
|
|
|
|
|
|
|
We hypothesize that the same kind of effect could postpone an instruction's
|
|
|
|
\uops{} until the next cycle if its \uops{} would cross a cycle boundary
|
|
|
|
otherwise, as illustrated in \qtodo{ref}.
|