\section{Manually modelling the A72 frontend}

\todo{}

\subsection{Finding micro-operation count for each instruction}

As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only
dispatch three \uops{} per cycle. The first important data to collect, thus, is
the number of \uops{} each instruction is decoded into.

To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not
thorough enough: for each instruction, it lists the ports on which load is
incurred, which sets a lower bound to the number of \uops{} the instruction is
decomposed into. This approach, however, is not really satisfying. First,
because it cannot be reproduced for another architecture whose optimisation
manual is not as detailed, cannot be automated, and fully trusts the
manufacturer. Second, because if an instruction loads \eg{} the integer ports,
it may have a single or multiple \uops{} executed on the integer ports; the
manual is only helpful to some extent to determine this.

\medskip{}

We instead use an approach akin to \palmed{}' saturating kernels, itself
inspired by Agner Fog's method to identify ports in the absence of hardware
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
output, sometimes manually confronted with the software optimisation
guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures
it models.

The \palmed{} resource mapping we use as a basis is composed of 1\,975
instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in
\autoref{sec:palmed_design}, as instructions in the same class are mapped to
the same resources, and thus are decomposed into the same \uops{}; this results
in only 98 classes of instructions.

\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick
\emph{basic instructions}: for each port, we select one instruction which
decodes into a single \uop{} executed by this port. We use the following
instructions, in \pipedream{}'s notation:

\begin{itemize}
    \item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0,
        x1, x2};
    \item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{}
        \lstarmasm{mul w0, w1, w2};
    \item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0,
        [x1, x2]};
    \item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0,
        [x1, x2]};
    \item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta
        d0, d1} (floating-point rounding to integral);
    \item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp
        d0, d1} (floating-point comparison);
    \item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin
        d0, d1, d1} (floating-point minimum);
    \item (Branch: no instruction, as they are unsupported by \pipedream{}).
\end{itemize}

As the integer ports are not specialized, a single basic instruction is
sufficient for both of them. The FP/SIMD ports are slightly specialized
(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that
stresses each of them independently, and one that stresses both without
distinction.

For each of these ports, we note $\basic{p}$ the basic instruction for
port \texttt{p}; \eg{}, $\basic{Int01}$ is \lstarmasm{ADC_RD_X_RN_X_RM_X}.

\paragraph{Counting the micro-ops of an instruction.} There are three main
sources of bottleneck for a kernel $\kerK$: backend, frontend and dependencies.
When measuring the execution time with \pipedream{}, we eliminate (as far as
possible) the dependencies, leaving us with only backend and frontend. We note
$\cycF{\kerK}$ the execution time of $\kerK$ if it was only limited by its
frontend, and $\cycB{\kerK}$ the execution time of $\kerK$ if it was only
limited by its backend. If we consider a kernel $\kerK$ that is simple enough
to exhibit a purely linear frontend behaviour ---~that is, the frontend's
throughput is a linear function of the number of \uops{} in the kernel~---, we
then know that either $\cyc{\kerK} = \cycF{\kerK}$ or $\cyc{\kerK} =
\cycB{\kerK}$.

For a given instruction $i$, we then construct a sequence $\kerK_k$ of kernels
such that:
\begin{enumerate}[(i)]
    \item\label{cnd:kerKk:compo} for all $k \in \nat$, $\kerK_k$ is composed of the instruction $i$,
        followed by $k$ basic instructions;
    \item\label{cnd:kerKk:linear} the kernels $\kerK_k$ are simple enough to exhibit this purely linear
        frontend behaviour;
    \item\label{cnd:kerKk:fbound} after a certain rank, $\cycB{\kerK_k} \leq \cycF{\kerK_k}$.
\end{enumerate}

We denote by $\mucount{}\kerK$ the number of \uops{} in kernel $\kerK$. Under
the condition~(\ref{cnd:kerKk:linear}), we have for any $k \in \nat$

\begin{align*}
    \cycF{\kerK_k} &= \dfrac{\mucount{}\left(\kerK_k\right)}{3} & \text{for the A72} \\
                   &= \dfrac{\mucount{}i + k}{3} & \text{by condition (\ref{cnd:kerKk:compo})} \\
                   &\geq \dfrac{k+1}{3}
\end{align*}

We pick $k_0 := 3 \ceil{\cyc{\imath}} - 1$. Thus, we have $\ceil{\cyc{\imath}} \leq
\cycF{\kerK_{k_0}} \leq \cyc{\kerK_{k_0}}$.
Condition (\ref{cnd:kerKk:fbound}) can then be relaxed as $\cycB{\kerK_{k_0}} \leq
\ceil{\cyc{\imath}}$, which we know to be true if the load from $\kerK_{k_0}$
on each port does not exceed $\ceil{\cyc{\imath}}$ (as execution takes at least
this number of cycles).

We build $\kerK_{k_0}$ by adding basic instructions to $i$, using the port
mapping to pick basic instructions that do not load a port over
$\ceil{\cyc{\imath}}$. This is always possible, as we can load independently
seven ports (leaving out the branch port), while each instruction can load at
most three ports by cycle it takes to execute ---~each \uop{} is executed by a
single port, and only three \uops{} can be dispatched per cycle~---, leaving
four ports under-loaded. We build $\kerK_{k_0 + 1}$ the same way, still not
loading a port over $\ceil{\cyc{\imath}}$; in particular, we still have
$\cycB{\kerK_{k_0 + 1}} \leq \ceil{\cyc{\imath}} \leq \cycF{\kerK_{k_0+1}}$. To
ensure that condition (\ref{cnd:kerKk:linear}) is valid, as we will see later in
\qtodo{ref}, we spread as much as possible instructions loading the same port:
for instance, $i + \basic{Int01} + \basic{FP01} + \basic{Int01}$ is preferred
over $i + 2\times \basic{Int01} + \basic{FP01}$.

Unless condition (\ref{cnd:kerKk:linear}) is not met or our ports model is
incorrect for this instruction, we should measure
$\ceil{\cyc{\imath}} \leq \cyc{\kerK_{k_0}}$ and $\cyc{\kerK_{k_0}} +
\sfrac{1}{3} = \cyc{\kerK_{k_0+1}}$. For instructions $i$ where it is not the
case, increasing $k_0$ by 3 or using other basic instructions eventually
yielded satisfying measures. Finally, we then obtain

\[
    \mucount{}i = 3 \cyc{\kerK_{k_0}} - k_0
\]

\medskip{}

Applying this procedure manually on each instruction class provides us with a
model mapping each supported instruction of the ISA to its \uop{} count.

\begin{example}[\uop{} count measure: \lstarmasm{ADC_RD_X_RN_X_RM_X}]
    We measure the \uop{}-count of $i =$ \lstarmasm{ADC_RD_X_RN_X_RM_X}, our
    basic instruction for the integer port.

    We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
    we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this
    instruction loads only the \texttt{Int01} port with a load of
    $\sfrac{1}{2}$.

    We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i +
    \basic{FP01} + \basic{Ld} + \basic{FP01}$.

    We measure
    \begin{itemize}
        \item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$
        \item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
    \end{itemize}
    which is consistent. We conclude that, as expected, $\mucount i =
    3\cyc{\kerK_3} = 3-2 = 1$.
\end{example}

\begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
    We measure the \uop{}-count of $i =$ \lstarmasm{ADDV_FD_H_VN_V_8H}, the
    SIMD ``add across vector'' operation on a vector of eight sixteen-bits
    operands.

    We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
    $\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads
    the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
    load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
        load of 1 means two \uops{}. As there is already a \uop{} loading the
        \texttt{FP1} port, which also loads the combined port \texttt{FP01},
    this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
    on either \texttt{FP0} or \texttt{FP1}.}.

    We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i +
    \basic{Int01} + \basic{Ld} + \basic{Int01}$.

    We measure
    \begin{itemize}
        \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
        \item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
    \end{itemize}
    which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 =
    2$.
\end{example}


\subsection{Bubbles in the pipeline}

The frontend, however, does not always exhibit a purely linear behaviour. We
consider for instance the kernel
$\kerK =$ \lstarmasm{ADDV_FD_H_VN_V_8H} $+ 3\times\basic{Int01}$;
for the rest of this chapter, we refer to
\lstarmasm{ADDV_FD_H_VN_V_8H} as simply \lstarmasm{ADDV} when not stated
otherwise.

Backend-wise, \texttt{ADDV} fully loads \texttt{FP1} and \texttt{FP01}, while
$\basic{Int01}$ half-loads \texttt{Int01}. The port most loaded by $\kerK$ is
thus \texttt{Int01}, with a load of $1\,\sfrac{1}{2}$. We then expect
$\cycB{\kerK} = 1\,\sfrac{1}{2}$.

Frontend-wise, \texttt{ADDV} decomposes into two \uops{}, while $\basic{Int01}$
decomposes into a single \uops{}; thus, $\mucount{}\kerK = 5$. We then expect
$\cycF{\kerK} = 1\,\sfrac{2}{3}$.

As the frontend dominates the backend, we expect $\cyc{\kerK} = \cycF{\kerK} =
1\,\sfrac{2}{3}$. However, in reality, we measure $\cyc{\kerK} = 2.01 \simeq 2$
cycles.

\medskip{}

From then on, we strive to find a model that could reliably predict, given a
kernel, how many cycles it requires to execute, frontend-wise, in a
steady-state.

\subsubsection{No-cross model}

\begin{figure}
    \centering
    \hfill\begin{minipage}[c]{0.25\linewidth}
        \centering
        \includegraphics[width=3cm]{timeline_front_ex1_linear.svg}\\
        \textit{With linear frontend}
    \end{minipage}\begin{minipage}[c]{0.2\linewidth}
        \centering
        \Huge$\rightarrow$
    \end{minipage}\begin{minipage}[c]{0.25\linewidth}
        \centering
        \includegraphics[width=3cm]{timeline_front_ex1_nocross.svg}\\
        \textit{With no-cross frontend}
    \end{minipage}\hfill~

    \caption{Illustration of the no-cross frontend model. Rows represent CPU
    cycles.}\label{fig:frontend_nocross}
\end{figure}

On the x86-64 architectures they analyzed, \uica{}'s authors find that the
CPU's predecoder might cause an instruction's \uops{} to be postponed to the
next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1).

We hypothesize that the same kind of effect could postpone an instruction's
\uops{} until the next cycle if its \uops{} would cross a cycle boundary
otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross},
with a kernel composed of three instructions: the first two each decode to a
single \uop{}, while the third one decodes to two \uops{}. In this figure, each
row represents a CPU cycle, while each square represents a \uop{}-slot in the
frontend; there are thus three squares in each row. In the no-cross case
(right), the constraint forced the third instruction to start its decoding at
the beginning of the second cycle, leaving a ``bubble'' in the frontend in the
first cycle.

\medskip{}

\begin{wrapfigure}{R}{0.2\linewidth}
    \centering
    \vspace{-2em}
    \includegraphics[width=3cm]{timeline_front_nocross_addv_3add.svg}
    \caption{No-cross frontend for $\texttt{ADDV} +
    3\times\basic{Int01}$}\label{fig:frontend_nocross_addv_3add}
\end{wrapfigure}

This model explains the $\kerK = \texttt{ADDV} + 3\times\basic{Int01}$ example
introduced above, as depicted in \autoref{fig:frontend_nocross_addv_3add},
where $\kerK$ is represented twice, to ensure that the steady-state was
reached. Here, the frontend indeed requires two full cycles to issue $\kerK$,
which is consistent with our measure.

\medskip{}

The notion of steady-state is, in the general case, not as straightforward: it
is well possible that, after executing the kernel once, the second iteration of
the kernel does not begin at the cycle boundary.  The state of the model,
however, is entirely defined by the number $s \in \left\{0,1,2\right\}$ of
\uops{} already decoded this cycle. Thus, if at the end of a full execution of
a kernel, $s$ is equal to a state previously encountered at the end of a
kernel, $k$ kernel iterations before, steady-state was reached for this
portion: we know that further executing the kernel $k$ times will bring us
again to the same state. The steady-state execution time, frontend-wise, of a
kernel is then the number of elapsed cycles between the beginning and end of
the steady-state pattern (as the start and end state are the same), divided by
the number of kernel repetitions inside the pattern.

The no-cross model is formalized by the \texttt{next\_state} function defined in
\autoref{lst:nocross_next_state} in Python.

\begin{lstfloat}
\lstinputlisting[language=Python]{assets/src/40_A72-frontend/nocross_next.py}
\caption{Implementation of the \texttt{next\_state} function  for the no-cross
frontend model}\label{lst:nocross_next_state}
\end{lstfloat}

\medskip{}

There are two main phases when repeatedly applying the \texttt{next\_state}
function. Consider the following example of a graph representation of the
\texttt{next\_state} function, ignoring the \texttt{cycles\_started} return
value:

\begin{center}
    \begin{tikzpicture}[
        state/.style={circle,draw=black, thick, minimum size=0.5cm, align=center}
    ]
    \node[state] at (0, 0)  (0) {$0$};
    \node[state] at (3, 0)  (1) {$1$};
    \node[state] at (6, 0)  (2) {$2$};

    \draw[->, thick] (0) to[bend left] (1);
    \draw[->, thick] (1) to[bend left] (2);
    \draw[->, thick] (2) to[bend left] (1);
    \end{tikzpicture}
\end{center}

When repeatedly applied starting from $0$, the \texttt{next\_state} function
will yield the sequence $0, 1, 2, 1, 2, 1, 2, \ldots$. The first iteration
brings us to state $1$, which belongs to the steady-state; starting from there,
the next iterations will loop through the steady-state.

\begin{wrapfigure}{R}{0.2\linewidth}
    \centering
    \vspace{-2em}
    \includegraphics[width=3cm]{timeline_front_nocross_addv_2add.svg}
    \caption{No-cross frontend for $\texttt{ADDV} +
    2\times\basic{Int01}$}\label{fig:frontend_nocross_addv_2add}
\end{wrapfigure}

In the general case, the model iterates the \texttt{next\_state} function
starting from state $0$ until a previously-encountered state is reached
---~this requires at most three iterations. At this point, steady-state is
reached. The function is further iterated until the same state is encountered
again ---~also requiring at most three iterations~---. The number of elapsed
cycles during this second phase, divided by the number of iterations of the
function, is returned as the predicted steady-state execution time of the
kernel, frontend-wise.

\bigskip{}

This model, however, is not satisfactory in many cases. For instance, the
kernel $\kerK' = \texttt{ADDV} + 2x\basic{Int01}$ is predicted to run in $1.5$
cycles, as depicted in \autoref{fig:frontend_nocross_addv_2add}; however, a
\pipedream{} measure yields $\cyc{\kerK'} = 1.35 \simeq 1\,\sfrac{1}{3}$
cycles.