\section{Manually modelling the A72 frontend} \todo{} \subsection{Finding micro-operation count for each instruction} As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only dispatch three \uops{} per cycle. The first important data to collect, thus, is the number of \uops{} each instruction is decoded into. To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not thorough enough: for each instruction, it lists the ports on which load is incurred, which sets a lower bound to the number of \uops{} the instruction is decomposed into. This approach, however, is not really satisfying. First, because it cannot be reproduced for another architecture whose optimisation manual is not as detailed, cannot be automated, and fully trusts the manufacturer. Second, because if an instruction loads \eg{} the integer ports, it may have a single or multiple \uops{} executed on the integer ports; the manual is only helpful to some extent to determine this. \medskip{} We instead use an approach akin to \palmed{}' saturating kernels, itself inspired by Agner Fog's method to identify ports in the absence of hardware counters~\cite{AgnerFog}. To this end, we assume the availability of a port mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s output, sometimes manually confronted with the software optimisation guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures it models. The \palmed{} resource mapping we use as a basis is composed of 1\,975 instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in \autoref{sec:palmed_design}, as instructions in the same class are mapped to the same resources, and thus are decomposed into the same \uops{}; this results in only 98 classes of instructions. \paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick \emph{basic instructions}: for each port, we select one instruction which decodes into a single \uop{} executed by this port. We use the following instructions, in \pipedream{}'s notation: \begin{itemize} \item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0, x1, x2}; \item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{} \lstarmasm{mul w0, w1, w2}; \item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0, [x1, x2]}; \item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0, [x1, x2]}; \item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta d0, d1} (floating-point rounding to integral); \item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp d0, d1} (floating-point comparison); \item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin d0, d1, d1} (floating-point minimum); \item (Branch: no instruction, as they are unsupported by \pipedream{}). \end{itemize} As the integer ports are not specialized, a single basic instruction is sufficient for both of them. The FP/SIMD ports are slightly specialized (see \autoref{sec:a72_descr}), we thus use three basic instructions: one that stresses each of them independently, and one that stresses both without distinction. For each of these ports, we note $\basic{p}$ the basic instruction for port \texttt{p}; \eg{}, $\basic{Int01}$ is \lstarmasm{ADC_RD_X_RN_X_RM_X}. \paragraph{Counting the micro-ops of an instruction.} There are three main sources of bottleneck for a kernel $\kerK$: backend, frontend and dependencies. When measuring the execution time with \pipedream{}, we eliminate (as far as possible) the dependencies, leaving us with only backend and frontend. We note $\cycF{\kerK}$ the execution time of $\kerK$ if it was only limited by its frontend, and $\cycB{\kerK}$ the execution time of $\kerK$ if it was only limited by its backend. If we consider a kernel $\kerK$ that is simple enough to exhibit a purely linear frontend behaviour ---~that is, the frontend's throughput is a linear function of the number of \uops{} in the kernel~---, we then know that either $\cyc{\kerK} = \cycF{\kerK}$ or $\cyc{\kerK} = \cycB{\kerK}$. For a given instruction $i$ and for a certain $k \in \nat$, we then construct a kernel $\kerK_k$ such that: \begin{enumerate}[(i)] \item\label{cnd:kerKk:compo} $\kerK_k$ is composed of the instruction $i$, followed by $k$ basic instructions; \item\label{cnd:kerKk:linear} the kernel $\kerK_k$ is simple enough to exhibit this purely linear frontend behaviour; \item\label{cnd:kerKk:fbound} $\cycB{\kerK_k} \leq \cycF{\kerK_k}$. \end{enumerate} We denote by $\mucount{}\kerK$ the number of \uops{} in kernel $\kerK$. Under the condition~(\ref{cnd:kerKk:linear}), we have for any $k \in \nat$ \begin{align*} \cycF{\kerK_k} &= \dfrac{\mucount{}\left(\kerK_k\right)}{3} & \text{for the A72} \\ &= \dfrac{\mucount{}i + k}{3} & \text{by condition (\ref{cnd:kerKk:compo})} \\ &\geq \dfrac{k+1}{3} \end{align*} We pick $k_0 := 3 \ceil{\cyc{\imath}} - 1$. Thus, we have $\ceil{\cyc{\imath}} \leq \cycF{\kerK_{k_0}} \leq \cyc{\kerK_{k_0}}$. Condition (\ref{cnd:kerKk:fbound}) can then be relaxed as $\cycB{\kerK_{k_0}} \leq \ceil{\cyc{\imath}}$, which we know to be true if the load from $\kerK_{k_0}$ on each port does not exceed $\ceil{\cyc{\imath}}$ (as execution takes at least this number of cycles). We build $\kerK_{k_0}$ by adding basic instructions to $i$, using the port mapping to pick basic instructions that do not load a port over $\ceil{\cyc{\imath}}$. This is always possible, as we can load independently seven ports (leaving out the branch port), while each instruction can load at most three ports by cycle it takes to execute ---~each \uop{} is executed by a single port, and only three \uops{} can be dispatched per cycle~---, leaving four ports under-loaded. We build $\kerK_{k_0 + 1}$ the same way, still not loading a port over $\ceil{\cyc{\imath}}$; in particular, we still have $\cycB{\kerK_{k_0 + 1}} \leq \ceil{\cyc{\imath}} \leq \cycF{\kerK_{k_0+1}}$. To ensure that condition (\ref{cnd:kerKk:linear}) is valid, as we will see later in \autoref{sssec:a72_dispatch_queues}, we spread as much as possible instructions loading the same port: for instance, $i + \basic{Int01} + \basic{FP01} + \basic{Int01}$ is preferred over $i + 2\times \basic{Int01} + \basic{FP01}$. Unless condition (\ref{cnd:kerKk:linear}) is not met or our ports model is incorrect for this instruction, we should measure $\ceil{\cyc{\imath}} \leq \cyc{\kerK_{k_0}}$ and $\cyc{\kerK_{k_0}} + \sfrac{1}{3} = \cyc{\kerK_{k_0+1}}$. For instructions $i$ where it is not the case, increasing $k_0$ by 3 or using other basic instructions eventually yielded satisfying measures. Finally, we obtain \[ \mucount{}i = 3 \cyc{\kerK_{k_0}} - k_0 \] \medskip{} Applying this procedure manually on each instruction class provides us with a model mapping each supported instruction of the ISA to its \uop{} count. \begin{example}[\uop{} count measure: \lstarmasm{ADC_RD_X_RN_X_RM_X}] We measure the \uop{}-count of $i =$ \lstarmasm{ADC_RD_X_RN_X_RM_X}, our basic instruction for the integer port. We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence, we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads only the \texttt{Int01} port with a load of $\sfrac{1}{2}$. We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i + \basic{FP01} + \basic{Ld} + \basic{FP01}$. We measure \begin{itemize} \item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$ \item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$ \end{itemize} which is consistent. We conclude that, as expected, $\mucount i = 3\cyc{\kerK_3} = 3-2 = 1$. \end{example} \begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}] We measure the \uop{}-count of $i =$ \lstarmasm{ADDV_FD_H_VN_V_8H}, the SIMD ``add across vector'' operation on a vector of eight sixteen-bits operands. We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a load of 1 means two \uops{}. As there is already a \uop{} loading the \texttt{FP1} port, which also loads the combined port \texttt{FP01}, this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one on either \texttt{FP0} or \texttt{FP1}.}. We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i + \basic{Int01} + \basic{Ld} + \basic{Int01}$. We measure \begin{itemize} \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$ \item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$ \end{itemize} which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 = 2$. \end{example} \subsection{Bubbles in the pipeline} The frontend, however, does not always exhibit a purely linear behaviour. We consider for instance the kernel $\kerK =$ \lstarmasm{ADDV_FD_H_VN_V_8H} $+ 3\times\basic{Int01}$; for the rest of this chapter, we refer to \lstarmasm{ADDV_FD_H_VN_V_8H} as simply \lstarmasm{ADDV} when not stated otherwise. Backend-wise, \texttt{ADDV} fully loads \texttt{FP1} and \texttt{FP01}, while $\basic{Int01}$ half-loads \texttt{Int01}. The port most loaded by $\kerK$ is thus \texttt{Int01}, with a load of $1\,\sfrac{1}{2}$. We then expect $\cycB{\kerK} = 1\,\sfrac{1}{2}$. Frontend-wise, \texttt{ADDV} decomposes into two \uops{}, while $\basic{Int01}$ decomposes into a single \uops{}; thus, $\mucount{}\kerK = 5$. We then expect $\cycF{\kerK} = 1\,\sfrac{2}{3}$. As the frontend dominates the backend, we expect $\cyc{\kerK} = \cycF{\kerK} = 1\,\sfrac{2}{3}$. However, in reality, we measure $\cyc{\kerK} = 2.01 \simeq 2$ cycles. \medskip{} From then on, we strive to find a model that could reliably predict, given a kernel, how many cycles it requires to execute, frontend-wise, in a steady-state. \subsubsection{No-cross model} \begin{figure} \centering \hfill\begin{minipage}[c]{0.25\linewidth} \centering \includegraphics[width=3cm]{timeline_front_ex1_linear.svg}\\ \textit{With linear frontend} \end{minipage}\begin{minipage}[c]{0.2\linewidth} \centering \Huge$\rightarrow$ \end{minipage}\begin{minipage}[c]{0.25\linewidth} \centering \includegraphics[width=3cm]{timeline_front_ex1_nocross.svg}\\ \textit{With no-cross frontend} \end{minipage}\hfill~ \caption{Illustration of the no-cross frontend model. Rows represent CPU cycles.}\label{fig:frontend_nocross} \end{figure} On the x86-64 architectures they analyzed, \uica{}'s authors find that the CPU's predecoder might cause an instruction's \uops{} to be postponed to the next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (ยง4.1). We hypothesize that the same kind of effect could postpone an instruction's \uops{} until the next cycle if its \uops{} would cross a cycle boundary otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross}, with a kernel composed of three instructions: the first two each decode to a single \uop{}, while the third one decodes to two \uops{}. In this figure, each row represents a CPU cycle, while each square represents a \uop{}-slot in the frontend; there are thus three squares in each row. In the no-cross case (right), the constraint forced the third instruction to start its decoding at the beginning of the second cycle, leaving a ``bubble'' in the frontend in the first cycle. \medskip{} \begin{wrapfigure}{R}{0.2\linewidth} \centering \vspace{-2em} \includegraphics[width=3cm]{timeline_front_nocross_addv_3add.svg} \caption{No-cross frontend for $\texttt{ADDV} + 3\times\basic{Int01}$}\label{fig:frontend_nocross_addv_3add} \end{wrapfigure} This model explains the $\kerK = \texttt{ADDV} + 3\times\basic{Int01}$ example introduced above, as depicted in \autoref{fig:frontend_nocross_addv_3add}, where $\kerK$ is represented twice, to ensure that the steady-state was reached. Here, the frontend indeed requires two full cycles to issue $\kerK$, which is consistent with our measure. \medskip{} The notion of steady-state is, in the general case, not as straightforward: it is well possible that, after executing the kernel once, the second iteration of the kernel does not begin at the cycle boundary. The state of the model, however, is entirely defined by the number $s \in \left\{0,1,2\right\}$ of \uops{} already decoded this cycle. Thus, if at the end of a full execution of a kernel, $s$ is equal to a state previously encountered at the end of a kernel, $k$ kernel iterations before, steady-state was reached for this portion: we know that further executing the kernel $k$ times will bring us again to the same state. The steady-state execution time, frontend-wise, of a kernel is then the number of elapsed cycles between the beginning and end of the steady-state pattern (as the start and end state are the same), divided by the number of kernel repetitions inside the pattern. The no-cross model is formalized by the \texttt{next\_state} function defined in \autoref{lst:nocross_next_state} in Python. \begin{lstfloat} \lstinputlisting[language=Python]{assets/src/40_A72-frontend/nocross_next.py} \caption{Implementation of the \texttt{next\_state} function for the no-cross frontend model}\label{lst:nocross_next_state} \end{lstfloat} \medskip{} There are two main phases when repeatedly applying the \texttt{next\_state} function. Consider the following example of a graph representation of the \texttt{next\_state} function, ignoring the \texttt{cycles\_started} return value: \begin{center} \begin{tikzpicture}[ state/.style={circle,draw=black, thick, minimum size=0.5cm, align=center} ] \node[state] at (0, 0) (0) {$0$}; \node[state] at (3, 0) (1) {$1$}; \node[state] at (6, 0) (2) {$2$}; \draw[->, thick] (0) to[bend left] (1); \draw[->, thick] (1) to[bend left] (2); \draw[->, thick] (2) to[bend left] (1); \end{tikzpicture} \end{center} When repeatedly applied starting from $0$, the \texttt{next\_state} function will yield the sequence $0, 1, 2, 1, 2, 1, 2, \ldots$. The first iteration brings us to state $1$, which belongs to the steady-state; starting from there, the next iterations will loop through the steady-state. \begin{wrapfigure}{R}{0.2\linewidth} \centering \vspace{-2em} \includegraphics[width=3cm]{timeline_front_nocross_addv_2add.svg} \caption{No-cross frontend for $\texttt{ADDV} + 2\times\basic{Int01}$}\label{fig:frontend_nocross_addv_2add} \end{wrapfigure} In the general case, the model iterates the \texttt{next\_state} function starting from state $0$ until a previously-encountered state is reached ---~this requires at most three iterations. At this point, steady-state is reached. The function is further iterated until the same state is encountered again ---~also requiring at most three iterations~---. The number of elapsed cycles during this second phase, divided by the number of iterations of the function, is returned as the predicted steady-state execution time of the kernel, frontend-wise. \bigskip{} This model, however, is not satisfactory in many cases. For instance, the kernel $\kerK' = \texttt{ADDV} + 2\times\basic{Int01}$ is predicted to run in $1.5$ cycles, as depicted in \autoref{fig:frontend_nocross_addv_2add}; however, a \pipedream{} measure yields $\cyc{\kerK'} = 1.35 \simeq 1\,\sfrac{1}{3}$ cycles. \subsubsection{Dispatch-queues model}\label{sssec:a72_dispatch_queues} \todo{Somehow say that I missed the optimisation guide's relevant part.} \begin{wrapfigure}{R}{0.2\linewidth} \centering \vspace{-2em} \includegraphics[width=3cm]{timeline_front_dispq_addv_3add.svg} \caption{Disp.\ queues frontend for $\texttt{ADDV} + 3\times\basic{Int01}$}\label{fig:frontend_dispq_addv_3add} \vspace{-10em} % wow. \end{wrapfigure} The software optimisation guide, however, details additional dispatch constraints in Section~4.1~\cite{ref:a72_optim}. While it confirms the dispatch constraint of at most three \uops{} per cycle, it also introduces more specific constraints. There are six dispatch pipelines, that each bottleneck at less than three \uops{} dispatched each cycle: \medskip{} \begin{minipage}{0.75\textwidth} \begin{center} \begin{tabular}{l l r} \toprule \textbf{Pipeline} & \textbf{Related ports} & \textbf{\uops{}/cyc.} \\ \midrule \texttt{Branch} & \texttt{Branch} & 1 \\ \texttt{Int} & \texttt{Int01} & 2 \\ \texttt{IntM} & \texttt{IntM} & 2 \\ \texttt{FP0} & \texttt{FP0} & 1 \\ \texttt{FP1} & \texttt{FP1} & 1 \\ \texttt{LdSt} & \texttt{Ld}, \texttt{St} & 2 \\ \bottomrule \end{tabular} \end{center} \end{minipage} \medskip{} These dispatch constraints could also explain the $\kerK = \texttt{ADDV} + 3\times\basic{Int01}$ measured run time, as detailed in \autoref{fig:frontend_dispq_addv_3add}: the frontend enters steady state on the fourth cycle and, on the fifth (and every second cycle from then on), can only execute two \uops{}, as the third one would be a third \uop{} loading the \texttt{Int} dispatch queue, which can only dispatch two \uops{} per cycle. As this part of the CPU is in-order, the frontend is stalled until the next cycle, leaving a dispatch bubble. In steady-state, the dispatch-queues model thus predicts $\cycF{\kerK} = 2$ cycles, which is consistent with our measure. \medskip{} This model also explains the $\kerK' = \texttt{ADDV} + 2\times\basic{Int01}$ measured time, as detailed in \autoref{fig:frontend_dispq_addv_2add}: the dispatch-queues constraints do not force any bubble in the dispatch pipeline. The model thus predicts $\cycF{\kerK'} = 1\,\sfrac{1}{3}$, which is consistent with our measure. \begin{wrapfigure}{r}{0.2\linewidth} \centering \vspace{-2em} \includegraphics[width=3cm]{timeline_front_dispq_addv_2add.svg} \caption{Disp.\ queues frontend for $\texttt{ADDV} + 2\times\basic{Int01}$}\label{fig:frontend_dispq_addv_2add} \end{wrapfigure} \paragraph{Finding a dispatch model.} This model, however, cannot be deduced straightforwardly from our previous \uops{} model: each \uop{} needs to further be mapped to a dispatch queue. The model yielded by \palmed{} is not usable as-is to deduce the dispatch queues used by each instruction: the detected resources are not a perfect match for the CPU, and some resources tend to be duplicated due to measurement artifacts. For instance, the \texttt{Int01} resource might be duplicated into $r_0$ and $r_1$, with some integer operations loading $r_0$, some loading $r_1$, and a majority loading both --- while it makes for a reasonably good throughput model, it would require extra cleaning work to be used as a dispatch model. It is, however, a good basis for manual mapping: for instance, the \texttt{Ld} and \texttt{St} ports are one-to-one matches, and allow to automatically map all load and store operations. We generate a base dispatch model from the Palmed model, and manually cross-check each class of instructions using the optimisation manual, with \pipedream{} measures in some cases to further clarify. \medskip{} This method trivially works for most instructions, which are built out of a single \uop{} and for which we find a single relevant dispatch queue. However, instructions where this is not the case require specific processing. For an instruction $i$, with $u = \mucount{}i$ \uops{} and found to belong to $d$ distinct dispatch queues, we break down the following cases. \begin{itemize} \item{} If $d > u$, our \uops{} model is wrong for this instruction, as each \uop{} belongs to a single dispatch queue. This did not happen in our model. \item{} If $d = 1, u > 1$, each \uop{} belongs to the same dispatch queue. \item{} If $d = u > 1$, each \uop{} belongs to its own dispatch queue. For now, our model orders those \uops{} arbitrarily. However, the order of those \uops{} might be important due to dispatch constraints, and would require specific investigation with kernels meant to stress the dispatch queues, assuming a certain \uop{} order. \item{} If $1 < d < u$, we do not have enough data to determine how many \uops{} belong to each queue; this would require further measurements. We do not support those instructions, as they represent only 35 instructions out of the 1749 instructions supported by our \palmed{}-generated backend model. \end{itemize} \medskip{} Due to the separation of \texttt{FP0} and \texttt{FP1} in two different dispatch queues, this model also borrows the abstract resources' formalism in a simple case: it actually models seven queues, including \texttt{FP0}, \texttt{FP1} and \texttt{FP01}, the two former with a maximal load of 1 and the latter with a maximal load of 2. Any \uop{} loading queues \texttt{FP0} or \texttt{FP1} also load the \texttt{FP01} queue likewise. \paragraph{Implementing the dispatch model.} A key observation to implement this model is that, as with the no-cross model, the state of the model at the end of a kernel occurrence is still only determined by the number of \uops{} already dispatched in the current cycle. Indeed, since the dispatcher is in-order, at the end of a kernel occurrence, the last \uops{} dispatched will always be the same in steady-state, as the last instructions are the few last of the kernel. On account of this observation, the general structure of the no-cross implementation remains correct: at most three kernel iterations to reach steady-state, and at most three more to find a fixed point. The \texttt{next\_state} function is adapted to account for the dispatch queues limit. \section{Evaluation on Palmed} To evaluate the gain brought by each frontend model, we plug them successively on top of the \palmed{} backend model. The number of cycles for a kernel $\kerK$ is then predicted as the maximum between the backend-predicted time and the frontend-predicted time. We evaluate four models: \palmed{}'s backend alone, \palmed{} with a purely linear frontend, based on our modeled number of \uops{} for each instruction, \palmed{} with the no-cross frontend, and finally \palmed{} with the dispatch-queues frontend. The results of each model are reported in \autoref{table:a72_frontend_err}, to which we add \llvmmca{}'s results as a basis for comparison with the state-of-the-art. \begin{table} \centering \begin{tabular}{l l c r r r r r} \toprule & & & \multirow{2}{*}{\llvmmca{}} & \multicolumn{4}{c}{\palmed{} with frontend\ldots} \\ & & & & none & linear & no-cross & disp.\ queues \\ \midrule{} \multirow{3}{*}{SPEC} & \covrow{} & 100.0 & \na{} & 97.21 & 97.21 & 97.16 \\ & \errrow{} & 9.0 & 20.1 & 6.2 & 6.3 & 4.6 \\ & \taurow{} & 0.83 & 0.88 & 0.91 & 0.91 & 0.93 \\ \midrule \multirow{3}{*}{Polybench} & \covrow{} & 100.00 & \na{} & 99.33 & 99.33 & 99.33 \\ & \errrow{} & 13.9 & 12.6 & 8.1 & 8.1 & 8.0 \\ & \taurow{} & 0.47 & 0.82 & 0.88 & 0.88 & 0.90 \\ \bottomrule \end{tabular} \caption{Comparative accuracy of IPC predictions with different frontend models on the Cortex A72}\label{table:a72_frontend_err} \end{table} As expected, the error is greatly reduced with the addition of any reasonable frontend model ---~especially on the SPEC benchmark suite. Using the dispatch-queues model, which models more accurately the frontend, further reduces significantly the error rate on SPEC by 1.6 points, without significantly increasing the $\tau_K$ coefficient. On Polybench, however, the gains brought by the dispatch-queues model are very modest ---~only 0.1 point. In all cases, \palmed{} with a frontend model performs significantly better than \llvmmca{} on the Cortex A72. \section{Future works: benchmark-based automatic frontend model generation} \todo{}