\section{Manually modelling the A72 frontend} \todo{} \subsection{Finding micro-operation count for each instruction} As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only dispatch three \uops{} per cycle. The first important data to collect, thus, is the number of \uops{} each instruction is decoded into. To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not thorough enough: for each instruction, it lists the ports on which load is incurred, which sets a lower bound to the number of \uops{} the instruction is decomposed into. This approach, however, is not really satisfying. First, because it cannot be reproduced for another architecture whose optimisation manual is not as detailed, cannot be automated, and fully trusts the manufacturer. Second, because if an instruction loads \eg{} the integer ports, it may have a single or multiple \uops{} executed on the integer ports; the manual is only helpful to some extent to determine this. \medskip{} We instead use an approach akin to \palmed{}' saturating kernels, itself inspired by Agner Fog's method to identify ports in the absence of hardware counters~\cite{AgnerFog}. To this end, we assume the availability of a port mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s output, sometimes manually confronted with the software optimisation guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures it models. The \palmed{} resource mapping we use as a basis is composed of 1\,975 instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in \autoref{sec:palmed_design}, as instructions in the same class are mapped to the same resources, and thus are decomposed into the same \uops{}; this results in only 98 classes of instructions. \paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick \emph{basic instructions}: for each port, we select one instruction which decodes into a single \uop{} executed by this port. We use the following instructions, in \pipedream{}'s notation: \begin{itemize} \item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0, x1, x2}; \item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{} \lstarmasm{mul w0, w1, w2}; \item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0, [x1, x2]}; \item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0, [x1, x2]}; \item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta d0, d1} (floating-point rounding to integral); \item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp d0, d1} (floating-point comparison); \item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin d0, d1, d1} (floating-point minimum); \item (Branch: no instruction, as they are unsupported by \pipedream{}). \end{itemize} As the integer ports are not specialized, a single basic instruction is sufficient for both of them. The FP/SIMD ports are slightly specialized (see \autoref{sec:a72_descr}), we thus use three basic instructions: one that stresses each of them independently, and one that stresses both without distinction. For each of these ports, we note $\basic{p}$ the basic instruction for port \texttt{p}; \eg{}, $\basic{Int01}$ is \lstarmasm{ADC_RD_X_RN_X_RM_X}. \paragraph{Counting the micro-ops of an instruction.} There are three main sources of bottleneck for a kernel $\kerK$: backend, frontend and dependencies. When measuring the execution time with \pipedream{}, we eliminate (as far as possible) the dependencies, leaving us with only backend and frontend. We note $\cycF{\kerK}$ the execution time of $\kerK$ if it was only limited by its frontend, and $\cycB{\kerK}$ the execution time of $\kerK$ if it was only limited by its backend. If we consider a kernel $\kerK$ that is simple enough to exhibit a purely linear frontend behaviour ---~that is, the frontend's throughput is a linear function of the number of \uops{} in the kernel~---, we then know that either $\cyc{\kerK} = \cycF{\kerK}$ or $\cyc{\kerK} = \cycB{\kerK}$. For a given instruction $i$, we then construct a sequence $\kerK_k$ of kernels such that: \begin{enumerate}[(i)] \item\label{cnd:kerKk:compo} for all $k \in \nat$, $\kerK_k$ is composed of the instruction $i$, followed by $k$ basic instructions; \item\label{cnd:kerKk:linear} the kernels $\kerK_k$ are simple enough to exhibit this purely linear frontend behaviour; \item\label{cnd:kerKk:fbound} after a certain rank, $\cycB{\kerK_k} \leq \cycF{\kerK_k}$. \end{enumerate} We denote by $\mucount{}\kerK$ the number of \uops{} in kernel $\kerK$. Under the condition~(\ref{cnd:kerKk:linear}), we have for any $k \in \nat$ \begin{align*} \cycF{\kerK_k} &= \dfrac{\mucount{}\left(\kerK_k\right)}{3} & \text{for the A72} \\ &= \dfrac{\mucount{}i + k}{3} & \text{by condition (\ref{cnd:kerKk:compo})} \\ &\geq \dfrac{k+1}{3} \end{align*} We pick $k_0 := 3 \ceil{\cyc{\imath}} - 1$. Thus, we have $\ceil{\cyc{\imath}} \leq \cycF{\kerK_{k_0}} \leq \cyc{\kerK_{k_0}}$. Condition (\ref{cnd:kerKk:fbound}) can then be relaxed as $\cycB{\kerK_{k_0}} \leq \ceil{\cyc{\imath}}$, which we know to be true if the load from $\kerK_{k_0}$ on each port does not exceed $\ceil{\cyc{\imath}}$ (as execution takes at least this number of cycles). We build $\kerK_{k_0}$ by adding basic instructions to $i$, using the port mapping to pick basic instructions that do not load a port over $\ceil{\cyc{\imath}}$. This is always possible, as we can load independently seven ports (leaving out the branch port), while each instruction can load at most three ports by cycle it takes to execute ---~each \uop{} is executed by a single port, and only three \uops{} can be dispatched per cycle~---, leaving four ports under-loaded. We build $\kerK_{k_0 + 1}$ the same way, still not loading a port over $\ceil{\cyc{\imath}}$; in particular, we still have $\cycB{\kerK_{k_0 + 1}} \leq \ceil{\cyc{\imath}} \leq \cycF{\kerK_{k_0+1}}$. To ensure that condition (\ref{cnd:kerKk:linear}) is valid, as we will see later in \qtodo{ref}, we spread as much as possible instructions loading the same port: for instance, $i + \basic{Int01} + \basic{FP01} + \basic{Int01}$ is preferred over $i + 2\times \basic{Int01} + \basic{FP01}$. Unless condition (\ref{cnd:kerKk:linear}) is not met or our ports model is incorrect for this instruction, we should measure $\ceil{\cyc{\imath}} \leq \cyc{\kerK_{k_0}}$ and $\cyc{\kerK_{k_0}} + \sfrac{1}{3} = \cyc{\kerK_{k_0+1}}$. For instructions $i$ where it is not the case, increasing $k_0$ by 3 or using other basic instructions eventually yielded satisfying measures. Finally, we then obtain \[ \mucount{}i = 3 \cyc{\kerK_{k_0}} - k_0 \]