\section{The Cortex A72 CPU}\label{sec:a72_descr}

The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
CPU, with Neon SIMD support. The CPU is designed as a general-purpose,
high-performance core for low-power applications.

The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711;
it is thus easy to have access to an A72 to run experiments.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{A72_pipeline_diagram.svg}
    \caption{Simplified overview of the Cortex A72
    pipeline}\label{fig:a72_pipeline}
\end{figure}

\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from
the software optimization guide for the Cortex A72, published by
ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports:
\begin{itemize}
    \item a branch port (branch instructions, equivalent to x86 jumps);
    \item two identical integer ports (integer arithmetic operation), noted
        \texttt{Int01};
    \item an integer multi-cycle port (complex integer operations, \eg{}
        divisions), noted \texttt{IntM};
    \item two nearly-identical floating point and SIMD ports, noted
        \texttt{FP0} and \texttt{FP1}, or \texttt{FP01} to denote both. They
        are mostly identical, with slight specializations: \eg{} only port
        \texttt{FP0} can do SIMD multiplication, while only port \texttt{FP1}
        can do floating point comparisons);
    \item a load port, noted \texttt{Ld};
    \item a store port, noted \texttt{St}.
\end{itemize}

\paragraph{Frontend.} The Cortex A72 frontend can only decode three
instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}.
Intel's \texttt{SKL-SP}, which we considered before, has a frontend that
bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This
difference of one \uop{} per cycle is actually meaningful, as this means that
only three of the eight backend ports can be used each cycle.

\begin{example}[2nd order polynomial evaluation]
    Consider a kernel evaluating the 2nd order polynomial expression for
    different values of $x$:
    \begin{align*}
        P[i] &= a{X[i]}^2 + bX[i] + c \\
             &= \left( aX[i] + b \right) \times X[i] + c
    \end{align*}
    which directly translates to four operations: load $X[i]$, two floating
    point multiply-add, store the result $P[i]$. The backend, having a load
    port, two SIMD ports and a store port, can execute one iteration of such a
    kernel every cycle; in steady-state, out-of-order execution can lift the
    latency-induced pressure. However, as the frontend bottlenecks at three \uops{}
    per cycle, this kernel does not fit in a single cycle.
\end{example}

\paragraph{Lack of hardware counters.}
The Cortex A72 only features a very limited set of specialized hardware counters.
While the CPU is able to report the number of elapsed cycles,
retired instructions, branch misses and various metrics on cache misses, it
does not report any event regarding macro- or micro-operations, dispatching or
issuing to specific ports. This makes it, as pointed before, a particularly
relevant target for \palmed{}.