\section{The Cortex A72 CPU}\label{sec:a72_descr} The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order CPU, with Neon SIMD support. The CPU is designed as a general-purpose, high-performance core for low-power applications. The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711; it is thus easy to have access to an A72 to run experiments. \begin{figure} \centering \includegraphics[width=\linewidth]{A72_pipeline_diagram.svg} \caption{Simplified overview of the Cortex A72 pipeline}\label{fig:a72_pipeline} \end{figure} \paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from the software optimization guide for the Cortex A72, published by ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports: \begin{itemize} \item a branch port (branch instructions, equivalent to x86 jumps); \item two identical integer ports (integer arithmetic operation), noted \texttt{Int01}; \item an integer multi-cycle port (complex integer operations, \eg{} divisions), noted \texttt{IntM}; \item two nearly-identical floating point and SIMD ports, noted \texttt{FP0} and \texttt{FP1}, or \texttt{FP01} to denote both. They are mostly identical, with slight specializations: \eg{} only port \texttt{FP0} can do SIMD multiplication, while only port \texttt{FP1} can do floating point comparisons); \item a load port, noted \texttt{Ld}; \item a store port, noted \texttt{St}. \end{itemize} \paragraph{Frontend.} The Cortex A72 frontend can only decode three instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}. Intel's \texttt{SKL-SP}, which we considered before, has a frontend that bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This difference of one \uop{} per cycle is actually meaningful, as this means that only three of the eight backend ports can be used each cycle. \begin{example}[2nd order polynomial evaluation] Consider a kernel evaluating the 2nd order polynomial expression for different values of $x$: \begin{align*} P[i] &= a{X[i]}^2 + bX[i] + c \\ &= \left( aX[i] + b \right) \times X[i] + c \end{align*} which directly translates to four operations: load $X[i]$, two floating point multiply-add, store the result $P[i]$. The backend, having a load port, two SIMD ports and a store port, can execute one iteration of such a kernel every cycle; in steady-state, out-of-order execution can lift the latency-induced pressure. However, as the frontend bottlenecks at three \uops{} per cycle, this kernel does not fit in a single cycle. \end{example} \paragraph{Lack of hardware counters.} The Cortex A72 only features a very limited set of specialized hardware counters. While the CPU is able to report the number of elapsed cycles, retired instructions, branch misses and various metrics on cache misses, it does not report any event regarding macro- or micro-operations, dispatching or issuing to specific ports. This makes it, as pointed before, a particularly relevant target for \palmed{}.