phd-thesis/manuscrit/40_A72-frontend/10_cortex_a72.tex

The usual reverse-engineering methods for CPU models usually make abundant use
of hardware counters ---~and legitimately so, as they are the natural and
accurate way to obtain insight on the internals of a CPU\@. Such methods
include, among others, the optimisation guides from Agner Fog~\cite{AgnerFog},
as well as \uopsinfo{}~\cite{uopsinfo} and \uica{}'s~\cite{uica} approach to
respectively model the CPU's back- and front-end.  In \autoref{chap:palmed}, we
introduced Palmed, whose main goal is to automatically produce port-mappings of
CPUs without assuming the presence of specific hardware counters.

\smallskip{}

The ARM architectures occupy a growing space in the global computing ecosystem.
They are already pervasive among the embedded and mobile devices, with most
mobile phones featuring an ARM CPU~\cite{arm_mobile}. Processors based on ARM
are emerging in datacenters and supercomputers: the Fugaku supercomputer
---~considered the fastest supercomputer in the world by the TOP500
ranking~\cite{fugaku_top500}~--- runs on ARM-based CPUs~\cite{fugaku_arm}, the
MareNostrum 4 supercomputer has an ARM-based cluster~\cite{marenostrum4_arm}.

Yet, the ARM ecosystem is still lacking in performance debugging tooling. While
\llvmmca{} supports ARM, it is one of the only few: \iaca{}, made by Intel, is
not supported ---~and will never be, as it is end-of-life~---; \uica{} is
focused on Intel architectures, and cannot be easily ported as it heavily
relies on reverse engineering specific to Intel, and enabled by specific
hardware counters; Intel \texttt{VTune}, a commonly used profiling performance
analysis tool, supports only x86-64.

\smallskip{}

In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
to be an important goal, especially meaningful as this particular CPU only has
very few hardware counters. However, it yielded only mixed results, as shown in
\autoref{sec:palmed_results}.

In this chapter, we show that a major cause of imprecision in these results is
the absence of a frontend model.


\section{Necessity to go beyond ports}

The resource models produced by \palmed{} are mainly concerned with the backend
of the CPUs modeled. However, the importance of the frontend in the accuracy of
a model's prediction cannot be ignored. Its effect can be clearly seen in the
evaluation heatmaps of various code analyzers in \autoref{fig:palmed_heatmaps}.
Each heatmap has a clear-cut limit on the horizontal axis: independently of the
benchmark's content, it is impossible to reach more than a given number of
instructions per cycle for a given processor ---~4 instructions for the
\texttt{SKL-SP}, 5 for the \texttt{ZEN1}. This limit is imposed by the
frontend.

Some analyzers, such as \palmed{} and \iaca{}, model this limit: the heatmap
shows that the predicted IPC will not surpass this limit. The other three
analyzers studied, however, do not model this limit; for instance, \uopsinfo{}
has a high density of benchmarks predicted at 8 instructions per cycle on
SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
heatmaps.

\begin{example}{High back-end throughput on \texttt{SKL-SP}}
    On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
    frontend, a number of instructions per cycle higher than 4 is easy to
    reach.

    According to \uopsinfo{} data, a 64-bits
    integer \lstxasm{addq} is processed with a single \uop{}, dispatched on
    port 0, 1, 5 or 6. In the meantime, a simple form 64 bits register store
    to a direct register-held address ---~\eg{} a \lstxasm{movq \%rax,
    (\%rbx)}~--- is also processed with a single \uop{}, dispatched on port 2
    or 3.

    Thus, backend-wise, the kernel $4\times \texttt{addq} + 2\times
    \texttt{mov}$ has a throughput of 6 instructions per cycle. However, in
    reality, this kernel would be frontend-bound, with a theoretical maximum throughput of 4
    instructions per cycle ---~in fact, a \pipedream{} measure only yields 3
    instructions per cycle.
\end{example}

\bigskip{}

To account for this, \palmed{} tries to detect an additional resource, apart
from the backend ports and combined ports, on which every \uop{} incurs a load.
This allows \palmed{} to avoid large errors on frontend-bound kernels.

The approach is, however, far from perfect. The clearest reason for this is is
that the frontend, both on x86-64 and ARM architectures, works in-order, while
\palmed{} inherently models kernels as multisets of instructions, thus
completely ignoring ordering. This resource model is purely linear: an
instruction incurs a load on the frontend resource in a fully commutative way,
independently of the previous instructions executed this cycle and of many
other effects.

The article introducing \uica{}~\cite{uica} explores this question in detail
for x86-64 Intel architectures. The authors, having previously developed
\uopsinfo{}, discuss the importance of a correct modelling of the frontend to
accurately predict throughput. Their approach, based on the exploration and
reverse-engineering of the crucial parts of the frontend, showcases many
important and non-trivial aspects of frontends usually neglected, such as the
switching between the decoders and \uop{}-cache as source of instructions
---~which cannot be linearly modelled.

\section{The Cortex A72 CPU}

The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
CPU, with Neon SIMD support. The CPU is designed as a general-purpose,
high-performance core for low-power applications.

The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711;
it is thus easy to have access to an A72 to run experiments.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{A72_pipeline_diagram.svg}
    \caption{Simplified overview of the Cortex A72
    pipeline}\label{fig:a72_pipeline}
\end{figure}

\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from
the software optimization guide for the Cortex A72, published by
ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports:
\begin{itemize}
    \item a branch port (branch instructions, equivalent to x86 jumps);
    \item two identical integer ports (integer arithmetic operation);
    \item an integer multi-cycle port (complex integer operations, \eg{} divisions);
    \item two nearly-identical floating point and SIMD ports (mostly identical,
        with slight specializations: \eg{} only port FP0 can do SIMD
        multiplication, while only port FP1 can do floating point comparisons);
    \item a load port;
    \item a store port.
\end{itemize}

\paragraph{Frontend.} The Cortex A72 frontend can only decode three
instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}.
Intel's \texttt{SKL-SP}, which we considered before, has a frontend that
bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This
difference of one \uop{} per cycle is actually meaningful, as this means that
only three of the eight backend ports can be used each cycle.

\begin{example}[2nd order polynomial evaluation]
    Consider a kernel evaluating the 2nd order polynomial expression for
    different values of $x$:
    \begin{align*}
        P[i] &= a{X[i]}^2 + bX[i] + c \\
             &= \left( aX[i] + b \right) \times X[i] + c
    \end{align*}
    which directly translates to four operations: load $X[i]$, two floating
    point multiply-add, store the result $P[i]$. The backend, having a load
    port, two SIMD ports and a store port, can execute one iteration of such a
    kernel every cycle; in steady-state, out-of-order execution can lift the
    latency-induced pressure. However, as the frontend bottlenecks at three \uops{}
    per cycle, this kernel does not fit in a single cycle.
\end{example}

\paragraph{Lack of hardware counters.}
The Cortex A72 only features a very limited set of specialized hardware counters.
While the CPU is able to report the number of elapsed cycles,
retired instructions, branch misses and various metrics on cache misses, it
does not report any event regarding macro- or micro-operations, dispatching or
issuing to specific ports. This makes it, as pointed before, a particularly
relevant target for \palmed{}.


\section{Manually modelling the A72 frontend}

% TODO

\subsection{Methodology}
No results found.