64 lines
3.2 KiB
TeX
64 lines
3.2 KiB
TeX
\section{The Cortex A72 CPU}\label{sec:a72_descr}
|
|
|
|
The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
|
|
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
|
|
CPU, with Neon SIMD support. The CPU is designed as a general-purpose,
|
|
high-performance core for low-power applications.
|
|
|
|
The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711;
|
|
it is thus easy to have access to an A72 to run experiments.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=\linewidth]{A72_pipeline_diagram.svg}
|
|
\caption{Simplified overview of the Cortex A72
|
|
pipeline}\label{fig:a72_pipeline}
|
|
\end{figure}
|
|
|
|
\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from
|
|
the software optimization guide for the Cortex A72, published by
|
|
ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports:
|
|
\begin{itemize}
|
|
\item a branch port (branch instructions, equivalent to x86 jumps);
|
|
\item two identical integer ports (integer arithmetic operation), noted
|
|
\texttt{Int01};
|
|
\item an integer multi-cycle port (complex integer operations, \eg{}
|
|
divisions), noted \texttt{IntM};
|
|
\item two nearly-identical floating point and SIMD ports, noted
|
|
\texttt{FP0} and \texttt{FP1}, or \texttt{FP01} to denote both. They
|
|
are mostly identical, with slight specializations: \eg{} only port
|
|
\texttt{FP0} can do SIMD multiplication, while only port \texttt{FP1}
|
|
can do floating point comparisons);
|
|
\item a load port, noted \texttt{Ld};
|
|
\item a store port, noted \texttt{St}.
|
|
\end{itemize}
|
|
|
|
\paragraph{Frontend.} The Cortex A72 frontend can only decode three
|
|
instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}.
|
|
Intel's \texttt{SKL-SP}, which we considered before, has a frontend that
|
|
bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This
|
|
difference of one \uop{} per cycle is actually meaningful, as this means that
|
|
only three of the eight backend ports can be used each cycle.
|
|
|
|
\begin{example}[2nd order polynomial evaluation]
|
|
Consider a kernel evaluating the 2nd order polynomial expression for
|
|
different values of $x$:
|
|
\begin{align*}
|
|
P[i] &= a{X[i]}^2 + bX[i] + c \\
|
|
&= \left( aX[i] + b \right) \times X[i] + c
|
|
\end{align*}
|
|
which directly translates to four operations: load $X[i]$, two floating
|
|
point multiply-add, store the result $P[i]$. The backend, having a load
|
|
port, two SIMD ports and a store port, can execute one iteration of such a
|
|
kernel every cycle; in steady-state, out-of-order execution can lift the
|
|
latency-induced pressure. However, as the frontend bottlenecks at three \uops{}
|
|
per cycle, this kernel does not fit in a single cycle.
|
|
\end{example}
|
|
|
|
\paragraph{Lack of hardware counters.}
|
|
The Cortex A72 only features a very limited set of specialized hardware counters.
|
|
While the CPU is able to report the number of elapsed cycles,
|
|
retired instructions, branch misses and various metrics on cache misses, it
|
|
does not report any event regarding macro- or micro-operations, dispatching or
|
|
issuing to specific ports. This makes it, as pointed before, a particularly
|
|
relevant target for \palmed{}.
|