61 lines
2.9 KiB
TeX
61 lines
2.9 KiB
TeX
|
\section{The Cortex A72 CPU}
|
||
|
|
||
|
The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
|
||
|
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
|
||
|
CPU, with Neon SIMD support. The CPU is designed as a general-purpose,
|
||
|
high-performance core for low-power applications.
|
||
|
|
||
|
The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711;
|
||
|
it is thus easy to have access to an A72 to run experiments.
|
||
|
|
||
|
\begin{figure}
|
||
|
\centering
|
||
|
\includegraphics[width=\linewidth]{A72_pipeline_diagram.svg}
|
||
|
\caption{Simplified overview of the Cortex A72
|
||
|
pipeline}\label{fig:a72_pipeline}
|
||
|
\end{figure}
|
||
|
|
||
|
\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from
|
||
|
the software optimization guide for the Cortex A72, published by
|
||
|
ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports:
|
||
|
\begin{itemize}
|
||
|
\item a branch port (branch instructions, equivalent to x86 jumps);
|
||
|
\item two identical integer ports (integer arithmetic operation);
|
||
|
\item an integer multi-cycle port (complex integer operations, \eg{} divisions);
|
||
|
\item two nearly-identical floating point and SIMD ports (mostly identical,
|
||
|
with slight specializations: \eg{} only port FP0 can do SIMD
|
||
|
multiplication, while only port FP1 can do floating point comparisons);
|
||
|
\item a load port;
|
||
|
\item a store port.
|
||
|
\end{itemize}
|
||
|
|
||
|
\paragraph{Frontend.} The Cortex A72 frontend can only decode three
|
||
|
instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}.
|
||
|
Intel's \texttt{SKL-SP}, which we considered before, has a frontend that
|
||
|
bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This
|
||
|
difference of one \uop{} per cycle is actually meaningful, as this means that
|
||
|
only three of the eight backend ports can be used each cycle.
|
||
|
|
||
|
\begin{example}[2nd order polynomial evaluation]
|
||
|
Consider a kernel evaluating the 2nd order polynomial expression for
|
||
|
different values of $x$:
|
||
|
\begin{align*}
|
||
|
P[i] &= a{X[i]}^2 + bX[i] + c \\
|
||
|
&= \left( aX[i] + b \right) \times X[i] + c
|
||
|
\end{align*}
|
||
|
which directly translates to four operations: load $X[i]$, two floating
|
||
|
point multiply-add, store the result $P[i]$. The backend, having a load
|
||
|
port, two SIMD ports and a store port, can execute one iteration of such a
|
||
|
kernel every cycle; in steady-state, out-of-order execution can lift the
|
||
|
latency-induced pressure. However, as the frontend bottlenecks at three \uops{}
|
||
|
per cycle, this kernel does not fit in a single cycle.
|
||
|
\end{example}
|
||
|
|
||
|
\paragraph{Lack of hardware counters.}
|
||
|
The Cortex A72 only features a very limited set of specialized hardware counters.
|
||
|
While the CPU is able to report the number of elapsed cycles,
|
||
|
retired instructions, branch misses and various metrics on cache misses, it
|
||
|
does not report any event regarding macro- or micro-operations, dispatching or
|
||
|
issuing to specific ports. This makes it, as pointed before, a particularly
|
||
|
relevant target for \palmed{}.
|