169 lines
8.3 KiB
TeX
169 lines
8.3 KiB
TeX
The usual reverse-engineering methods for CPU models usually make abundant use
|
|
of hardware counters ---~and legitimately so, as they are the natural and
|
|
accurate way to obtain insight on the internals of a CPU\@. Such methods
|
|
include, among others, the optimisation guides from Agner Fog~\cite{AgnerFog},
|
|
as well as \uopsinfo{}~\cite{uopsinfo} and \uica{}'s~\cite{uica} approach to
|
|
respectively model the CPU's back- and front-end. In \autoref{chap:palmed}, we
|
|
introduced Palmed, whose main goal is to automatically produce port-mappings of
|
|
CPUs without assuming the presence of specific hardware counters.
|
|
|
|
\smallskip{}
|
|
|
|
The ARM architectures occupy a growing space in the global computing ecosystem.
|
|
They are already pervasive among the embedded and mobile devices, with most
|
|
mobile phones featuring an ARM CPU~\cite{arm_mobile}. Processors based on ARM
|
|
are emerging in datacenters and supercomputers: the Fugaku supercomputer
|
|
---~considered the fastest supercomputer in the world by the TOP500
|
|
ranking~\cite{fugaku_top500}~--- runs on ARM-based CPUs~\cite{fugaku_arm}, the
|
|
MareNostrum 4 supercomputer has an ARM-based cluster~\cite{marenostrum4_arm}.
|
|
|
|
Yet, the ARM ecosystem is still lacking in performance debugging tooling. While
|
|
\llvmmca{} supports ARM, it is one of the only few: \iaca{}, made by Intel, is
|
|
not supported ---~and will never be, as it is end-of-life~---; \uica{} is
|
|
focused on Intel architectures, and cannot be easily ported as it heavily
|
|
relies on reverse engineering specific to Intel, and enabled by specific
|
|
hardware counters; Intel \texttt{VTune}, a commonly used profiling performance
|
|
analysis tool, supports only x86-64.
|
|
|
|
\smallskip{}
|
|
|
|
In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
|
|
to be an important goal, especially meaningful as this particular CPU only has
|
|
very few hardware counters. However, it yielded only mixed results, as shown in
|
|
\autoref{sec:palmed_results}.
|
|
|
|
In this chapter, we show that a major cause of imprecision in these results is
|
|
the absence of a frontend model.
|
|
|
|
|
|
\section{Necessity to go beyond ports}
|
|
|
|
The resource models produced by \palmed{} are mainly concerned with the backend
|
|
of the CPUs modeled. However, the importance of the frontend in the accuracy of
|
|
a model's prediction cannot be ignored. Its effect can be clearly seen in the
|
|
evaluation heatmaps of various code analyzers in \autoref{fig:palmed_heatmaps}.
|
|
Each heatmap has a clear-cut limit on the horizontal axis: independently of the
|
|
benchmark's content, it is impossible to reach more than a given number of
|
|
instructions per cycle for a given processor ---~4 instructions for the
|
|
\texttt{SKL-SP}, 5 for the \texttt{ZEN1}. This limit is imposed by the
|
|
frontend.
|
|
|
|
Some analyzers, such as \palmed{} and \iaca{}, model this limit: the heatmap
|
|
shows that the predicted IPC will not surpass this limit. The other three
|
|
analyzers studied, however, do not model this limit; for instance, \uopsinfo{}
|
|
has a high density of benchmarks predicted at 8 instructions per cycle on
|
|
SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
|
|
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
|
|
heatmaps.
|
|
|
|
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
|
|
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
|
|
frontend, a number of instructions per cycle higher than 4 is easy to
|
|
reach.
|
|
|
|
According to \uopsinfo{} data, a 64-bits
|
|
integer \lstxasm{addq} is processed with a single \uop{}, dispatched on
|
|
port 0, 1, 5 or 6. In the meantime, a simple form 64 bits register store
|
|
to a direct register-held address ---~\eg{} a \lstxasm{movq \%rax,
|
|
(\%rbx)}~--- is also processed with a single \uop{}, dispatched on port 2
|
|
or 3.
|
|
|
|
Thus, backend-wise, the kernel $4\times \texttt{addq} + 2\times
|
|
\texttt{mov}$ has a throughput of 6 instructions per cycle. However, in
|
|
reality, this kernel would be frontend-bound, with a theoretical maximum throughput of 4
|
|
instructions per cycle ---~in fact, a \pipedream{} measure only yields 3
|
|
instructions per cycle.
|
|
\end{example}
|
|
|
|
\bigskip{}
|
|
|
|
To account for this, \palmed{} tries to detect an additional resource, apart
|
|
from the backend ports and combined ports, on which every \uop{} incurs a load.
|
|
This allows \palmed{} to avoid large errors on frontend-bound kernels.
|
|
|
|
The approach is, however, far from perfect. The clearest reason for this is is
|
|
that the frontend, both on x86-64 and ARM architectures, works in-order, while
|
|
\palmed{} inherently models kernels as multisets of instructions, thus
|
|
completely ignoring ordering. This resource model is purely linear: an
|
|
instruction incurs a load on the frontend resource in a fully commutative way,
|
|
independently of the previous instructions executed this cycle and of many
|
|
other effects.
|
|
|
|
The article introducing \uica{}~\cite{uica} explores this question in detail
|
|
for x86-64 Intel architectures. The authors, having previously developed
|
|
\uopsinfo{}, discuss the importance of a correct modelling of the frontend to
|
|
accurately predict throughput. Their approach, based on the exploration and
|
|
reverse-engineering of the crucial parts of the frontend, showcases many
|
|
important and non-trivial aspects of frontends usually neglected, such as the
|
|
switching between the decoders and \uop{}-cache as source of instructions
|
|
---~which cannot be linearly modelled.
|
|
|
|
\section{The Cortex A72 CPU}
|
|
|
|
The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
|
|
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
|
|
CPU, with Neon SIMD support. The CPU is designed as a general-purpose,
|
|
high-performance core for low-power applications.
|
|
|
|
The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711;
|
|
it is thus easy to have access to an A72 to run experiments.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=\linewidth]{A72_pipeline_diagram.svg}
|
|
\caption{Simplified overview of the Cortex A72
|
|
pipeline}\label{fig:a72_pipeline}
|
|
\end{figure}
|
|
|
|
\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from
|
|
the software optimization guide for the Cortex A72, published by
|
|
ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports:
|
|
\begin{itemize}
|
|
\item a branch port (branch instructions, equivalent to x86 jumps);
|
|
\item two identical integer ports (integer arithmetic operation);
|
|
\item an integer multi-cycle port (complex integer operations, \eg{} divisions);
|
|
\item two nearly-identical floating point and SIMD ports (mostly identical,
|
|
with slight specializations: \eg{} only port FP0 can do SIMD
|
|
multiplication, while only port FP1 can do floating point comparisons);
|
|
\item a load port;
|
|
\item a store port.
|
|
\end{itemize}
|
|
|
|
\paragraph{Frontend.} The Cortex A72 frontend can only decode three
|
|
instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}.
|
|
Intel's \texttt{SKL-SP}, which we considered before, has a frontend that
|
|
bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This
|
|
difference of one \uop{} per cycle is actually meaningful, as this means that
|
|
only three of the eight backend ports can be used each cycle.
|
|
|
|
\begin{example}[2nd order polynomial evaluation]
|
|
Consider a kernel evaluating the 2nd order polynomial expression for
|
|
different values of $x$:
|
|
\begin{align*}
|
|
P[i] &= a{X[i]}^2 + bX[i] + c \\
|
|
&= \left( aX[i] + b \right) \times X[i] + c
|
|
\end{align*}
|
|
which directly translates to four operations: load $X[i]$, two floating
|
|
point multiply-add, store the result $P[i]$. The backend, having a load
|
|
port, two SIMD ports and a store port, can execute one iteration of such a
|
|
kernel every cycle; in steady-state, out-of-order execution can lift the
|
|
latency-induced pressure. However, as the frontend bottlenecks at three \uops{}
|
|
per cycle, this kernel does not fit in a single cycle.
|
|
\end{example}
|
|
|
|
\paragraph{Lack of hardware counters.}
|
|
The Cortex A72 only features a very limited set of specialized hardware counters.
|
|
While the CPU is able to report the number of elapsed cycles,
|
|
retired instructions, branch misses and various metrics on cache misses, it
|
|
does not report any event regarding macro- or micro-operations, dispatching or
|
|
issuing to specific ports. This makes it, as pointed before, a particularly
|
|
relevant target for \palmed{}.
|
|
|
|
|
|
\section{Manually modelling the A72 frontend}
|
|
|
|
% TODO
|
|
|
|
\subsection{Methodology}
|
|
|
|
|