Continue writing on A72 frontend

This commit is contained in:
Théophile Bastian 2023-09-20 17:46:06 +02:00
parent 8e1485e03d
commit 344b959862
4 changed files with 75 additions and 4 deletions

View file

@ -1,4 +1,4 @@
\section{Palmed design}
\section{Palmed design}\label{sec:palmed_design}
Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully
automated way, based on the execution of well-chosen benchmarks. As its goal is

View file

@ -1,5 +1,7 @@
\section{Measuring a kernel's throughput: \pipedream{}}
\todo{Introduce pipedream notation for instructions}
To build a mapping of a CPU, Palmed fundamentally depends on the ability to
measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we
saw above, Palmed defines a kernel as a multiset of instructions, and makes

View file

@ -1,4 +1,4 @@
\section{The Cortex A72 CPU}
\section{The Cortex A72 CPU}\label{sec:a72_descr}
The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order

View file

@ -1,5 +1,74 @@
\section{Manually modelling the A72 frontend}
% TODO
\todo{}
\subsection{Finding micro-operation count for each instruction}
As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only
dispatch three \uops{} per cycle. The first important data to collect, thus, is
the number of \uops{} each instruction is decoded into.
To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not
thorough enough: for each instruction, it lists the ports on which load is
incurred, which sets a lower bound to the number of \uops{} the instruction is
decomposed into. This approach, however, is not really satisfying. First,
because it cannot be reproduced for another architecture whose optimisation
manual is not as detailed, cannot be automated, and fully trusts the
manufacturer. Second, because if an instruction loads \eg{} the integer ports,
it may have a single or multiple \uops{} executed on the integer ports; the
manual is only helpful to some extent to determine this.
\medskip{}
We instead use an approach akin to \palmed{}' saturating kernels, itself
inspired by Agner Fog's method to identify ports in the absence of hardware
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
output, sometimes manually confronted with the software optimisation
guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures
it models.
The \palmed{} resource mapping we use as a basis is composed of 1\,975
instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in
\autoref{sec:palmed_design}, as instructions in the same class are mapped to
the same resources, and thus are decomposed into the same \uops{}; this results
in only 98 classes of instructions.
\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick
\emph{basic instructions}: for each port, we select one instruction which
decodes into a single \uop{} executed by this port. We use the following
instructions, in \pipedream{}'s notation:
\begin{itemize}
\item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0,
x1, x2};
\item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{}
\lstarmasm{mul w0, w1, w2};
\item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0,
[x1, x2]};
\item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0,
[x1, x2]};
\item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta
d0, d1} (floating-point rounding to integral);
\item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp
d0, d1} (floating-point comparison);
\item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin
d0, d1, d1} (floating-point minimum);
\item (Branch: no instruction, as they are unsupported by \pipedream{}).
\end{itemize}
As the integer ports are not specialized, a single basic instruction is
sufficient for both of them. The FP/SIMD ports are slightly specialized
(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that
stresses each of them independently, and one that stresses both without
distinction.
\paragraph{Counting the micro-ops of an instruction.} Based on the port mapping
of an instruction~$i$, we are able to create kernels built out of one
occurrence of $i$ and occurrences of basic instructions that should not,
backend-wise, take more cycles than $i$ alone.
By measuring them using \pipedream{}, we also ensure that our
measure is not latency-bound. Thus, if the kernel takes more cycles than $i$
alone, this limitation must come from the frontend. \todo{detail method,
explain, examples}
\subsection{Methodology}