Continue writing on A72 frontend
This commit is contained in:
parent
8e1485e03d
commit
344b959862
4 changed files with 75 additions and 4 deletions
|
@ -1,4 +1,4 @@
|
|||
\section{Palmed design}
|
||||
\section{Palmed design}\label{sec:palmed_design}
|
||||
|
||||
Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully
|
||||
automated way, based on the execution of well-chosen benchmarks. As its goal is
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
\section{Measuring a kernel's throughput: \pipedream{}}
|
||||
|
||||
\todo{Introduce pipedream notation for instructions}
|
||||
|
||||
To build a mapping of a CPU, Palmed fundamentally depends on the ability to
|
||||
measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we
|
||||
saw above, Palmed defines a kernel as a multiset of instructions, and makes
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
\section{The Cortex A72 CPU}
|
||||
\section{The Cortex A72 CPU}\label{sec:a72_descr}
|
||||
|
||||
The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
|
||||
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
|
||||
|
|
|
@ -1,5 +1,74 @@
|
|||
\section{Manually modelling the A72 frontend}
|
||||
|
||||
% TODO
|
||||
\todo{}
|
||||
|
||||
\subsection{Finding micro-operation count for each instruction}
|
||||
|
||||
As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only
|
||||
dispatch three \uops{} per cycle. The first important data to collect, thus, is
|
||||
the number of \uops{} each instruction is decoded into.
|
||||
|
||||
To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not
|
||||
thorough enough: for each instruction, it lists the ports on which load is
|
||||
incurred, which sets a lower bound to the number of \uops{} the instruction is
|
||||
decomposed into. This approach, however, is not really satisfying. First,
|
||||
because it cannot be reproduced for another architecture whose optimisation
|
||||
manual is not as detailed, cannot be automated, and fully trusts the
|
||||
manufacturer. Second, because if an instruction loads \eg{} the integer ports,
|
||||
it may have a single or multiple \uops{} executed on the integer ports; the
|
||||
manual is only helpful to some extent to determine this.
|
||||
|
||||
\medskip{}
|
||||
|
||||
We instead use an approach akin to \palmed{}' saturating kernels, itself
|
||||
inspired by Agner Fog's method to identify ports in the absence of hardware
|
||||
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
|
||||
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
|
||||
output, sometimes manually confronted with the software optimisation
|
||||
guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures
|
||||
it models.
|
||||
|
||||
The \palmed{} resource mapping we use as a basis is composed of 1\,975
|
||||
instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in
|
||||
\autoref{sec:palmed_design}, as instructions in the same class are mapped to
|
||||
the same resources, and thus are decomposed into the same \uops{}; this results
|
||||
in only 98 classes of instructions.
|
||||
|
||||
\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick
|
||||
\emph{basic instructions}: for each port, we select one instruction which
|
||||
decodes into a single \uop{} executed by this port. We use the following
|
||||
instructions, in \pipedream{}'s notation:
|
||||
|
||||
\begin{itemize}
|
||||
\item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0,
|
||||
x1, x2};
|
||||
\item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{}
|
||||
\lstarmasm{mul w0, w1, w2};
|
||||
\item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0,
|
||||
[x1, x2]};
|
||||
\item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0,
|
||||
[x1, x2]};
|
||||
\item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta
|
||||
d0, d1} (floating-point rounding to integral);
|
||||
\item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp
|
||||
d0, d1} (floating-point comparison);
|
||||
\item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin
|
||||
d0, d1, d1} (floating-point minimum);
|
||||
\item (Branch: no instruction, as they are unsupported by \pipedream{}).
|
||||
\end{itemize}
|
||||
|
||||
As the integer ports are not specialized, a single basic instruction is
|
||||
sufficient for both of them. The FP/SIMD ports are slightly specialized
|
||||
(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that
|
||||
stresses each of them independently, and one that stresses both without
|
||||
distinction.
|
||||
|
||||
\paragraph{Counting the micro-ops of an instruction.} Based on the port mapping
|
||||
of an instruction~$i$, we are able to create kernels built out of one
|
||||
occurrence of $i$ and occurrences of basic instructions that should not,
|
||||
backend-wise, take more cycles than $i$ alone.
|
||||
By measuring them using \pipedream{}, we also ensure that our
|
||||
measure is not latency-bound. Thus, if the kernel takes more cycles than $i$
|
||||
alone, this limitation must come from the frontend. \todo{detail method,
|
||||
explain, examples}
|
||||
|
||||
\subsection{Methodology}
|
||||
|
|
Loading…
Reference in a new issue