74 lines
3.7 KiB
TeX
74 lines
3.7 KiB
TeX
\section{Manually modelling the A72 frontend}
|
|
|
|
\todo{}
|
|
|
|
\subsection{Finding micro-operation count for each instruction}
|
|
|
|
As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only
|
|
dispatch three \uops{} per cycle. The first important data to collect, thus, is
|
|
the number of \uops{} each instruction is decoded into.
|
|
|
|
To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not
|
|
thorough enough: for each instruction, it lists the ports on which load is
|
|
incurred, which sets a lower bound to the number of \uops{} the instruction is
|
|
decomposed into. This approach, however, is not really satisfying. First,
|
|
because it cannot be reproduced for another architecture whose optimisation
|
|
manual is not as detailed, cannot be automated, and fully trusts the
|
|
manufacturer. Second, because if an instruction loads \eg{} the integer ports,
|
|
it may have a single or multiple \uops{} executed on the integer ports; the
|
|
manual is only helpful to some extent to determine this.
|
|
|
|
\medskip{}
|
|
|
|
We instead use an approach akin to \palmed{}' saturating kernels, itself
|
|
inspired by Agner Fog's method to identify ports in the absence of hardware
|
|
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
|
|
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
|
|
output, sometimes manually confronted with the software optimisation
|
|
guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures
|
|
it models.
|
|
|
|
The \palmed{} resource mapping we use as a basis is composed of 1\,975
|
|
instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in
|
|
\autoref{sec:palmed_design}, as instructions in the same class are mapped to
|
|
the same resources, and thus are decomposed into the same \uops{}; this results
|
|
in only 98 classes of instructions.
|
|
|
|
\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick
|
|
\emph{basic instructions}: for each port, we select one instruction which
|
|
decodes into a single \uop{} executed by this port. We use the following
|
|
instructions, in \pipedream{}'s notation:
|
|
|
|
\begin{itemize}
|
|
\item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0,
|
|
x1, x2};
|
|
\item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{}
|
|
\lstarmasm{mul w0, w1, w2};
|
|
\item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0,
|
|
[x1, x2]};
|
|
\item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0,
|
|
[x1, x2]};
|
|
\item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta
|
|
d0, d1} (floating-point rounding to integral);
|
|
\item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp
|
|
d0, d1} (floating-point comparison);
|
|
\item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin
|
|
d0, d1, d1} (floating-point minimum);
|
|
\item (Branch: no instruction, as they are unsupported by \pipedream{}).
|
|
\end{itemize}
|
|
|
|
As the integer ports are not specialized, a single basic instruction is
|
|
sufficient for both of them. The FP/SIMD ports are slightly specialized
|
|
(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that
|
|
stresses each of them independently, and one that stresses both without
|
|
distinction.
|
|
|
|
\paragraph{Counting the micro-ops of an instruction.} Based on the port mapping
|
|
of an instruction~$i$, we are able to create kernels built out of one
|
|
occurrence of $i$ and occurrences of basic instructions that should not,
|
|
backend-wise, take more cycles than $i$ alone.
|
|
By measuring them using \pipedream{}, we also ensure that our
|
|
measure is not latency-bound. Thus, if the kernel takes more cycles than $i$
|
|
alone, this limitation must come from the frontend. \todo{detail method,
|
|
explain, examples}
|
|
|