Continue writing on A72 frontend

2023-09-20 17:46:06 +02:00 · 2023-09-20 17:46:06 +02:00 · 344b959862
commit 344b959862
parent 8e1485e03d
4 changed files with 75 additions and 4 deletions
--- a/manuscrit/30_palmed/20_palmed_design.tex
+++ b/manuscrit/30_palmed/20_palmed_design.tex
@ -1,4 +1,4 @@
-\section{Palmed design}
+\section{Palmed design}\label{sec:palmed_design}

 Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully
 automated way, based on the execution of well-chosen benchmarks. As its goal is
--- a/manuscrit/30_palmed/30_pipedream.tex
+++ b/manuscrit/30_palmed/30_pipedream.tex
@ -1,5 +1,7 @@
 \section{Measuring a kernel's throughput: \pipedream{}}

+\todo{Introduce pipedream notation for instructions}
+
 To build a mapping of a CPU, Palmed fundamentally depends on the ability to
 measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we
 saw above, Palmed defines a kernel as a multiset of instructions, and makes
--- a/manuscrit/40_A72-frontend/20_cortex_a72.tex
+++ b/manuscrit/40_A72-frontend/20_cortex_a72.tex
@ -1,4 +1,4 @@
-\section{The Cortex A72 CPU}
+\section{The Cortex A72 CPU}\label{sec:a72_descr}

 The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
 ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
--- a/manuscrit/40_A72-frontend/30_manual_frontend.tex
+++ b/manuscrit/40_A72-frontend/30_manual_frontend.tex
@ -1,5 +1,74 @@
 \section{Manually modelling the A72 frontend}

-% TODO
+\todo{}
+
+\subsection{Finding micro-operation count for each instruction}
+
+As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only
+dispatch three \uops{} per cycle. The first important data to collect, thus, is
+the number of \uops{} each instruction is decoded into.
+
+To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not
+thorough enough: for each instruction, it lists the ports on which load is
+incurred, which sets a lower bound to the number of \uops{} the instruction is
+decomposed into. This approach, however, is not really satisfying. First,
+because it cannot be reproduced for another architecture whose optimisation
+manual is not as detailed, cannot be automated, and fully trusts the
+manufacturer. Second, because if an instruction loads \eg{} the integer ports,
+it may have a single or multiple \uops{} executed on the integer ports; the
+manual is only helpful to some extent to determine this.
+
+\medskip{}
+
+We instead use an approach akin to \palmed{}' saturating kernels, itself
+inspired by Agner Fog's method to identify ports in the absence of hardware
+counters~\cite{AgnerFog}. To this end, we assume the availability of a port
+mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
+output, sometimes manually confronted with the software optimisation
+guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures
+it models.
+
+The \palmed{} resource mapping we use as a basis is composed of 1\,975
+instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in
+\autoref{sec:palmed_design}, as instructions in the same class are mapped to
+the same resources, and thus are decomposed into the same \uops{}; this results
+in only 98 classes of instructions.
+
+\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick
+\emph{basic instructions}: for each port, we select one instruction which
+decodes into a single \uop{} executed by this port. We use the following
+instructions, in \pipedream{}'s notation:
+
+\begin{itemize}
+    \item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0,
+        x1, x2};
+    \item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{}
+        \lstarmasm{mul w0, w1, w2};
+    \item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0,
+        [x1, x2]};
+    \item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0,
+        [x1, x2]};
+    \item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta
+        d0, d1} (floating-point rounding to integral);
+    \item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp
+        d0, d1} (floating-point comparison);
+    \item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin
+        d0, d1, d1} (floating-point minimum);
+    \item (Branch: no instruction, as they are unsupported by \pipedream{}).
+\end{itemize}
+
+As the integer ports are not specialized, a single basic instruction is
+sufficient for both of them. The FP/SIMD ports are slightly specialized
+(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that
+stresses each of them independently, and one that stresses both without
+distinction.
+
+\paragraph{Counting the micro-ops of an instruction.} Based on the port mapping
+of an instruction~$i$, we are able to create kernels built out of one
+occurrence of $i$ and occurrences of basic instructions that should not,
+backend-wise, take more cycles than $i$ alone.
+By measuring them using \pipedream{}, we also ensure that our
+measure is not latency-bound. Thus, if the kernel takes more cycles than $i$
+alone, this limitation must come from the frontend. \todo{detail method,
+explain, examples}

-\subsection{Methodology}