diff --git a/manuscrit/30_palmed/20_palmed_design.tex b/manuscrit/30_palmed/20_palmed_design.tex index 7090cb7..d15f956 100644 --- a/manuscrit/30_palmed/20_palmed_design.tex +++ b/manuscrit/30_palmed/20_palmed_design.tex @@ -1,4 +1,4 @@ -\section{Palmed design} +\section{Palmed design}\label{sec:palmed_design} Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully automated way, based on the execution of well-chosen benchmarks. As its goal is diff --git a/manuscrit/30_palmed/30_pipedream.tex b/manuscrit/30_palmed/30_pipedream.tex index bf97a43..d4135c5 100644 --- a/manuscrit/30_palmed/30_pipedream.tex +++ b/manuscrit/30_palmed/30_pipedream.tex @@ -1,5 +1,7 @@ \section{Measuring a kernel's throughput: \pipedream{}} +\todo{Introduce pipedream notation for instructions} + To build a mapping of a CPU, Palmed fundamentally depends on the ability to measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we saw above, Palmed defines a kernel as a multiset of instructions, and makes diff --git a/manuscrit/40_A72-frontend/20_cortex_a72.tex b/manuscrit/40_A72-frontend/20_cortex_a72.tex index 254447f..1dc6dbb 100644 --- a/manuscrit/40_A72-frontend/20_cortex_a72.tex +++ b/manuscrit/40_A72-frontend/20_cortex_a72.tex @@ -1,4 +1,4 @@ -\section{The Cortex A72 CPU} +\section{The Cortex A72 CPU}\label{sec:a72_descr} The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order diff --git a/manuscrit/40_A72-frontend/30_manual_frontend.tex b/manuscrit/40_A72-frontend/30_manual_frontend.tex index 9bcf016..b58b1c8 100644 --- a/manuscrit/40_A72-frontend/30_manual_frontend.tex +++ b/manuscrit/40_A72-frontend/30_manual_frontend.tex @@ -1,5 +1,74 @@ \section{Manually modelling the A72 frontend} -% TODO +\todo{} + +\subsection{Finding micro-operation count for each instruction} + +As we saw in \autoref{sec:a72_descr}, the Cortex A72's frontend can only +dispatch three \uops{} per cycle. The first important data to collect, thus, is +the number of \uops{} each instruction is decoded into. + +To that end, the optimisation manual~\cite{ref:a72_optim} helps, but is not +thorough enough: for each instruction, it lists the ports on which load is +incurred, which sets a lower bound to the number of \uops{} the instruction is +decomposed into. This approach, however, is not really satisfying. First, +because it cannot be reproduced for another architecture whose optimisation +manual is not as detailed, cannot be automated, and fully trusts the +manufacturer. Second, because if an instruction loads \eg{} the integer ports, +it may have a single or multiple \uops{} executed on the integer ports; the +manual is only helpful to some extent to determine this. + +\medskip{} + +We instead use an approach akin to \palmed{}' saturating kernels, itself +inspired by Agner Fog's method to identify ports in the absence of hardware +counters~\cite{AgnerFog}. To this end, we assume the availability of a port +mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s +output, sometimes manually confronted with the software optimisation +guide~\cite{ref:a72_optim}; \uopsinfo{} could also be used on the architectures +it models. + +The \palmed{} resource mapping we use as a basis is composed of 1\,975 +instructions. To make this more manageable in a semi-automated method, we reuse the instruction classes provided by \palmed{}, introduced in +\autoref{sec:palmed_design}, as instructions in the same class are mapped to +the same resources, and thus are decomposed into the same \uops{}; this results +in only 98 classes of instructions. + +\paragraph{Basic instructions.} We use \palmed{}'s mapping to hand-pick +\emph{basic instructions}: for each port, we select one instruction which +decodes into a single \uop{} executed by this port. We use the following +instructions, in \pipedream{}'s notation: + +\begin{itemize} + \item{} integer 0/1: \lstarmasm{ADC_RD_X_RN_X_RM_X}, \eg{} \lstarmasm{adc x0, + x1, x2}; + \item{} integer multi-cycle: \lstarmasm{MUL_RD_W_RN_W_RM_W}, \eg{} + \lstarmasm{mul w0, w1, w2}; + \item{} load: \lstarmasm{LDR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{ldr x0, + [x1, x2]}; + \item{} store: \lstarmasm{STR_RT_X_ADDR_REGOFF}, \eg{} \lstarmasm{str x0, + [x1, x2]}; + \item{} FP/SIMD 0: \lstarmasm{FRINTA_FD_D_FN_D}, \eg{} \lstarmasm{frinta + d0, d1} (floating-point rounding to integral); + \item{} FP/SIMD 1: \lstarmasm{FCMP_FN_D_FM_D}, \eg{} \lstarmasm{fcmp + d0, d1} (floating-point comparison); + \item{} FP/SIMD 0/1: \lstarmasm{FMIN_FD_D_FN_D_FM_D}, \eg{} \lstarmasm{fmin + d0, d1, d1} (floating-point minimum); + \item (Branch: no instruction, as they are unsupported by \pipedream{}). +\end{itemize} + +As the integer ports are not specialized, a single basic instruction is +sufficient for both of them. The FP/SIMD ports are slightly specialized +(see \autoref{sec:a72_descr}), we thus use three basic instructions: one that +stresses each of them independently, and one that stresses both without +distinction. + +\paragraph{Counting the micro-ops of an instruction.} Based on the port mapping +of an instruction~$i$, we are able to create kernels built out of one +occurrence of $i$ and occurrences of basic instructions that should not, +backend-wise, take more cycles than $i$ alone. +By measuring them using \pipedream{}, we also ensure that our +measure is not latency-bound. Thus, if the kernel takes more cycles than $i$ +alone, this limitation must come from the frontend. \todo{detail method, +explain, examples} -\subsection{Methodology}