From 24e3d4a81743f11f6dcb0f9d72dd7ff4bd26097c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= Date: Sat, 17 Aug 2024 15:43:20 +0200 Subject: [PATCH] Proof-read chapter 3 (A72 frontend) --- manuscrit/40_A72-frontend/00_intro.tex | 20 ++++++----- manuscrit/40_A72-frontend/10_beyond_ports.tex | 2 +- .../40_A72-frontend/30_manual_frontend.tex | 36 +++++++++---------- manuscrit/40_A72-frontend/40_evaluation.tex | 2 +- manuscrit/40_A72-frontend/50_future_works.tex | 3 +- 5 files changed, 33 insertions(+), 30 deletions(-) diff --git a/manuscrit/40_A72-frontend/00_intro.tex b/manuscrit/40_A72-frontend/00_intro.tex index 12edfff..21f96f1 100644 --- a/manuscrit/40_A72-frontend/00_intro.tex +++ b/manuscrit/40_A72-frontend/00_intro.tex @@ -27,19 +27,23 @@ analysis tool, supports only x86-64. \smallskip{} -In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed -to be an important goal, especially meaningful as this particular CPU only has -very few hardware counters. However, it yielded only mixed results, as shown in -\autoref{sec:palmed_results}. +In this context, modelling an ARM CPU ---~the Cortex A72~--- with \palmed{} +seemed to be an important goal, especially meaningful as this particular CPU +only has very few hardware counters. However, it yielded only mixed results, as +we will see in \autoref{sec:a40_eval}. \bigskip{} In this chapter, we show that a major cause of imprecision in these results is -the absence of a frontend model. We manually model the Cortex A72 frontend to -compare a raw \palmed{}-generated model, to one naively augmented with a -frontend model. +the absence in \palmed{} of a frontend model. We manually model the Cortex A72 +frontend to compare a raw \palmed{}-generated model, to one naively augmented +with a frontend model. While this chapter only documents a manual approach, we view it as a preliminary work towards an automation of the synthesis of a model that stems from benchmarks data, in the same way that \palmed{} synthesises a backend -model. +model. In this direction, we propose in \autoref{sec:frontend_parametric_model} +a generic, parametric frontend that, we expect, could be used with good results +on many architectures. We also offer methodologies that we expect to be able to +automatically fill some of the parameters of this model for an arbitrary +architecture. diff --git a/manuscrit/40_A72-frontend/10_beyond_ports.tex b/manuscrit/40_A72-frontend/10_beyond_ports.tex index adaf881..bc86967 100644 --- a/manuscrit/40_A72-frontend/10_beyond_ports.tex +++ b/manuscrit/40_A72-frontend/10_beyond_ports.tex @@ -18,7 +18,7 @@ SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only 4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{} heatmaps. -\begin{example}{High back-end throughput on \texttt{SKL-SP}} +\begin{example}[High back-end throughput on \texttt{SKL-SP}] On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large frontend, a number of instructions per cycle higher than 4 is easy to reach. diff --git a/manuscrit/40_A72-frontend/30_manual_frontend.tex b/manuscrit/40_A72-frontend/30_manual_frontend.tex index 64a1569..f2dcd3a 100644 --- a/manuscrit/40_A72-frontend/30_manual_frontend.tex +++ b/manuscrit/40_A72-frontend/30_manual_frontend.tex @@ -1,7 +1,7 @@ \section{Manually modelling the A72 frontend} Our objective is now to manually construct a frontend model of the Cortex A72. -We strive, however, to remain as close to an algorithmic methodology that is +We strive, however, to remain as close to an algorithmic methodology as possible: while our model's structure is manually crafted, its data should come from experiments that can be later automated. @@ -24,7 +24,7 @@ manual is only helpful to some extent to determine this. \medskip{} -We instead use an approach akin to \palmed{}' saturating kernels, itself +We instead use an approach akin to \palmed{}'s saturating kernels, itself inspired by Agner Fog's method to identify ports in the absence of hardware counters~\cite{AgnerFog}. To this end, we assume the availability of a port mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s @@ -146,20 +146,20 @@ model mapping each supported instruction of the ISA to its \uop{} count. basic instruction for the integer port. We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence, - we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this + we consider $\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads only the \texttt{Int01} port with a load of $\sfrac{1}{2}$. - We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i + + We select \eg{} $\kerK_2 = i + 2\times \basic{FP01}$ and $\kerK_3 = i + \basic{FP01} + \basic{Ld} + \basic{FP01}$. We measure \begin{itemize} - \item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$ - \item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$ + \item $\cyc{\kerK_2} = 1.01 \simeq 1\,\text{cycle}$ + \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$ \end{itemize} which is consistent. We conclude that, as expected, $\mucount i = - 3\cyc{\kerK_3} = 3-2 = 1$. + 3\cyc{\kerK_2} - 2 = 3-2 = 1$. \end{example} \begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}] @@ -168,7 +168,7 @@ model mapping each supported instruction of the ISA to its \uop{} count. operands. We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider - $\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads + $\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a load of 1 means two \uops{}. As there is already a \uop{} loading the @@ -176,16 +176,16 @@ model mapping each supported instruction of the ISA to its \uop{} count. this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one on either \texttt{FP0} or \texttt{FP1}.}. - We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i + + We select \eg{} $\kerK_2 = i + 2\times \basic{Int01}$ and $\kerK_3 = i + \basic{Int01} + \basic{Ld} + \basic{Int01}$. We measure \begin{itemize} - \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$ - \item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$ + \item $\cyc{\kerK_2} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$ + \item $\cyc{\kerK_3} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$ \end{itemize} - which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 = - 2$. + which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_2} - 2 = + 4-2 = 2$. \end{example} @@ -240,7 +240,7 @@ steady-state. On the x86-64 architectures they analyzed, \uica{}'s authors find that the CPU's predecoder might cause an instruction's \uops{} to be postponed to the -next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1). +next cycle if it is pre-decoded across a cycle boundary~\cite[§4.1]{uica}. We hypothesize that the same kind of effect could postpone an instruction's \uops{} until the next cycle if its \uops{} would cross a cycle boundary @@ -248,10 +248,10 @@ otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross}, with a kernel composed of three instructions: the first two each decode to a single \uop{}, while the third one decodes to two \uops{}. In this figure, each row represents a CPU cycle, while each square represents a \uop{}-slot in the -frontend; there are thus three squares in each row. In the no-cross case -(right), the constraint forced the third instruction to start its decoding at -the beginning of the second cycle, leaving a ``bubble'' in the frontend in the -first cycle. +frontend; there are thus at most three squares in each row. In the no-cross +case (right), the constraint forced the third instruction to start its decoding +at the beginning of the second cycle, leaving a ``bubble'' in the frontend on +the first cycle. \medskip{} diff --git a/manuscrit/40_A72-frontend/40_evaluation.tex b/manuscrit/40_A72-frontend/40_evaluation.tex index 33c3456..af67010 100644 --- a/manuscrit/40_A72-frontend/40_evaluation.tex +++ b/manuscrit/40_A72-frontend/40_evaluation.tex @@ -1,4 +1,4 @@ -\section{Evaluation on Palmed} +\section{Evaluation on Palmed}\label{sec:a40_eval} To evaluate the gain brought by each frontend model, we plug them successively on top of the \palmed{} backend model. The number of cycles for a kernel diff --git a/manuscrit/40_A72-frontend/50_future_works.tex b/manuscrit/40_A72-frontend/50_future_works.tex index bc459d5..918f852 100644 --- a/manuscrit/40_A72-frontend/50_future_works.tex +++ b/manuscrit/40_A72-frontend/50_future_works.tex @@ -1,6 +1,5 @@ \section{A parametric model for future works of automatic frontend model -generation} -%\section{Future works: benchmarks-based automatic frontend model generation} +generation}\label{sec:frontend_parametric_model} While this chapter was solely centered on the Cortex A72, we believe that this study paves the way for an automated frontend model synthesis akin to