Proof-read chapter 3 (A72 frontend)

2024-08-17 15:43:20 +02:00 · 2024-08-17 15:43:20 +02:00 · 24e3d4a817
commit 24e3d4a817
parent 4e13835886
5 changed files with 33 additions and 30 deletions
--- a/manuscrit/40_A72-frontend/00_intro.tex
+++ b/manuscrit/40_A72-frontend/00_intro.tex
@ -27,19 +27,23 @@ analysis tool, supports only x86-64.

 \smallskip{}

-In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
-to be an important goal, especially meaningful as this particular CPU only has
-very few hardware counters. However, it yielded only mixed results, as shown in
-\autoref{sec:palmed_results}.
+In this context, modelling an ARM CPU ---~the Cortex A72~--- with \palmed{}
+seemed to be an important goal, especially meaningful as this particular CPU
+only has very few hardware counters. However, it yielded only mixed results, as
+we will see in \autoref{sec:a40_eval}.

 \bigskip{}

 In this chapter, we show that a major cause of imprecision in these results is
-the absence of a frontend model. We manually model the Cortex A72 frontend to
-compare a raw \palmed{}-generated model, to one naively augmented with a
-frontend model.
+the absence in \palmed{} of a frontend model. We manually model the Cortex A72
+frontend to compare a raw \palmed{}-generated model, to one naively augmented
+with a frontend model.

 While this chapter only documents a manual approach, we view it as a
 preliminary work towards an automation of the synthesis of a model that stems
 from benchmarks data, in the same way that \palmed{} synthesises a backend
-model.
+model. In this direction, we propose in \autoref{sec:frontend_parametric_model}
+a generic, parametric frontend that, we expect, could be used with good results
+on many architectures. We also offer methodologies that we expect to be able to
+automatically fill some of the parameters of this model for an arbitrary
+architecture.
--- a/manuscrit/40_A72-frontend/10_beyond_ports.tex
+++ b/manuscrit/40_A72-frontend/10_beyond_ports.tex
@ -18,7 +18,7 @@ SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
 4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
 heatmaps.

-\begin{example}{High back-end throughput on \texttt{SKL-SP}}
+\begin{example}[High back-end throughput on \texttt{SKL-SP}]
    On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
    frontend, a number of instructions per cycle higher than 4 is easy to
    reach.
--- a/manuscrit/40_A72-frontend/30_manual_frontend.tex
+++ b/manuscrit/40_A72-frontend/30_manual_frontend.tex
@ -1,7 +1,7 @@
 \section{Manually modelling the A72 frontend}

 Our objective is now to manually construct a frontend model of the Cortex A72.
-We strive, however, to remain as close to an algorithmic methodology that is
+We strive, however, to remain as close to an algorithmic methodology as
 possible: while our model's structure is manually crafted, its data should come
 from experiments that can be later automated.

@ -24,7 +24,7 @@ manual is only helpful to some extent to determine this.

 \medskip{}

-We instead use an approach akin to \palmed{}' saturating kernels, itself
+We instead use an approach akin to \palmed{}'s saturating kernels, itself
 inspired by Agner Fog's method to identify ports in the absence of hardware
 counters~\cite{AgnerFog}. To this end, we assume the availability of a port
 mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
@ -146,20 +146,20 @@ model mapping each supported instruction of the ISA to its \uop{} count.
    basic instruction for the integer port.

    We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
-    we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this
+    we consider $\kerK_2$ and $\kerK_3$. Our mapping indicates that this
    instruction loads only the \texttt{Int01} port with a load of
    $\sfrac{1}{2}$.

-    We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i +
+    We select \eg{} $\kerK_2 = i + 2\times \basic{FP01}$ and $\kerK_3 = i +
    \basic{FP01} + \basic{Ld} + \basic{FP01}$.

    We measure
    \begin{itemize}
-        \item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$
-        \item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
+        \item $\cyc{\kerK_2} = 1.01 \simeq 1\,\text{cycle}$
+        \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
    \end{itemize}
    which is consistent. We conclude that, as expected, $\mucount i =
-    3\cyc{\kerK_3} = 3-2 = 1$.
+    3\cyc{\kerK_2} - 2 = 3-2 = 1$.
 \end{example}

 \begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
@ -168,7 +168,7 @@ model mapping each supported instruction of the ISA to its \uop{} count.
    operands.

    We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
-    $\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads
+    $\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads
    the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
    load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
        load of 1 means two \uops{}. As there is already a \uop{} loading the
@ -176,16 +176,16 @@ model mapping each supported instruction of the ISA to its \uop{} count.
    this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
    on either \texttt{FP0} or \texttt{FP1}.}.

-    We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i +
+    We select \eg{} $\kerK_2 = i + 2\times \basic{Int01}$ and $\kerK_3 = i +
    \basic{Int01} + \basic{Ld} + \basic{Int01}$.

    We measure
    \begin{itemize}
-        \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
-        \item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
+        \item $\cyc{\kerK_2} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
+        \item $\cyc{\kerK_3} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
    \end{itemize}
-    which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 =
-    2$.
+    which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_2} - 2 =
+    4-2 = 2$.
 \end{example}


@ -240,7 +240,7 @@ steady-state.

 On the x86-64 architectures they analyzed, \uica{}'s authors find that the
 CPU's predecoder might cause an instruction's \uops{} to be postponed to the
-next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1).
+next cycle if it is pre-decoded across a cycle boundary~\cite[§4.1]{uica}.

 We hypothesize that the same kind of effect could postpone an instruction's
 \uops{} until the next cycle if its \uops{} would cross a cycle boundary
@ -248,10 +248,10 @@ otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross},
 with a kernel composed of three instructions: the first two each decode to a
 single \uop{}, while the third one decodes to two \uops{}. In this figure, each
 row represents a CPU cycle, while each square represents a \uop{}-slot in the
-frontend; there are thus three squares in each row. In the no-cross case
-(right), the constraint forced the third instruction to start its decoding at
-the beginning of the second cycle, leaving a ``bubble'' in the frontend in the
-first cycle.
+frontend; there are thus at most three squares in each row. In the no-cross
+case (right), the constraint forced the third instruction to start its decoding
+at the beginning of the second cycle, leaving a ``bubble'' in the frontend on
+the first cycle.

 \medskip{}

--- a/manuscrit/40_A72-frontend/40_evaluation.tex
+++ b/manuscrit/40_A72-frontend/40_evaluation.tex
@ -1,4 +1,4 @@
-\section{Evaluation on Palmed}
+\section{Evaluation on Palmed}\label{sec:a40_eval}

 To evaluate the gain brought by each frontend model, we plug them successively
 on top of the \palmed{} backend model. The number of cycles for a kernel
--- a/manuscrit/40_A72-frontend/50_future_works.tex
+++ b/manuscrit/40_A72-frontend/50_future_works.tex
@ -1,6 +1,5 @@
 \section{A parametric model for future works of automatic frontend model
-generation}
-%\section{Future works: benchmarks-based automatic frontend model generation}
+generation}\label{sec:frontend_parametric_model}

 While this chapter was solely centered on the Cortex A72, we believe that this
 study paves the way for an automated frontend model synthesis akin to