Proof-read chapter 3 (A72 frontend)

2024-08-17 15:43:20 +02:00 · 2024-08-17 15:43:20 +02:00 · 24e3d4a817
commit 24e3d4a817
parent 4e13835886
5 changed files with 33 additions and 30 deletions
--- a/manuscrit/40_A72-frontend/00_intro.tex
+++ b/manuscrit/40_A72-frontend/00_intro.tex
@ -27,19 +27,23 @@ analysis tool, supports only x86-64.
 \smallskip{}
-In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
+In this context, modelling an ARM CPU ---~the Cortex A72~--- with \palmed{}
-to be an important goal, especially meaningful as this particular CPU only has
+seemed to be an important goal, especially meaningful as this particular CPU
-very few hardware counters. However, it yielded only mixed results, as shown in
+only has very few hardware counters. However, it yielded only mixed results, as
-\autoref{sec:palmed_results}.
+we will see in \autoref{sec:a40_eval}.
 \bigskip{}
 In this chapter, we show that a major cause of imprecision in these results is
-the absence of a frontend model. We manually model the Cortex A72 frontend to
+the absence in \palmed{} of a frontend model. We manually model the Cortex A72
-compare a raw \palmed{}-generated model, to one naively augmented with a
+frontend to compare a raw \palmed{}-generated model, to one naively augmented
-frontend model.
+with a frontend model.
 While this chapter only documents a manual approach, we view it as a
 preliminary work towards an automation of the synthesis of a model that stems
 from benchmarks data, in the same way that \palmed{} synthesises a backend
-model.
+model. In this direction, we propose in \autoref{sec:frontend_parametric_model}
 a generic, parametric frontend that, we expect, could be used with good results
 on many architectures. We also offer methodologies that we expect to be able to
 automatically fill some of the parameters of this model for an arbitrary
 architecture.
--- a/manuscrit/40_A72-frontend/10_beyond_ports.tex
+++ b/manuscrit/40_A72-frontend/10_beyond_ports.tex
@ -18,7 +18,7 @@ SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
 4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
 heatmaps.
-\begin{example}{High back-end throughput on \texttt{SKL-SP}}
+\begin{example}[High back-end throughput on \texttt{SKL-SP}]
    On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
    frontend, a number of instructions per cycle higher than 4 is easy to
    reach.
--- a/manuscrit/40_A72-frontend/30_manual_frontend.tex
+++ b/manuscrit/40_A72-frontend/30_manual_frontend.tex
@ -1,7 +1,7 @@
 \section{Manually modelling the A72 frontend}
 Our objective is now to manually construct a frontend model of the Cortex A72.
-We strive, however, to remain as close to an algorithmic methodology that is
+We strive, however, to remain as close to an algorithmic methodology as
 possible: while our model's structure is manually crafted, its data should come
 from experiments that can be later automated.
@ -24,7 +24,7 @@ manual is only helpful to some extent to determine this.
 \medskip{}
-We instead use an approach akin to \palmed{}' saturating kernels, itself
+We instead use an approach akin to \palmed{}'s saturating kernels, itself
 inspired by Agner Fog's method to identify ports in the absence of hardware
 counters~\cite{AgnerFog}. To this end, we assume the availability of a port
 mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
@ -146,20 +146,20 @@ model mapping each supported instruction of the ISA to its \uop{} count.
    basic instruction for the integer port.
    We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
-    we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this
+    we consider $\kerK_2$ and $\kerK_3$. Our mapping indicates that this
    instruction loads only the \texttt{Int01} port with a load of
    $\sfrac{1}{2}$.
-    We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i +
+    We select \eg{} $\kerK_2 = i + 2\times \basic{FP01}$ and $\kerK_3 = i +
    \basic{FP01} + \basic{Ld} + \basic{FP01}$.
    We measure
    \begin{itemize}
-        \item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$
+        \item $\cyc{\kerK_2} = 1.01 \simeq 1\,\text{cycle}$
-        \item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
+        \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
    \end{itemize}
    which is consistent. We conclude that, as expected, $\mucount i =
-    3\cyc{\kerK_3} = 3-2 = 1$.
+    3\cyc{\kerK_2} - 2 = 3-2 = 1$.
 \end{example}
 \begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
@ -168,7 +168,7 @@ model mapping each supported instruction of the ISA to its \uop{} count.
    operands.
    We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
-    $\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads
+    $\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads
    the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
    load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
        load of 1 means two \uops{}. As there is already a \uop{} loading the
@ -176,16 +176,16 @@ model mapping each supported instruction of the ISA to its \uop{} count.
    this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
    on either \texttt{FP0} or \texttt{FP1}.}.
-    We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i +
+    We select \eg{} $\kerK_2 = i + 2\times \basic{Int01}$ and $\kerK_3 = i +
    \basic{Int01} + \basic{Ld} + \basic{Int01}$.
    We measure
    \begin{itemize}
-        \item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
+        \item $\cyc{\kerK_2} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
-        \item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
+        \item $\cyc{\kerK_3} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
    \end{itemize}
-    which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 =
+    which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_2} - 2 =
-    2$.
+    4-2 = 2$.
 \end{example}
@ -240,7 +240,7 @@ steady-state.
 On the x86-64 architectures they analyzed, \uica{}'s authors find that the
 CPU's predecoder might cause an instruction's \uops{} to be postponed to the
-next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1).
+next cycle if it is pre-decoded across a cycle boundary~\cite[§4.1]{uica}.
 We hypothesize that the same kind of effect could postpone an instruction's
 \uops{} until the next cycle if its \uops{} would cross a cycle boundary
@ -248,10 +248,10 @@ otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross},
 with a kernel composed of three instructions: the first two each decode to a
 single \uop{}, while the third one decodes to two \uops{}. In this figure, each
 row represents a CPU cycle, while each square represents a \uop{}-slot in the
-frontend; there are thus three squares in each row. In the no-cross case
+frontend; there are thus at most three squares in each row. In the no-cross
-(right), the constraint forced the third instruction to start its decoding at
+case (right), the constraint forced the third instruction to start its decoding
-the beginning of the second cycle, leaving a ``bubble'' in the frontend in the
+at the beginning of the second cycle, leaving a ``bubble'' in the frontend on
-first cycle.
+the first cycle.
 \medskip{}
--- a/manuscrit/40_A72-frontend/40_evaluation.tex
+++ b/manuscrit/40_A72-frontend/40_evaluation.tex
@ -1,4 +1,4 @@
-\section{Evaluation on Palmed}
+\section{Evaluation on Palmed}\label{sec:a40_eval}
 To evaluate the gain brought by each frontend model, we plug them successively
 on top of the \palmed{} backend model. The number of cycles for a kernel
--- a/manuscrit/40_A72-frontend/50_future_works.tex
+++ b/manuscrit/40_A72-frontend/50_future_works.tex
@ -1,6 +1,5 @@
 \section{A parametric model for future works of automatic frontend model
-generation}
+generation}\label{sec:frontend_parametric_model}
 %\section{Future works: benchmarks-based automatic frontend model generation}
 While this chapter was solely centered on the Cortex A72, we believe that this
 study paves the way for an automated frontend model synthesis akin to