Proof-read chapter 3 (A72 frontend)

This commit is contained in:
Théophile Bastian 2024-08-17 15:43:20 +02:00
parent 4e13835886
commit 24e3d4a817
5 changed files with 33 additions and 30 deletions

View file

@ -27,19 +27,23 @@ analysis tool, supports only x86-64.
\smallskip{}
In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
to be an important goal, especially meaningful as this particular CPU only has
very few hardware counters. However, it yielded only mixed results, as shown in
\autoref{sec:palmed_results}.
In this context, modelling an ARM CPU ---~the Cortex A72~--- with \palmed{}
seemed to be an important goal, especially meaningful as this particular CPU
only has very few hardware counters. However, it yielded only mixed results, as
we will see in \autoref{sec:a40_eval}.
\bigskip{}
In this chapter, we show that a major cause of imprecision in these results is
the absence of a frontend model. We manually model the Cortex A72 frontend to
compare a raw \palmed{}-generated model, to one naively augmented with a
frontend model.
the absence in \palmed{} of a frontend model. We manually model the Cortex A72
frontend to compare a raw \palmed{}-generated model, to one naively augmented
with a frontend model.
While this chapter only documents a manual approach, we view it as a
preliminary work towards an automation of the synthesis of a model that stems
from benchmarks data, in the same way that \palmed{} synthesises a backend
model.
model. In this direction, we propose in \autoref{sec:frontend_parametric_model}
a generic, parametric frontend that, we expect, could be used with good results
on many architectures. We also offer methodologies that we expect to be able to
automatically fill some of the parameters of this model for an arbitrary
architecture.

View file

@ -18,7 +18,7 @@ SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
heatmaps.
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
\begin{example}[High back-end throughput on \texttt{SKL-SP}]
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
frontend, a number of instructions per cycle higher than 4 is easy to
reach.

View file

@ -1,7 +1,7 @@
\section{Manually modelling the A72 frontend}
Our objective is now to manually construct a frontend model of the Cortex A72.
We strive, however, to remain as close to an algorithmic methodology that is
We strive, however, to remain as close to an algorithmic methodology as
possible: while our model's structure is manually crafted, its data should come
from experiments that can be later automated.
@ -24,7 +24,7 @@ manual is only helpful to some extent to determine this.
\medskip{}
We instead use an approach akin to \palmed{}' saturating kernels, itself
We instead use an approach akin to \palmed{}'s saturating kernels, itself
inspired by Agner Fog's method to identify ports in the absence of hardware
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
@ -146,20 +146,20 @@ model mapping each supported instruction of the ISA to its \uop{} count.
basic instruction for the integer port.
We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this
we consider $\kerK_2$ and $\kerK_3$. Our mapping indicates that this
instruction loads only the \texttt{Int01} port with a load of
$\sfrac{1}{2}$.
We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i +
We select \eg{} $\kerK_2 = i + 2\times \basic{FP01}$ and $\kerK_3 = i +
\basic{FP01} + \basic{Ld} + \basic{FP01}$.
We measure
\begin{itemize}
\item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$
\item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
\item $\cyc{\kerK_2} = 1.01 \simeq 1\,\text{cycle}$
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
\end{itemize}
which is consistent. We conclude that, as expected, $\mucount i =
3\cyc{\kerK_3} = 3-2 = 1$.
3\cyc{\kerK_2} - 2 = 3-2 = 1$.
\end{example}
\begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
@ -168,7 +168,7 @@ model mapping each supported instruction of the ISA to its \uop{} count.
operands.
We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
$\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads
$\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads
the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
load of 1 means two \uops{}. As there is already a \uop{} loading the
@ -176,16 +176,16 @@ model mapping each supported instruction of the ISA to its \uop{} count.
this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
on either \texttt{FP0} or \texttt{FP1}.}.
We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i +
We select \eg{} $\kerK_2 = i + 2\times \basic{Int01}$ and $\kerK_3 = i +
\basic{Int01} + \basic{Ld} + \basic{Int01}$.
We measure
\begin{itemize}
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
\item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
\item $\cyc{\kerK_2} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
\item $\cyc{\kerK_3} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
\end{itemize}
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 =
2$.
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_2} - 2 =
4-2 = 2$.
\end{example}
@ -240,7 +240,7 @@ steady-state.
On the x86-64 architectures they analyzed, \uica{}'s authors find that the
CPU's predecoder might cause an instruction's \uops{} to be postponed to the
next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1).
next cycle if it is pre-decoded across a cycle boundary~\cite[§4.1]{uica}.
We hypothesize that the same kind of effect could postpone an instruction's
\uops{} until the next cycle if its \uops{} would cross a cycle boundary
@ -248,10 +248,10 @@ otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross},
with a kernel composed of three instructions: the first two each decode to a
single \uop{}, while the third one decodes to two \uops{}. In this figure, each
row represents a CPU cycle, while each square represents a \uop{}-slot in the
frontend; there are thus three squares in each row. In the no-cross case
(right), the constraint forced the third instruction to start its decoding at
the beginning of the second cycle, leaving a ``bubble'' in the frontend in the
first cycle.
frontend; there are thus at most three squares in each row. In the no-cross
case (right), the constraint forced the third instruction to start its decoding
at the beginning of the second cycle, leaving a ``bubble'' in the frontend on
the first cycle.
\medskip{}

View file

@ -1,4 +1,4 @@
\section{Evaluation on Palmed}
\section{Evaluation on Palmed}\label{sec:a40_eval}
To evaluate the gain brought by each frontend model, we plug them successively
on top of the \palmed{} backend model. The number of cycles for a kernel

View file

@ -1,6 +1,5 @@
\section{A parametric model for future works of automatic frontend model
generation}
%\section{Future works: benchmarks-based automatic frontend model generation}
generation}\label{sec:frontend_parametric_model}
While this chapter was solely centered on the Cortex A72, we believe that this
study paves the way for an automated frontend model synthesis akin to