Proof-read chapter 3 (A72 frontend)
This commit is contained in:
parent
4e13835886
commit
24e3d4a817
5 changed files with 33 additions and 30 deletions
|
@ -27,19 +27,23 @@ analysis tool, supports only x86-64.
|
|||
|
||||
\smallskip{}
|
||||
|
||||
In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
|
||||
to be an important goal, especially meaningful as this particular CPU only has
|
||||
very few hardware counters. However, it yielded only mixed results, as shown in
|
||||
\autoref{sec:palmed_results}.
|
||||
In this context, modelling an ARM CPU ---~the Cortex A72~--- with \palmed{}
|
||||
seemed to be an important goal, especially meaningful as this particular CPU
|
||||
only has very few hardware counters. However, it yielded only mixed results, as
|
||||
we will see in \autoref{sec:a40_eval}.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
In this chapter, we show that a major cause of imprecision in these results is
|
||||
the absence of a frontend model. We manually model the Cortex A72 frontend to
|
||||
compare a raw \palmed{}-generated model, to one naively augmented with a
|
||||
frontend model.
|
||||
the absence in \palmed{} of a frontend model. We manually model the Cortex A72
|
||||
frontend to compare a raw \palmed{}-generated model, to one naively augmented
|
||||
with a frontend model.
|
||||
|
||||
While this chapter only documents a manual approach, we view it as a
|
||||
preliminary work towards an automation of the synthesis of a model that stems
|
||||
from benchmarks data, in the same way that \palmed{} synthesises a backend
|
||||
model.
|
||||
model. In this direction, we propose in \autoref{sec:frontend_parametric_model}
|
||||
a generic, parametric frontend that, we expect, could be used with good results
|
||||
on many architectures. We also offer methodologies that we expect to be able to
|
||||
automatically fill some of the parameters of this model for an arbitrary
|
||||
architecture.
|
||||
|
|
|
@ -18,7 +18,7 @@ SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
|
|||
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
|
||||
heatmaps.
|
||||
|
||||
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
|
||||
\begin{example}[High back-end throughput on \texttt{SKL-SP}]
|
||||
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
|
||||
frontend, a number of instructions per cycle higher than 4 is easy to
|
||||
reach.
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
\section{Manually modelling the A72 frontend}
|
||||
|
||||
Our objective is now to manually construct a frontend model of the Cortex A72.
|
||||
We strive, however, to remain as close to an algorithmic methodology that is
|
||||
We strive, however, to remain as close to an algorithmic methodology as
|
||||
possible: while our model's structure is manually crafted, its data should come
|
||||
from experiments that can be later automated.
|
||||
|
||||
|
@ -24,7 +24,7 @@ manual is only helpful to some extent to determine this.
|
|||
|
||||
\medskip{}
|
||||
|
||||
We instead use an approach akin to \palmed{}' saturating kernels, itself
|
||||
We instead use an approach akin to \palmed{}'s saturating kernels, itself
|
||||
inspired by Agner Fog's method to identify ports in the absence of hardware
|
||||
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
|
||||
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
|
||||
|
@ -146,20 +146,20 @@ model mapping each supported instruction of the ISA to its \uop{} count.
|
|||
basic instruction for the integer port.
|
||||
|
||||
We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
|
||||
we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this
|
||||
we consider $\kerK_2$ and $\kerK_3$. Our mapping indicates that this
|
||||
instruction loads only the \texttt{Int01} port with a load of
|
||||
$\sfrac{1}{2}$.
|
||||
|
||||
We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i +
|
||||
We select \eg{} $\kerK_2 = i + 2\times \basic{FP01}$ and $\kerK_3 = i +
|
||||
\basic{FP01} + \basic{Ld} + \basic{FP01}$.
|
||||
|
||||
We measure
|
||||
\begin{itemize}
|
||||
\item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$
|
||||
\item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
||||
\item $\cyc{\kerK_2} = 1.01 \simeq 1\,\text{cycle}$
|
||||
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
||||
\end{itemize}
|
||||
which is consistent. We conclude that, as expected, $\mucount i =
|
||||
3\cyc{\kerK_3} = 3-2 = 1$.
|
||||
3\cyc{\kerK_2} - 2 = 3-2 = 1$.
|
||||
\end{example}
|
||||
|
||||
\begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
|
||||
|
@ -168,7 +168,7 @@ model mapping each supported instruction of the ISA to its \uop{} count.
|
|||
operands.
|
||||
|
||||
We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
|
||||
$\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads
|
||||
$\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads
|
||||
the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
|
||||
load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
|
||||
load of 1 means two \uops{}. As there is already a \uop{} loading the
|
||||
|
@ -176,16 +176,16 @@ model mapping each supported instruction of the ISA to its \uop{} count.
|
|||
this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
|
||||
on either \texttt{FP0} or \texttt{FP1}.}.
|
||||
|
||||
We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i +
|
||||
We select \eg{} $\kerK_2 = i + 2\times \basic{Int01}$ and $\kerK_3 = i +
|
||||
\basic{Int01} + \basic{Ld} + \basic{Int01}$.
|
||||
|
||||
We measure
|
||||
\begin{itemize}
|
||||
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
||||
\item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
|
||||
\item $\cyc{\kerK_2} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
||||
\item $\cyc{\kerK_3} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
|
||||
\end{itemize}
|
||||
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 =
|
||||
2$.
|
||||
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_2} - 2 =
|
||||
4-2 = 2$.
|
||||
\end{example}
|
||||
|
||||
|
||||
|
@ -240,7 +240,7 @@ steady-state.
|
|||
|
||||
On the x86-64 architectures they analyzed, \uica{}'s authors find that the
|
||||
CPU's predecoder might cause an instruction's \uops{} to be postponed to the
|
||||
next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1).
|
||||
next cycle if it is pre-decoded across a cycle boundary~\cite[§4.1]{uica}.
|
||||
|
||||
We hypothesize that the same kind of effect could postpone an instruction's
|
||||
\uops{} until the next cycle if its \uops{} would cross a cycle boundary
|
||||
|
@ -248,10 +248,10 @@ otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross},
|
|||
with a kernel composed of three instructions: the first two each decode to a
|
||||
single \uop{}, while the third one decodes to two \uops{}. In this figure, each
|
||||
row represents a CPU cycle, while each square represents a \uop{}-slot in the
|
||||
frontend; there are thus three squares in each row. In the no-cross case
|
||||
(right), the constraint forced the third instruction to start its decoding at
|
||||
the beginning of the second cycle, leaving a ``bubble'' in the frontend in the
|
||||
first cycle.
|
||||
frontend; there are thus at most three squares in each row. In the no-cross
|
||||
case (right), the constraint forced the third instruction to start its decoding
|
||||
at the beginning of the second cycle, leaving a ``bubble'' in the frontend on
|
||||
the first cycle.
|
||||
|
||||
\medskip{}
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
\section{Evaluation on Palmed}
|
||||
\section{Evaluation on Palmed}\label{sec:a40_eval}
|
||||
|
||||
To evaluate the gain brought by each frontend model, we plug them successively
|
||||
on top of the \palmed{} backend model. The number of cycles for a kernel
|
||||
|
|
|
@ -1,6 +1,5 @@
|
|||
\section{A parametric model for future works of automatic frontend model
|
||||
generation}
|
||||
%\section{Future works: benchmarks-based automatic frontend model generation}
|
||||
generation}\label{sec:frontend_parametric_model}
|
||||
|
||||
While this chapter was solely centered on the Cortex A72, we believe that this
|
||||
study paves the way for an automated frontend model synthesis akin to
|
||||
|
|
Loading…
Reference in a new issue