Proof-read chapter 3 (A72 frontend)
This commit is contained in:
parent
4e13835886
commit
24e3d4a817
5 changed files with 33 additions and 30 deletions
|
@ -27,19 +27,23 @@ analysis tool, supports only x86-64.
|
||||||
|
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
|
||||||
In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
|
In this context, modelling an ARM CPU ---~the Cortex A72~--- with \palmed{}
|
||||||
to be an important goal, especially meaningful as this particular CPU only has
|
seemed to be an important goal, especially meaningful as this particular CPU
|
||||||
very few hardware counters. However, it yielded only mixed results, as shown in
|
only has very few hardware counters. However, it yielded only mixed results, as
|
||||||
\autoref{sec:palmed_results}.
|
we will see in \autoref{sec:a40_eval}.
|
||||||
|
|
||||||
\bigskip{}
|
\bigskip{}
|
||||||
|
|
||||||
In this chapter, we show that a major cause of imprecision in these results is
|
In this chapter, we show that a major cause of imprecision in these results is
|
||||||
the absence of a frontend model. We manually model the Cortex A72 frontend to
|
the absence in \palmed{} of a frontend model. We manually model the Cortex A72
|
||||||
compare a raw \palmed{}-generated model, to one naively augmented with a
|
frontend to compare a raw \palmed{}-generated model, to one naively augmented
|
||||||
frontend model.
|
with a frontend model.
|
||||||
|
|
||||||
While this chapter only documents a manual approach, we view it as a
|
While this chapter only documents a manual approach, we view it as a
|
||||||
preliminary work towards an automation of the synthesis of a model that stems
|
preliminary work towards an automation of the synthesis of a model that stems
|
||||||
from benchmarks data, in the same way that \palmed{} synthesises a backend
|
from benchmarks data, in the same way that \palmed{} synthesises a backend
|
||||||
model.
|
model. In this direction, we propose in \autoref{sec:frontend_parametric_model}
|
||||||
|
a generic, parametric frontend that, we expect, could be used with good results
|
||||||
|
on many architectures. We also offer methodologies that we expect to be able to
|
||||||
|
automatically fill some of the parameters of this model for an arbitrary
|
||||||
|
architecture.
|
||||||
|
|
|
@ -18,7 +18,7 @@ SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
|
||||||
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
|
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
|
||||||
heatmaps.
|
heatmaps.
|
||||||
|
|
||||||
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
|
\begin{example}[High back-end throughput on \texttt{SKL-SP}]
|
||||||
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
|
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
|
||||||
frontend, a number of instructions per cycle higher than 4 is easy to
|
frontend, a number of instructions per cycle higher than 4 is easy to
|
||||||
reach.
|
reach.
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
\section{Manually modelling the A72 frontend}
|
\section{Manually modelling the A72 frontend}
|
||||||
|
|
||||||
Our objective is now to manually construct a frontend model of the Cortex A72.
|
Our objective is now to manually construct a frontend model of the Cortex A72.
|
||||||
We strive, however, to remain as close to an algorithmic methodology that is
|
We strive, however, to remain as close to an algorithmic methodology as
|
||||||
possible: while our model's structure is manually crafted, its data should come
|
possible: while our model's structure is manually crafted, its data should come
|
||||||
from experiments that can be later automated.
|
from experiments that can be later automated.
|
||||||
|
|
||||||
|
@ -24,7 +24,7 @@ manual is only helpful to some extent to determine this.
|
||||||
|
|
||||||
\medskip{}
|
\medskip{}
|
||||||
|
|
||||||
We instead use an approach akin to \palmed{}' saturating kernels, itself
|
We instead use an approach akin to \palmed{}'s saturating kernels, itself
|
||||||
inspired by Agner Fog's method to identify ports in the absence of hardware
|
inspired by Agner Fog's method to identify ports in the absence of hardware
|
||||||
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
|
counters~\cite{AgnerFog}. To this end, we assume the availability of a port
|
||||||
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
|
mapping for the backend ---~in the case of the Cortex A72, we use \palmed{}'s
|
||||||
|
@ -146,20 +146,20 @@ model mapping each supported instruction of the ISA to its \uop{} count.
|
||||||
basic instruction for the integer port.
|
basic instruction for the integer port.
|
||||||
|
|
||||||
We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
|
We measure $\cyc{\imath} = 0.51 \simeq \sfrac{1}{2}\,\text{cycle}$; hence,
|
||||||
we consider $\kerK_3$ and $\kerK_4$. Our mapping indicates that this
|
we consider $\kerK_2$ and $\kerK_3$. Our mapping indicates that this
|
||||||
instruction loads only the \texttt{Int01} port with a load of
|
instruction loads only the \texttt{Int01} port with a load of
|
||||||
$\sfrac{1}{2}$.
|
$\sfrac{1}{2}$.
|
||||||
|
|
||||||
We select \eg{} $\kerK_3 = i + 2\times \basic{FP01}$ and $\kerK_4 = i +
|
We select \eg{} $\kerK_2 = i + 2\times \basic{FP01}$ and $\kerK_3 = i +
|
||||||
\basic{FP01} + \basic{Ld} + \basic{FP01}$.
|
\basic{FP01} + \basic{Ld} + \basic{FP01}$.
|
||||||
|
|
||||||
We measure
|
We measure
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item $\cyc{\kerK_3} = 1.01 \simeq 1\,\text{cycle}$
|
\item $\cyc{\kerK_2} = 1.01 \simeq 1\,\text{cycle}$
|
||||||
\item $\cyc{\kerK_4} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
which is consistent. We conclude that, as expected, $\mucount i =
|
which is consistent. We conclude that, as expected, $\mucount i =
|
||||||
3\cyc{\kerK_3} = 3-2 = 1$.
|
3\cyc{\kerK_2} - 2 = 3-2 = 1$.
|
||||||
\end{example}
|
\end{example}
|
||||||
|
|
||||||
\begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
|
\begin{example}[\uop{} count measure: \lstarmasm{ADDV_FD_H_VN_V_8H}]
|
||||||
|
@ -168,7 +168,7 @@ model mapping each supported instruction of the ISA to its \uop{} count.
|
||||||
operands.
|
operands.
|
||||||
|
|
||||||
We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
|
We measure $\cyc{\imath} = 1.01 \simeq 1\,\text{cycle}$; hence, we consider
|
||||||
$\kerK_3$ and $\kerK_4$. Our mapping indicates that this instruction loads
|
$\kerK_2$ and $\kerK_3$. Our mapping indicates that this instruction loads
|
||||||
the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
|
the \texttt{FP1} port with a load of $1$, and the \texttt{FP01} port with a
|
||||||
load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
|
load of $1$\footnote{The \texttt{FP01} port has a throughput of 2, hence a
|
||||||
load of 1 means two \uops{}. As there is already a \uop{} loading the
|
load of 1 means two \uops{}. As there is already a \uop{} loading the
|
||||||
|
@ -176,16 +176,16 @@ model mapping each supported instruction of the ISA to its \uop{} count.
|
||||||
this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
|
this can be understood as one \uop{} on \texttt{FP1} exclusively, plus one
|
||||||
on either \texttt{FP0} or \texttt{FP1}.}.
|
on either \texttt{FP0} or \texttt{FP1}.}.
|
||||||
|
|
||||||
We select \eg{} $\kerK_3 = i + 2\times \basic{Int01}$ and $\kerK_4 = i +
|
We select \eg{} $\kerK_2 = i + 2\times \basic{Int01}$ and $\kerK_3 = i +
|
||||||
\basic{Int01} + \basic{Ld} + \basic{Int01}$.
|
\basic{Int01} + \basic{Ld} + \basic{Int01}$.
|
||||||
|
|
||||||
We measure
|
We measure
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item $\cyc{\kerK_3} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
\item $\cyc{\kerK_2} = 1.35 \simeq 1\,\sfrac{1}{3}\,\text{cycles}$
|
||||||
\item $\cyc{\kerK_4} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
|
\item $\cyc{\kerK_3} = 1.68 \simeq 1\,\sfrac{2}{3}\,\text{cycles}$
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_3} = 4-2 =
|
which is consistent. We conclude that $\mucount i = 3\cyc{\kerK_2} - 2 =
|
||||||
2$.
|
4-2 = 2$.
|
||||||
\end{example}
|
\end{example}
|
||||||
|
|
||||||
|
|
||||||
|
@ -240,7 +240,7 @@ steady-state.
|
||||||
|
|
||||||
On the x86-64 architectures they analyzed, \uica{}'s authors find that the
|
On the x86-64 architectures they analyzed, \uica{}'s authors find that the
|
||||||
CPU's predecoder might cause an instruction's \uops{} to be postponed to the
|
CPU's predecoder might cause an instruction's \uops{} to be postponed to the
|
||||||
next cycle if it is pre-decoded across a cycle boundary~\cite{uica} (§4.1).
|
next cycle if it is pre-decoded across a cycle boundary~\cite[§4.1]{uica}.
|
||||||
|
|
||||||
We hypothesize that the same kind of effect could postpone an instruction's
|
We hypothesize that the same kind of effect could postpone an instruction's
|
||||||
\uops{} until the next cycle if its \uops{} would cross a cycle boundary
|
\uops{} until the next cycle if its \uops{} would cross a cycle boundary
|
||||||
|
@ -248,10 +248,10 @@ otherwise. This behaviour is illustrated in \autoref{fig:frontend_nocross},
|
||||||
with a kernel composed of three instructions: the first two each decode to a
|
with a kernel composed of three instructions: the first two each decode to a
|
||||||
single \uop{}, while the third one decodes to two \uops{}. In this figure, each
|
single \uop{}, while the third one decodes to two \uops{}. In this figure, each
|
||||||
row represents a CPU cycle, while each square represents a \uop{}-slot in the
|
row represents a CPU cycle, while each square represents a \uop{}-slot in the
|
||||||
frontend; there are thus three squares in each row. In the no-cross case
|
frontend; there are thus at most three squares in each row. In the no-cross
|
||||||
(right), the constraint forced the third instruction to start its decoding at
|
case (right), the constraint forced the third instruction to start its decoding
|
||||||
the beginning of the second cycle, leaving a ``bubble'' in the frontend in the
|
at the beginning of the second cycle, leaving a ``bubble'' in the frontend on
|
||||||
first cycle.
|
the first cycle.
|
||||||
|
|
||||||
\medskip{}
|
\medskip{}
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
\section{Evaluation on Palmed}
|
\section{Evaluation on Palmed}\label{sec:a40_eval}
|
||||||
|
|
||||||
To evaluate the gain brought by each frontend model, we plug them successively
|
To evaluate the gain brought by each frontend model, we plug them successively
|
||||||
on top of the \palmed{} backend model. The number of cycles for a kernel
|
on top of the \palmed{} backend model. The number of cycles for a kernel
|
||||||
|
|
|
@ -1,6 +1,5 @@
|
||||||
\section{A parametric model for future works of automatic frontend model
|
\section{A parametric model for future works of automatic frontend model
|
||||||
generation}
|
generation}\label{sec:frontend_parametric_model}
|
||||||
%\section{Future works: benchmarks-based automatic frontend model generation}
|
|
||||||
|
|
||||||
While this chapter was solely centered on the Cortex A72, we believe that this
|
While this chapter was solely centered on the Cortex A72, we believe that this
|
||||||
study paves the way for an automated frontend model synthesis akin to
|
study paves the way for an automated frontend model synthesis akin to
|
||||||
|
|
Loading…
Reference in a new issue