phd-thesis/manuscrit/40_A72-frontend/50_future_works.tex

\section{Future works: benchmark-based automatic frontend model generation}

While this chapter was solely centered on the Cortex A72, we believe that this
study paves the way for an automated frontend model synthesis akin to
\palmed{}. This synthesis should be fully-automated; stem solely from
benchmarking data and a description of the ISA; and should avoid the use of any
specific hardware counter.

As a scaffold for such a future work, we propose the parametric model in
\autoref{fig:parametric_model}. Some of its parameters are should be possible
to obtain with the methods used in this chapter, while for some others, new
methods must be devised.

\begin{figure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=0.9\textwidth]{parametric_model_sketch-frontend}
        \caption{Frontend model}\label{fig:parametric_model:front}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=0.9\textwidth]{parametric_model_sketch-insn}
        \caption{Instruction model}\label{fig:parametric_model:insn}
    \end{subfigure}
    \caption{A generic parametric model of a processor's frontend. In red, the
    parameters which must be discovered for each
architecture.}\label{fig:parametric_model}
\end{figure}

\bigskip{}

A part of this model does not appear \emph{at all} in the present chapter, as
it is absent from the Cortex A72: the complexity of decoding. As AArch64
instructions are of fixed bit size, each instruction is equally difficult to
decode, and no ``complex'' decoder is needed  ---~as is the case with \eg{}
x86-64. It seems, however, to be crucial to accurate modeling on
other architectures~\cite{uica}. For much the same reason, the present chapter
does not distinguish between the \emph{number of decoders} and the \emph{number
of \uops{} dispatched per cycle}. Indeed, as there is no variability in
instruction decoding, designing a processor with different values for these two
microarchitectural parameters would have been inefficient.

\todo{}

The core of the model presented in this chapter is the discovery, for each
instruction, of its \uop{} count. Assuming that a model of the backend is known
--~by taking for instance a model generated by \palmed{} or \uopsinfo{}~--, the
method described in \autoref{ssec:a72_insn_muop_count} should be generic enough
to be used on any processor. The basic instructions may be easily selected
using the backend model --~we assume their existence in most
microarchitectures, as pragmatic concerns guide the ports design. Counting the
\uops{} of an instruction thus follows, using only elapsed cycles counters,
assuming $\cycF{\kerK}$ bottlenecks on a global dispatch queue for $\kerK$. This
can however be arranged by selecting well-chosen kernels~--- for instance, on
the A72, care must be taken to interleave instructions corresponding to diverse
enough dispatch pipelines.

In order to generalize this method to arbitrary microarchitectures, it is
first necessary to obtain a global view of the common design choices that vary
between processors' frontends. A comparative study of their respective
importance to accurately model frontends, and ways to circumvent their impact
on the measure of $\cycF{\kerK}$ to count \uops{} per instruction would also be
needed.

Such fully-automated methods would probably be unable to account for
``unusual'' frontend bottlenecks ---~at least not at the level of detail that
\eg{} \uica{} authors gather for Intel frontends~\cite{uica}. This level of
detail, however, is possible exactly because the authors' restricted their
scope to microarchitectures that share a lot of similarity, coming from the
same manufacturer. Assessing extent of the loss of precision of an
automatically-generated model, and its gain of precision \wrt{} a model without
frontend, remains to be done.
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\section{Future works: benchmark-based automatic frontend model generation}`

			`While this chapter was solely centered on the Cortex A72, we believe that this`
			`study paves the way for an automated frontend model synthesis akin to`
			`\palmed{}. This synthesis should be fully-automated; stem solely from`
			`benchmarking data and a description of the ISA; and should avoid the use of any`
			`specific hardware counter.`

			`As a scaffold for such a future work, we propose the parametric model in`
			`\autoref{fig:parametric_model}. Some of its parameters are should be possible`
			`to obtain with the methods used in this chapter, while for some others, new`
			`methods must be devised.`

			`\begin{figure}`
			`\begin{subfigure}{\textwidth}`
			`\centering`
			`\includegraphics[width=0.9\textwidth]{parametric_model_sketch-frontend}`
			`\caption{Frontend model}\label{fig:parametric_model:front}`
			`\end{subfigure}`
			`\begin{subfigure}{\textwidth}`
			`\centering`
			`\includegraphics[width=0.9\textwidth]{parametric_model_sketch-insn}`
			`\caption{Instruction model}\label{fig:parametric_model:insn}`
			`\end{subfigure}`
			`\caption{A generic parametric model of a processor's frontend. In red, the`
			`parameters which must be discovered for each`
			`architecture.}\label{fig:parametric_model}`
			`\end{figure}`

			`\bigskip{}`

			`A part of this model does not appear \emph{at all} in the present chapter, as`
			`it is absent from the Cortex A72: the complexity of decoding. As AArch64`
			`instructions are of fixed bit size, each instruction is equally difficult to`
			decode, and no ``complex'' decoder is needed ---~as is the case with \eg{}
			`x86-64. It seems, however, to be crucial to accurate modeling on`
			`other architectures~\cite{uica}. For much the same reason, the present chapter`
			`does not distinguish between the \emph{number of decoders} and the \emph{number`
			`of \uops{} dispatched per cycle}. Indeed, as there is no variability in`
			`instruction decoding, designing a processor with different values for these two`
			`microarchitectural parameters would have been inefficient.`

			`\todo{}`

			`The core of the model presented in this chapter is the discovery, for each`
			`instruction, of its \uop{} count. Assuming that a model of the backend is known`
			`--~by taking for instance a model generated by \palmed{} or \uopsinfo{}~--, the`
			`method described in \autoref{ssec:a72_insn_muop_count} should be generic enough`
			`to be used on any processor. The basic instructions may be easily selected`
			`using the backend model --~we assume their existence in most`
			`microarchitectures, as pragmatic concerns guide the ports design. Counting the`
			`\uops{} of an instruction thus follows, using only elapsed cycles counters,`
			`assuming $\cycF{\kerK}$ bottlenecks on a global dispatch queue for $\kerK$. This`
			`can however be arranged by selecting well-chosen kernels~--- for instance, on`
			`the A72, care must be taken to interleave instructions corresponding to diverse`
			`enough dispatch pipelines.`

			`In order to generalize this method to arbitrary microarchitectures, it is`
			`first necessary to obtain a global view of the common design choices that vary`
			`between processors' frontends. A comparative study of their respective`
			`importance to accurately model frontends, and ways to circumvent their impact`
			`on the measure of $\cycF{\kerK}$ to count \uops{} per instruction would also be`
			`needed.`

			`Such fully-automated methods would probably be unable to account for`
			``unusual'' frontend bottlenecks ---~at least not at the level of detail that
			`\eg{} \uica{} authors gather for Intel frontends~\cite{uica}. This level of`
			`detail, however, is possible exactly because the authors' restricted their`
			`scope to microarchitectures that share a lot of similarity, coming from the`
			`same manufacturer. Assessing extent of the loss of precision of an`
			`automatically-generated model, and its gain of precision \wrt{} a model without`
			`frontend, remains to be done.`