\section{Future works: benchmark-based automatic frontend model generation} While this chapter was solely centered on the Cortex A72, we believe that this study paves the way for an automated frontend model synthesis akin to \palmed{}. This synthesis should be fully-automated; stem solely from benchmarking data and a description of the ISA; and should avoid the use of any specific hardware counter. As a scaffold for such a future work, we propose the parametric model in \autoref{fig:parametric_model}. Some of its parameters are should be possible to obtain with the methods used in this chapter, while for some others, new methods must be devised. \begin{figure} \begin{subfigure}{\textwidth} \centering \includegraphics[width=0.9\textwidth]{parametric_model_sketch-frontend} \caption{Frontend model}\label{fig:parametric_model:front} \end{subfigure} \begin{subfigure}{\textwidth} \centering \includegraphics[width=0.9\textwidth]{parametric_model_sketch-insn} \caption{Instruction model}\label{fig:parametric_model:insn} \end{subfigure} \caption{A generic parametric model of a processor's frontend. In red, the parameters which must be discovered for each architecture.}\label{fig:parametric_model} \end{figure} \bigskip{} A part of this model does not appear \emph{at all} in the present chapter, as it is absent from the Cortex A72: the complexity of decoding. As AArch64 instructions are of fixed bit size, each instruction is equally difficult to decode, and no ``complex'' decoder is needed ---~as is the case with \eg{} x86-64. It seems, however, to be crucial to accurate modeling on other architectures~\cite{uica}. For much the same reason, the present chapter does not distinguish between the \emph{number of decoders} and the \emph{number of \uops{} dispatched per cycle}. Indeed, as there is no variability in instruction decoding, designing a processor with different values for these two microarchitectural parameters would have been inefficient. \todo{} The core of the model presented in this chapter is the discovery, for each instruction, of its \uop{} count. Assuming that a model of the backend is known --~by taking for instance a model generated by \palmed{} or \uopsinfo{}~--, the method described in \autoref{ssec:a72_insn_muop_count} should be generic enough to be used on any processor. The basic instructions may be easily selected using the backend model --~we assume their existence in most microarchitectures, as pragmatic concerns guide the ports design. Counting the \uops{} of an instruction thus follows, using only elapsed cycles counters, assuming $\cycF{\kerK}$ bottlenecks on a global dispatch queue for $\kerK$. This can however be arranged by selecting well-chosen kernels~--- for instance, on the A72, care must be taken to interleave instructions corresponding to diverse enough dispatch pipelines. In order to generalize this method to arbitrary microarchitectures, it is first necessary to obtain a global view of the common design choices that vary between processors' frontends. A comparative study of their respective importance to accurately model frontends, and ways to circumvent their impact on the measure of $\cycF{\kerK}$ to count \uops{} per instruction would also be needed. Such fully-automated methods would probably be unable to account for ``unusual'' frontend bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors gather for Intel frontends~\cite{uica}. This level of detail, however, is possible exactly because the authors' restricted their scope to microarchitectures that share a lot of similarity, coming from the same manufacturer. Assessing extent of the loss of precision of an automatically-generated model, and its gain of precision \wrt{} a model without frontend, remains to be done.