\section{A parametric model for future works of automatic frontend model generation} %\section{Future works: benchmarks-based automatic frontend model generation} While this chapter was solely centered on the Cortex A72, we believe that this study paves the way for an automated frontend model synthesis akin to \palmed{}. This synthesis should be fully-automated; stem solely from benchmarking data and a description of the ISA; and should avoid the use of any specific hardware counter. As a scaffold for such a future work, we propose the parametric model in \autoref{fig:parametric_model}. Some of its parameters should be possible to obtain with the methods used in this chapter, while for some others, new methods must be devised. Such a model would probably be unable to account for ``unusual'' frontend bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors gather for Intel frontends~\cite{uica}. This level of detail, however, is possible exactly because the authors' restricted their scope to microarchitectures that share a lot of similarity, coming from the same manufacturer. Assessing the extent of the loss of precision of an automatically-generated model, and its gain of precision \wrt{} a model without frontend, remains to be done. \medskip{} Our model introduces a limited number of parameters, depicted in red italics in \autoref{fig:parametric_model}. It is composed of two parts: a model of the frontend in itself, describing architectural parameters; and insights about each instruction. Its parameters are: \begin{itemize} \item{} the number of \uops{} that can be dispatched overall per cycle; \item{} the number of distinct dispatch queues of the processor (\eg{} memory operations, integer operations, \ldots); \item{} for each of those queues, the number of \uops{} it can dispatch per cycle; \needspace{4\baselineskip} \item{} for each instruction $i$, \begin{itemize} \item{} its total number of \uops{} $\mu_i$; \item{} the number of \uops{} that get dispatched to each individual queue (summing up to $\mu_i$). \end{itemize} \end{itemize} \begin{figure} \begin{subfigure}{\textwidth} \centering \includegraphics[width=0.7\textwidth]{parametric_model-frontend.svg} \caption{Frontend model}\label{fig:parametric_model:front} \end{subfigure} \vspace{2em} \begin{subfigure}{\textwidth} \centering \includegraphics[width=0.75\textwidth]{parametric_model-insn.svg} \caption{Instruction model}\label{fig:parametric_model:insn} \end{subfigure} \caption{A generic parametric model of a processor's frontend. In red italics, the parameters which must be discovered for each architecture.}\label{fig:parametric_model} \end{figure} \bigskip{} The first step in modeling a processor's frontend should certainly be to characterize the number of \uops{} that can be dispatched in a cycle. We assume that a model of the backend is known --~by taking for instance a model generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the best of our knowledge, we can safely further assume that instructions that load a single backend port only once are also composed of a single \uop{}. Generating a few combinations of a diversity of those and measuring their effective throughput --~making sure using the backend model that the latter is not the bottleneck~-- and keeping the maximal throughput reached should provide a good value. \medskip{} In this chapter, we obtained the number of dispatch queues and their respective throughput by reading the official documentation. Automating this part remains to be addressed to obtain an automatic model. It should be possible to make these parameters apparent by identifying ``simple'' instructions that conflict further than the main dispatch limitation and combining them. \medskip{} The core of the model presented in this chapter is the discovery, for each instruction, of its \uop{} count. Still assuming the knowledge of a backend model, the method described in \autoref{ssec:a72_insn_muop_count} should be generic enough to be used on any processor. The basic instructions may be easily selected using the backend model --~we assume their existence in most microarchitectures, as pragmatic concerns guide the ports design. Counting the \uops{} of an instruction thus follows, using only elapsed cycles counters. This method assumes that $\cycF{\kerK}$ bottlenecks on a global dispatch queue for $\kerK$, and not specific dispatch queues. This must be ensured by selecting well-chosen kernels~--- for instance, on the A72, care must be taken to interleave instructions corresponding to diverse enough dispatch pipelines. \medskip{} Finally, the break-down of each instruction's \uops{} into their respective dispatch queues should follow from the backend model, as each dispatch queue is tied to a subset of backend ports. \bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is composed of instructions of fixed length, making decoding easier, such is not always the case. The x86 ISA, for one, uses instructions that vary in length from one to fifteen bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions may prove to be a huge frontend slowdown, especially when such instructions cross an instruction cache line boundary~\cite{uica}. Processors implementing ISAs subject to decoding bottleneck typically also feature a decoded \uop{} cache, or \emph{decoded stream buffer} (DSB). The typical hit rate of this cache is about 80\%~\cites[Section B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code analyzers are concerned with loops and, more generally, hot code portions. Under such conditions, we expect this cache, once hot in steady-state, to be very close to a 100\% hit rate. In this case, only the dispatch throughput will be limiting, and modeling the decoding bottlenecks becomes irrelevant. \bigskip{} \paragraph{Points of vigilance and limitations.} This parametric model aims to be a compromise between simplicity of automation and good accuracy. Experimentation may prove that it lacks some important features to be accurate. Depending on the architecture targeted, the following points should be investigated if the model does not reach the expected accuracy. \begin{itemize} \item{} We introduced just above the DSB (\uop{} cache). This model considers that the DSB will never be the cause of a bottleneck and that, instead, the number of dispatched \uops{} per cycle will always bottleneck before. This might not be true, as DSBs are complex in themselves already~\cite{uica}. \item{} Intel CPUs use a Loop Stream Detector (LSD) to keep in the decode queue a whole loop's body of \uops{} if the frontend detects that a small enough loop is repeated~\cite{uica, dead_uops}. In this case, \uops{} are repeatedly streamed from the decode queue, without even the necessity to hit a cache. We are unaware of similar features in other commercial processors. In embedded programming, however, \emph{hardware loops} --~which are set up explicitly by the programmer~-- achieve, among others, the same goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}. \item{} The \emph{branch predictor} of a CPU is responsible for guessing, before the actual logic is computed, whether a conditional jump will be taken. A misprediction forces the frontend to re-populate its queues with instructions from the branch actually taken and typically stalls the pipeline for several cycles~\cite{branch_pred_penalty}. Our model, however, does not include a branch predictor for much the same reason that it does not include complex decoder: in steady-state, in a hot code portion, we expect the branch predictor to always predict correctly. \item{} In reality, there is an intermediary step between instructions and \uops{}: macro-ops. Although it serves a designing and semantic purpose, we omit this step in the current model as --~we believe~-- it is of little importance to predict performance. \item{} On x86 architectures at least, common pairs of micro- or macro-operations may be ``fused'' into a single one, up to various parts of the pipeline, to save space in some queues or artificially boost dispatch limitations. This mechanism is implemented in Intel architectures, and to some extent in AMD architectures since Zen~\cites[ยง3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}. This may make some kernels seem to ``bypass'' dispatch limits. \end{itemize}