phd-thesis/manuscrit/40_A72-frontend/50_future_works.tex

\section{A parametric model for future works of automatic frontend model
generation}\label{sec:frontend_parametric_model}

While this chapter was solely centered on the Cortex A72, we believe that this
study paves the way for an automated frontend model synthesis akin to
\palmed{}. This synthesis should be fully-automated; stem solely from
benchmarking data and a description of the ISA; and should avoid the use of any
specific hardware counter.

As a scaffold for such a future work, we propose the parametric model in
\autoref{fig:parametric_model}. Some of its parameters should be possible
to obtain with the methods used in this chapter, while for some others, new
methods must be devised.

Such a model would probably be unable to account for ``unusual'' frontend
bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors
gather for Intel frontends~\cite{uica}. This level of detail, however, is
possible exactly because the authors' restricted their scope to
microarchitectures that share a lot of similarity, coming from the same
manufacturer. Assessing the extent of the loss of precision of an
automatically-generated model, and its gain of precision \wrt{} a model without
frontend, remains to be done.

\medskip{}

Our model introduces a limited number of parameters, depicted in red italics in
\autoref{fig:parametric_model}. It is composed of two parts: a model of the
frontend in itself, describing architectural parameters; and insights about
each instruction. Its parameters are:
\begin{itemize}
    \item{} the number of \uops{} that can be dispatched overall per cycle;
    \item{} the number of distinct dispatch queues of the processor (\eg{}
        memory operations, integer operations, \ldots);
    \item{} for each of those queues, the number of \uops{} it can dispatch per
        cycle;
    \needspace{4\baselineskip}
    \item{} for each instruction $i$,
        \begin{itemize}
            \item{} its total number of \uops{} $\mu_i$;
            \item{} the number of \uops{} that get dispatched to each
                individual queue (summing up to $\mu_i$).
        \end{itemize}
\end{itemize}

\begin{figure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=0.7\textwidth]{parametric_model-frontend.svg}
        \caption{Frontend model}\label{fig:parametric_model:front}
    \end{subfigure}
    \vspace{2em}

    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=0.75\textwidth]{parametric_model-insn.svg}
        \caption{Instruction model}\label{fig:parametric_model:insn}
    \end{subfigure}
    \caption{A generic parametric model of a processor's frontend. In red
        italics, the parameters which must be discovered for each
        architecture.}\label{fig:parametric_model}
\end{figure}

\bigskip{}

The first step in modeling a processor's frontend should certainly be to
characterize the number of \uops{} that can be dispatched in a cycle. We assume
that a model of the backend is known ---~by taking for instance a model
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
best of our knowledge, we can safely further assume that instructions that load
a single backend port only once are also composed of a single \uop{}.
Generating a few combinations of a diversity of those and measuring their
effective throughput ---~making sure using the backend model that the latter is
not the bottleneck~--- and keeping the maximal throughput reached should provide
a good value.

\medskip{}

In this chapter, we obtained the number of dispatch queues and their
respective throughput by reading the official documentation. Automating this
part remains to be addressed to obtain an automatic model. It should be
possible to make these parameters apparent by identifying ``simple''
instructions that conflict further than the main dispatch limitation and
combining them.

\medskip{}

The core of the model presented in this chapter is the discovery, for each
instruction, of its \uop{} count. Still assuming the knowledge of a backend
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
generic enough to be used on any processor. The basic instructions may be
easily selected using the backend model ---~we assume their existence in most
microarchitectures, as pragmatic concerns guide the ports design. Counting the
\uops{} of an instruction thus follows, using only elapsed cycles counters.

This method assumes that $\cycF{\kerK}$ bottlenecks on a global dispatch queue
for $\kerK$, and not specific dispatch queues. This must be ensured by
selecting well-chosen kernels~--- for instance, on the A72, care must be taken
to interleave instructions corresponding to diverse enough dispatch pipelines.

\medskip{}

Finally, the break-down of each instruction's \uops{} into their respective
dispatch queues should follow from the backend model, as each dispatch queue is
tied to a subset of backend ports.

\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is
composed of instructions of fixed length, making decoding easier, such is not
always the case.
The x86 ISA, for one, uses instructions that vary in length from one to fifteen
bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions
may prove to be a huge frontend slowdown, especially when such instructions
cross an instruction cache line boundary~\cite{uica}.

Processors implementing ISAs subject to decoding bottleneck typically also
feature a decoded \uop{} cache, or \emph{decoded stream buffer} (DSB). The
typical hit rate of this cache is about 80\%~\cites[Section
B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code
analyzers are concerned with loops and, more generally, hot code portions.
Under such conditions, we expect this cache, once hot in steady-state, to be
very close to a 100\% hit rate. In this case, only the dispatch throughput will
be limiting, and modeling the decoding bottlenecks becomes irrelevant.


\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric
model aims to be a compromise between simplicity of automation and good
accuracy. Experimentation may prove that it lacks some important features to be
accurate. Depending on the architecture targeted, the following points should
be investigated if the model does not reach the expected accuracy.

\begin{itemize}

    \item{} We introduced just above the DSB (\uop{} cache). This model
        considers that the DSB will never be the cause of a bottleneck and
        that, instead, the number of dispatched \uops{} per cycle will always
        bottleneck before. This might not be true, as DSBs are complex in
        themselves already~\cite{uica}.

    \item{} Intel CPUs use a Loop Stream Detector (LSD) to keep in the decode
        queue a whole loop's body of \uops{} if the frontend detects that a
        small enough loop is repeated~\cite{uica, dead_uops}. In this case,
        \uops{} are repeatedly streamed from the decode queue, without even the
        necessity to hit a cache. We are unaware of similar features in other
        commercial processors. In embedded programming, however, \emph{hardware
        loops} ---~which are set up explicitly by the programmer~--- achieve,
        among others, the same
        goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.

    \item{} The \emph{branch predictor} of a CPU is responsible for guessing,
        before the actual logic is computed, whether a conditional jump will be
        taken. A misprediction forces the frontend to re-populate its queues
        with instructions from the branch actually taken and typically stalls
        the pipeline for several cycles~\cite{branch_pred_penalty}. Our model,
        however, does not include a branch predictor for much the same reason
        that it does not include complex decoder: in steady-state, in a hot
        code portion, we expect the branch predictor to always predict
        correctly.

    \item{} In reality, there is an intermediary step between instructions and
        \uops{}: macro-ops. Although it serves a designing and semantic
        purpose, we omit this step in the current model as ---~we
        believe~--- it is of little importance to predict performance.

    \item{} On x86 architectures at least, common pairs of micro- or
        macro-operations may be ``fused'' into a single one, up to various
        parts of the pipeline, to save space in some queues or artificially
        boost dispatch limitations. This mechanism is implemented in Intel
        architectures, and to some extent in AMD architectures since
        Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}.
        This may make some kernels seem to ``bypass'' dispatch limits.

\end{itemize}
Parametric model: rename section 2024-06-18 09:52:32 +02:00			`\section{A parametric model for future works of automatic frontend model`
Proof-read chapter 3 (A72 frontend) 2024-08-17 15:43:20 +02:00			`generation}\label{sec:frontend_parametric_model}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00
			`While this chapter was solely centered on the Cortex A72, we believe that this`
			`study paves the way for an automated frontend model synthesis akin to`
			`\palmed{}. This synthesis should be fully-automated; stem solely from`
			`benchmarking data and a description of the ISA; and should avoid the use of any`
			`specific hardware counter.`

			`As a scaffold for such a future work, we propose the parametric model in`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`\autoref{fig:parametric_model}. Some of its parameters should be possible`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`to obtain with the methods used in this chapter, while for some others, new`
			`methods must be devised.`

Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			Such a model would probably be unable to account for ``unusual'' frontend
			`bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors`
			`gather for Intel frontends~\cite{uica}. This level of detail, however, is`
			`possible exactly because the authors' restricted their scope to`
			`microarchitectures that share a lot of similarity, coming from the same`
			`manufacturer. Assessing the extent of the loss of precision of an`
			`automatically-generated model, and its gain of precision \wrt{} a model without`
			`frontend, remains to be done.`

Parametric frontend: add svgs, describe model 2024-06-18 23:21:07 +02:00			`\medskip{}`

			`Our model introduces a limited number of parameters, depicted in red italics in`
			`\autoref{fig:parametric_model}. It is composed of two parts: a model of the`
			`frontend in itself, describing architectural parameters; and insights about`
			`each instruction. Its parameters are:`
			`\begin{itemize}`
			`\item{} the number of \uops{} that can be dispatched overall per cycle;`
			`\item{} the number of distinct dispatch queues of the processor (\eg{}`
			`memory operations, integer operations, \ldots);`
			`\item{} for each of those queues, the number of \uops{} it can dispatch per`
			`cycle;`
			`\needspace{4\baselineskip}`
			`\item{} for each instruction $i$,`
			`\begin{itemize}`
			`\item{} its total number of \uops{} $\mu_i$;`
			`\item{} the number of \uops{} that get dispatched to each`
			`individual queue (summing up to $\mu_i$).`
			`\end{itemize}`
			`\end{itemize}`

WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\begin{figure}`
			`\begin{subfigure}{\textwidth}`
			`\centering`
Parametric frontend: add svgs, describe model 2024-06-18 23:21:07 +02:00			`\includegraphics[width=0.7\textwidth]{parametric_model-frontend.svg}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\caption{Frontend model}\label{fig:parametric_model:front}`
			`\end{subfigure}`
Parametric frontend: add svgs, describe model 2024-06-18 23:21:07 +02:00			`\vspace{2em}`

WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\begin{subfigure}{\textwidth}`
			`\centering`
Parametric frontend: add svgs, describe model 2024-06-18 23:21:07 +02:00			`\includegraphics[width=0.75\textwidth]{parametric_model-insn.svg}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\caption{Instruction model}\label{fig:parametric_model:insn}`
			`\end{subfigure}`
Parametric frontend: add svgs, describe model 2024-06-18 23:21:07 +02:00			`\caption{A generic parametric model of a processor's frontend. In red`
			`italics, the parameters which must be discovered for each`
			`architecture.}\label{fig:parametric_model}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\end{figure}`

			`\bigskip{}`

Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`The first step in modeling a processor's frontend should certainly be to`
			`characterize the number of \uops{} that can be dispatched in a cycle. We assume`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`that a model of the backend is known ---~by taking for instance a model`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the`
			`best of our knowledge, we can safely further assume that instructions that load`
			`a single backend port only once are also composed of a single \uop{}.`
			`Generating a few combinations of a diversity of those and measuring their`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`effective throughput ---~making sure using the backend model that the latter is`
			`not the bottleneck~--- and keeping the maximal throughput reached should provide`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`a good value.`

			`\medskip{}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`In this chapter, we obtained the number of dispatch queues and their`
			`respective throughput by reading the official documentation. Automating this`
			`part remains to be addressed to obtain an automatic model. It should be`
			possible to make these parameters apparent by identifying ``simple''
			`instructions that conflict further than the main dispatch limitation and`
			`combining them.`

			`\medskip{}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00
			`The core of the model presented in this chapter is the discovery, for each`
Parametric frontend: rewordings 2024-06-18 09:12:30 +02:00			`instruction, of its \uop{} count. Still assuming the knowledge of a backend`
			`model, the method described in \autoref{ssec:a72_insn_muop_count} should be`
			`generic enough to be used on any processor. The basic instructions may be`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`easily selected using the backend model ---~we assume their existence in most`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`microarchitectures, as pragmatic concerns guide the ports design. Counting the`
Parametric frontend: rewordings 2024-06-18 09:12:30 +02:00			`\uops{} of an instruction thus follows, using only elapsed cycles counters.`

			`This method assumes that $\cycF{\kerK}$ bottlenecks on a global dispatch queue`
			`for $\kerK$, and not specific dispatch queues. This must be ensured by`
			`selecting well-chosen kernels~--- for instance, on the A72, care must be taken`
			`to interleave instructions corresponding to diverse enough dispatch pipelines.`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00
			`\medskip{}`

			`Finally, the break-down of each instruction's \uops{} into their respective`
			`dispatch queues should follow from the backend model, as each dispatch queue is`
			`tied to a subset of backend ports.`

			`\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is`
			`composed of instructions of fixed length, making decoding easier, such is not`
			`always the case.`
			`The x86 ISA, for one, uses instructions that vary in length from one to fifteen`
			`bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions`
			`may prove to be a huge frontend slowdown, especially when such instructions`
			`cross an instruction cache line boundary~\cite{uica}.`

			`Processors implementing ISAs subject to decoding bottleneck typically also`
Parametric frontend: add Fabrice's suggestions 2024-06-18 12:06:42 +02:00			`feature a decoded \uop{} cache, or \emph{decoded stream buffer} (DSB). The`
			`typical hit rate of this cache is about 80\%~\cites[Section`
			`B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code`
			`analyzers are concerned with loops and, more generally, hot code portions.`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`Under such conditions, we expect this cache, once hot in steady-state, to be`
			`very close to a 100\% hit rate. In this case, only the dispatch throughput will`
			`be limiting, and modeling the decoding bottlenecks becomes irrelevant.`


			`\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric`
			`model aims to be a compromise between simplicity of automation and good`
			`accuracy. Experimentation may prove that it lacks some important features to be`
			`accurate. Depending on the architecture targeted, the following points should`
			`be investigated if the model does not reach the expected accuracy.`

			`\begin{itemize}`

Parametric frontend: add Fabrice's suggestions 2024-06-18 12:06:42 +02:00			`\item{} We introduced just above the DSB (\uop{} cache). This model`
			`considers that the DSB will never be the cause of a bottleneck and`
			`that, instead, the number of dispatched \uops{} per cycle will always`
			`bottleneck before. This might not be true, as DSBs are complex in`
			`themselves already~\cite{uica}.`

			`\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep in the decode`
			`queue a whole loop's body of \uops{} if the frontend detects that a`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`small enough loop is repeated~\cite{uica, dead_uops}. In this case,`
			`\uops{} are repeatedly streamed from the decode queue, without even the`
Parametric frontend: add Fabrice's suggestions 2024-06-18 12:06:42 +02:00			`necessity to hit a cache. We are unaware of similar features in other`
			`commercial processors. In embedded programming, however, \emph{hardware`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`loops} ---~which are set up explicitly by the programmer~--- achieve,`
Frontend: hardware loop: references 2024-06-18 19:08:41 +02:00			`among others, the same`
			`goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.`
Parametric frontend: add Fabrice's suggestions 2024-06-18 12:06:42 +02:00
			`\item{} The \emph{branch predictor} of a CPU is responsible for guessing,`
			`before the actual logic is computed, whether a conditional jump will be`
			`taken. A misprediction forces the frontend to re-populate its queues`
			`with instructions from the branch actually taken and typically stalls`
			`the pipeline for several cycles~\cite{branch_pred_penalty}. Our model,`
			`however, does not include a branch predictor for much the same reason`
			`that it does not include complex decoder: in steady-state, in a hot`
			`code portion, we expect the branch predictor to always predict`
			`correctly.`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00
Parametric frontend: first writeup 2024-06-18 09:50:28 +02:00			`\item{} In reality, there is an intermediary step between instructions and`
			`\uops{}: macro-ops. Although it serves a designing and semantic`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`purpose, we omit this step in the current model as ---~we`
			`believe~--- it is of little importance to predict performance.`
Parametric frontend: first writeup 2024-06-18 09:50:28 +02:00
			`\item{} On x86 architectures at least, common pairs of micro- or`
			macro-operations may be ``fused'' into a single one, up to various
			`parts of the pipeline, to save space in some queues or artificially`
			`boost dispatch limitations. This mechanism is implemented in Intel`
			`architectures, and to some extent in AMD architectures since`
			`Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}.`
			This may make some kernels seem to ``bypass'' dispatch limits.
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00
			`\end{itemize}`