phd-thesis/manuscrit/40_A72-frontend/50_future_works.tex

\section{A parametric model for future works of automatic frontend model
generation}
%\section{Future works: benchmarks-based automatic frontend model generation}

While this chapter was solely centered on the Cortex A72, we believe that this
study paves the way for an automated frontend model synthesis akin to
\palmed{}. This synthesis should be fully-automated; stem solely from
benchmarking data and a description of the ISA; and should avoid the use of any
specific hardware counter.

As a scaffold for such a future work, we propose the parametric model in
\autoref{fig:parametric_model}. Some of its parameters should be possible
to obtain with the methods used in this chapter, while for some others, new
methods must be devised.

Such a model would probably be unable to account for ``unusual'' frontend
bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors
gather for Intel frontends~\cite{uica}. This level of detail, however, is
possible exactly because the authors' restricted their scope to
microarchitectures that share a lot of similarity, coming from the same
manufacturer. Assessing the extent of the loss of precision of an
automatically-generated model, and its gain of precision \wrt{} a model without
frontend, remains to be done.

\begin{figure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=0.9\textwidth]{parametric_model_sketch-frontend}
        \caption{Frontend model}\label{fig:parametric_model:front}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=0.9\textwidth]{parametric_model_sketch-insn}
        \caption{Instruction model}\label{fig:parametric_model:insn}
    \end{subfigure}
    \caption{A generic parametric model of a processor's frontend. In red, the
    parameters which must be discovered for each
architecture.}\label{fig:parametric_model}

\textbf{NOTE:} En {\color{green}vert}, mes éditions après scan.\todo{}
\end{figure}

\bigskip{}

The first step in modeling a processor's frontend should certainly be to
characterize the number of \uops{} that can be dispatched in a cycle. We assume
that a model of the backend is known --~by taking for instance a model
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
best of our knowledge, we can safely further assume that instructions that load
a single backend port only once are also composed of a single \uop{}.
Generating a few combinations of a diversity of those and measuring their
effective throughput --~making sure using the backend model that the latter is
not the bottleneck~-- and keeping the maximal throughput reached should provide
a good value.

\medskip{}

In this chapter, we obtained the number of dispatch queues and their
respective throughput by reading the official documentation. Automating this
part remains to be addressed to obtain an automatic model. It should be
possible to make these parameters apparent by identifying ``simple''
instructions that conflict further than the main dispatch limitation and
combining them.

\medskip{}

The core of the model presented in this chapter is the discovery, for each
instruction, of its \uop{} count. Still assuming the knowledge of a backend
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
generic enough to be used on any processor. The basic instructions may be
easily selected using the backend model --~we assume their existence in most
microarchitectures, as pragmatic concerns guide the ports design. Counting the
\uops{} of an instruction thus follows, using only elapsed cycles counters.

This method assumes that $\cycF{\kerK}$ bottlenecks on a global dispatch queue
for $\kerK$, and not specific dispatch queues. This must be ensured by
selecting well-chosen kernels~--- for instance, on the A72, care must be taken
to interleave instructions corresponding to diverse enough dispatch pipelines.

\medskip{}

Finally, the break-down of each instruction's \uops{} into their respective
dispatch queues should follow from the backend model, as each dispatch queue is
tied to a subset of backend ports.

\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is
composed of instructions of fixed length, making decoding easier, such is not
always the case.
The x86 ISA, for one, uses instructions that vary in length from one to fifteen
bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions
may prove to be a huge frontend slowdown, especially when such instructions
cross an instruction cache line boundary~\cite{uica}.

Processors implementing ISAs subject to decoding bottleneck typically also
feature a decoded \uop{} cache. The typical hit rate of this cache is about
80\%~\cites[Section
B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However,
code analyzers are concerned with loops and, more generally, hot code portions.
Under such conditions, we expect this cache, once hot in steady-state, to be
very close to a 100\% hit rate. In this case, only the dispatch throughput will
be limiting, and modeling the decoding bottlenecks becomes irrelevant.


\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric
model aims to be a compromise between simplicity of automation and good
accuracy. Experimentation may prove that it lacks some important features to be
accurate. Depending on the architecture targeted, the following points should
be investigated if the model does not reach the expected accuracy.

\begin{itemize}

    \item{} Intel CPUs use a Loop Stream Detector (LSD) to keep
        in the decode queue a whole loop's body of \uops{} if the frontend detects that a
        small enough loop is repeated~\cite{uica, dead_uops}. In this case,
        \uops{} are repeatedly streamed from the decode queue, without even the
        necessity to hit a cache. We are unaware of
        other architectures with such a feature.

    \item{} In reality, there is an intermediary step between instructions and
        \uops{}: macro-ops. Although it serves a designing and semantic
        purpose, we omit this step in the current model as --~we
        believe~-- it is of little importance to predict performance.

    \item{} On x86 architectures at least, common pairs of micro- or
        macro-operations may be ``fused'' into a single one, up to various
        parts of the pipeline, to save space in some queues or artificially
        boost dispatch limitations. This mechanism is implemented in Intel
        architectures, and to some extent in AMD architectures since
        Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}.
        This may make some kernels seem to ``bypass'' dispatch limits.

\end{itemize}
Parametric model: rename section 2024-06-18 09:52:32 +02:00			`\section{A parametric model for future works of automatic frontend model`
			`generation}`
			`%\section{Future works: benchmarks-based automatic frontend model generation}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00
			`While this chapter was solely centered on the Cortex A72, we believe that this`
			`study paves the way for an automated frontend model synthesis akin to`
			`\palmed{}. This synthesis should be fully-automated; stem solely from`
			`benchmarking data and a description of the ISA; and should avoid the use of any`
			`specific hardware counter.`

			`As a scaffold for such a future work, we propose the parametric model in`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`\autoref{fig:parametric_model}. Some of its parameters should be possible`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`to obtain with the methods used in this chapter, while for some others, new`
			`methods must be devised.`

Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			Such a model would probably be unable to account for ``unusual'' frontend
			`bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors`
			`gather for Intel frontends~\cite{uica}. This level of detail, however, is`
			`possible exactly because the authors' restricted their scope to`
			`microarchitectures that share a lot of similarity, coming from the same`
			`manufacturer. Assessing the extent of the loss of precision of an`
			`automatically-generated model, and its gain of precision \wrt{} a model without`
			`frontend, remains to be done.`

WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\begin{figure}`
			`\begin{subfigure}{\textwidth}`
			`\centering`
			`\includegraphics[width=0.9\textwidth]{parametric_model_sketch-frontend}`
			`\caption{Frontend model}\label{fig:parametric_model:front}`
			`\end{subfigure}`
			`\begin{subfigure}{\textwidth}`
			`\centering`
			`\includegraphics[width=0.9\textwidth]{parametric_model_sketch-insn}`
			`\caption{Instruction model}\label{fig:parametric_model:insn}`
			`\end{subfigure}`
			`\caption{A generic parametric model of a processor's frontend. In red, the`
			`parameters which must be discovered for each`
			`architecture.}\label{fig:parametric_model}`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00
			`\textbf{NOTE:} En {\color{green}vert}, mes éditions après scan.\todo{}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`\end{figure}`

			`\bigskip{}`

Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`The first step in modeling a processor's frontend should certainly be to`
			`characterize the number of \uops{} that can be dispatched in a cycle. We assume`
			`that a model of the backend is known --~by taking for instance a model`
			`generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the`
			`best of our knowledge, we can safely further assume that instructions that load`
			`a single backend port only once are also composed of a single \uop{}.`
			`Generating a few combinations of a diversity of those and measuring their`
			`effective throughput --~making sure using the backend model that the latter is`
			`not the bottleneck~-- and keeping the maximal throughput reached should provide`
			`a good value.`

			`\medskip{}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`In this chapter, we obtained the number of dispatch queues and their`
			`respective throughput by reading the official documentation. Automating this`
			`part remains to be addressed to obtain an automatic model. It should be`
			possible to make these parameters apparent by identifying ``simple''
			`instructions that conflict further than the main dispatch limitation and`
			`combining them.`

			`\medskip{}`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00
			`The core of the model presented in this chapter is the discovery, for each`
Parametric frontend: rewordings 2024-06-18 09:12:30 +02:00			`instruction, of its \uop{} count. Still assuming the knowledge of a backend`
			`model, the method described in \autoref{ssec:a72_insn_muop_count} should be`
			`generic enough to be used on any processor. The basic instructions may be`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`easily selected using the backend model --~we assume their existence in most`
WIP: Frontend: refactor + skeleton of parametric 2024-06-17 11:21:04 +02:00			`microarchitectures, as pragmatic concerns guide the ports design. Counting the`
Parametric frontend: rewordings 2024-06-18 09:12:30 +02:00			`\uops{} of an instruction thus follows, using only elapsed cycles counters.`

			`This method assumes that $\cycF{\kerK}$ bottlenecks on a global dispatch queue`
			`for $\kerK$, and not specific dispatch queues. This must be ensured by`
			`selecting well-chosen kernels~--- for instance, on the A72, care must be taken`
			`to interleave instructions corresponding to diverse enough dispatch pipelines.`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00
			`\medskip{}`

			`Finally, the break-down of each instruction's \uops{} into their respective`
			`dispatch queues should follow from the backend model, as each dispatch queue is`
			`tied to a subset of backend ports.`

			`\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is`
			`composed of instructions of fixed length, making decoding easier, such is not`
			`always the case.`
			`The x86 ISA, for one, uses instructions that vary in length from one to fifteen`
			`bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions`
			`may prove to be a huge frontend slowdown, especially when such instructions`
			`cross an instruction cache line boundary~\cite{uica}.`

			`Processors implementing ISAs subject to decoding bottleneck typically also`
			`feature a decoded \uop{} cache. The typical hit rate of this cache is about`
Parametric frontend: first writeup 2024-06-18 09:50:28 +02:00			`80\%~\cites[Section`
			`B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However,`
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00			`code analyzers are concerned with loops and, more generally, hot code portions.`
			`Under such conditions, we expect this cache, once hot in steady-state, to be`
			`very close to a 100\% hit rate. In this case, only the dispatch throughput will`
			`be limiting, and modeling the decoding bottlenecks becomes irrelevant.`


			`\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric`
			`model aims to be a compromise between simplicity of automation and good`
			`accuracy. Experimentation may prove that it lacks some important features to be`
			`accurate. Depending on the architecture targeted, the following points should`
			`be investigated if the model does not reach the expected accuracy.`

			`\begin{itemize}`

			`\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep`
			`in the decode queue a whole loop's body of \uops{} if the frontend detects that a`
			`small enough loop is repeated~\cite{uica, dead_uops}. In this case,`
			`\uops{} are repeatedly streamed from the decode queue, without even the`
			`necessity to hit a cache. We are unaware of`
			`other architectures with such a feature.`

Parametric frontend: first writeup 2024-06-18 09:50:28 +02:00			`\item{} In reality, there is an intermediary step between instructions and`
			`\uops{}: macro-ops. Although it serves a designing and semantic`
			`purpose, we omit this step in the current model as --~we`
			`believe~-- it is of little importance to predict performance.`

			`\item{} On x86 architectures at least, common pairs of micro- or`
			macro-operations may be ``fused'' into a single one, up to various
			`parts of the pipeline, to save space in some queues or artificially`
			`boost dispatch limitations. This mechanism is implemented in Intel`
			`architectures, and to some extent in AMD architectures since`
			`Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}.`
			This may make some kernels seem to ``bypass'' dispatch limits.
Frontend: parametric model: remove decoders, enhance 2024-06-17 23:04:50 +02:00
			`\end{itemize}`