2024-06-17 11:21:04 +02:00
|
|
|
\section{Future works: benchmark-based automatic frontend model generation}
|
|
|
|
|
|
|
|
While this chapter was solely centered on the Cortex A72, we believe that this
|
|
|
|
study paves the way for an automated frontend model synthesis akin to
|
|
|
|
\palmed{}. This synthesis should be fully-automated; stem solely from
|
|
|
|
benchmarking data and a description of the ISA; and should avoid the use of any
|
|
|
|
specific hardware counter.
|
|
|
|
|
|
|
|
As a scaffold for such a future work, we propose the parametric model in
|
2024-06-17 23:04:50 +02:00
|
|
|
\autoref{fig:parametric_model}. Some of its parameters should be possible
|
2024-06-17 11:21:04 +02:00
|
|
|
to obtain with the methods used in this chapter, while for some others, new
|
|
|
|
methods must be devised.
|
|
|
|
|
2024-06-17 23:04:50 +02:00
|
|
|
Such a model would probably be unable to account for ``unusual'' frontend
|
|
|
|
bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors
|
|
|
|
gather for Intel frontends~\cite{uica}. This level of detail, however, is
|
|
|
|
possible exactly because the authors' restricted their scope to
|
|
|
|
microarchitectures that share a lot of similarity, coming from the same
|
|
|
|
manufacturer. Assessing the extent of the loss of precision of an
|
|
|
|
automatically-generated model, and its gain of precision \wrt{} a model without
|
|
|
|
frontend, remains to be done.
|
|
|
|
|
2024-06-17 11:21:04 +02:00
|
|
|
\begin{figure}
|
|
|
|
\begin{subfigure}{\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.9\textwidth]{parametric_model_sketch-frontend}
|
|
|
|
\caption{Frontend model}\label{fig:parametric_model:front}
|
|
|
|
\end{subfigure}
|
|
|
|
\begin{subfigure}{\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.9\textwidth]{parametric_model_sketch-insn}
|
|
|
|
\caption{Instruction model}\label{fig:parametric_model:insn}
|
|
|
|
\end{subfigure}
|
|
|
|
\caption{A generic parametric model of a processor's frontend. In red, the
|
|
|
|
parameters which must be discovered for each
|
|
|
|
architecture.}\label{fig:parametric_model}
|
2024-06-17 23:04:50 +02:00
|
|
|
|
|
|
|
\textbf{NOTE:} En {\color{green}vert}, mes éditions après scan.\todo{}
|
2024-06-17 11:21:04 +02:00
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
\bigskip{}
|
|
|
|
|
2024-06-17 23:04:50 +02:00
|
|
|
The first step in modeling a processor's frontend should certainly be to
|
|
|
|
characterize the number of \uops{} that can be dispatched in a cycle. We assume
|
|
|
|
that a model of the backend is known --~by taking for instance a model
|
|
|
|
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
|
|
|
|
best of our knowledge, we can safely further assume that instructions that load
|
|
|
|
a single backend port only once are also composed of a single \uop{}.
|
|
|
|
Generating a few combinations of a diversity of those and measuring their
|
|
|
|
effective throughput --~making sure using the backend model that the latter is
|
|
|
|
not the bottleneck~-- and keeping the maximal throughput reached should provide
|
|
|
|
a good value.
|
|
|
|
|
|
|
|
\medskip{}
|
2024-06-17 11:21:04 +02:00
|
|
|
|
2024-06-17 23:04:50 +02:00
|
|
|
In this chapter, we obtained the number of dispatch queues and their
|
|
|
|
respective throughput by reading the official documentation. Automating this
|
|
|
|
part remains to be addressed to obtain an automatic model. It should be
|
|
|
|
possible to make these parameters apparent by identifying ``simple''
|
|
|
|
instructions that conflict further than the main dispatch limitation and
|
|
|
|
combining them.
|
|
|
|
|
|
|
|
\medskip{}
|
2024-06-17 11:21:04 +02:00
|
|
|
|
|
|
|
The core of the model presented in this chapter is the discovery, for each
|
2024-06-17 23:04:50 +02:00
|
|
|
instruction, of its \uop{} count. Still assuming the presence of a backend
|
|
|
|
known model, the method described in \autoref{ssec:a72_insn_muop_count} should
|
|
|
|
be generic enough to be used on any processor. The basic instructions may be
|
|
|
|
easily selected using the backend model --~we assume their existence in most
|
2024-06-17 11:21:04 +02:00
|
|
|
microarchitectures, as pragmatic concerns guide the ports design. Counting the
|
|
|
|
\uops{} of an instruction thus follows, using only elapsed cycles counters,
|
2024-06-17 23:04:50 +02:00
|
|
|
assuming $\cycF{\kerK}$ bottlenecks on a global dispatch queue for $\kerK$.
|
|
|
|
This can however be arranged by selecting well-chosen kernels~--- for instance,
|
|
|
|
on the A72, care must be taken to interleave instructions corresponding to
|
|
|
|
diverse enough dispatch pipelines.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Finally, the break-down of each instruction's \uops{} into their respective
|
|
|
|
dispatch queues should follow from the backend model, as each dispatch queue is
|
|
|
|
tied to a subset of backend ports.
|
|
|
|
|
|
|
|
\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is
|
|
|
|
composed of instructions of fixed length, making decoding easier, such is not
|
|
|
|
always the case.
|
|
|
|
The x86 ISA, for one, uses instructions that vary in length from one to fifteen
|
|
|
|
bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions
|
|
|
|
may prove to be a huge frontend slowdown, especially when such instructions
|
|
|
|
cross an instruction cache line boundary~\cite{uica}.
|
|
|
|
|
|
|
|
Processors implementing ISAs subject to decoding bottleneck typically also
|
|
|
|
feature a decoded \uop{} cache. The typical hit rate of this cache is about
|
|
|
|
80\%~\cite[Section
|
|
|
|
B.5.7.2]{ref:intel64_software_dev_reference_vol1}\cite{dead_uops}. However,
|
|
|
|
code analyzers are concerned with loops and, more generally, hot code portions.
|
|
|
|
Under such conditions, we expect this cache, once hot in steady-state, to be
|
|
|
|
very close to a 100\% hit rate. In this case, only the dispatch throughput will
|
|
|
|
be limiting, and modeling the decoding bottlenecks becomes irrelevant.
|
|
|
|
|
|
|
|
|
|
|
|
\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric
|
|
|
|
model aims to be a compromise between simplicity of automation and good
|
|
|
|
accuracy. Experimentation may prove that it lacks some important features to be
|
|
|
|
accurate. Depending on the architecture targeted, the following points should
|
|
|
|
be investigated if the model does not reach the expected accuracy.
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
|
|
\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep
|
|
|
|
in the decode queue a whole loop's body of \uops{} if the frontend detects that a
|
|
|
|
small enough loop is repeated~\cite{uica, dead_uops}. In this case,
|
|
|
|
\uops{} are repeatedly streamed from the decode queue, without even the
|
|
|
|
necessity to hit a cache. We are unaware of
|
|
|
|
other architectures with such a feature.
|
|
|
|
|
|
|
|
\item{} macro-ops \todo{}
|
|
|
|
|
|
|
|
\item{} fusion, lamination \todo{}
|
|
|
|
|
|
|
|
\end{itemize}
|