phd-thesis/manuscrit/40_A72-frontend/50_future_works.tex

121 lines
5.8 KiB
TeX
Raw Normal View History

\section{Future works: benchmark-based automatic frontend model generation}
While this chapter was solely centered on the Cortex A72, we believe that this
study paves the way for an automated frontend model synthesis akin to
\palmed{}. This synthesis should be fully-automated; stem solely from
benchmarking data and a description of the ISA; and should avoid the use of any
specific hardware counter.
As a scaffold for such a future work, we propose the parametric model in
\autoref{fig:parametric_model}. Some of its parameters should be possible
to obtain with the methods used in this chapter, while for some others, new
methods must be devised.
Such a model would probably be unable to account for ``unusual'' frontend
bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors
gather for Intel frontends~\cite{uica}. This level of detail, however, is
possible exactly because the authors' restricted their scope to
microarchitectures that share a lot of similarity, coming from the same
manufacturer. Assessing the extent of the loss of precision of an
automatically-generated model, and its gain of precision \wrt{} a model without
frontend, remains to be done.
\begin{figure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.9\textwidth]{parametric_model_sketch-frontend}
\caption{Frontend model}\label{fig:parametric_model:front}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.9\textwidth]{parametric_model_sketch-insn}
\caption{Instruction model}\label{fig:parametric_model:insn}
\end{subfigure}
\caption{A generic parametric model of a processor's frontend. In red, the
parameters which must be discovered for each
architecture.}\label{fig:parametric_model}
\textbf{NOTE:} En {\color{green}vert}, mes éditions après scan.\todo{}
\end{figure}
\bigskip{}
The first step in modeling a processor's frontend should certainly be to
characterize the number of \uops{} that can be dispatched in a cycle. We assume
that a model of the backend is known --~by taking for instance a model
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
best of our knowledge, we can safely further assume that instructions that load
a single backend port only once are also composed of a single \uop{}.
Generating a few combinations of a diversity of those and measuring their
effective throughput --~making sure using the backend model that the latter is
not the bottleneck~-- and keeping the maximal throughput reached should provide
a good value.
\medskip{}
In this chapter, we obtained the number of dispatch queues and their
respective throughput by reading the official documentation. Automating this
part remains to be addressed to obtain an automatic model. It should be
possible to make these parameters apparent by identifying ``simple''
instructions that conflict further than the main dispatch limitation and
combining them.
\medskip{}
The core of the model presented in this chapter is the discovery, for each
instruction, of its \uop{} count. Still assuming the presence of a backend
known model, the method described in \autoref{ssec:a72_insn_muop_count} should
be generic enough to be used on any processor. The basic instructions may be
easily selected using the backend model --~we assume their existence in most
microarchitectures, as pragmatic concerns guide the ports design. Counting the
\uops{} of an instruction thus follows, using only elapsed cycles counters,
assuming $\cycF{\kerK}$ bottlenecks on a global dispatch queue for $\kerK$.
This can however be arranged by selecting well-chosen kernels~--- for instance,
on the A72, care must be taken to interleave instructions corresponding to
diverse enough dispatch pipelines.
\medskip{}
Finally, the break-down of each instruction's \uops{} into their respective
dispatch queues should follow from the backend model, as each dispatch queue is
tied to a subset of backend ports.
\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is
composed of instructions of fixed length, making decoding easier, such is not
always the case.
The x86 ISA, for one, uses instructions that vary in length from one to fifteen
bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions
may prove to be a huge frontend slowdown, especially when such instructions
cross an instruction cache line boundary~\cite{uica}.
Processors implementing ISAs subject to decoding bottleneck typically also
feature a decoded \uop{} cache. The typical hit rate of this cache is about
80\%~\cite[Section
B.5.7.2]{ref:intel64_software_dev_reference_vol1}\cite{dead_uops}. However,
code analyzers are concerned with loops and, more generally, hot code portions.
Under such conditions, we expect this cache, once hot in steady-state, to be
very close to a 100\% hit rate. In this case, only the dispatch throughput will
be limiting, and modeling the decoding bottlenecks becomes irrelevant.
\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric
model aims to be a compromise between simplicity of automation and good
accuracy. Experimentation may prove that it lacks some important features to be
accurate. Depending on the architecture targeted, the following points should
be investigated if the model does not reach the expected accuracy.
\begin{itemize}
\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep
in the decode queue a whole loop's body of \uops{} if the frontend detects that a
small enough loop is repeated~\cite{uica, dead_uops}. In this case,
\uops{} are repeatedly streamed from the decode queue, without even the
necessity to hit a cache. We are unaware of
other architectures with such a feature.
\item{} macro-ops \todo{}
\item{} fusion, lamination \todo{}
\end{itemize}