phd-thesis/manuscrit/40_A72-frontend/50_future_works.tex

171 lines
8.7 KiB
TeX

\section{A parametric model for future works of automatic frontend model
generation}\label{sec:frontend_parametric_model}
While this chapter was solely centered on the Cortex A72, we believe that this
study paves the way for an automated frontend model synthesis akin to
\palmed{}. This synthesis should be fully-automated; stem solely from
benchmarking data and a description of the ISA; and should avoid the use of any
specific hardware counter.
As a scaffold for such a future work, we propose the parametric model in
\autoref{fig:parametric_model}. Some of its parameters should be possible
to obtain with the methods used in this chapter, while for some others, new
methods must be devised.
Such a model would probably be unable to account for ``unusual'' frontend
bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors
gather for Intel frontends~\cite{uica}. This level of detail, however, is
possible exactly because the authors' restricted their scope to
microarchitectures that share a lot of similarity, coming from the same
manufacturer. Assessing the extent of the loss of precision of an
automatically-generated model, and its gain of precision \wrt{} a model without
frontend, remains to be done.
\medskip{}
Our model introduces a limited number of parameters, depicted in red italics in
\autoref{fig:parametric_model}. It is composed of two parts: a model of the
frontend in itself, describing architectural parameters; and insights about
each instruction. Its parameters are:
\begin{itemize}
\item{} the number of \uops{} that can be dispatched overall per cycle;
\item{} the number of distinct dispatch queues of the processor (\eg{}
memory operations, integer operations, \ldots);
\item{} for each of those queues, the number of \uops{} it can dispatch per
cycle;
\needspace{4\baselineskip}
\item{} for each instruction $i$,
\begin{itemize}
\item{} its total number of \uops{} $\mu_i$;
\item{} the number of \uops{} that get dispatched to each
individual queue (summing up to $\mu_i$).
\end{itemize}
\end{itemize}
\begin{figure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.7\textwidth]{parametric_model-frontend.svg}
\caption{Frontend model}\label{fig:parametric_model:front}
\end{subfigure}
\vspace{2em}
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.75\textwidth]{parametric_model-insn.svg}
\caption{Instruction model}\label{fig:parametric_model:insn}
\end{subfigure}
\caption{A generic parametric model of a processor's frontend. In red
italics, the parameters which must be discovered for each
architecture.}\label{fig:parametric_model}
\end{figure}
\bigskip{}
The first step in modeling a processor's frontend should certainly be to
characterize the number of \uops{} that can be dispatched in a cycle. We assume
that a model of the backend is known ---~by taking for instance a model
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
best of our knowledge, we can safely further assume that instructions that load
a single backend port only once are also composed of a single \uop{}.
Generating a few combinations of a diversity of those and measuring their
effective throughput ---~making sure using the backend model that the latter is
not the bottleneck~--- and keeping the maximal throughput reached should provide
a good value.
\medskip{}
In this chapter, we obtained the number of dispatch queues and their
respective throughput by reading the official documentation. Automating this
part remains to be addressed to obtain an automatic model. It should be
possible to make these parameters apparent by identifying ``simple''
instructions that conflict further than the main dispatch limitation and
combining them.
\medskip{}
The core of the model presented in this chapter is the discovery, for each
instruction, of its \uop{} count. Still assuming the knowledge of a backend
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
generic enough to be used on any processor. The basic instructions may be
easily selected using the backend model ---~we assume their existence in most
microarchitectures, as pragmatic concerns guide the ports design. Counting the
\uops{} of an instruction thus follows, using only elapsed cycles counters.
This method assumes that $\cycF{\kerK}$ bottlenecks on a global dispatch queue
for $\kerK$, and not specific dispatch queues. This must be ensured by
selecting well-chosen kernels~--- for instance, on the A72, care must be taken
to interleave instructions corresponding to diverse enough dispatch pipelines.
\medskip{}
Finally, the break-down of each instruction's \uops{} into their respective
dispatch queues should follow from the backend model, as each dispatch queue is
tied to a subset of backend ports.
\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is
composed of instructions of fixed length, making decoding easier, such is not
always the case.
The x86 ISA, for one, uses instructions that vary in length from one to fifteen
bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions
may prove to be a huge frontend slowdown, especially when such instructions
cross an instruction cache line boundary~\cite{uica}.
Processors implementing ISAs subject to decoding bottleneck typically also
feature a decoded \uop{} cache, or \emph{decoded stream buffer} (DSB). The
typical hit rate of this cache is about 80\%~\cites[Section
B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code
analyzers are concerned with loops and, more generally, hot code portions.
Under such conditions, we expect this cache, once hot in steady-state, to be
very close to a 100\% hit rate. In this case, only the dispatch throughput will
be limiting, and modeling the decoding bottlenecks becomes irrelevant.
\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric
model aims to be a compromise between simplicity of automation and good
accuracy. Experimentation may prove that it lacks some important features to be
accurate. Depending on the architecture targeted, the following points should
be investigated if the model does not reach the expected accuracy.
\begin{itemize}
\item{} We introduced just above the DSB (\uop{} cache). This model
considers that the DSB will never be the cause of a bottleneck and
that, instead, the number of dispatched \uops{} per cycle will always
bottleneck before. This might not be true, as DSBs are complex in
themselves already~\cite{uica}.
\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep in the decode
queue a whole loop's body of \uops{} if the frontend detects that a
small enough loop is repeated~\cite{uica, dead_uops}. In this case,
\uops{} are repeatedly streamed from the decode queue, without even the
necessity to hit a cache. We are unaware of similar features in other
commercial processors. In embedded programming, however, \emph{hardware
loops} ---~which are set up explicitly by the programmer~--- achieve,
among others, the same
goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.
\item{} The \emph{branch predictor} of a CPU is responsible for guessing,
before the actual logic is computed, whether a conditional jump will be
taken. A misprediction forces the frontend to re-populate its queues
with instructions from the branch actually taken and typically stalls
the pipeline for several cycles~\cite{branch_pred_penalty}. Our model,
however, does not include a branch predictor for much the same reason
that it does not include complex decoder: in steady-state, in a hot
code portion, we expect the branch predictor to always predict
correctly.
\item{} In reality, there is an intermediary step between instructions and
\uops{}: macro-ops. Although it serves a designing and semantic
purpose, we omit this step in the current model as ---~we
believe~--- it is of little importance to predict performance.
\item{} On x86 architectures at least, common pairs of micro- or
macro-operations may be ``fused'' into a single one, up to various
parts of the pipeline, to save space in some queues or artificially
boost dispatch limitations. This mechanism is implemented in Intel
architectures, and to some extent in AMD architectures since
Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}.
This may make some kernels seem to ``bypass'' dispatch limits.
\end{itemize}