172 lines
8.7 KiB
TeX
172 lines
8.7 KiB
TeX
\section{A parametric model for future works of automatic frontend model
|
|
generation}
|
|
%\section{Future works: benchmarks-based automatic frontend model generation}
|
|
|
|
While this chapter was solely centered on the Cortex A72, we believe that this
|
|
study paves the way for an automated frontend model synthesis akin to
|
|
\palmed{}. This synthesis should be fully-automated; stem solely from
|
|
benchmarking data and a description of the ISA; and should avoid the use of any
|
|
specific hardware counter.
|
|
|
|
As a scaffold for such a future work, we propose the parametric model in
|
|
\autoref{fig:parametric_model}. Some of its parameters should be possible
|
|
to obtain with the methods used in this chapter, while for some others, new
|
|
methods must be devised.
|
|
|
|
Such a model would probably be unable to account for ``unusual'' frontend
|
|
bottlenecks ---~at least not at the level of detail that \eg{} \uica{} authors
|
|
gather for Intel frontends~\cite{uica}. This level of detail, however, is
|
|
possible exactly because the authors' restricted their scope to
|
|
microarchitectures that share a lot of similarity, coming from the same
|
|
manufacturer. Assessing the extent of the loss of precision of an
|
|
automatically-generated model, and its gain of precision \wrt{} a model without
|
|
frontend, remains to be done.
|
|
|
|
\medskip{}
|
|
|
|
Our model introduces a limited number of parameters, depicted in red italics in
|
|
\autoref{fig:parametric_model}. It is composed of two parts: a model of the
|
|
frontend in itself, describing architectural parameters; and insights about
|
|
each instruction. Its parameters are:
|
|
\begin{itemize}
|
|
\item{} the number of \uops{} that can be dispatched overall per cycle;
|
|
\item{} the number of distinct dispatch queues of the processor (\eg{}
|
|
memory operations, integer operations, \ldots);
|
|
\item{} for each of those queues, the number of \uops{} it can dispatch per
|
|
cycle;
|
|
\needspace{4\baselineskip}
|
|
\item{} for each instruction $i$,
|
|
\begin{itemize}
|
|
\item{} its total number of \uops{} $\mu_i$;
|
|
\item{} the number of \uops{} that get dispatched to each
|
|
individual queue (summing up to $\mu_i$).
|
|
\end{itemize}
|
|
\end{itemize}
|
|
|
|
\begin{figure}
|
|
\begin{subfigure}{\textwidth}
|
|
\centering
|
|
\includegraphics[width=0.7\textwidth]{parametric_model-frontend.svg}
|
|
\caption{Frontend model}\label{fig:parametric_model:front}
|
|
\end{subfigure}
|
|
\vspace{2em}
|
|
|
|
\begin{subfigure}{\textwidth}
|
|
\centering
|
|
\includegraphics[width=0.75\textwidth]{parametric_model-insn.svg}
|
|
\caption{Instruction model}\label{fig:parametric_model:insn}
|
|
\end{subfigure}
|
|
\caption{A generic parametric model of a processor's frontend. In red
|
|
italics, the parameters which must be discovered for each
|
|
architecture.}\label{fig:parametric_model}
|
|
\end{figure}
|
|
|
|
\bigskip{}
|
|
|
|
The first step in modeling a processor's frontend should certainly be to
|
|
characterize the number of \uops{} that can be dispatched in a cycle. We assume
|
|
that a model of the backend is known --~by taking for instance a model
|
|
generated by \palmed{}, using tables from \uopsinfo{} or any other mean. To the
|
|
best of our knowledge, we can safely further assume that instructions that load
|
|
a single backend port only once are also composed of a single \uop{}.
|
|
Generating a few combinations of a diversity of those and measuring their
|
|
effective throughput --~making sure using the backend model that the latter is
|
|
not the bottleneck~-- and keeping the maximal throughput reached should provide
|
|
a good value.
|
|
|
|
\medskip{}
|
|
|
|
In this chapter, we obtained the number of dispatch queues and their
|
|
respective throughput by reading the official documentation. Automating this
|
|
part remains to be addressed to obtain an automatic model. It should be
|
|
possible to make these parameters apparent by identifying ``simple''
|
|
instructions that conflict further than the main dispatch limitation and
|
|
combining them.
|
|
|
|
\medskip{}
|
|
|
|
The core of the model presented in this chapter is the discovery, for each
|
|
instruction, of its \uop{} count. Still assuming the knowledge of a backend
|
|
model, the method described in \autoref{ssec:a72_insn_muop_count} should be
|
|
generic enough to be used on any processor. The basic instructions may be
|
|
easily selected using the backend model --~we assume their existence in most
|
|
microarchitectures, as pragmatic concerns guide the ports design. Counting the
|
|
\uops{} of an instruction thus follows, using only elapsed cycles counters.
|
|
|
|
This method assumes that $\cycF{\kerK}$ bottlenecks on a global dispatch queue
|
|
for $\kerK$, and not specific dispatch queues. This must be ensured by
|
|
selecting well-chosen kernels~--- for instance, on the A72, care must be taken
|
|
to interleave instructions corresponding to diverse enough dispatch pipelines.
|
|
|
|
\medskip{}
|
|
|
|
Finally, the break-down of each instruction's \uops{} into their respective
|
|
dispatch queues should follow from the backend model, as each dispatch queue is
|
|
tied to a subset of backend ports.
|
|
|
|
\bigskip{} \paragraph{The question of complex decoders.} While the ARM ISA is
|
|
composed of instructions of fixed length, making decoding easier, such is not
|
|
always the case.
|
|
The x86 ISA, for one, uses instructions that vary in length from one to fifteen
|
|
bytes~\cite{ref:intel64_software_dev_reference_vol1}. Larger instructions
|
|
may prove to be a huge frontend slowdown, especially when such instructions
|
|
cross an instruction cache line boundary~\cite{uica}.
|
|
|
|
Processors implementing ISAs subject to decoding bottleneck typically also
|
|
feature a decoded \uop{} cache, or \emph{decoded stream buffer} (DSB). The
|
|
typical hit rate of this cache is about 80\%~\cites[Section
|
|
B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code
|
|
analyzers are concerned with loops and, more generally, hot code portions.
|
|
Under such conditions, we expect this cache, once hot in steady-state, to be
|
|
very close to a 100\% hit rate. In this case, only the dispatch throughput will
|
|
be limiting, and modeling the decoding bottlenecks becomes irrelevant.
|
|
|
|
|
|
\bigskip{} \paragraph{Points of vigilance and limitations.} This parametric
|
|
model aims to be a compromise between simplicity of automation and good
|
|
accuracy. Experimentation may prove that it lacks some important features to be
|
|
accurate. Depending on the architecture targeted, the following points should
|
|
be investigated if the model does not reach the expected accuracy.
|
|
|
|
\begin{itemize}
|
|
|
|
\item{} We introduced just above the DSB (\uop{} cache). This model
|
|
considers that the DSB will never be the cause of a bottleneck and
|
|
that, instead, the number of dispatched \uops{} per cycle will always
|
|
bottleneck before. This might not be true, as DSBs are complex in
|
|
themselves already~\cite{uica}.
|
|
|
|
\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep in the decode
|
|
queue a whole loop's body of \uops{} if the frontend detects that a
|
|
small enough loop is repeated~\cite{uica, dead_uops}. In this case,
|
|
\uops{} are repeatedly streamed from the decode queue, without even the
|
|
necessity to hit a cache. We are unaware of similar features in other
|
|
commercial processors. In embedded programming, however, \emph{hardware
|
|
loops} --~which are set up explicitly by the programmer~-- achieve,
|
|
among others, the same
|
|
goal~\cite{hardware_loops_patent,kavvadias2007hardware,talla2001hwloops}.
|
|
|
|
\item{} The \emph{branch predictor} of a CPU is responsible for guessing,
|
|
before the actual logic is computed, whether a conditional jump will be
|
|
taken. A misprediction forces the frontend to re-populate its queues
|
|
with instructions from the branch actually taken and typically stalls
|
|
the pipeline for several cycles~\cite{branch_pred_penalty}. Our model,
|
|
however, does not include a branch predictor for much the same reason
|
|
that it does not include complex decoder: in steady-state, in a hot
|
|
code portion, we expect the branch predictor to always predict
|
|
correctly.
|
|
|
|
\item{} In reality, there is an intermediary step between instructions and
|
|
\uops{}: macro-ops. Although it serves a designing and semantic
|
|
purpose, we omit this step in the current model as --~we
|
|
believe~-- it is of little importance to predict performance.
|
|
|
|
\item{} On x86 architectures at least, common pairs of micro- or
|
|
macro-operations may be ``fused'' into a single one, up to various
|
|
parts of the pipeline, to save space in some queues or artificially
|
|
boost dispatch limitations. This mechanism is implemented in Intel
|
|
architectures, and to some extent in AMD architectures since
|
|
Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}.
|
|
This may make some kernels seem to ``bypass'' dispatch limits.
|
|
|
|
\end{itemize}
|