phd-thesis/manuscrit/40_A72-frontend/10_beyond_ports.tex

61 lines
3.3 KiB
TeX

\section{Necessity to go beyond ports}
The resource models produced by \palmed{} are mainly concerned with the backend
of the CPUs modeled. However, the importance of the frontend in the accuracy of
a model's prediction cannot be ignored. Its effect can be clearly seen in the
evaluation heatmaps of various code analyzers in \autoref{fig:palmed_heatmaps}.
Each heatmap has a clear-cut limit on the horizontal axis: independently of the
benchmark's content, it is impossible to reach more than a given number of
instructions per cycle for a given processor ---~4 instructions for the
\texttt{SKL-SP}, 5 for the \texttt{ZEN1}. This limit is imposed by the
frontend.
Some analyzers, such as \palmed{} and \iaca{}, model this limit: the heatmap
shows that the predicted IPC will not surpass this limit. The other three
analyzers studied, however, do not model this limit; for instance, \uopsinfo{}
has a high density of benchmarks predicted at 8 instructions per cycle on
SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
heatmaps.
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
frontend, a number of instructions per cycle higher than 4 is easy to
reach.
According to \uopsinfo{} data, a 64-bits
integer \lstxasm{addq} is processed with a single \uop{}, dispatched on
port 0, 1, 5 or 6. In the meantime, a simple form 64 bits register store
to a direct register-held address ---~\eg{} a \lstxasm{movq \%rax,
(\%rbx)}~--- is also processed with a single \uop{}, dispatched on port 2
or 3.
Thus, backend-wise, the kernel $4\times \texttt{addq} + 2\times
\texttt{mov}$ has a throughput of 6 instructions per cycle. However, in
reality, this kernel would be frontend-bound, with a theoretical maximum throughput of 4
instructions per cycle ---~in fact, a \pipedream{} measure only yields 3
instructions per cycle.
\end{example}
\bigskip{}
To account for this, \palmed{} tries to detect an additional resource, apart
from the backend ports and combined ports, on which every \uop{} incurs a load.
This allows \palmed{} to avoid large errors on frontend-bound kernels.
The approach is, however, far from perfect. The clearest reason for this is is
that the frontend, both on x86-64 and ARM architectures, works in-order, while
\palmed{} inherently models kernels as multisets of instructions, thus
completely ignoring ordering. This resource model is purely linear: an
instruction incurs a load on the frontend resource in a fully commutative way,
independently of the previous instructions executed this cycle and of many
other effects.
The article introducing \uica{}~\cite{uica} explores this question in detail
for x86-64 Intel architectures. The authors, having previously developed
\uopsinfo{}, discuss the importance of a correct modelling of the frontend to
accurately predict throughput. Their approach, based on the exploration and
reverse-engineering of the crucial parts of the frontend, showcases many
important and non-trivial aspects of frontends usually neglected, such as the
switching between the decoders and \uop{}-cache as source of instructions
---~which cannot be linearly modelled.