61 lines
3.3 KiB
TeX
61 lines
3.3 KiB
TeX
\section{Necessity to go beyond ports}
|
|
|
|
The resource models produced by \palmed{} are mainly concerned with the backend
|
|
of the CPUs modeled. However, the importance of the frontend in the accuracy of
|
|
a model's prediction cannot be ignored. Its effect can be clearly seen in the
|
|
evaluation heatmaps of various code analyzers in \autoref{fig:palmed_heatmaps}.
|
|
Each heatmap has a clear-cut limit on the horizontal axis: independently of the
|
|
benchmark's content, it is impossible to reach more than a given number of
|
|
instructions per cycle for a given processor ---~4 instructions for the
|
|
\texttt{SKL-SP}, 5 for the \texttt{ZEN1}. This limit is imposed by the
|
|
frontend.
|
|
|
|
Some analyzers, such as \palmed{} and \iaca{}, model this limit: the heatmap
|
|
shows that the predicted IPC will not surpass this limit. The other three
|
|
analyzers studied, however, do not model this limit; for instance, \uopsinfo{}
|
|
has a high density of benchmarks predicted at 8 instructions per cycle on
|
|
SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
|
|
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
|
|
heatmaps.
|
|
|
|
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
|
|
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
|
|
frontend, a number of instructions per cycle higher than 4 is easy to
|
|
reach.
|
|
|
|
According to \uopsinfo{} data, a 64-bits
|
|
integer \lstxasm{addq} is processed with a single \uop{}, dispatched on
|
|
port 0, 1, 5 or 6. In the meantime, a simple form 64 bits register store
|
|
to a direct register-held address ---~\eg{} a \lstxasm{movq \%rax,
|
|
(\%rbx)}~--- is also processed with a single \uop{}, dispatched on port 2
|
|
or 3.
|
|
|
|
Thus, backend-wise, the kernel $4\times \texttt{addq} + 2\times
|
|
\texttt{mov}$ has a throughput of 6 instructions per cycle. However, in
|
|
reality, this kernel would be frontend-bound, with a theoretical maximum throughput of 4
|
|
instructions per cycle ---~in fact, a \pipedream{} measure only yields 3
|
|
instructions per cycle.
|
|
\end{example}
|
|
|
|
\bigskip{}
|
|
|
|
To account for this, \palmed{} tries to detect an additional resource, apart
|
|
from the backend ports and combined ports, on which every \uop{} incurs a load.
|
|
This allows \palmed{} to avoid large errors on frontend-bound kernels.
|
|
|
|
The approach is, however, far from perfect. The clearest reason for this is is
|
|
that the frontend, both on x86-64 and ARM architectures, works in-order, while
|
|
\palmed{} inherently models kernels as multisets of instructions, thus
|
|
completely ignoring ordering. This resource model is purely linear: an
|
|
instruction incurs a load on the frontend resource in a fully commutative way,
|
|
independently of the previous instructions executed this cycle and of many
|
|
other effects.
|
|
|
|
The article introducing \uica{}~\cite{uica} explores this question in detail
|
|
for x86-64 Intel architectures. The authors, having previously developed
|
|
\uopsinfo{}, discuss the importance of a correct modelling of the frontend to
|
|
accurately predict throughput. Their approach, based on the exploration and
|
|
reverse-engineering of the crucial parts of the frontend, showcases many
|
|
important and non-trivial aspects of frontends usually neglected, such as the
|
|
switching between the decoders and \uop{}-cache as source of instructions
|
|
---~which cannot be linearly modelled.
|