64 lines
3.2 KiB
TeX
64 lines
3.2 KiB
TeX
\section{Palmed design}
|
|
|
|
Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully
|
|
automated way, based on the execution of well-chosen benchmarks. As its goal is
|
|
to construct a resource mapping, its only concern is backend throughput ---~in
|
|
particular, dependencies are entirely ignored. In-order effects are not
|
|
modelled either; in fact, Palmed defines a kernel as a multiset of
|
|
instructions, discarding instructions ordering at once.
|
|
|
|
The general idea behind Palmed is that, as we saw above, the execution time of
|
|
a kernel is described by a resource model through
|
|
\autoref{eqn:res_model_rthroughput}.
|
|
We can, however, reverse the problem: if we measure $\cyc{\kerK}$, the only
|
|
unknown parameters in \autoref{eqn:res_model_rthroughput} become the
|
|
$\rho_{i,r}$; that is, the weight of the edges in the resource model for the
|
|
CPU under scrutiny. Given enough, well-chosen couples $\left(\kerK,
|
|
\cyc{\kerK}\right)$, it should then be possible to solve the system for the
|
|
$\rho_{i,r}$ coefficients, thus building a resource model.
|
|
|
|
This section does not detail entirely Palmed, but rather coarsely describes the
|
|
general approach; the full methodology can be found in the full
|
|
article~\cite{palmed}. Its main steps and components are sketched in
|
|
\autoref{fig:palmed_big_picture}.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{big_picture.svg}
|
|
\caption{High-level view of Palmed's
|
|
architecture}\label{fig:palmed_big_picture}
|
|
\end{figure}
|
|
|
|
Palmed starts off with a list of instructions available in the ISA, $\calI$, as
|
|
well as a description of their legal parameters. This list can be obtained
|
|
using a decompiler.
|
|
|
|
The first block, Basic Instructions Selection, benchmarks every couple of
|
|
instructions ---~a step we call \emph{quadratic benchmarks}. These quadratic
|
|
benchmarks are used to group together instructions into \emph{classes} of
|
|
instructions that behave identically from the backend's point of view.
|
|
Formally, the classes are built as equivalence classes for the relation
|
|
$\sim$:
|
|
\[
|
|
a \sim b \iff \forall i \in \calI, \cyc{a+i} = \cyc{b+i}
|
|
\]
|
|
To accommodate for measurement imprecisions and fluctuations, this strict
|
|
equality is in practice relaxed; the classes are obtained by hierarchical
|
|
clustering~\cite{hcluster_ward}, splitting the tree into classes by maximizing
|
|
the silhouette~\cite{hcluster_silhouette}. This clustering into classes is
|
|
reused later in \autoref{chap:frontend}.
|
|
|
|
The first block then finishes by applying heuristics to select \emph{basic}
|
|
instructions, that is, instructions that stress as few resources as possible,
|
|
with the highest possible throughput. These instructions can later be combined
|
|
with others to detect whether they stress a resource.
|
|
|
|
The second block, Core Mapping, builds benchmarks against these basic
|
|
instructions to discover, for each resource~$r$, a kernel ---~that should be as
|
|
simple as possible~--- that saturates it: adding any instruction that uses $r$
|
|
to this kernel should increase its execution time. These saturating kernels are
|
|
discovered with successive Linear Programming (LP) passes, using the Gurobi
|
|
Optimizer~\cite{tool:gurobi}.
|
|
|
|
These kernels are then used in a final block, Complete Mapping, to find the
|
|
$\rho_{i,r}$ coefficients for every instruction, constituting the final model.
|