phd-thesis/manuscrit/30_palmed/20_palmed_design.tex

\section{Palmed design}

Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully
automated way, based on the execution of well-chosen benchmarks. As its goal is
to construct a resource mapping, its only concern is backend throughput ---~in
particular, dependencies are entirely ignored. In-order effects are not
modelled either; in fact, Palmed defines a kernel as a multiset of
instructions, discarding instructions ordering at once.

The general idea behind Palmed is that, as we saw above, the execution time of
a kernel is described by a resource model through
\autoref{eqn:res_model_rthroughput}.
We can, however, reverse the problem: if we measure $\cyc{\kerK}$, the only
unknown parameters in \autoref{eqn:res_model_rthroughput} become the
$\rho_{i,r}$; that is, the weight of the edges in the resource model for the
CPU under scrutiny. Given enough, well-chosen couples $\left(\kerK,
\cyc{\kerK}\right)$, it should then be possible to solve the system for the
$\rho_{i,r}$ coefficients, thus building a resource model.

This section does not detail entirely Palmed, but rather coarsely describes the
general approach; the full methodology can be found in the full
article~\cite{palmed}. Its main steps and components are sketched in
\autoref{fig:palmed_big_picture}.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{big_picture.svg}
    \caption{High-level view of Palmed's
    architecture}\label{fig:palmed_big_picture}
\end{figure}

Palmed starts off with a list of instructions available in the ISA, $\calI$, as
well as a description of their legal parameters. This list can be obtained
using a decompiler.

The first block, Basic Instructions Selection, benchmarks every couple of
instructions ---~a step we call \emph{quadratic benchmarks}. These quadratic
benchmarks are used to group together instructions into \emph{classes} of
instructions that behave identically from the backend's point of view.
Formally, the classes are built as equivalence classes for the relation
$\sim$:
\[
    a \sim b \iff \forall i \in \calI, \cyc{a+i} = \cyc{b+i}
\]
To accommodate for measurement imprecisions and fluctuations, this strict
equality is in practice relaxed; the classes are obtained by hierarchical
clustering~\cite{hcluster_ward}, splitting the tree into classes by maximizing
the silhouette~\cite{hcluster_silhouette}. This clustering into classes is
reused later in \autoref{chap:frontend}.

The first block then finishes by applying heuristics to select \emph{basic}
instructions, that is, instructions that stress as few resources as possible,
with the highest possible throughput. These instructions can later be combined
with others to detect whether they stress a resource.

The second block, Core Mapping, builds benchmarks against these basic
instructions to discover, for each resource~$r$, a kernel ---~that should be as
simple as possible~--- that saturates it: adding any instruction that uses $r$
to this kernel should increase its execution time. These saturating kernels are
discovered with successive Linear Programming (LP) passes, using the Gurobi
Optimizer~\cite{tool:gurobi}.

These kernels are then used in a final block, Complete Mapping, to find the
$\rho_{i,r}$ coefficients for every instruction, constituting the final model.