phd-thesis/manuscrit/30_palmed/20_palmed_design.tex

\section{Palmed design}

Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully
automated way, based on the execution of well-chosen benchmarks. As its goal is
to construct a resource mapping, its only concern is backend throughput ---~in
particular, dependencies are entirely ignored. In-order effects are not
modelled either; in fact, Palmed defines a kernel as a multiset of
instructions, discarding instructions ordering at once.

The general idea behind Palmed is that, as we saw above, the execution time of
a kernel is described by a resource model through
\autoref{eqn:res_model_rthroughput}.
We can, however, reverse the problem: if we measure $\cyc{\kerK}$, the only
unknown parameters in \autoref{eqn:res_model_rthroughput} become the
$\rho_{i,r}$; that is, the weight of the edges in the resource model for the
CPU under scrutiny. Given enough, well-chosen couples $\left(\kerK,
\cyc{\kerK}\right)$, it should then be possible to solve the system for the
$\rho_{i,r}$ coefficients, thus building a resource model.

This section does not detail entirely Palmed, but rather coarsely describes the
general approach; the full methodology can be found in the full
article~\cite{palmed}. Its main steps and components are sketched in
\autoref{fig:palmed_big_picture}.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{big_picture.svg}
    \caption{High-level view of Palmed's
    architecture}\label{fig:palmed_big_picture}
\end{figure}

Palmed starts off with a list of instructions available in the ISA, $\calI$, as
well as a description of their legal parameters. This list can be obtained
using a decompiler.

The first block, Basic Instructions Selection, benchmarks every couple of
instructions ---~a step we call \emph{quadratic benchmarks}. These quadratic
benchmarks are used to group together instructions into \emph{classes} of
instructions that behave identically from the backend's point of view.
Formally, the classes are built as equivalence classes for the relation
$\sim$:
\[
    a \sim b \iff \forall i \in \calI, \cyc{a+i} = \cyc{b+i}
\]
To accommodate for measurement imprecisions and fluctuations, this strict
equality is in practice relaxed; the classes are obtained by hierarchical
clustering~\cite{hcluster_ward}, splitting the tree into classes by maximizing
the silhouette~\cite{hcluster_silhouette}. This clustering into classes is
reused later in \autoref{chap:frontend}.

The first block then finishes by applying heuristics to select \emph{basic}
instructions, that is, instructions that stress as few resources as possible,
with the highest possible throughput. These instructions can later be combined
with others to detect whether they stress a resource.

The second block, Core Mapping, builds benchmarks against these basic
instructions to discover, for each resource~$r$, a kernel ---~that should be as
simple as possible~--- that saturates it: adding any instruction that uses $r$
to this kernel should increase its execution time. These saturating kernels are
discovered with successive Linear Programming (LP) passes, using the Gurobi
Optimizer~\cite{tool:gurobi}.

These kernels are then used in a final block, Complete Mapping, to find the
$\rho_{i,r}$ coefficients for every instruction, constituting the final model.
Palmed: write up to Pipedream 2023-09-15 14:34:32 +02:00			`\section{Palmed design}`

			`Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully`
			`automated way, based on the execution of well-chosen benchmarks. As its goal is`
			`to construct a resource mapping, its only concern is backend throughput ---~in`
			`particular, dependencies are entirely ignored. In-order effects are not`
			`modelled either; in fact, Palmed defines a kernel as a multiset of`
			`instructions, discarding instructions ordering at once.`

			`The general idea behind Palmed is that, as we saw above, the execution time of`
			`a kernel is described by a resource model through`
			`\autoref{eqn:res_model_rthroughput}.`
			`We can, however, reverse the problem: if we measure $\cyc{\kerK}$, the only`
			`unknown parameters in \autoref{eqn:res_model_rthroughput} become the`
			`$\rho_{i,r}$; that is, the weight of the edges in the resource model for the`
			`CPU under scrutiny. Given enough, well-chosen couples $\left(\kerK,`
			`\cyc{\kerK}\right)$, it should then be possible to solve the system for the`
			`$\rho_{i,r}$ coefficients, thus building a resource model.`

			`This section does not detail entirely Palmed, but rather coarsely describes the`
			`general approach; the full methodology can be found in the full`
			`article~\cite{palmed}. Its main steps and components are sketched in`
			`\autoref{fig:palmed_big_picture}.`

			`\begin{figure}`
			`\centering`
			`\includegraphics[width=\textwidth]{big_picture.svg}`
			`\caption{High-level view of Palmed's`
			`architecture}\label{fig:palmed_big_picture}`
			`\end{figure}`

			`Palmed starts off with a list of instructions available in the ISA, $\calI$, as`
			`well as a description of their legal parameters. This list can be obtained`
			`using a decompiler.`

			`The first block, Basic Instructions Selection, benchmarks every couple of`
			`instructions ---~a step we call \emph{quadratic benchmarks}. These quadratic`
			`benchmarks are used to group together instructions into \emph{classes} of`
			`instructions that behave identically from the backend's point of view.`
			`Formally, the classes are built as equivalence classes for the relation`
			`$\sim$:`
			`\[`
			`a \sim b \iff \forall i \in \calI, \cyc{a+i} = \cyc{b+i}`
			`\]`
			`To accommodate for measurement imprecisions and fluctuations, this strict`
			`equality is in practice relaxed; the classes are obtained by hierarchical`
			`clustering~\cite{hcluster_ward}, splitting the tree into classes by maximizing`
			`the silhouette~\cite{hcluster_silhouette}. This clustering into classes is`
			`reused later in \autoref{chap:frontend}.`

			`The first block then finishes by applying heuristics to select \emph{basic}`
			`instructions, that is, instructions that stress as few resources as possible,`
			`with the highest possible throughput. These instructions can later be combined`
			`with others to detect whether they stress a resource.`

			`The second block, Core Mapping, builds benchmarks against these basic`
			`instructions to discover, for each resource~$r$, a kernel ---~that should be as`
			`simple as possible~--- that saturates it: adding any instruction that uses $r$`
			`to this kernel should increase its execution time. These saturating kernels are`
			`discovered with successive Linear Programming (LP) passes, using the Gurobi`
			`Optimizer~\cite{tool:gurobi}.`

			`These kernels are then used in a final block, Complete Mapping, to find the`
			`$\rho_{i,r}$ coefficients for every instruction, constituting the final model.`