\section{Palmed design} Palmed is a tool aiming to construct a resource mapping for a CPU, in a fully automated way, based on the execution of well-chosen benchmarks. As its goal is to construct a resource mapping, its only concern is backend throughput ---~in particular, dependencies are entirely ignored. In-order effects are not modelled either; in fact, Palmed defines a kernel as a multiset of instructions, discarding instructions ordering at once. The general idea behind Palmed is that, as we saw above, the execution time of a kernel is described by a resource model through \autoref{eqn:res_model_rthroughput}. We can, however, reverse the problem: if we measure $\cyc{\kerK}$, the only unknown parameters in \autoref{eqn:res_model_rthroughput} become the $\rho_{i,r}$; that is, the weight of the edges in the resource model for the CPU under scrutiny. Given enough, well-chosen couples $\left(\kerK, \cyc{\kerK}\right)$, it should then be possible to solve the system for the $\rho_{i,r}$ coefficients, thus building a resource model. This section does not detail entirely Palmed, but rather coarsely describes the general approach; the full methodology can be found in the full article~\cite{palmed}. Its main steps and components are sketched in \autoref{fig:palmed_big_picture}. \begin{figure} \centering \includegraphics[width=\textwidth]{big_picture.svg} \caption{High-level view of Palmed's architecture}\label{fig:palmed_big_picture} \end{figure} Palmed starts off with a list of instructions available in the ISA, $\calI$, as well as a description of their legal parameters. This list can be obtained using a decompiler. The first block, Basic Instructions Selection, benchmarks every couple of instructions ---~a step we call \emph{quadratic benchmarks}. These quadratic benchmarks are used to group together instructions into \emph{classes} of instructions that behave identically from the backend's point of view. Formally, the classes are built as equivalence classes for the relation $\sim$: \[ a \sim b \iff \forall i \in \calI, \cyc{a+i} = \cyc{b+i} \] To accommodate for measurement imprecisions and fluctuations, this strict equality is in practice relaxed; the classes are obtained by hierarchical clustering~\cite{hcluster_ward}, splitting the tree into classes by maximizing the silhouette~\cite{hcluster_silhouette}. This clustering into classes is reused later in \autoref{chap:frontend}. The first block then finishes by applying heuristics to select \emph{basic} instructions, that is, instructions that stress as few resources as possible, with the highest possible throughput. These instructions can later be combined with others to detect whether they stress a resource. The second block, Core Mapping, builds benchmarks against these basic instructions to discover, for each resource~$r$, a kernel ---~that should be as simple as possible~--- that saturates it: adding any instruction that uses $r$ to this kernel should increase its execution time. These saturating kernels are discovered with successive Linear Programming (LP) passes, using the Gurobi Optimizer~\cite{tool:gurobi}. These kernels are then used in a final block, Complete Mapping, to find the $\rho_{i,r}$ coefficients for every instruction, constituting the final model.