\section{Evaluating \palmed{}}\label{sec:palmed_results} Evaluating \palmed{} on the previously gathered basic blocks now requires, on one hand, to define evaluation metrics and, on the other hand, an evaluation harness to collect the throughput predictions from \palmed{} and the other considered code analyzers, from which metrics will be derived. \subsection{Evaluation harness} We implement into \palmed{} an evaluation harness to evaluate it both against native measurement and other code analyzers. We first strip each basic block gathered of its dependencies to fall into the use-case of \palmed{} using \pipedream{}, as we did previously. This yields assembly code that can be run and measured natively. The body of the most nested loop can also be used as an assembly basic block for other code analyzers. However, as \pipedream{} does not support some instructions (control flow, x86-64 divisions, \ldots), those are stripped from the original kernel, which might denature the original basic block. To evaluate \palmed{}, the same kernel's run time is measured: \begin{enumerate} \item{} natively on each CPU, using the \pipedream{} harness to measure its execution time; \item{} using the resource mapping \palmed{} produced on the evaluation machine; \item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its equivalent conjunctive resource mapping\footnote{When this evaluation was made, \uica{}~\cite{uica} was not yet published. Since \palmed{} only provides a resource mapping, but no frontend, the comparison to \uopsinfo{} is fair.}; \item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by its provided mapping; \item{} using \iaca~\cite{iaca}, by inserting assembly markers around the kernel and running the tool; \item{} using \llvmmca~\cite{llvm-mca}, by inserting markers in the \pipedream{}-generated assembly code and running the tool. \end{enumerate} The raw results are saved (as a Python \pymodule{pickle} file) for reuse and archival. \subsection{Metrics extracted}\label{ssec:palmed_eval_metrics} As \palmed{} internally works with Instructions Per Cycle (IPC) metrics, and as all these tools are also able to provide results in IPC, the most natural metric to evaluate is the error on the predicted IPC. We measure this as a Root-Mean-Square (RMS) error over all basic blocks considered, weighted by each basic block's measured occurrences: \[ \text{Err}_\text{RMS, tool} = \sqrt{\sum_{i \in \text{BBs}} \frac{\text{weight}_i}{\sum_j \text{weight}_j} \left( \frac{\text{IPC}_{i,\text{tool}} - \text{IPC}_{i,\text{native}}}{\text{IPC}_{i,\text{native}}} \right)^2 } \] \medskip{} This error metric measures the relative deviation of predictions with respect to a baseline. However, depending on how this prediction is used, the relative \emph{ordering} of predictions ---~that is, which basic block is faster~--- might be more important. For instance, a compiler might use such models for code selection; here, the goal would not be to predict the performance of the kernel selected, but to accurately pick the fastest. For this, we also provide Kendall's $\tau$ coefficient~\cite{kendalltau}. This coefficient varies between $-1$ (full anti-correlation) and $1$ (full correlation), and measures how many pairs of basic blocks $(i, j)$ were correctly ordered by a tool, that is, whether \[ \text{IPC}_{i,\text{native}} \leq \text{IPC}_{j,\text{native}} \iff \text{IPC}_{i,\text{tool}} \leq \text{IPC}_{j,\text{tool}} \] \medskip{} Finally, we also provide a \emph{coverage} metric for each tool; that is, which proportion of basic blocks it was able to process. The definition of \emph{able to process}, however, varies from tool to tool. For \iaca{} and \llvmmca{}, this means that the analyzer crashed or ended without yielding a result. For \uopsinfo{}, this means that one of the instructions of the basic block is absent from the port mapping. \pmevo{}, however, is evaluated in degraded mode when instructions are not mapped, simply ignoring them; it is considered as failed only when \emph{no instruction at all} in the basic block was present in the model. This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as a baseline measurement, instructions that cannot be benchmarked by \pipedream{} are pruned from the benchmarks. Hence, \palmed{} has a 100\,\% coverage \emph{by construction} --- which does not mean that is supports all the instructions found in the original basic blocks, but only that our methodology is unable to process basic blocks unsupported by Pipedream. \subsection{Results} \input{40-1_results_fig.tex} We run the evaluation harness on two different machines: \begin{itemize} \item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver 4114 CPU, totalling 20 cores; \item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P CPU with 24 cores. \end{itemize} As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64 machines and gives only very rough information for \texttt{ZEN} architectures ---~without port mapping~---, these two tools were only tested on the \texttt{SKL-SP} machine. \medskip{} The evaluation metrics for all three architecture and all five tools are presented in \autoref{table:palmed_eval}. We further represent IPC prediction accuracy as heatmaps in \autoref{fig:palmed_heatmaps}. A dark area at coordinate $(x, y)$ means that the selected tool has a prediction accuracy of $y$ for a significant number of microkernels with a measured IPC of $x$. The closer a prediction is to the red horizontal line, the more accurate it is. These results are analyzed in the full article~\cite{palmed}.