133 lines
5.7 KiB
TeX
133 lines
5.7 KiB
TeX
\section{Evaluating \palmed{}}\label{sec:palmed_results}
|
|
|
|
Evaluating \palmed{} on the previously gathered basic blocks now requires, on
|
|
one hand, to define evaluation metrics and, on the other hand, an evaluation
|
|
harness to collect the throughput predictions from \palmed{} and the other
|
|
considered code analyzers, from which metrics will be derived.
|
|
|
|
\subsection{Evaluation harness}
|
|
|
|
We implement into \palmed{} an evaluation harness to evaluate it both against
|
|
native measurement and other code analyzers.
|
|
|
|
We first strip each basic block gathered of its dependencies to fall into the
|
|
use-case of \palmed{} using \pipedream{}, as we did previously. This yields
|
|
assembly code that can be run and measured natively. The body of the most
|
|
nested loop can also be used as an assembly basic block for other code
|
|
analyzers.
|
|
However, as \pipedream{}
|
|
does not support some instructions (control flow, x86-64 divisions, \ldots),
|
|
those are stripped from the original kernel, which might denature the original
|
|
basic block.
|
|
|
|
To evaluate \palmed{}, the same kernel's run time is measured:
|
|
|
|
\begin{enumerate}
|
|
|
|
\item{} natively on each CPU, using the \pipedream{} harness to measure its
|
|
execution time;
|
|
|
|
\item{} using the resource mapping \palmed{} produced on the evaluation machine;
|
|
|
|
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
|
|
equivalent conjunctive resource mapping\footnote{When this evaluation was
|
|
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} only
|
|
provides a resource mapping, but no frontend, the comparison to \uopsinfo{}
|
|
is fair.};
|
|
|
|
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
|
|
its provided mapping;
|
|
|
|
\item{} using \iaca~\cite{iaca}, by inserting assembly markers around the
|
|
kernel and running the tool;
|
|
|
|
\item{} using \llvmmca~\cite{llvm-mca}, by inserting markers in the
|
|
\pipedream{}-generated assembly code and running the tool.
|
|
|
|
\end{enumerate}
|
|
|
|
The raw results are saved (as a Python \pymodule{pickle} file) for reuse and
|
|
archival.
|
|
|
|
\subsection{Metrics extracted}\label{ssec:palmed_eval_metrics}
|
|
|
|
As \palmed{} internally works with Instructions Per Cycle (IPC) metrics, and as
|
|
all these tools are also able to provide results in IPC, the most natural
|
|
metric to evaluate is the error on the predicted IPC. We measure this as a
|
|
Root-Mean-Square (RMS) error over all basic blocks considered, weighted by each
|
|
basic block's measured occurrences:
|
|
|
|
\[ \text{Err}_\text{RMS, tool} = \sqrt{\sum_{i \in \text{BBs}}
|
|
\frac{\text{weight}_i}{\sum_j \text{weight}_j} \left(
|
|
\frac{\text{IPC}_{i,\text{tool}} - \text{IPC}_{i,\text{native}}}{\text{IPC}_{i,\text{native}}}
|
|
\right)^2
|
|
}
|
|
\]
|
|
|
|
\medskip{}
|
|
|
|
This error metric measures the relative deviation of predictions with respect
|
|
to a baseline. However, depending on how this prediction is used, the relative
|
|
\emph{ordering} of predictions ---~that is, which basic block is faster~---
|
|
might be more important. For instance, a compiler might use such models for
|
|
code selection; here, the goal would not be to predict the performance of the
|
|
kernel selected, but to accurately pick the fastest.
|
|
|
|
For this, we also provide Kendall's $\tau$ coefficient~\cite{kendalltau}. This
|
|
coefficient varies between $-1$ (full anti-correlation) and $1$ (full
|
|
correlation), and measures how many pairs of basic blocks $(i, j)$ were
|
|
correctly ordered by a tool, that is, whether
|
|
|
|
\[
|
|
\text{IPC}_{i,\text{native}} \leq \text{IPC}_{j,\text{native}}
|
|
\iff
|
|
\text{IPC}_{i,\text{tool}} \leq \text{IPC}_{j,\text{tool}}
|
|
\]
|
|
|
|
\medskip{}
|
|
|
|
Finally, we also provide a \emph{coverage} metric for each tool; that is,
|
|
which proportion of basic blocks it was able to process.
|
|
|
|
The definition of \emph{able to process}, however, varies from tool to tool.
|
|
For \iaca{} and \llvmmca{}, this means that the analyzer crashed or ended
|
|
without yielding a result. For \uopsinfo{}, this means that one of the
|
|
instructions of the basic block is absent from the port mapping. \pmevo{},
|
|
however, is evaluated in degraded mode when instructions are not mapped, simply
|
|
ignoring them; it is considered as failed only when \emph{no instruction at
|
|
all} in the basic block was present in the model.
|
|
|
|
This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
|
|
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
|
|
are pruned from the benchmarks. Hence, \palmed{} has a 100\,\% coverage
|
|
\emph{by construction} --- which does not mean that is supports all the
|
|
instructions found in the original basic blocks, but only that our methodology
|
|
is unable to process basic blocks unsupported by Pipedream.
|
|
|
|
\subsection{Results}
|
|
|
|
\input{40-1_results_fig.tex}
|
|
|
|
We run the evaluation harness on two different machines:
|
|
\begin{itemize}
|
|
\item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
|
|
4114 CPU, totalling 20 cores;
|
|
\item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
|
|
CPU with 24 cores.
|
|
\end{itemize}
|
|
|
|
As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
|
|
machines and gives only very rough
|
|
information for \texttt{ZEN} architectures ---~without port mapping~---, these
|
|
two tools were only tested on the \texttt{SKL-SP} machine.
|
|
|
|
\medskip{}
|
|
|
|
The evaluation metrics for all three architecture and all five tools are
|
|
presented in \autoref{table:palmed_eval}. We further represent IPC prediction
|
|
accuracy as heatmaps in \autoref{fig:palmed_heatmaps}. A dark area at
|
|
coordinate $(x, y)$ means that the selected tool has a prediction accuracy of
|
|
$y$ for a significant number of microkernels with a measured IPC of $x$. The
|
|
closer a prediction is to the red horizontal line, the more accurate it is.
|
|
|
|
These results are analyzed in the full article~\cite{palmed}.
|