176 lines
7.8 KiB
TeX
176 lines
7.8 KiB
TeX
\section{Main contribution: evaluating \palmed{}}
|
|
|
|
The main contribution I made to \palmed{} is its evaluation harness and
|
|
procedure. \todo{}
|
|
|
|
\subsection{Basic blocks from benchmark suites}
|
|
|
|
Models generated by \palmed{} are meant to be used on basic blocks that are
|
|
computationally intensive ---~so that the backend is actually the relevant
|
|
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
|
running in steady-state ---~that is, which is the body of a loop long enough to
|
|
be reasonably considered infinite for performance modelling purposes. The basic
|
|
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
|
criteria.
|
|
|
|
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
|
for their evaluation. This approach, however, may yield basic blocks that do
|
|
not fit in those criteria; furthermore, it may not be representative of
|
|
real-life code on which the users of the tool expect it to be accurate.
|
|
|
|
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
|
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
|
|
|
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
|
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
|
domain-specific and centered around scientific computation, mathematical
|
|
computation, image processing, etc. As the computation kernels are
|
|
clearly identifiable in the source code, extracting the relevant basic blocks
|
|
is easy, and fits well for our purpose. It is written in C language. Although
|
|
it is not under a free/libre software license, it is free to use and
|
|
open-source.
|
|
|
|
We compile multiple versions of each benchmark (\texttt{-O2}, \texttt{-O3} and
|
|
tiled using the Pluto optimizer~\cite{tool:pluto}), then extract the basic
|
|
block corresponding to the benchmarks' kernels using \qemu~\cite{tool:qemu},
|
|
gathering translation blocks and occurrence statistics.
|
|
|
|
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
|
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
|
based benchmarks, extracted from (mainly open source) real-world software, such
|
|
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
|
metrics and compare CPUs on a unified workload; it is however commonly used
|
|
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
|
etc. It is split into four variants: integer and floating-point, combined with
|
|
speed ---~time to perform a single task~--- and rate ---~throughput for
|
|
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
|
The SPEC suite is under a paid license, and cannot be redistributed, which
|
|
makes peer-review and replication of experiments ---~\eg{} for artifact
|
|
review~--- complicated.
|
|
|
|
In the case of SPEC, there is no clear kernel available for each benchmark;
|
|
extracting basic blocks to evaluate \palmed{} is not trivial. We manually
|
|
extract the relevant basic blocks using a profiling-based approach with Linux
|
|
\perf{}, as the \qemu{}-based solution used for Polybench would be too costly
|
|
for SPEC\@. We automatize and describe this method in detail later in
|
|
\qtodo{ref}.
|
|
|
|
\bigskip{}
|
|
|
|
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
|
and 2\,664 polybench-based basic blocks.
|
|
|
|
\subsection{Evaluation harness}
|
|
|
|
We implement into \palmed{} an evaluation harness to evaluate it both against
|
|
native measurement and other code analyzers.
|
|
|
|
We first strip each basic block gathered of its dependencies to fall into the
|
|
use-case of \palmed{} using \pipedream{}, as we did previously. This yields
|
|
assembly code that can be run and measured natively. The body of the most
|
|
nested loop can also be used as an assembly basic block for other code
|
|
analyzers.
|
|
However, as \pipedream{}
|
|
does not support some instructions (control flow, x86-64 divisions, \ldots),
|
|
those are stripped from the original kernel, which might denature the original
|
|
basic block.
|
|
|
|
To evaluate \palmed{}, the same kernel is run:
|
|
|
|
\begin{enumerate}
|
|
|
|
\item{} natively on each CPU, using the \pipedream{} harness to measure its
|
|
execution time;
|
|
|
|
\item{} using the resource mapping \palmed{} produced on the evaluation machine;
|
|
|
|
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
|
|
equivalent conjunctive resource mapping\footnote{When this evaluation was
|
|
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
|
|
resource mapping, the comparison to \uopsinfo{} is fair.};
|
|
|
|
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
|
|
its provided mapping;
|
|
|
|
\item{} using \iaca~\cite{iaca}, by inserting assembly markers around the
|
|
kernel and running the tool;
|
|
|
|
\item{} using \llvmmca~\cite{llvm-mca}, by inserting markers in the
|
|
\pipedream{}-generated assembly code and running the tool.
|
|
|
|
\end{enumerate}
|
|
|
|
The raw results are saved (as a Python \pymodule{pickle} file) for reuse and
|
|
archival.
|
|
|
|
\medskip{}
|
|
|
|
The evaluation is run on two different machines:
|
|
\begin{itemize}
|
|
\item{} an Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
|
|
4114 CPU, totalling 20 cores;
|
|
\item{} an AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
|
|
CPU with 24 cores.
|
|
\end{itemize}
|
|
|
|
As \iaca{} only supports Intel CPUs, and \uopsinfo{} gives only very rough
|
|
information for \texttt{ZEN} architectures ---~without port mapping~---, these
|
|
two tools were not tested on the \texttt{ZEN1} machine.
|
|
|
|
\subsection{Metrics extracted}
|
|
|
|
As \palmed{} internally works with Instructions Per Cycle (IPC) metrics, and as
|
|
all these tools are also able to provide results in IPC, the most natural
|
|
metric to evaluate is the error on the predicted IPC. We measure this as a
|
|
Root-Mean-Square (RMS) error over all basic blocks considered, weighted by each
|
|
basic block's measured occurrences:
|
|
|
|
\[ \text{Err}_\text{RMS, tool} = \sqrt{\sum_{i \in \text{BBs}}
|
|
\frac{\text{weight}_i}{\sum_j \text{weight}_j} \left(
|
|
\frac{\text{IPC}_{i,\text{tool}} - \text{IPC}_{i,\text{native}}}{\text{IPC}_{i,\text{native}}}
|
|
\right)^2
|
|
}
|
|
\]
|
|
|
|
\medskip{}
|
|
|
|
This error metric measures the relative deviation of predictions with respect
|
|
to a baseline. However, depending on how this prediction is used, the relative
|
|
\emph{ordering} of predictions ---~that is, which basic block is faster~---
|
|
might be more important. For instance, a compiler might use such models for
|
|
code selection; here, the goal would not be to predict the performance of the
|
|
kernel selected, but to accurately pick the fastest.
|
|
|
|
For this, we also provide Kendall's $\tau$ coefficient~\cite{kendalltau}. This
|
|
coefficient varies between $-1$ (full anti-correlation) and $1$ (full
|
|
correlation), and measures how many pairs of basic blocks $(i, j)$ were
|
|
correctly ordered by a tool, that is, whether
|
|
|
|
\[
|
|
\text{IPC}_{i,\text{native}} \leq \text{IPC}_{j,\text{native}}
|
|
\iff
|
|
\text{IPC}_{i,\text{tool}} \leq \text{IPC}_{j,\text{tool}}
|
|
\]
|
|
|
|
\medskip{}
|
|
|
|
Finally, we also provide a \emph{coverage} metric for each tool; that is,
|
|
which proportion of basic blocks it was able to process.
|
|
|
|
The definition of \emph{able to process}, however, varies from tool to tool.
|
|
For \iaca{} and \llvmmca{}, this means that the analyzer crashed or ended
|
|
without yielding a result. For \uopsinfo{}, this means that one of the
|
|
instructions of the basic block is absent from the port mapping. \pmevo{},
|
|
however, is evaluated in degraded mode when instructions are not mapped, simply
|
|
ignoring them; it is considered as failed only when \emph{no instruction at
|
|
all} in the basic block was present in the model.
|
|
|
|
This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
|
|
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
|
|
are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
|
|
\emph{by construction} --- which does not mean that is supports all the
|
|
instructions found in the original basic blocks.
|
|
|
|
\subsection{Results}
|
|
|
|
\input{40-1_results_fig.tex}
|