103 lines
4.8 KiB
TeX
103 lines
4.8 KiB
TeX
|
\section{Main contribution: evaluating \palmed{}}
|
||
|
|
||
|
The main contribution I made to \palmed{} is its evaluation harness and
|
||
|
procedure. \todo{}
|
||
|
|
||
|
\subsection{Basic blocks from benchmark suites}
|
||
|
|
||
|
Models generated by \palmed{} are meant to be used on basic blocks that are
|
||
|
computationally intensive ---~so that the backend is actually the relevant
|
||
|
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
||
|
running in steady-state ---~that is, which is the body of a loop long enough to
|
||
|
be reasonably considered infinite for performance modelling purposes. The basic
|
||
|
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
||
|
criteria.
|
||
|
|
||
|
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
||
|
for their evaluation. This approach, however, may yield basic blocks that do
|
||
|
not fit in those criteria; furthermore, it may not be representative of
|
||
|
real-life code on which the users of the tool expect it to be accurate.
|
||
|
|
||
|
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
||
|
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
||
|
|
||
|
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
||
|
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
||
|
domain-specific and centered around scientific computation, mathematical
|
||
|
computation, image processing, etc. As the computation kernels are
|
||
|
clearly identifiable in the source code, extracting the relevant basic blocks
|
||
|
is easy, and fits well for our purpose. It is written in C language. Although
|
||
|
it is not under a free/libre software license, it is free to use and
|
||
|
open-source.
|
||
|
|
||
|
We compile multiple versions of each benchmark (\texttt{-O2}, \texttt{-O3} and
|
||
|
tiled using the Pluto optimizer~\cite{tool:pluto}), then extract the basic
|
||
|
block corresponding to the benchmarks' kernels using \qemu~\cite{tool:qemu},
|
||
|
gathering translation blocks and occurrence statistics.
|
||
|
|
||
|
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
||
|
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
||
|
based benchmarks, extracted from (mainly open source) real-world software, such
|
||
|
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
||
|
metrics and compare CPUs on a unified workload; it is however commonly used
|
||
|
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
||
|
etc. It is split into four variants: integer and floating-point, combined with
|
||
|
speed ---~time to perform a single task~--- and rate ---~throughput for
|
||
|
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
||
|
The SPEC suite is under a paid license, and cannot be redistributed, which
|
||
|
makes peer-review and replication of experiments ---~\eg{} for artifact
|
||
|
review~--- complicated.
|
||
|
|
||
|
In the case of SPEC, there is no clear kernel available for each benchmark;
|
||
|
extracting basic blocks to evaluate \palmed{} is not trivial. We manually
|
||
|
extract the relevant basic blocks using a profiling-based approach with Linux
|
||
|
\perf{}, as the \qemu{}-based solution used for Polybench would be too costly
|
||
|
for SPEC\@. We automatize and describe this method in detail later in
|
||
|
\qtodo{ref}.
|
||
|
|
||
|
\bigskip{}
|
||
|
|
||
|
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
||
|
and 2\,664 polybench-based basic blocks.
|
||
|
|
||
|
\subsection{Evaluation harness}
|
||
|
|
||
|
We implement into \palmed{} an evaluation harness to evaluate it both against
|
||
|
native measurement and other code analyzers.
|
||
|
|
||
|
We first strip each basic block gathered of its dependencies to fall into the
|
||
|
use-case of \palmed{} using \pipedream{}, as we did previously. This yields
|
||
|
assembly code that can be run and measured natively. The body of the most
|
||
|
nested loop can also be used as an assembly basic block for other code
|
||
|
analyzers.
|
||
|
However, as \pipedream{}
|
||
|
does not support some instructions (control flow, x86-64 divisions, \ldots),
|
||
|
those are stripped from the original kernel, which might denature the original
|
||
|
basic block.
|
||
|
|
||
|
To evaluate \palmed{}, the same kernel is run:
|
||
|
|
||
|
\begin{enumerate}
|
||
|
|
||
|
\item{} natively on each CPU, using the \pipedream{} harness to measure its
|
||
|
execution time;
|
||
|
|
||
|
\item{} using the resource mapping \palmed{} produced on the evaluation machine;
|
||
|
|
||
|
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
|
||
|
equivalent conjunctive resource mapping\footnote{When this evaluation was
|
||
|
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
|
||
|
resource mapping, the comparison to \uopsinfo{} is fair.};
|
||
|
|
||
|
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
|
||
|
its provided mapping;
|
||
|
|
||
|
\item{} using \iaca~\cite{iaca}, by inserting assembly markers around the
|
||
|
kernel and running the tool;
|
||
|
|
||
|
\item{} using \llvmmca~\cite{llvm-mca}, by inserting markers in the
|
||
|
\pipedream{}-generated assembly code and running the tool.
|
||
|
\end{enumerate}
|
||
|
|
||
|
% TODO: metrics extracted
|