phd-thesis/manuscrit/30_palmed/40_palmed_results.tex

\section{Main contribution: evaluating \palmed{}}

The main contribution I made to \palmed{} is its evaluation harness and
procedure. \todo{}

\subsection{Basic blocks from benchmark suites}

Models generated by \palmed{} are meant to be used on basic blocks that are
computationally intensive ---~so that the backend is actually the relevant
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
running in steady-state ---~that is, which is the body of a loop long enough to
be reasonably considered infinite for performance modelling purposes. The basic
blocks used to evaluate \palmed{} should thus be reasonably close from these
criteria.

Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
for their evaluation. This approach, however, may yield basic blocks that do
not fit in those criteria; furthermore, it may not be representative of
real-life code on which the users of the tool expect it to be accurate.

For this reason, we evaluate \palmed{} on basic blocks extracted from
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.

\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
numerical computation~\cite{bench:polybench}. Its benchmarks are
domain-specific and centered around scientific computation, mathematical
computation, image processing, etc. As the computation kernels are
clearly identifiable in the source code, extracting the relevant basic blocks
is easy, and fits well for our purpose. It is written in C language. Although
it is not under a free/libre software license, it is free to use and
open-source.

We compile multiple versions of each benchmark (\texttt{-O2}, \texttt{-O3} and
tiled using the Pluto optimizer~\cite{tool:pluto}), then extract the basic
block corresponding to the benchmarks' kernels using \qemu~\cite{tool:qemu},
gathering translation blocks and occurrence statistics.

\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
based benchmarks, extracted from (mainly open source) real-world software, such
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
metrics and compare CPUs on a unified workload; it is however commonly used
throughout the literature to evaluate compilers, optimizers, code analyzers,
etc. It is split into four variants: integer and floating-point, combined with
speed ---~time to perform a single task~--- and rate ---~throughput for
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
The SPEC suite is under a paid license, and cannot be redistributed, which
makes peer-review and replication of experiments ---~\eg{} for artifact
review~--- complicated.

In the case of SPEC, there is no clear kernel available for each benchmark;
extracting basic blocks to evaluate \palmed{} is not trivial. We manually
extract the relevant basic blocks using a profiling-based approach with Linux
\perf{}, as the \qemu{}-based solution used for Polybench would be too costly
for SPEC\@.

\bigskip{}

Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
and 2\,664 polybench-based basic blocks.

\medskip{}

We automate and describe in detail the \perf{}-based method later in
\qtodo{ref}; however, this automation allows us to also extract basic blocks
from these same benchmark suites, compiled for ARMv8a.


\subsection{Evaluation harness}

We implement into \palmed{} an evaluation harness to evaluate it both against
native measurement and other code analyzers.

We first strip each basic block gathered of its dependencies to fall into the
use-case of \palmed{} using \pipedream{}, as we did previously. This yields
assembly code that can be run and measured natively. The body of the most
nested loop can also be used as an assembly basic block for other code
analyzers.
However, as \pipedream{}
does not support some instructions (control flow, x86-64 divisions, \ldots),
those are stripped from the original kernel, which might denature the original
basic block.

To evaluate \palmed{}, the same kernel is run:

\begin{enumerate}

\item{} natively on each CPU, using the \pipedream{} harness to measure its
    execution time;

\item{} using the resource mapping \palmed{} produced on the evaluation machine;

\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
    equivalent conjunctive resource mapping\footnote{When this evaluation was
    made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
    resource mapping, the comparison to \uopsinfo{} is fair.};

\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
    its provided mapping;

\item{} using \iaca~\cite{iaca}, by inserting assembly markers around the
    kernel and running the tool;

\item{} using \llvmmca~\cite{llvm-mca}, by inserting markers in the
    \pipedream{}-generated assembly code and running the tool.

\end{enumerate}

The raw results are saved (as a Python \pymodule{pickle} file) for reuse and
archival.

\subsection{Metrics extracted}

As \palmed{} internally works with Instructions Per Cycle (IPC) metrics, and as
all these tools are also able to provide results in IPC, the most natural
metric to evaluate is the error on the predicted IPC. We measure this as a
Root-Mean-Square (RMS) error over all basic blocks considered, weighted by each
basic block's measured occurrences:

\[ \text{Err}_\text{RMS, tool} = \sqrt{\sum_{i \in \text{BBs}}
    \frac{\text{weight}_i}{\sum_j \text{weight}_j} \left(
    \frac{\text{IPC}_{i,\text{tool}} - \text{IPC}_{i,\text{native}}}{\text{IPC}_{i,\text{native}}}
    \right)^2
    }
\]

\medskip{}

This error metric measures the relative deviation of predictions with respect
to a baseline. However, depending on how this prediction is used, the relative
\emph{ordering} of predictions ---~that is, which basic block is faster~---
might be more important. For instance, a compiler might use such models for
code selection; here, the goal would not be to predict the performance of the
kernel selected, but to accurately pick the fastest.

For this, we also provide Kendall's $\tau$ coefficient~\cite{kendalltau}. This
coefficient varies between $-1$ (full anti-correlation) and $1$ (full
correlation), and measures how many pairs of basic blocks $(i, j)$ were
correctly ordered by a tool, that is, whether

\[
    \text{IPC}_{i,\text{native}} \leq \text{IPC}_{j,\text{native}}
\iff
\text{IPC}_{i,\text{tool}} \leq \text{IPC}_{j,\text{tool}}
\]

\medskip{}

Finally, we also provide a \emph{coverage} metric for each tool; that is,
which proportion of basic blocks it was able to process.

The definition of \emph{able to process}, however, varies from tool to tool.
For \iaca{} and \llvmmca{}, this means that the analyzer crashed or ended
without yielding a result. For \uopsinfo{}, this means that one of the
instructions of the basic block is absent from the port mapping. \pmevo{},
however, is evaluated in degraded mode when instructions are not mapped, simply
ignoring them; it is considered as failed only when \emph{no instruction at
all} in the basic block was present in the model.

This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
\emph{by construction} --- which does not mean that is supports all the
instructions found in the original basic blocks.

\subsection{Results}

\input{40-1_results_fig.tex}

We run the evaluation harness on three different machines:
\begin{itemize}
    \item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
        4114 CPU, totalling 20 cores;
    \item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
        CPU with 24 cores;
    \item{} an ARMv8a Raspberry Pi 4 with 4 Cortex A72 cores.
\end{itemize}

As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
machines and gives only very rough
information for \texttt{ZEN} architectures ---~without port mapping~---, these
two tools were only tested on the \texttt{SKL-SP} machine.

\medskip{}

The evaluation metrics for all three architecture and all five tools are
presented in \autoref{table:palmed_eval}. We further represent IPC prediction
accuracy as heatmaps in \autoref{fig:palmed_heatmaps}. A dark area at
coordinate $(x, y)$ means that the selected tool has a prediction accuracy of
$y$ for a significant number of microkernels with a measured IPC of $x$. The
closer a prediction is to the red horizontal line, the more accurate it is.

These results are analyzed in the full article~\cite{palmed}.

\section{Other contributions}

\paragraph{Using a database to enhance reproducibility and usability.}
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
instance, generating a mapping for an x86-64 machine requires the execution of
about $10^6$ benchmarks on the CPU\@.

Each of these measures takes time: the multiset of instructions must be
transformed into an assembly code, including the register mapping phrase; this
assembly must be assembled and linked into an ELF file; and finally, the
benchmark must be actually executed, with multiple warm-up rounds and multiple
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
to two-thirds of a second on a single core. The whole benchmarking phase, on
the \texttt{SKL-SP} processor, roughly took eight hours.

\medskip{}

As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
in measured cycles between two executions of a benchmark are also a source of
non-determinism in the execution of Palmed.

\medskip{}

For both these reasons, we implemented into \palmed{} a database-backed storage of
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
to find a corresponding measure in the database; if the measure does not exist
yet, it will be run, then stored in database.

For each measure, we further store for context:
the time and date at which the measure was made;
the machine on which the measure was made;
how many times the measure was repeated;
how many warm-up rounds were performed;
how many instructions were in the unrolled loop;
how many instructions were executed per repetition in total;
the parameters for \pipedream{}'s assembly generation procedure;
how the final result was aggregated from the repeated measures;
the variance of the set of measures;
how many CPU cores were active when the measure was made;
which CPU core was used for this measure;
whether the kernel's scheduler was set to FIFO mode.

\bigskip{}

We believe that, as a whole, the use of a database increases the usability of
\palmed{}: it is faster if some measures were already made in the past and
recovers better upon error.

This also gives us a better confidence towards our results: we can easily
archive and backup our experimental data, and we can easily trace the origin of
a measure if needed. We can also reuse the exact same measures between two runs
of \palmed{}, to ensure that the results are as consistent as possible.


\paragraph{General engineering contributions.} Apart from purely scientific
contributions, we worked on improving \palmed{} as a whole, from the
engineering point of view: code quality; reliable parallel measurements;
recovery upon error; logging; \ldots{} These improvements amount to about a
hundred merge-requests between \nderumig{} and myself.