254 lines
12 KiB
TeX
254 lines
12 KiB
TeX
\section{Main contribution: evaluating \palmed{}}\label{sec:palmed_results}
|
|
|
|
The main contribution I made to \palmed{} is its evaluation harness and
|
|
procedure. \todo{}
|
|
|
|
\subsection{Basic blocks from benchmark suites}
|
|
|
|
Models generated by \palmed{} are meant to be used on basic blocks that are
|
|
computationally intensive ---~so that the backend is actually the relevant
|
|
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
|
running in steady-state ---~that is, which is the body of a loop long enough to
|
|
be reasonably considered infinite for performance modelling purposes. The basic
|
|
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
|
criteria.
|
|
|
|
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
|
for their evaluation. This approach, however, may yield basic blocks that do
|
|
not fit in those criteria; furthermore, it may not be representative of
|
|
real-life code on which the users of the tool expect it to be accurate.
|
|
|
|
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
|
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
|
|
|
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
|
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
|
domain-specific and centered around scientific computation, mathematical
|
|
computation, image processing, etc. As the computation kernels are
|
|
clearly identifiable in the source code, extracting the relevant basic blocks
|
|
is easy, and fits well for our purpose. It is written in C language. Although
|
|
it is not under a free/libre software license, it is free to use and
|
|
open-source.
|
|
|
|
We compile multiple versions of each benchmark (\texttt{-O2}, \texttt{-O3} and
|
|
tiled using the Pluto optimizer~\cite{tool:pluto}), then extract the basic
|
|
block corresponding to the benchmarks' kernels using \qemu~\cite{tool:qemu},
|
|
gathering translation blocks and occurrence statistics.
|
|
|
|
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
|
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
|
based benchmarks, extracted from (mainly open source) real-world software, such
|
|
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
|
metrics and compare CPUs on a unified workload; it is however commonly used
|
|
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
|
etc. It is split into four variants: integer and floating-point, combined with
|
|
speed ---~time to perform a single task~--- and rate ---~throughput for
|
|
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
|
The SPEC suite is under a paid license, and cannot be redistributed, which
|
|
makes peer-review and replication of experiments ---~\eg{} for artifact
|
|
review~--- complicated.
|
|
|
|
In the case of SPEC, there is no clear kernel available for each benchmark;
|
|
extracting basic blocks to evaluate \palmed{} is not trivial. We manually
|
|
extract the relevant basic blocks using a profiling-based approach with Linux
|
|
\perf{}, as the \qemu{}-based solution used for Polybench would be too costly
|
|
for SPEC\@.
|
|
|
|
\bigskip{}
|
|
|
|
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
|
and 2\,664 polybench-based basic blocks.
|
|
|
|
\medskip{}
|
|
|
|
We automate and describe in detail the \perf{}-based method later in
|
|
\qtodo{ref}; however, this automation allows us to also extract basic blocks
|
|
from these same benchmark suites, compiled for ARMv8a.
|
|
|
|
|
|
\subsection{Evaluation harness}
|
|
|
|
We implement into \palmed{} an evaluation harness to evaluate it both against
|
|
native measurement and other code analyzers.
|
|
|
|
We first strip each basic block gathered of its dependencies to fall into the
|
|
use-case of \palmed{} using \pipedream{}, as we did previously. This yields
|
|
assembly code that can be run and measured natively. The body of the most
|
|
nested loop can also be used as an assembly basic block for other code
|
|
analyzers.
|
|
However, as \pipedream{}
|
|
does not support some instructions (control flow, x86-64 divisions, \ldots),
|
|
those are stripped from the original kernel, which might denature the original
|
|
basic block.
|
|
|
|
To evaluate \palmed{}, the same kernel is run:
|
|
|
|
\begin{enumerate}
|
|
|
|
\item{} natively on each CPU, using the \pipedream{} harness to measure its
|
|
execution time;
|
|
|
|
\item{} using the resource mapping \palmed{} produced on the evaluation machine;
|
|
|
|
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
|
|
equivalent conjunctive resource mapping\footnote{When this evaluation was
|
|
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
|
|
resource mapping, the comparison to \uopsinfo{} is fair.};
|
|
|
|
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
|
|
its provided mapping;
|
|
|
|
\item{} using \iaca~\cite{iaca}, by inserting assembly markers around the
|
|
kernel and running the tool;
|
|
|
|
\item{} using \llvmmca~\cite{llvm-mca}, by inserting markers in the
|
|
\pipedream{}-generated assembly code and running the tool.
|
|
|
|
\end{enumerate}
|
|
|
|
The raw results are saved (as a Python \pymodule{pickle} file) for reuse and
|
|
archival.
|
|
|
|
\subsection{Metrics extracted}
|
|
|
|
As \palmed{} internally works with Instructions Per Cycle (IPC) metrics, and as
|
|
all these tools are also able to provide results in IPC, the most natural
|
|
metric to evaluate is the error on the predicted IPC. We measure this as a
|
|
Root-Mean-Square (RMS) error over all basic blocks considered, weighted by each
|
|
basic block's measured occurrences:
|
|
|
|
\[ \text{Err}_\text{RMS, tool} = \sqrt{\sum_{i \in \text{BBs}}
|
|
\frac{\text{weight}_i}{\sum_j \text{weight}_j} \left(
|
|
\frac{\text{IPC}_{i,\text{tool}} - \text{IPC}_{i,\text{native}}}{\text{IPC}_{i,\text{native}}}
|
|
\right)^2
|
|
}
|
|
\]
|
|
|
|
\medskip{}
|
|
|
|
This error metric measures the relative deviation of predictions with respect
|
|
to a baseline. However, depending on how this prediction is used, the relative
|
|
\emph{ordering} of predictions ---~that is, which basic block is faster~---
|
|
might be more important. For instance, a compiler might use such models for
|
|
code selection; here, the goal would not be to predict the performance of the
|
|
kernel selected, but to accurately pick the fastest.
|
|
|
|
For this, we also provide Kendall's $\tau$ coefficient~\cite{kendalltau}. This
|
|
coefficient varies between $-1$ (full anti-correlation) and $1$ (full
|
|
correlation), and measures how many pairs of basic blocks $(i, j)$ were
|
|
correctly ordered by a tool, that is, whether
|
|
|
|
\[
|
|
\text{IPC}_{i,\text{native}} \leq \text{IPC}_{j,\text{native}}
|
|
\iff
|
|
\text{IPC}_{i,\text{tool}} \leq \text{IPC}_{j,\text{tool}}
|
|
\]
|
|
|
|
\medskip{}
|
|
|
|
Finally, we also provide a \emph{coverage} metric for each tool; that is,
|
|
which proportion of basic blocks it was able to process.
|
|
|
|
The definition of \emph{able to process}, however, varies from tool to tool.
|
|
For \iaca{} and \llvmmca{}, this means that the analyzer crashed or ended
|
|
without yielding a result. For \uopsinfo{}, this means that one of the
|
|
instructions of the basic block is absent from the port mapping. \pmevo{},
|
|
however, is evaluated in degraded mode when instructions are not mapped, simply
|
|
ignoring them; it is considered as failed only when \emph{no instruction at
|
|
all} in the basic block was present in the model.
|
|
|
|
This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
|
|
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
|
|
are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
|
|
\emph{by construction} --- which does not mean that is supports all the
|
|
instructions found in the original basic blocks.
|
|
|
|
\subsection{Results}
|
|
|
|
\input{40-1_results_fig.tex}
|
|
|
|
We run the evaluation harness on three different machines:
|
|
\begin{itemize}
|
|
\item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
|
|
4114 CPU, totalling 20 cores;
|
|
\item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
|
|
CPU with 24 cores;
|
|
\item{} an ARMv8a Raspberry Pi 4 with 4 Cortex A72 cores.
|
|
\end{itemize}
|
|
|
|
As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
|
|
machines and gives only very rough
|
|
information for \texttt{ZEN} architectures ---~without port mapping~---, these
|
|
two tools were only tested on the \texttt{SKL-SP} machine.
|
|
|
|
\medskip{}
|
|
|
|
The evaluation metrics for all three architecture and all five tools are
|
|
presented in \autoref{table:palmed_eval}. We further represent IPC prediction
|
|
accuracy as heatmaps in \autoref{fig:palmed_heatmaps}. A dark area at
|
|
coordinate $(x, y)$ means that the selected tool has a prediction accuracy of
|
|
$y$ for a significant number of microkernels with a measured IPC of $x$. The
|
|
closer a prediction is to the red horizontal line, the more accurate it is.
|
|
|
|
These results are analyzed in the full article~\cite{palmed}.
|
|
|
|
\section{Other contributions}
|
|
|
|
\paragraph{Using a database to enhance reproducibility and usability.}
|
|
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
|
|
instance, generating a mapping for an x86-64 machine requires the execution of
|
|
about $10^6$ benchmarks on the CPU\@.
|
|
|
|
Each of these measures takes time: the multiset of instructions must be
|
|
transformed into an assembly code, including the register mapping phrase; this
|
|
assembly must be assembled and linked into an ELF file; and finally, the
|
|
benchmark must be actually executed, with multiple warm-up rounds and multiple
|
|
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
|
|
to two-thirds of a second on a single core. The whole benchmarking phase, on
|
|
the \texttt{SKL-SP} processor, roughly took eight hours.
|
|
|
|
\medskip{}
|
|
|
|
As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
|
|
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
|
|
in measured cycles between two executions of a benchmark are also a source of
|
|
non-determinism in the execution of Palmed.
|
|
|
|
\medskip{}
|
|
|
|
For both these reasons, we implemented into \palmed{} a database-backed storage of
|
|
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
|
|
to find a corresponding measure in the database; if the measure does not exist
|
|
yet, it will be run, then stored in database.
|
|
|
|
For each measure, we further store for context:
|
|
the time and date at which the measure was made;
|
|
the machine on which the measure was made;
|
|
how many times the measure was repeated;
|
|
how many warm-up rounds were performed;
|
|
how many instructions were in the unrolled loop;
|
|
how many instructions were executed per repetition in total;
|
|
the parameters for \pipedream{}'s assembly generation procedure;
|
|
how the final result was aggregated from the repeated measures;
|
|
the variance of the set of measures;
|
|
how many CPU cores were active when the measure was made;
|
|
which CPU core was used for this measure;
|
|
whether the kernel's scheduler was set to FIFO mode.
|
|
|
|
\bigskip{}
|
|
|
|
We believe that, as a whole, the use of a database increases the usability of
|
|
\palmed{}: it is faster if some measures were already made in the past and
|
|
recovers better upon error.
|
|
|
|
This also gives us a better confidence towards our results: we can easily
|
|
archive and backup our experimental data, and we can easily trace the origin of
|
|
a measure if needed. We can also reuse the exact same measures between two runs
|
|
of \palmed{}, to ensure that the results are as consistent as possible.
|
|
|
|
|
|
\paragraph{General engineering contributions.} Apart from purely scientific
|
|
contributions, we worked on improving \palmed{} as a whole, from the
|
|
engineering point of view: code quality; reliable parallel measurements;
|
|
recovery upon error; logging; \ldots{} These improvements amount to about a
|
|
hundred merge-requests between \nderumig{} and myself.
|