phd-thesis/manuscrit/50_CesASMe/10_bench_gen.tex

\section{Generating microbenchmarks}\label{sec:bench_gen}

Our framework aims to generate \emph{microbenchmarks} relevant to a specific
domain.
A microbenchmark is a code that is as simplified as possible to expose the
behaviour under consideration.
The specified computations should be representative of the considered domain,
and at the same time they should stress the different aspects of the
target architecture ---~which is modeled by code analyzers.

In practice, a microbenchmark's \textit{computational kernel} is a simple
\texttt{for} loop, whose
body contains no loops and whose bounds are statically known.
A \emph{measure} is a number of repetitions $n$ of this computational
kernel, $n$ being an user-specified parameter.
The measure may be repeated an arbitrary number of times to improve
stability.

Furthermore, such a microbenchmark should be a function whose computation
happens without leaving the L1 cache.
This requirement helps measurements and analyses to be
undisturbed by memory accesses, but it is also a matter of comparability.
Indeed, most of the static analyzers make the assumption that the code under
consideration is L1-resident; if it is not, their results are meaningless, and
can not be compared with an actual measurement.

The generation of such microbenchmarks is achieved through four distinct
components, whose parameter variations are specified in configuration files~:
a benchmark suite, C-to-C loop nest optimizers, a constraining utility
and a C-to-binary compiler.

\subsection{Benchmark suite}\label{ssec:bench_suite}
Our first component is an initial set of benchmarks which materializes
the human expertise we intend to exploit for the generation of relevant codes.
The considered suite must embed computation kernels
delimited by ad-hoc \texttt{\#pragma}s,
whose arrays are accessed
directly (no indirections) and whose loops are affine.
These constraints are necessary to ensure that the microkernelification phase,
presented below, generates segfault-free code.

In this case, we use Polybench~\cite{bench:polybench}, a suite of 30
benchmarks for polyhedral compilation ---~of which we use only 26. The
\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
removed because they are incompatible with PoCC (introduced below). The
\texttt{lu} benchmark is left out as its execution alone takes longer than all
others together, making its dynamic analysis (\eg{} with \gus) impractical.
In addition to the importance of linear algebra within
it, one of its important features is that it does not include computational
kernels with conditional control flow (\eg{} \texttt{if-then-else})
---~however, it does includes conditional data flow, using the ternary
conditional operator of C.

\subsection{C-to-C loop nest optimizers}\label{ssec:loop_nest_optimizer}
Loop nest optimizers transform an initial benchmark in different ways (generate different
\textit{versions} of the same benchmark), varying the stress on
resources of the target architecture, and by extension the models on which the
static analyzers are based.

In this case, we chose to use the
\textsc{Pluto}~\cite{tool:pluto} and PoCC~\cite{tool:pocc} polyhedral
compilers, to easily access common loop nest optimizations~: register tiling,
tiling,
skewing, vectorization/simdization, loop unrolling, loop permutation,
loop fusion.
These transformations are meant to maximize variety within the initial
benchmark suite. Eventually, the generated benchmarks are expected to
highlight the impact on performance of the resulting behaviours.
For instance, \textit{skewing} introduces non-trivial pointer arithmetics,
increasing the pressure on address computation units~; \textit{loop unrolling},
among many things, opens the way to register promotion, which exposes dependencies
and alleviates load-store units~;
\textit{vectorization} stresses SIMD units and decreases
pressure on the front-end~; and so on.

\subsection{Constraining utility}\label{ssec:kernelify}

A constraining utility transforms the code in order to respect an arbitrary number of non-functional
properties.
In this case, we apply a pass of \emph{microkernelification}: we
extract a computational kernel from the arbitrarily deep and arbitrarily
long loop nest generated by the previous component.
The loop chosen to form the microkernel is the one considered to be
the \textit{hottest}; the \textit{hotness} of a loop being obtained by
multiplying the number of arithmetic operations it contains by the number of
times it is iterated. This metric allows us to prioritize the parts of the
code that have the greatest impact on performance.

At this point, the resulting code can
compute a different result from the initial code;
for instance, the composition of tiling and
kernelification reduces the number of loop iterations.
Indeed, our framework is not meant to preserve the
functional semantics of the benchmarks.
Our goal is only to generate codes that are relevant from the point of view of
performance analysis.

\subsection{C-to-binary compiler}\label{ssec:compile}

A C-to-binary compiler varies binary optimization options by
enabling/disabling auto-vectorization, extended instruction
sets, \textit{etc}. We use \texttt{gcc}.

\bigskip

Eventually, the relevance of the microbenchmarks set generated using this approach
derives not only from initial benchmark suite and the relevance of the
transformations chosen at each
stage, but also from the combinatorial explosion generated by the composition
of the four stages. In our experimental setup, this yields up to 144
microbenchmarks per benchmark of the original suite.