111 lines
5.5 KiB
TeX
111 lines
5.5 KiB
TeX
\section{Generating microbenchmarks}\label{sec:bench_gen}
|
|
|
|
Our framework aims to generate \emph{microbenchmarks} relevant to a specific
|
|
domain.
|
|
A microbenchmark is a code that is as simplified as possible to expose the
|
|
behaviour under consideration.
|
|
The specified computations should be representative of the considered domain,
|
|
and at the same time they should stress the different aspects of the
|
|
target architecture ---~which is modeled by code analyzers.
|
|
|
|
In practice, a microbenchmark's \textit{computational kernel} is a simple
|
|
\texttt{for} loop, whose
|
|
body contains no loops and whose bounds are statically known.
|
|
A \emph{measure} is a number of repetitions $n$ of this computational
|
|
kernel, $n$ being a user-specified parameter.
|
|
The measure may be repeated an arbitrary number of times to improve
|
|
stability.
|
|
|
|
Furthermore, such a microbenchmark should be a function whose computation
|
|
happens without leaving the L1 cache.
|
|
This requirement helps measurements and analyses to be
|
|
undisturbed by memory accesses, but it is also a matter of comparability.
|
|
Indeed, most of the static analyzers make the assumption that the code under
|
|
consideration is L1-resident; if it is not, their results are meaningless, and
|
|
can not be compared with an actual measurement.
|
|
|
|
The generation of such microbenchmarks is achieved through four distinct
|
|
components, whose parameter variations are specified in configuration files~:
|
|
a benchmark suite, C-to-C loop nest optimizers, a constraining utility
|
|
and a C-to-binary compiler.
|
|
|
|
\subsection{Benchmark suite}\label{ssec:bench_suite}
|
|
Our first component is an initial set of benchmarks which materializes
|
|
the human expertise we intend to exploit for the generation of relevant codes.
|
|
The considered suite must embed computation kernels
|
|
delimited by ad-hoc \texttt{\#pragma}s,
|
|
whose arrays are accessed
|
|
directly (no indirections) and whose loops are affine.
|
|
These constraints are necessary to ensure that the microkernelification phase,
|
|
presented below, generates segfault-free code.
|
|
|
|
In this case, we use Polybench~\cite{bench:polybench}, a suite of 30
|
|
benchmarks for polyhedral compilation ---~of which we use only 26. The
|
|
\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
|
|
removed because they are incompatible with PoCC (introduced below). The
|
|
\texttt{lu} benchmark is left out as its execution alone takes longer than all
|
|
others together, making its dynamic analysis (\eg{} with \gus) impractical.
|
|
In addition to the importance of linear algebra within
|
|
Polybench, one of its important features is that it does not include computational
|
|
kernels with conditional control flow (\eg{} \texttt{if-then-else})
|
|
---~however, it does includes conditional data flow, using the ternary
|
|
conditional operator of C.
|
|
|
|
\subsection{C-to-C loop nest optimizers}\label{ssec:loop_nest_optimizer}
|
|
Loop nest optimizers transform an initial benchmark in different ways (generate different
|
|
\textit{versions} of the same benchmark), varying the stress on
|
|
resources of the target architecture, and by extension the models on which the
|
|
static analyzers are based.
|
|
|
|
In this case, we chose to use the
|
|
\textsc{Pluto}~\cite{tool:pluto} and PoCC~\cite{tool:pocc} polyhedral
|
|
compilers, to easily access common loop nest optimizations~: register tiling,
|
|
tiling,
|
|
skewing, vectorization/simdization, loop unrolling, loop permutation,
|
|
loop fusion.
|
|
These transformations are meant to maximize variety within the initial
|
|
benchmark suite. Eventually, the generated benchmarks are expected to
|
|
highlight the impact on performance of the resulting behaviours.
|
|
For instance, \textit{skewing} introduces non-trivial pointer arithmetics,
|
|
increasing the pressure on address computation units~; \textit{loop unrolling},
|
|
among many things, opens the way to register promotion, which exposes dependencies
|
|
and alleviates load-store units~;
|
|
\textit{vectorization} stresses SIMD units and decreases
|
|
pressure on the front-end~; and so on.
|
|
|
|
\subsection{Constraining utility}\label{ssec:kernelify}
|
|
|
|
A constraining utility transforms the code in order to respect an arbitrary number of non-functional
|
|
properties.
|
|
In this case, we apply a pass of \emph{microkernelification}: we
|
|
extract a computational kernel from the arbitrarily deep and arbitrarily
|
|
long loop nest generated by the previous component.
|
|
The loop chosen to form the microkernel is the one considered to be
|
|
the \textit{hottest}; the \textit{hotness} of a loop being obtained by
|
|
multiplying the number of arithmetic operations it contains by the number of
|
|
times it is iterated. This metric allows us to prioritize the parts of the
|
|
code that have the greatest impact on performance.
|
|
|
|
At this point, the resulting code can
|
|
compute a different result from the initial code;
|
|
for instance, the composition of tiling and
|
|
kernelification reduces the number of loop iterations.
|
|
Indeed, our framework is not meant to preserve the
|
|
functional semantics of the benchmarks.
|
|
Our goal is only to generate codes that are relevant from the point of view of
|
|
performance analysis.
|
|
|
|
\subsection{C-to-binary compiler}\label{ssec:compile}
|
|
|
|
A C-to-binary compiler varies binary optimization options by
|
|
enabling/disabling auto-vectorization, extended instruction
|
|
sets, \textit{etc}. We use \texttt{gcc}.
|
|
|
|
\bigskip
|
|
|
|
Eventually, the relevance of the microbenchmarks set generated using this approach
|
|
derives not only from initial benchmark suite and the relevance of the
|
|
transformations chosen at each
|
|
stage, but also from the combinatorial explosion generated by the composition
|
|
of the four stages. In our experimental setup, this yields up to 144
|
|
microbenchmarks per benchmark of the original suite.
|