\section{Benchmarking harness}\label{sec:bench_harness}

To compare full-kernel cycle measurements to throughput predictions on
individual basic blocks, we lift predictions by adding the weighted basic block

\text{lifted\_pred}(\mathcal{K}) =
    \sum_{b \in \operatorname{BBs}(\mathcal{K})}
    \operatorname{occurences}(b) \times \operatorname{pred}(b)

Our benchmarking harness works in three successive stages. It first
extracts the basic blocks constituting a computation kernel, and instruments it
to retrieve their respective occurrences in the original context. It then runs
all the studied tools on each basic block, while also running measures on the
whole computation kernel. Finally, the block-level results are lifted to
kernel-level results thanks to the occurrences previously measured.

\subsection{Basic block extraction}\label{ssec:bb_extr}

Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
code at each control flow instruction (jump, call, return, \ldots) and each
jump site, as in \autoref{alg:bb_extr_procedure} from

To accurately obtain the occurrences of each basic block in the whole kernel's
we then instrument it with \gdb{} by placing a break
point at each basic block's first instruction in order to count the occurrences
of each basic block between two calls to the \perf{} counters\footnote{We
assume the program under analysis to be deterministic.}.  While this
instrumentation takes about 50 to 100$\times$ more time than a regular run, it
can safely be run in parallel, as the performance results are discarded.

\subsection{Throughput predictions and measures}\label{ssec:throughput_pred_meas}

The harness leverages a variety of tools: actual CPU measurement; the \bhive{}
basic block profiler~\cite{bhive}; \llvmmca~\cite{llvm-mca}, \uica~\cite{uica}
and \iaca~\cite{iaca}, which leverage microarchitectural
models to predict a block's throughput; \ithemal~\cite{ithemal}, a machine
learning model; and \gus~\cite{phd:gruber}, a dynamic analyzer based on \qemu{}
that works at the whole binary level.

The execution time of the full kernel is measured using Linux
\perf~\cite{tool:perf} CPU counters around the full computation kernel. The
measure is repeated four times and the smallest is kept; this ensures that the
cache is warm and compensates for context switching or other measurement
artifacts. \gus{} instruments the whole function body. The other tools included
all work at basic block level; these are run on each basic block of each

We emphasize the importance, throughout the whole evaluation chain, to keep the
exact same assembled binary. Indeed, recompiling the kernel from source
\emph{cannot} be assumed to produce the same assembly kernel. This is even more
important in the presence of slight changes: for instance, inserting \iaca{}
markers at the C-level ---~as is intended~--- around the kernel \emph{might}
change the compiled kernel, if only for alignment regions. We argue that, in
the case of \iaca{} markers, the problem is even more critical, as those
markers prevent a binary from being run by overwriting registers with arbitrary
values. This forces a user to run and measure a version which is different from
the analyzed one. In our harness, we circumvent this issue by adding markers
directly at the assembly level, editing the already compiled version.  Our
\gdb{} instrumentation procedure also respects this principle of
single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
\gus{} with a preloaded stub shared library to be able to instrument binaries
containing calls to \perf.

\subsection{Prediction lifting and filtering}\label{ssec:harness_lifting}

We finally lift single basic block predictions to a whole-kernel cycle
prediction by summing the block-level results, weighted by the occurrences of
the basic block in the original context (formula above). If an analyzer fails
on one of the basic blocks of a benchmark, the whole benchmark is discarded for
this analyzer.

In the presence of complex control flow, \eg{} with conditionals inside loops,
our approach based on basic block occurrences is arguably less precise than an
approach based on paths occurrences, as we have less information available
---~for instance, whether a branch is taken with a regular pattern, whether we
have constraints on register values, etc. We however chose this block-based
approach, as most throughput prediction tools work a basic block-level, and are
thus readily available and can be directly plugged into our harness.

Finally, we control the proportion of cache misses in the program's execution
using \texttt{Cachegrind}~\cite{tool:valgrind} and \gus; programs that have more
than 15\,\% of cache misses on a warm cache are not considered L1-resident and
are discarded.