2023-09-25 17:00:07 +02:00
|
|
|
\section{Benchmarking harness}\label{sec:bench_harness}
|
|
|
|
|
|
|
|
To compare full-kernel cycle measurements to throughput predictions on
|
|
|
|
individual basic blocks, we lift predictions by adding the weighted basic block
|
|
|
|
predictions:
|
|
|
|
|
|
|
|
\[
|
|
|
|
\text{lifted\_pred}(\mathcal{K}) =
|
|
|
|
\sum_{b \in \operatorname{BBs}(\mathcal{K})}
|
|
|
|
\operatorname{occurences}(b) \times \operatorname{pred}(b)
|
|
|
|
\]
|
|
|
|
|
|
|
|
Our benchmarking harness works in three successive stages. It first
|
|
|
|
extracts the basic blocks constituting a computation kernel, and instruments it
|
|
|
|
to retrieve their respective occurrences in the original context. It then runs
|
|
|
|
all the studied tools on each basic block, while also running measures on the
|
|
|
|
whole computation kernel. Finally, the block-level results are lifted to
|
|
|
|
kernel-level results thanks to the occurrences previously measured.
|
|
|
|
|
|
|
|
\subsection{Basic block extraction}\label{ssec:bb_extr}
|
|
|
|
|
|
|
|
Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
|
|
|
|
code at each control flow instruction (jump, call, return, \ldots) and each
|
2024-08-18 17:42:44 +02:00
|
|
|
jump site, as in \autoref{alg:bb_extr_procedure} from
|
|
|
|
\autoref{ssec:palmed_bb_extraction}.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
To accurately obtain the occurrences of each basic block in the whole kernel's
|
|
|
|
computation,
|
2023-09-26 11:39:26 +02:00
|
|
|
we then instrument it with \gdb{} by placing a break
|
2023-09-25 17:00:07 +02:00
|
|
|
point at each basic block's first instruction in order to count the occurrences
|
|
|
|
of each basic block between two calls to the \perf{} counters\footnote{We
|
|
|
|
assume the program under analysis to be deterministic.}. While this
|
|
|
|
instrumentation takes about 50 to 100$\times$ more time than a regular run, it
|
|
|
|
can safely be run in parallel, as the performance results are discarded.
|
|
|
|
|
|
|
|
\subsection{Throughput predictions and measures}\label{ssec:throughput_pred_meas}
|
|
|
|
|
|
|
|
The harness leverages a variety of tools: actual CPU measurement; the \bhive{}
|
|
|
|
basic block profiler~\cite{bhive}; \llvmmca~\cite{llvm-mca}, \uica~\cite{uica}
|
|
|
|
and \iaca~\cite{iaca}, which leverage microarchitectural
|
|
|
|
models to predict a block's throughput; \ithemal~\cite{ithemal}, a machine
|
|
|
|
learning model; and \gus~\cite{phd:gruber}, a dynamic analyzer based on \qemu{}
|
|
|
|
that works at the whole binary level.
|
|
|
|
|
|
|
|
The execution time of the full kernel is measured using Linux
|
|
|
|
\perf~\cite{tool:perf} CPU counters around the full computation kernel. The
|
|
|
|
measure is repeated four times and the smallest is kept; this ensures that the
|
|
|
|
cache is warm and compensates for context switching or other measurement
|
|
|
|
artifacts. \gus{} instruments the whole function body. The other tools included
|
|
|
|
all work at basic block level; these are run on each basic block of each
|
|
|
|
benchmark.
|
|
|
|
|
|
|
|
We emphasize the importance, throughout the whole evaluation chain, to keep the
|
|
|
|
exact same assembled binary. Indeed, recompiling the kernel from source
|
|
|
|
\emph{cannot} be assumed to produce the same assembly kernel. This is even more
|
|
|
|
important in the presence of slight changes: for instance, inserting \iaca{}
|
|
|
|
markers at the C-level ---~as is intended~--- around the kernel \emph{might}
|
|
|
|
change the compiled kernel, if only for alignment regions. We argue that, in
|
|
|
|
the case of \iaca{} markers, the problem is even more critical, as those
|
|
|
|
markers prevent a binary from being run by overwriting registers with arbitrary
|
|
|
|
values. This forces a user to run and measure a version which is different from
|
|
|
|
the analyzed one. In our harness, we circumvent this issue by adding markers
|
|
|
|
directly at the assembly level, editing the already compiled version. Our
|
2023-09-26 11:39:26 +02:00
|
|
|
\gdb{} instrumentation procedure also respects this principle of
|
2023-09-25 17:00:07 +02:00
|
|
|
single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
|
|
|
|
\gus{} with a preloaded stub shared library to be able to instrument binaries
|
|
|
|
containing calls to \perf.
|
|
|
|
|
|
|
|
\subsection{Prediction lifting and filtering}\label{ssec:harness_lifting}
|
|
|
|
|
|
|
|
We finally lift single basic block predictions to a whole-kernel cycle
|
|
|
|
prediction by summing the block-level results, weighted by the occurrences of
|
|
|
|
the basic block in the original context (formula above). If an analyzer fails
|
|
|
|
on one of the basic blocks of a benchmark, the whole benchmark is discarded for
|
|
|
|
this analyzer.
|
|
|
|
|
|
|
|
In the presence of complex control flow, \eg{} with conditionals inside loops,
|
|
|
|
our approach based on basic block occurrences is arguably less precise than an
|
|
|
|
approach based on paths occurrences, as we have less information available
|
|
|
|
---~for instance, whether a branch is taken with a regular pattern, whether we
|
|
|
|
have constraints on register values, etc. We however chose this block-based
|
|
|
|
approach, as most throughput prediction tools work a basic block-level, and are
|
|
|
|
thus readily available and can be directly plugged into our harness.
|
|
|
|
|
|
|
|
Finally, we control the proportion of cache misses in the program's execution
|
2023-09-25 17:41:37 +02:00
|
|
|
using \texttt{Cachegrind}~\cite{tool:valgrind} and \gus; programs that have more
|
2023-09-25 17:00:07 +02:00
|
|
|
than 15\,\% of cache misses on a warm cache are not considered L1-resident and
|
|
|
|
are discarded.
|