\section{Benchmarking harness}\label{sec:bench_harness} To compare full-kernel cycle measurements to throughput predictions on individual basic blocks, we lift predictions by adding the weighted basic block predictions: \[ \text{lifted\_pred}(\mathcal{K}) = \sum_{b \in \operatorname{BBs}(\mathcal{K})} \operatorname{occurences}(b) \times \operatorname{pred}(b) \] Our benchmarking harness works in three successive stages. It first extracts the basic blocks constituting a computation kernel, and instruments it to retrieve their respective occurrences in the original context. It then runs all the studied tools on each basic block, while also running measures on the whole computation kernel. Finally, the block-level results are lifted to kernel-level results thanks to the occurrences previously measured. \subsection{Basic block extraction}\label{ssec:bb_extr} Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly code at each control flow instruction (jump, call, return, \ldots) and each jump site, as in \autoref{alg:bb_extr_procedure} from \autoref{ssec:palmed_bb_extraction}. To accurately obtain the occurrences of each basic block in the whole kernel's computation, we then instrument it with \gdb{} by placing a break point at each basic block's first instruction in order to count the occurrences of each basic block between two calls to the \perf{} counters\footnote{We assume the program under analysis to be deterministic.}. While this instrumentation takes about 50 to 100$\times$ more time than a regular run, it can safely be run in parallel, as the performance results are discarded. \subsection{Throughput predictions and measures}\label{ssec:throughput_pred_meas} The harness leverages a variety of tools: actual CPU measurement; the \bhive{} basic block profiler~\cite{bhive}; \llvmmca~\cite{llvm-mca}, \uica~\cite{uica} and \iaca~\cite{iaca}, which leverage microarchitectural models to predict a block's throughput; \ithemal~\cite{ithemal}, a machine learning model; and \gus~\cite{phd:gruber}, a dynamic analyzer based on \qemu{} that works at the whole binary level. The execution time of the full kernel is measured using Linux \perf~\cite{tool:perf} CPU counters around the full computation kernel. The measure is repeated four times and the smallest is kept; this ensures that the cache is warm and compensates for context switching or other measurement artifacts. \gus{} instruments the whole function body. The other tools included all work at basic block level; these are run on each basic block of each benchmark. We emphasize the importance, throughout the whole evaluation chain, to keep the exact same assembled binary. Indeed, recompiling the kernel from source \emph{cannot} be assumed to produce the same assembly kernel. This is even more important in the presence of slight changes: for instance, inserting \iaca{} markers at the C-level ---~as is intended~--- around the kernel \emph{might} change the compiled kernel, if only for alignment regions. We argue that, in the case of \iaca{} markers, the problem is even more critical, as those markers prevent a binary from being run by overwriting registers with arbitrary values. This forces a user to run and measure a version which is different from the analyzed one. In our harness, we circumvent this issue by adding markers directly at the assembly level, editing the already compiled version. Our \gdb{} instrumentation procedure also respects this principle of single-compilation. As \qemu{} breaks the \perf{} interface, we have to run \gus{} with a preloaded stub shared library to be able to instrument binaries containing calls to \perf. \subsection{Prediction lifting and filtering}\label{ssec:harness_lifting} We finally lift single basic block predictions to a whole-kernel cycle prediction by summing the block-level results, weighted by the occurrences of the basic block in the original context (formula above). If an analyzer fails on one of the basic blocks of a benchmark, the whole benchmark is discarded for this analyzer. In the presence of complex control flow, \eg{} with conditionals inside loops, our approach based on basic block occurrences is arguably less precise than an approach based on paths occurrences, as we have less information available ---~for instance, whether a branch is taken with a regular pattern, whether we have constraints on register values, etc. We however chose this block-based approach, as most throughput prediction tools work a basic block-level, and are thus readily available and can be directly plugged into our harness. Finally, we control the proportion of cache misses in the program's execution using \texttt{Cachegrind}~\cite{tool:valgrind} and \gus; programs that have more than 15\,\% of cache misses on a warm cache are not considered L1-resident and are discarded.