\section{Experimental setup and evaluation}\label{sec:exp_setup} Running the harness described above provides us with 3500 benchmarks ---~after filtering out non-L1-resident benchmarks~---, on which each throughput predictor is run. Before analyzing these results in Section~\ref{sec:results_analysis}, we evaluate the relevance of the methodology presented in Section~\ref{sec:bench_harness} to make the tools' predictions comparable to baseline hardware counter measures. \subsection{Experimental environment} The experiments presented in this paper were all realized on a Dell PowerEdge C6420 machine, from the \textit{Dahu} cluster of Grid5000 in Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each. The experiments themselves were run inside a Docker environment based on Debian Bullseye. Care was taken to disable hyperthreading to improve measurements stability. For tools whose output is based on a direct measurement (\perf, \bhive), the benchmarks were run sequentially on a single core with no experiments on the other cores. No such care was taken for \gus{} as, although based on a dynamic run, its prediction is purely function of recorded program events and not of program measures. All other tools were run in parallel. We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{} at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at commit \texttt{87463c9}, \ithemal{} at commit \texttt{b3c39a8}. \subsection{Comparability of the results} We define the relative error of a time prediction $C_\text{pred}$ (in cycles) with respect to a baseline $C_\text{baseline}$ as \[ \operatorname{err} = \frac{\left| C_\text{pred} - C_\text{baseline} \right|}{C_\text{baseline}} \] We assess the comparability of the whole benchmark, measured with \perf{}, to lifted block-based results by measuring the statistical distribution of the relative error of two series: the predictions made by \bhive, and the series of the best block-based prediction for each benchmark. We single out \bhive{} as it is the only tool able to \textit{measure} ---~instead of predicting~--- an isolated basic block's timing. This, however, is not sufficient: as discussed later in Section~\ref{ssec:bhive_errors}, \bhive{} is not able to yield a result for about $40\,\%$ of the benchmarks, and is subject to large errors in some cases. For this purpose, we also consider, for each benchmark, the best block-based prediction: we argue that if, for most benchmarks, at least one of these predictors is able to yield a satisfyingly accurate result, then the lifting methodology is sound in practice. The result of this analysis is presented in Table~\ref{table:exp_comparability} and in Figure~\ref{fig:exp_comparability}. The results are in a range compatible with common results of the field, as seen \eg{} in~\cite{uica} reporting Mean Absolute Percentage Error (MAPE, corresponding to the ``Average'' row) of about 10-15\,\% in many cases. While lifted \bhive's average error is driven high by large errors on certain benchmarks, investigated later in this article, its median error is still comparable to the errors of state-of-the-art tools. From this, we conclude that lifted cycle measures and predictions are consistent with whole-benchmark measures; and consequently, lifted predictions can reasonably be compared to one another. \begin{figure} \centering \includegraphics[width=0.7\linewidth]{results_comparability_hist.pdf} \caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability} \end{figure} \begin{table} \centering \begin{tabular}{l r r} \toprule & \textbf{Best block-based} & \textbf{BHive} \\ \midrule Datapoints & 3500 & 2198 \\ Errors & 0 & 1302 \\ & (0\,\%) & (37.20\,\%) \\ Average (\%) & 11.60 & 27.95 \\ Median (\%) & 5.81 & 7.78 \\ Q1 (\%) & 1.99 & 3.01 \\ Q3 (\%) & 15.41 & 23.01 \\ \bottomrule \end{tabular} \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability} \end{table} \begin{table} \centering \footnotesize \begin{tabular}{l | r r r | r r r | r r r} \toprule & \multicolumn{3}{c|}{\textbf{Frontend}} & \multicolumn{3}{c|}{\textbf{Ports}} & \multicolumn{3}{c}{\textbf{Dependencies}} \\ & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\ \midrule 2mm & 34 & 61 & 25.8 \% & 25 & 13 & 70.3 \% & 18 & 29 & 63.3 \% \\ 3mm & 44 & 61 & 18.0 \% & 30 & 13 & 66.4 \% & 23 & 37 & 53.1 \% \\ atax & 13 & 72 & 41.0 \% & 25 & 17 & 70.8 \% & 23 & 30 & 63.2 \% \\ bicg & 19 & 59 & 45.8 \% & 25 & 25 & 65.3 \% & 21 & 37 & 59.7 \% \\ doitgen & 51 & 25 & 40.6 \% & 36 & 30 & 48.4 \% & 17 & 22 & 69.5 \% \\ mvt & 27 & 53 & 33.3 \% & 9 & 18 & 77.5 \% & 7 & 32 & 67.5 \% \\ gemver & 62 & 13 & 39.5 \% & 2 & 48 & 59.7 \% & 1 & 28 & 76.6 \% \\ gesummv & 16 & 69 & 41.0 \% & 17 & 23 & 72.2 \% & 24 & 28 & 63.9 \% \\ syr2k & 51 & 37 & 38.9 \% & 8 & 42 & 65.3 \% & 19 & 34 & 63.2 \% \\ trmm & 69 & 27 & 25.0 \% & 16 & 30 & 64.1 \% & 15 & 30 & 64.8 \% \\ symm & 0 & 121 & 11.0 \% & 5 & 20 & 81.6 \% & 9 & 5 & 89.7 \% \\ syrk & 54 & 46 & 30.6 \% & 12 & 42 & 62.5 \% & 20 & 48 & 52.8 \% \\ gemm & 42 & 41 & 42.4 \% & 30 & 41 & 50.7 \% & 16 & 57 & 49.3 \% \\ gramschmidt & 48 & 52 & 21.9 \% & 16 & 20 & 71.9 \% & 24 & 39 & 50.8 \% \\ cholesky & 24 & 72 & 33.3 \% & 0 & 19 & 86.8 \% & 5 & 14 & 86.8 \% \\ durbin & 49 & 52 & 29.9 \% & 0 & 65 & 54.9 \% & 2 & 39 & 71.5 \% \\ trisolv & 53 & 84 & 4.9 \% & 6 & 22 & 80.6 \% & 4 & 16 & 86.1 \% \\ jacobi-1d & 18 & 78 & 33.3 \% & 66 & 9 & 47.9 \% & 0 & 13 & 91.0 \% \\ heat-3d & 32 & 8 & 72.2 \% & 26 & 0 & 81.9 \% & 0 & 0 & 100.0 \% \\ seidel-2d & 0 & 112 & 22.2 \% & 32 & 0 & 77.8 \% & 0 & 0 & 100.0 \% \\ fdtd-2d & 52 & 22 & 47.1 \% & 20 & 41 & 56.4 \% & 0 & 40 & 71.4 \% \\ jacobi-2d & 6 & 31 & 73.6 \% & 24 & 61 & 39.3 \% & 0 & 44 & 68.6 \% \\ adi & 12 & 76 & 21.4 \% & 40 & 0 & 64.3 \% & 0 & 0 & 100.0 \% \\ correlation & 18 & 36 & 51.8 \% & 19 & 30 & 56.2 \% & 23 & 45 & 39.3 \% \\ covariance & 39 & 36 & 37.5 \% & 4 & 34 & 68.3 \% & 19 & 53 & 40.0 \% \\ floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 & 8 & 78.1 \% \\ \textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\ \bottomrule \end{tabular} \caption{Bottleneck reports from the studied tools}\label{table:coverage} \end{table} \subsection{Relevance and representativity (bottleneck analysis)}\label{ssec:bottleneck_diversity} The results provided by our harness are only relevant to evaluate the parts of the tools' models that are stressed by the benchmarks generated; it is hence critical that our benchmark generation procedure in Section~\ref{sec:bench_gen} yields diverse results. This should be true by construction, as the various polyhedral compilation techniques used stress different parts of the microarchitecture. To assess this, we study the generated benchmarks' bottlenecks, \ie{} architectural resources on which a release of pressure improves execution time. Note that a saturated resource is not necessarily a bottleneck: a code that uses \eg{} 100\,\% of the arithmetics units available for computations outside of the critical path, at a point where a chain of dependencies is blocking, will not run faster if the arithmetics operations are removed; hence, hardware counters alone are not sufficient to find bottlenecks. However, some static analyzers report the bottlenecks they detect. To unify their results and keep things simple, we study three general kinds of bottlenecks. \begin{itemize} \item{} \emph{Frontend:} the CPU's frontend is not able to issue micro-operations to the backend fast enough. \iaca{} and \uica{} are able to detect this. \item{} \emph{Ports:} at least one of the backend ports has too much work; reducing its pressure would accelerate the computation. \llvmmca, \iaca{} and \uica{} are able to detect this. \item{} \emph{Dependencies:} there is a chain of data dependencies slowing down the computation. \llvmmca, \iaca{} and \uica{} are able to detect this. \end{itemize} For each source benchmark from Polybench and each type of bottleneck, we report in Table~\ref{table:coverage} the number of derived benchmarks on which all the tools agree that the bottleneck is present or absent. We also report the proportion of cases in which the tools failed to agree. We analyze those results later in Section~\ref{ssec:bottleneck_pred_analysis}. As we have no source of truth indicating whether a bottleneck is effectively present in a microbenchmark, we adopt a conservative approach, and consider only the subset of the microbenchmarks on which the tools agree on the status of all three resources; for those, we have a good confidence on the bottlenecks reported. Obviously, this approach is limited, because it excludes microbenchmarks that might be worth considering, and is most probably subject to selection bias. Of the 3,500 microbenchmarks we have generated, 261 (7.5\,\%) are the subject of the above-mentioned consensus. This sample is made up of microbenchmarks generated from 21 benchmarks ---~\ie{} for 5 benchmarks, none of the derived microbenchmarks reached a consensus among the tools~---, yielding a wide variety of calculations, including floating-point arithmetic, pointer arithmetic or Boolean arithmetic. Of these, 200 (76.6\,\%) are bottlenecked on the CPU front-end, 19 (7,3\,\%) on back-end ports, and 81 (31.0\,\%) on latency introduced by dependencies. As mentioned above, this distribution probably does not transcribe the distribution among the 3,500 original benchmarks, as the 261 were not uniformly sampled. However, we argue that, as all categories are represented in the sample, the initial hypothesis that the generated benchmarks are diverse and representative is confirmed ---~thanks to the transformations described in Section~\ref{sec:bench_gen}. \subsection{Carbon footprint} Generating and running the full suite of benchmarks required about 30h of continuous computation on a single machine. During the experiments, the power supply units reported a near-constant consumption of about 350W. The carbon intensity of the power grid in France, where the experiment was run, at the time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}. The electricity consumed directly by the server thus amounts to about 10.50\,kWh. Assuming a Power Usage Efficiency of 1.5, the total electricity consumption roughly amounts to 15.75\,kWh, or about 450\,g\coeq. A carbon footprint estimate of the machine's manufacture itself was conducted by the manufacturer~\cite{poweredgeC6420lca}. Additionally accounting for the extra 160\,GB of DDR4 SDRAM~\cite{meta_ACT}, the hardware manufacturing, transport and end-of-life is evaluated to 1,266\,kg\coeq. In 2023, this computation cluster's usage rate was 35\,\%. Assuming 6 years of product life, 30h of usage represents about 2,050\,g\coeq{}. The whole experiment thus amounts to 2.5\,kg\coeq.