213 lines
12 KiB
TeX
213 lines
12 KiB
TeX
\section{Experimental setup and evaluation}\label{sec:exp_setup}
|
|
|
|
Running the harness described above provides us with 3500
|
|
benchmarks ---~after filtering out non-L1-resident
|
|
benchmarks~---, on which each throughput predictor is run.
|
|
Before analyzing these results in
|
|
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
|
|
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
|
|
predictions comparable to baseline hardware counter measures.
|
|
|
|
\subsection{Experimental environment}
|
|
|
|
The experiments presented in this paper were all realized on a Dell PowerEdge
|
|
C6420 machine, from the \textit{Dahu} cluster of Grid5000 in
|
|
Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM
|
|
---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130
|
|
CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
|
|
|
The experiments themselves were run inside a Docker environment based on Debian
|
|
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
|
stability. For tools whose output is based on a direct measurement (\perf,
|
|
\bhive), the benchmarks were run sequentially on a single core with no
|
|
experiments on the other cores. No such care was taken for \gus{} as, although
|
|
based on a dynamic run, its prediction is purely function of recorded program
|
|
events and not of program measures. All other tools were run in parallel.
|
|
|
|
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
|
|
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
|
|
commit \texttt{87463c9}, \ithemal{} at commit \texttt{b3c39a8}.
|
|
|
|
\subsection{Comparability of the results}
|
|
|
|
We define the relative error of a time prediction
|
|
$C_\text{pred}$ (in cycles) with respect to a baseline $C_\text{baseline}$ as
|
|
\[
|
|
\operatorname{err} = \frac{\left| C_\text{pred} - C_\text{baseline}
|
|
\right|}{C_\text{baseline}}
|
|
\]
|
|
|
|
We assess the comparability of the whole benchmark, measured with \perf{}, to
|
|
lifted block-based results by measuring the statistical distribution of the
|
|
relative error of two series: the predictions made by \bhive, and the series of
|
|
the best block-based prediction for each benchmark.
|
|
|
|
We single out \bhive{} as it is the only tool able to \textit{measure}
|
|
---~instead of predicting~--- an isolated basic block's timing. This, however, is
|
|
not sufficient: as discussed later in Section~\ref{ssec:bhive_errors}, \bhive{}
|
|
is not able to yield a result for about $40\,\%$ of the benchmarks, and is
|
|
subject to large errors in some cases. For this purpose, we also consider, for
|
|
each benchmark, the best block-based prediction: we argue that if, for most
|
|
benchmarks, at least one of these predictors is able to yield a satisfyingly
|
|
accurate result, then the lifting methodology is sound in practice.
|
|
|
|
The result of this analysis is presented in Table~\ref{table:exp_comparability}
|
|
and in Figure~\ref{fig:exp_comparability}. The results are in a range
|
|
compatible with common results of the field, as seen \eg{} in~\cite{uica}
|
|
reporting Mean Absolute Percentage Error (MAPE, corresponding to the
|
|
``Average'' row) of about 10-15\,\% in many cases. While lifted \bhive's
|
|
average error is driven high by large errors on certain benchmarks,
|
|
investigated later in this article, its median error is still comparable to the
|
|
errors of state-of-the-art tools. From this, we conclude that lifted cycle
|
|
measures and predictions are consistent with whole-benchmark measures; and
|
|
consequently, lifted predictions can reasonably be compared to one another.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.7\linewidth]{results_comparability_hist.pdf}
|
|
\caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
|
|
\end{figure}
|
|
|
|
\begin{table}
|
|
\centering
|
|
\begin{tabular}{l r r}
|
|
\toprule
|
|
& \textbf{Best block-based} & \textbf{BHive} \\
|
|
\midrule
|
|
Datapoints & 3500 & 2198 \\
|
|
Errors & 0 & 1302 \\
|
|
& (0\,\%) & (37.20\,\%) \\
|
|
Average (\%) & 11.60 & 27.95 \\
|
|
Median (\%) & 5.81 & 7.78 \\
|
|
Q1 (\%) & 1.99 & 3.01 \\
|
|
Q3 (\%) & 15.41 & 23.01 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
|
|
\end{table}
|
|
|
|
|
|
\begin{table}
|
|
\centering
|
|
\footnotesize
|
|
\begin{tabular}{l | r r r | r r r | r r r}
|
|
\toprule
|
|
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
|
& \multicolumn{3}{c|}{\textbf{Ports}}
|
|
& \multicolumn{3}{c}{\textbf{Dependencies}} \\
|
|
& \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\
|
|
|
|
\midrule
|
|
2mm & 34 & 61 & 25.8 \% & 25 & 13 & 70.3 \% & 18 & 29 & 63.3 \% \\
|
|
3mm & 44 & 61 & 18.0 \% & 30 & 13 & 66.4 \% & 23 & 37 & 53.1 \% \\
|
|
atax & 13 & 72 & 41.0 \% & 25 & 17 & 70.8 \% & 23 & 30 & 63.2 \% \\
|
|
bicg & 19 & 59 & 45.8 \% & 25 & 25 & 65.3 \% & 21 & 37 & 59.7 \% \\
|
|
doitgen & 51 & 25 & 40.6 \% & 36 & 30 & 48.4 \% & 17 & 22 & 69.5 \% \\
|
|
mvt & 27 & 53 & 33.3 \% & 9 & 18 & 77.5 \% & 7 & 32 & 67.5 \% \\
|
|
gemver & 62 & 13 & 39.5 \% & 2 & 48 & 59.7 \% & 1 & 28 & 76.6 \% \\
|
|
gesummv & 16 & 69 & 41.0 \% & 17 & 23 & 72.2 \% & 24 & 28 & 63.9 \% \\
|
|
syr2k & 51 & 37 & 38.9 \% & 8 & 42 & 65.3 \% & 19 & 34 & 63.2 \% \\
|
|
trmm & 69 & 27 & 25.0 \% & 16 & 30 & 64.1 \% & 15 & 30 & 64.8 \% \\
|
|
symm & 0 & 121 & 11.0 \% & 5 & 20 & 81.6 \% & 9 & 5 & 89.7 \% \\
|
|
syrk & 54 & 46 & 30.6 \% & 12 & 42 & 62.5 \% & 20 & 48 & 52.8 \% \\
|
|
gemm & 42 & 41 & 42.4 \% & 30 & 41 & 50.7 \% & 16 & 57 & 49.3 \% \\
|
|
gramschmidt & 48 & 52 & 21.9 \% & 16 & 20 & 71.9 \% & 24 & 39 & 50.8 \% \\
|
|
cholesky & 24 & 72 & 33.3 \% & 0 & 19 & 86.8 \% & 5 & 14 & 86.8 \% \\
|
|
durbin & 49 & 52 & 29.9 \% & 0 & 65 & 54.9 \% & 2 & 39 & 71.5 \% \\
|
|
trisolv & 53 & 84 & 4.9 \% & 6 & 22 & 80.6 \% & 4 & 16 & 86.1 \% \\
|
|
jacobi-1d & 18 & 78 & 33.3 \% & 66 & 9 & 47.9 \% & 0 & 13 & 91.0 \% \\
|
|
heat-3d & 32 & 8 & 72.2 \% & 26 & 0 & 81.9 \% & 0 & 0 & 100.0 \% \\
|
|
seidel-2d & 0 & 112 & 22.2 \% & 32 & 0 & 77.8 \% & 0 & 0 & 100.0 \% \\
|
|
fdtd-2d & 52 & 22 & 47.1 \% & 20 & 41 & 56.4 \% & 0 & 40 & 71.4 \% \\
|
|
jacobi-2d & 6 & 31 & 73.6 \% & 24 & 61 & 39.3 \% & 0 & 44 & 68.6 \% \\
|
|
adi & 12 & 76 & 21.4 \% & 40 & 0 & 64.3 \% & 0 & 0 & 100.0 \% \\
|
|
correlation & 18 & 36 & 51.8 \% & 19 & 30 & 56.2 \% & 23 & 45 & 39.3 \% \\
|
|
covariance & 39 & 36 & 37.5 \% & 4 & 34 & 68.3 \% & 19 & 53 & 40.0 \% \\
|
|
floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 & 8 & 78.1 \% \\
|
|
\textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Bottleneck reports from the studied tools}\label{table:coverage}
|
|
\end{table}
|
|
|
|
\subsection{Relevance and representativity (bottleneck
|
|
analysis)}\label{ssec:bottleneck_diversity}
|
|
|
|
The results provided by our harness are only relevant to evaluate the parts of
|
|
the tools' models that are stressed by the benchmarks generated; it is hence
|
|
critical that our benchmark generation procedure in Section~\ref{sec:bench_gen}
|
|
yields diverse results. This should be true by construction, as the various
|
|
polyhedral compilation techniques used stress different parts of the
|
|
microarchitecture.
|
|
|
|
To assess this, we study the generated benchmarks' bottlenecks, \ie{}
|
|
architectural resources on which a release of pressure improves execution time.
|
|
Note that a saturated resource is not necessarily a bottleneck: a code that
|
|
uses \eg{} 100\,\% of the arithmetics units available for computations outside
|
|
of the critical path, at a point where a chain of dependencies is blocking,
|
|
will not run faster if the arithmetics operations are removed; hence, hardware
|
|
counters alone are not sufficient to find bottlenecks.
|
|
|
|
However, some static analyzers report the bottlenecks they detect. To unify
|
|
their results and keep things simple, we study three general kinds of
|
|
bottlenecks.
|
|
|
|
\begin{itemize}
|
|
\item{} \emph{Frontend:} the CPU's frontend is not able to issue
|
|
micro-operations to the backend fast enough. \iaca{} and \uica{} are
|
|
able to detect this.
|
|
\item{} \emph{Ports:} at least one of the backend ports has too much work;
|
|
reducing its pressure would accelerate the computation.
|
|
\llvmmca, \iaca{} and \uica{} are able to detect this.
|
|
\item{} \emph{Dependencies:} there is a chain of data dependencies slowing
|
|
down the computation.
|
|
\llvmmca, \iaca{} and \uica{} are able to detect this.
|
|
\end{itemize}
|
|
|
|
For each source benchmark from Polybench and each type of bottleneck, we report
|
|
in Table~\ref{table:coverage} the number of derived benchmarks on which all the
|
|
tools agree that the bottleneck is present or absent. We also report the
|
|
proportion of cases in which the tools failed to agree. We analyze those
|
|
results later in Section~\ref{ssec:bottleneck_pred_analysis}.
|
|
|
|
As we have no source of truth indicating whether a bottleneck is effectively
|
|
present in a microbenchmark, we adopt a conservative approach, and consider
|
|
only the subset of the microbenchmarks on which the tools agree on the status
|
|
of all three resources; for those, we have a good confidence on the bottlenecks
|
|
reported. Obviously, this approach is limited, because it excludes
|
|
microbenchmarks that might be worth considering, and is most probably subject
|
|
to selection bias.
|
|
|
|
Of the 3,500 microbenchmarks we have generated, 261 (7.5\,\%) are the subject
|
|
of the above-mentioned consensus. This sample is made up of microbenchmarks
|
|
generated from 21 benchmarks ---~\ie{} for 5 benchmarks, none of the derived
|
|
microbenchmarks reached a consensus among the tools~---, yielding a wide
|
|
variety of calculations, including floating-point arithmetic, pointer
|
|
arithmetic or Boolean arithmetic. Of these, 200 (76.6\,\%) are bottlenecked on
|
|
the CPU front-end, 19 (7,3\,\%) on back-end ports, and 81 (31.0\,\%) on latency
|
|
introduced by dependencies. As mentioned above, this distribution
|
|
probably does not transcribe the distribution among the 3,500 original
|
|
benchmarks, as the 261 were not uniformly sampled. However, we argue that, as
|
|
all categories are represented in the sample, the initial hypothesis that the
|
|
generated benchmarks are diverse and representative is confirmed ---~thanks to
|
|
the transformations described in Section~\ref{sec:bench_gen}.
|
|
|
|
\subsection{Carbon footprint}
|
|
|
|
Generating and running the full suite of benchmarks required about 30h of
|
|
continuous computation on a single machine. During the experiments, the power
|
|
supply units reported a near-constant consumption of about 350W. The carbon
|
|
intensity of the power grid in France, where the experiment was run, at the
|
|
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
|
|
|
|
The electricity consumed directly by the server thus amounts to about
|
|
10.50\,kWh. Assuming a Power Usage Efficiency of 1.5, the total electricity
|
|
consumption roughly amounts to 15.75\,kWh, or about 450\,g\coeq.
|
|
|
|
A carbon footprint estimate of the machine's manufacture itself was conducted
|
|
by the manufacturer~\cite{poweredgeC6420lca}. Additionally accounting for the
|
|
extra 160\,GB of DDR4 SDRAM~\cite{meta_ACT}, the hardware manufacturing,
|
|
transport and end-of-life is evaluated to 1,266\,kg\coeq. In 2023, this
|
|
computation cluster's usage rate was 35\,\%. Assuming 6 years of product life,
|
|
30h of usage represents about 2,050\,g\coeq{}. The whole experiment thus amounts to
|
|
2.5\,kg\coeq.
|