phd-thesis/manuscrit/50_CesASMe/20_evaluation.tex

\section{Experimental setup and evaluation}\label{sec:exp_setup}

Running the harness described above provides us with 3500
benchmarks ---~after filtering out non-L1-resident
benchmarks~---, on which each throughput predictor is run.
Before analyzing these results in
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
predictions comparable to baseline hardware counter measures.

\subsection{Experimental environment}

The experiments presented in this chapter, unless stated otherwise, were all
realized on a Dell PowerEdge C6420 machine, from the \textit{Dahu} cluster of
Grid5000 in Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of
DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Xeon
Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.

The experiments themselves were run inside a Docker environment based on Debian
Bullseye. Care was taken to disable hyperthreading to improve measurements
stability. For tools whose output is based on a direct measurement (\perf,
\bhive), the benchmarks were run sequentially on a single core with no
experiments on the other cores. No such care was taken for \gus{} as, although
based on a dynamic run, its prediction is purely function of recorded program
events and not of program measures. All other tools were run in parallel.

We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
commit \texttt{87463c9}, \ithemal{} at commit \texttt{b3c39a8}.

\subsection{Comparability of the results}

We define the relative error of a time prediction
$C_\text{pred}$ (in cycles) with respect to a baseline $C_\text{baseline}$ as
\[
    \operatorname{err} = \frac{\left| C_\text{pred} - C_\text{baseline}
    \right|}{C_\text{baseline}}
\]

We assess the comparability of the whole benchmark, measured with \perf{}, to
lifted block-based results by measuring the statistical distribution of the
relative error of two series: the predictions made by \bhive, and the series of
the best block-based prediction for each benchmark.

We single out \bhive{} as it is the only tool able to \textit{measure}
---~instead of predicting~--- an isolated basic block's timing. This, however, is
not sufficient: as discussed later in Section~\ref{ssec:bhive_errors}, \bhive{}
is not able to yield a result for about $40\,\%$ of the benchmarks, and is
subject to large errors in some cases.  For this purpose, we also consider, for
each benchmark, the best block-based prediction: we argue that if, for most
benchmarks, at least one of these predictors is able to yield a satisfyingly
accurate result, then the lifting methodology is sound in practice.

The result of this analysis is presented in Table~\ref{table:exp_comparability}
and in Figure~\ref{fig:exp_comparability}. The results are in a range
compatible with common results of the field, as seen \eg{} in~\cite{uica}
reporting Mean Absolute Percentage Error (MAPE, corresponding to the
``Average'' row) of about 10-15\,\% in many cases. While lifted \bhive's
average error is driven high by large errors on certain benchmarks,
investigated later in this article, its median error is still comparable to the
errors of state-of-the-art tools. From this, we conclude that lifted cycle
measures and predictions are consistent with whole-benchmark measures; and
consequently, lifted predictions can reasonably be compared to one another.

\begin{figure}
    \centering
    \includegraphics[width=0.7\linewidth]{results_comparability_hist.pdf}
    \caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
\end{figure}

\begin{table}
    \centering
    \begin{tabular}{l r r}
        \toprule
        & \textbf{Best block-based} & \textbf{BHive} \\
        \midrule
        Datapoints & 3500 & 2198 \\
        Errors & 0 & 1302 \\
         & (0\,\%) & (37.20\,\%) \\
        Average (\%) & 11.60 & 27.95 \\
        Median (\%) & 5.81 & 7.78 \\
        Q1 (\%) & 1.99 & 3.01 \\
        Q3 (\%) & 15.41 & 23.01 \\
        \bottomrule
    \end{tabular}
    \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
\end{table}


\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l | r r r | r r r | r r r}
        \toprule
    \multicolumn{1}{c|}{\textbf{Polybench}}
 & \multicolumn{3}{c|}{\textbf{Frontend}}
 & \multicolumn{3}{c|}{\textbf{Ports}}
 & \multicolumn{3}{c}{\textbf{Dependencies}} \\
        \multicolumn{1}{c|}{\textbf{benchmark}}
 & \textbf{yes}  &  \textbf{no}  & \textbf{disagr.} & \textbf{yes}  &  \textbf{no}  & \textbf{disagr.} & \textbf{yes}  & \textbf{no}  & \textbf{disagr.} \\

        \midrule
2mm                  &   34 &   61 &  25.8 \% &   25 &   13 &  70.3 \% &   18 &   29 &  63.3 \% \\
3mm                  &   44 &   61 &  18.0 \% &   30 &   13 &  66.4 \% &   23 &   37 &  53.1 \% \\
atax                 &   13 &   72 &  41.0 \% &   25 &   17 &  70.8 \% &   23 &   30 &  63.2 \% \\
bicg                 &   19 &   59 &  45.8 \% &   25 &   25 &  65.3 \% &   21 &   37 &  59.7 \% \\
doitgen              &   51 &   25 &  40.6 \% &   36 &   30 &  48.4 \% &   17 &   22 &  69.5 \% \\
mvt                  &   27 &   53 &  33.3 \% &    9 &   18 &  77.5 \% &    7 &   32 &  67.5 \% \\
gemver               &   62 &   13 &  39.5 \% &    2 &   48 &  59.7 \% &    1 &   28 &  76.6 \% \\
gesummv              &   16 &   69 &  41.0 \% &   17 &   23 &  72.2 \% &   24 &   28 &  63.9 \% \\
syr2k                &   51 &   37 &  38.9 \% &    8 &   42 &  65.3 \% &   19 &   34 &  63.2 \% \\
trmm                 &   69 &   27 &  25.0 \% &   16 &   30 &  64.1 \% &   15 &   30 &  64.8 \% \\
symm                 &    0 &  121 &  11.0 \% &    5 &   20 &  81.6 \% &    9 &    5 &  89.7 \% \\
syrk                 &   54 &   46 &  30.6 \% &   12 &   42 &  62.5 \% &   20 &   48 &  52.8 \% \\
gemm                 &   42 &   41 &  42.4 \% &   30 &   41 &  50.7 \% &   16 &   57 &  49.3 \% \\
gramschmidt          &   48 &   52 &  21.9 \% &   16 &   20 &  71.9 \% &   24 &   39 &  50.8 \% \\
cholesky             &   24 &   72 &  33.3 \% &    0 &   19 &  86.8 \% &    5 &   14 &  86.8 \% \\
durbin               &   49 &   52 &  29.9 \% &    0 &   65 &  54.9 \% &    2 &   39 &  71.5 \% \\
trisolv              &   53 &   84 &   4.9 \% &    6 &   22 &  80.6 \% &    4 &   16 &  86.1 \% \\
jacobi-1d            &   18 &   78 &  33.3 \% &   66 &    9 &  47.9 \% &    0 &   13 &  91.0 \% \\
heat-3d              &   32 &    8 &  72.2 \% &   26 &    0 &  81.9 \% &    0 &    0 & 100.0 \% \\
seidel-2d            &    0 &  112 &  22.2 \% &   32 &    0 &  77.8 \% &    0 &    0 & 100.0 \% \\
fdtd-2d              &   52 &   22 &  47.1 \% &   20 &   41 &  56.4 \% &    0 &   40 &  71.4 \% \\
jacobi-2d            &    6 &   31 &  73.6 \% &   24 &   61 &  39.3 \% &    0 &   44 &  68.6 \% \\
adi                  &   12 &   76 &  21.4 \% &   40 &    0 &  64.3 \% &    0 &    0 & 100.0 \% \\
correlation          &   18 &   36 &  51.8 \% &   19 &   30 &  56.2 \% &   23 &   45 &  39.3 \% \\
covariance           &   39 &   36 &  37.5 \% &    4 &   34 &  68.3 \% &   19 &   53 &  40.0 \% \\
floyd-warshall       &   74 &   16 &  29.7 \% &   16 &   24 &  68.8 \% &   20 &    8 &  78.1 \% \\
\textbf{Total}       &  907 & 1360 &  35.2 \% &  509 &  687 &  65.8 \% &  310 &  728 &  70.3 \% \\
\bottomrule
    \end{tabular}
    \caption{Bottleneck reports from the studied tools}\label{table:coverage}
\end{table}

\subsection{Relevance and representativity (bottleneck
analysis)}\label{ssec:bottleneck_diversity}

The results provided by our harness are only relevant to evaluate the parts of
the tools' models that are stressed by the benchmarks generated; it is hence
critical that our benchmark generation procedure in Section~\ref{sec:bench_gen}
yields diverse results. This should be true by construction, as the various
polyhedral compilation techniques used stress different parts of the
microarchitecture.

To assess this, we study the generated benchmarks' bottlenecks, \ie{}
architectural resources on which a release of pressure improves execution time.
Note that a saturated resource is not necessarily a bottleneck: a code that
uses \eg{} 100\,\% of the arithmetics units available for computations outside
of the critical path, at a point where a chain of dependencies is blocking,
will not run faster if the arithmetics operations are removed; hence, hardware
counters alone are not sufficient to find bottlenecks.

However, some static analyzers report the bottlenecks they detect. To unify
their results and keep things simple, we study three general kinds of
bottlenecks.

\begin{itemize}
\item{} \emph{Frontend:} the CPU's frontend is not able to issue
  micro-operations to the backend fast enough. \iaca{} and \uica{} are
  able to detect this.
\item{} \emph{Ports:} at least one of the backend ports has too much work;
  reducing its pressure would accelerate the computation.
  \llvmmca, \iaca{} and \uica{} are able to detect this.
\item{} \emph{Dependencies:} there is a chain of data dependencies slowing
  down the computation.
  \llvmmca, \iaca{} and \uica{} are able to detect this.
\end{itemize}

For each source benchmark from Polybench and each type of bottleneck, we report
in Table~\ref{table:coverage} the number of derived benchmarks on which all the
tools agree that the bottleneck is present or absent. We also report the
proportion of cases in which the tools failed to agree. We analyze those
results later in Section~\ref{ssec:bottleneck_pred_analysis}.

As we have no source of truth indicating whether a bottleneck is effectively
present in a microbenchmark, we adopt a conservative approach, and consider
only the subset of the microbenchmarks on which the tools agree on the status
of all three resources; for those, we have a good confidence on the bottlenecks
reported. Obviously, this approach is limited, because it excludes
microbenchmarks that might be worth considering, and is most probably subject
to selection bias.

Of the 3,500 microbenchmarks we have generated, 261 (7.5\,\%) are the subject
of the above-mentioned consensus. This sample is made up of microbenchmarks
generated from 21 benchmarks ---~\ie{} for 5 benchmarks, none of the derived
microbenchmarks reached a consensus among the tools~---, yielding a wide
variety of calculations, including floating-point arithmetic, pointer
arithmetic or Boolean arithmetic. Of these, 200 (76.6\,\%) are bottlenecked on
the CPU front-end, 19 (7,3\,\%) on back-end ports, and 81 (31.0\,\%) on latency
introduced by dependencies. As mentioned above, this distribution
probably does not transcribe the distribution among the 3,500 original
benchmarks, as the 261 were not uniformly sampled. However, we argue that, as
all categories are represented in the sample, the initial hypothesis that the
generated benchmarks are diverse and representative is confirmed ---~thanks to
the transformations described in Section~\ref{sec:bench_gen}.

\subsection{Carbon footprint}

Generating and running the full suite of benchmarks required about 30h of
continuous computation on a single machine.  During the experiments, the power
supply units reported a near-constant consumption of about 350W. The carbon
intensity of the power grid in France, where the experiment was run, at the
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.

The electricity consumed directly by the server thus amounts to about
10.50\,kWh. Assuming a Power Usage Efficiency of 1.5, the total electricity
consumption roughly amounts to 15.75\,kWh, or about 450\,g\coeq.

A carbon footprint estimate of the machine's manufacture itself was conducted
by the manufacturer~\cite{poweredgeC6420lca}. Additionally accounting for the
extra 160\,GB of DDR4 SDRAM~\cite{meta_ACT}, the hardware manufacturing,
transport and end-of-life is evaluated to 1,266\,kg\coeq. In 2023, this
computation cluster's usage rate was 35\,\%. Assuming 6 years of product life,
30h of usage represents about 2,050\,g\coeq{}. The whole experiment thus amounts to
2.5\,kg\coeq.