phd-thesis/manuscrit/50_CesASMe/25_results_analysis.tex

\section{Results analysis}\label{sec:results_analysis}

The raw complete output from our benchmarking harness ---~roughly speaking, a
large table with, for each benchmark, a cycle measurement, cycle count for each
throughput analyzer, the resulting relative error, and a synthesis of the
bottlenecks reported by each tool~--- enables many analyses that, we believe,
could be useful both to throughput analysis tool developers and users. Tool
designers can draw insights on their tool's best strengths and weaknesses, and
work towards improving them with a clearer vision. Users can gain a better
understanding of which tool is more suited for each situation.

\subsection{Throughput results}\label{ssec:overall_results}

\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l r r r r r r r r r}
        \toprule
        \textbf{Bencher} & \textbf{Datapoints} &
        \multicolumn{2}{c}{\textbf{Failures}} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} & \textbf{Time}\\
              & & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) & & (CPU$\cdot$h) \\
\midrule
BHive & 2198 & 1302 & (37.20) & 27.95 & 7.78 & 3.01 & 23.01 & 0.81 & 1.37\\
llvm-mca & 3500 & 0 & (0.00) & 36.71 & 27.80 & 12.92 & 59.80 & 0.57 & 0.96 \\
UiCA & 3500 & 0 & (0.00) & 29.59 & 18.26 & 7.11 & 52.99 & 0.58 & 2.12 \\
Ithemal & 3500 & 0 & (0.00) & 57.04 & 48.70 & 22.92 & 75.69 & 0.39 & 0.38 \\
Iaca & 3500 & 0 & (0.00) & 30.23 & 18.51 & 7.13 & 57.18 & 0.59 & 1.31 \\
Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
\bottomrule
    \end{tabular}
    \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
\end{table}

The error distribution of the relative errors, for each tool, is presented as a
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
in \autoref{chap:palmed} and \autoref{chap:frontend}.

\begin{figure}
    \centering
    \includegraphics[width=0.5\linewidth]{overall_analysis_boxplot.pdf}
    \caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
\end{figure}


These results are, overall, significantly worse than what each tool's article
presents. We attribute this difference mostly to the specificities of
Polybench: being composed of computation kernels, it intrinsically stresses the
CPU more than basic blocks extracted out of the Spec benchmark suite. This
difference is clearly reflected in the experimental section of Palmed in
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
Spec, often by more than a factor of two.

As \bhive{} and \ithemal{} do not support control flow instructions
(\eg{} \texttt{jump} instructions), those had
to be removed from the blocks before analysis. While none of these tools, apart
from \gus{} ---~which is dynamic~---, is able to account for branching costs,
these two analyzers were also unable to account for the front- and backend cost
of the control flow instructions themselves as well ---~corresponding to the
$TP_U$ mode introduced by \uica~\cite{uica}, while others
measure $TP_L$.


\subsection{Understanding \bhive's results}\label{ssec:bhive_errors}

The error distribution of \bhive{} against \perf{}, plotted right in
Figure~\ref{fig:exp_comparability}, puts forward irregularities in \bhive's
results. Since \bhive{} is based on measures ---~instead of predictions~---
through hardware counters, an excellent accuracy is expected. Its lack of
support for control flow instructions can be held accountable for a portion of
this accuracy drop; our lifting method, based on block occurrences instead of
paths, can explain another portion. We also find that \bhive{} fails to produce
a result in about 40\,\% of the kernels explored ---~which means that, for
those cases, \bhive{} failed to produce a result on at least one of the
constituent basic blocks. In fact, this is due to the difficulties we mentioned
in \autoref{sec:redefine_exec_time} earlier, related to the need to reconstruct
the context of each basic block \textit{ex nihilo}.

The basis of \bhive's method is to run the code to be measured, unrolled a
number of times depending on the code size, with all memory pages but the
code unmapped. As the code tries to access memory, it will raise segfaults,
caught by \bhive's harness, which allocates a single shared-memory page, filled
with a repeated constant, that it will map wherever segfaults occur before
restarting the program.
The main causes of \bhive{} failure are bad code behaviour (\eg{} control flow
not reaching the exit point of the measure if a bad jump is inserted), too many
segfaults to be handled, or a segfault that occurs even after mapping a page at
the problematic address.

The registers are also initialized, at the beginning of the measurement, to the
fixed constant \texttt{0x2324000}. We show through two examples that this
initial value can be of crucial importance.

The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
(Cascade Lake), with hyperthreading disabled.

\paragraph{Imprecise analysis.} We consider the following x86-64 kernel.

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
    vmulsd (%rax), %xmm3, %xmm0
    vmovsd %xmm0, (%r10)
\end{lstlisting}
\end{minipage}

When executed with all the general purpose registers initialized to the default
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
\reg{r10} hold the same value, inducing a read-after-write dependency between
the two instructions. If, however, \bhive{} is tweaked to initialize \reg{r10}
to a value that aliases (\wrt{} physical addresses) with the value in
\reg{rax}, \eg{} between \texttt{0x10000} and \texttt{0x10007} (inclusive), it
reports 19 cycles per iteration instead; while a value between \texttt{0x10008}
and \texttt{0x1009f} (inclusive) yields the expected 1 cycle ---~except for
values in \texttt{0x10039}-\texttt{0x1003f} and
\texttt{0x10079}-\texttt{0x1007f}, yielding 2 cycles as the store crosses a
cache line boundary.

In the same way, the value used to initialize the shared memory page can
influence the results whenever it gets loaded into registers.

\vspace{0.5em}

\paragraph{Failed analysis.} Some memory accesses will always result in an
error; for instance, it is impossible to \texttt{mmap} at an address lower
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
with equal initial values for all registers, the following kernel would fail,
since the second operation attempts to load at address 0:

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
    subq %r11, %r10
    movq (%r10), %rax
\end{lstlisting}
\end{minipage}

Such errors can occur in more circumvoluted ways. The following x86-64 kernel,
for instance, is extracted from a version of the \texttt{durbin}
kernel\footnote{\texttt{durbin.pocc.noopt.default.unroll8.MEDIUM.kernel21.s}
in the full results}.

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
    vmovsd 0x10(%r8, %rcx), %xmm6
    subl %eax, %esi
    movslq %esi, %rsi
    vfmadd231sd -8(%r9, %rsi, 8), \
        %xmm6, %xmm0
\end{lstlisting}
\end{minipage}

Here, \bhive{} fails to measure the kernel when run with the general purpose
registers initialized to the default constant at the 2\textsuperscript{nd}
occurrence of the unrolled loop body, failing to recover from an error at the
\texttt{vfmadd231sd} instruction with the \texttt{mmap} strategy. Indeed, after
the first iteration the value in \reg{rsi} becomes null, then negative at the
second iteration; thus, the second occurrence of the last instruction fetches
at address \texttt{0xfffffffff0a03ff8}, which is in kernel space. This
microkernel can be benchmarked with BHive \eg{} by initializing \reg{rax} to 1.

Some other microkernels fail in a similar way when trying to access addresses
that are not a virtual address in \emph{canonical form} space for x86-64 with
48 bits virtual addresses, as defined in Section~3.3.7.1 of Intel's Software
Developer's Manual~\cite{ref:intel64_software_dev_reference_vol1} and
Section~5.3.1 of the AMD64 Architecture Programmer's
Manual~\cite{ref:amd64_architecture_dev_reference_vol2}. Others still fail with
accesses relative to the instruction pointer, as \bhive{} read-protects the
unrolled microkernel's instructions page.

\subsection{Bottleneck prediction}\label{ssec:bottleneck_pred_analysis}

We introduced in Section~\ref{ssec:bottleneck_diversity} earlier that some of
the tools studied are also able to report suspected bottlenecks for the
evaluated program, whose results are presented in Table~\ref{table:coverage}.
This feature might be even more useful than raw throughput predictions to the
users of these tools willing to optimize their program, as they strongly hint
towards what needs to be enhanced.

In the majority of the cases studied, the tools are not able to agree on the
presence or absence of a type of bottleneck. Although it might seem that the
tools are performing better on frontend bottleneck detection, it must be
recalled that only two tools (versus three in the other cases) are reporting
frontend bottlenecks, thus making it easier for them to agree.

\begin{table}
    \centering
    \begin{tabular}{l r r r r}
        \toprule
        \textbf{Tool}
            & \multicolumn{2}{c}{\textbf{Ports}}
            & \multicolumn{2}{c}{\textbf{Dependencies}} \\
        \midrule
        \llvmmca{} & 567 & (24.6 \%) & 1032 & (41.9 \%) \\
        \uica{} & 516 & (22.4 \%) & 530 & (21.5 \%) \\
        \iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
        \bottomrule
    \end{tabular}
    \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
\end{table}

The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
on which three tools disagree into the number of times one tool makes a
diverging prediction ---~\ie{} the tool predicts differently than the two
others. In the case of ports, \iaca{} is responsible for half of the
divergences ---~which is not sufficient to conclude that the prediction of the
other tools is correct. In the case of dependencies, however, there is no clear
outlier, even though \uica{} seems to fare better than others.

In no case one tool seems to be responsible for the vast majority of
disagreements, which could hint towards it failing to predict correctly this
bottleneck. In the absence of a source of truth indicating whether a bottleneck
is effectively present, and with no clear-cut result for (a subset of) tool
predictions, we cannot conclude on the quality of the predictions from each
tool for each kind of bottleneck.

\subsection{Impact of dependency-boundness}\label{ssec:memlatbound}

\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l r r r r r r r r r}
        \toprule
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
\midrule
BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
UiCA & 2388 & 0 & (0.00\,\%) & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
Ithemal & 2388 & 0 & (0.00\,\%) & 62.66\,\% & 53.84\,\% & 24.12\,\% & 81.95\,\% & 0.40\\
Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.82\\
Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
\bottomrule
    \end{tabular}
    \caption{Statistical analysis of overall results, without latency bound
    through memory-carried dependencies rows}\label{table:nomemdeps_stats}
\end{table}

An overview of the full results table hints towards two main tendencies: on a
significant number of rows, the static tools ---~thus leaving \gus{} and
\bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
predictions \emph{together}; and many of these rows are those using the
\texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
\texttt{-O1}, plus vectorisation options for the latter).

To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
---~yielding 1050 rows each. All of these share 869 rows (82.8\,\%), which we
call \textit{jointly bad rows}.

Among these 869 jointly bad rows, we further find that respectively 342
(39.4\,\%) and 337 (38.8\,\%) are compiled using the \texttt{O1} and
\texttt{O1autovect}, totalling to 679 (78.1\,\%) of \texttt{O1}-based rows,
against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
\texttt{O3nosimd}. This result is significant enough to be used as a hint to
investigate the issue.

\begin{figure}
    \centering
    \includegraphics[width=0.5\linewidth]{nomemdeps_boxplot.pdf}
    \caption{Statistical distribution of relative errors, with and without
    pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
\end{figure}


Insofar as our approach maintains a strong link between the basic blocks studied and
the source codes from which they are extracted, it is possible to identify the
high-level characteristics of the concerned microbenchmarks.
In the overwhelming majority (97.5\,\%) of those jointly bad rows, the tools predicted
fewer cycles than measured, meaning that a bottleneck is either missed or
underestimated.
Manual investigation of a few simple benchmarks (no polyhedral transformation
applied, \texttt{O1} mode, not unrolled) further hints towards dependencies:
for instance, the \texttt{gemver} benchmark, which is \emph{not} among the
badly predicted benchmarks, has this kernel:

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[ANSI]C}]
for(c3)
    A[c1][c3] += u1[c1] * v1[c3]
               + u2[c1] * v2[c3];
\end{lstlisting}
\end{minipage}

while the \texttt{atax} benchmark, which is among the badly predicted ones, has
this kernel:

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language=c]
for(c3)
    tmp[c1] += A[c1][c3] * x[c3];
\end{lstlisting}
\end{minipage}

The first one exhibits no obvious dependency-boundness, while the second,
accumulating on \texttt{tmp[c1]} (independent of the iteration variable) lacks
in instruction-level parallelism. Among the simple benchmarks (as described
above), 8 are in the badly predicted list, all of which exhibit a
read-after-write data dependency to the preceding iteration.

Looking at the assembly code generated for those in \texttt{O1} modes, it
appears that the dependencies exhibited at the C level are compiled to
\emph{memory-carried} dependencies: the read-after-write happens for a given
memory address, instead of for a register. This kind of dependency, prone to
aliasing and dependent on the values of the registers, is hard to infer for a
static tool and is not supported by the analyzers under scrutiny in the general
case; it could thus reasonably explain the results observed.

There is no easy way, however, to know for certain which of the 3500 benchmarks
are latency bound: no hardware counter reports this. We investigate this
further using \gus's sensitivity analysis: in complement of the ``normal''
throughput estimation of \gus, we run it a second time, disabling the
accounting for latency through memory dependencies. By construction, this second measurement should be
either very close to the first one, or significantly below. We then assume a
benchmark to be latency bound due to memory-carried dependencies when it is at
least 40\,\% faster when this latency is disabled; there are 1112 (31.8\,\%) such
benchmarks.

Of the 869 jointly bad rows, 745 (85.7\,\%) are declared latency
bound through memory-carried dependencies by \gus. We conclude that the main
reason for these jointly badly predicted benchmarks is that the predictors
under scrutiny failed to correctly detect these dependencies.

In Section~\ref{ssec:overall_results}, we presented in
Figure~\ref{fig:overall_analysis_boxplot} and
Table~\ref{table:overall_analysis_stats} general statistics on the tools
on the full set of benchmarks. We now remove the 1112 benchmarks
flagged as latency bound through memory-carried dependencies by \gus{} from the
dataset, and present in Figure~\ref{fig:nomemdeps_boxplot} a comparative
box plot for the tools under scrutiny. We also present in
Table~\ref{table:nomemdeps_stats} the same statistics on this pruned dataset.
While the results for \llvmmca, \uica{} and \iaca{} globally improved
significantly, the most noticeable improvements are the reduced spread of the
results and the Kendall's $\tau$ correlation coefficient's increase.

From this,
we argue that detecting memory-carried dependencies is a weak point in current
state-of-the-art static analyzers, and that their results could be
significantly more accurate if improvements are made in this direction.