\section{Results analysis}\label{sec:results_analysis} The raw complete output from our benchmarking harness ---~roughly speaking, a large table with, for each benchmark, a cycle measurement, cycle count for each throughput analyzer, the resulting relative error, and a synthesis of the bottlenecks reported by each tool~--- enables many analyses that, we believe, could be useful both to throughput analysis tool developers and users. Tool designers can draw insights on their tool's best strengths and weaknesses, and work towards improving them with a clearer vision. Users can gain a better understanding of which tool is more suited for each situation. \subsection{Throughput results}\label{ssec:overall_results} \begin{table} \centering \footnotesize \begin{tabular}{l r r r r r r r r r} \toprule \textbf{Bencher} & \textbf{Datapoints} & \multicolumn{2}{c}{\textbf{Failures}} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} & \textbf{Time}\\ & & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) & & (CPU$\cdot$h) \\ \midrule BHive & 2198 & 1302 & (37.20) & 27.95 & 7.78 & 3.01 & 23.01 & 0.81 & 1.37\\ llvm-mca & 3500 & 0 & (0.00) & 36.71 & 27.80 & 12.92 & 59.80 & 0.57 & 0.96 \\ UiCA & 3500 & 0 & (0.00) & 29.59 & 18.26 & 7.11 & 52.99 & 0.58 & 2.12 \\ Ithemal & 3500 & 0 & (0.00) & 57.04 & 48.70 & 22.92 & 75.69 & 0.39 & 0.38 \\ Iaca & 3500 & 0 & (0.00) & 30.23 & 18.51 & 7.13 & 57.18 & 0.59 & 1.31 \\ Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\ \bottomrule \end{tabular} \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats} \end{table} The error distribution of the relative errors, for each tool, is presented as a box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators are also given in Table~\ref{table:overall_analysis_stats}. We also give, for each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier in \autoref{chap:palmed} and \autoref{chap:frontend}. \begin{figure} \centering \includegraphics[width=0.5\linewidth]{overall_analysis_boxplot.pdf} \caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot} \end{figure} These results are, overall, significantly worse than what each tool's article presents. We attribute this difference mostly to the specificities of Polybench: being composed of computation kernels, it intrinsically stresses the CPU more than basic blocks extracted out of the Spec benchmark suite. This difference is clearly reflected in the experimental section of Palmed in \autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on Spec, often by more than a factor of two. As \bhive{} and \ithemal{} do not support control flow instructions (\eg{} \texttt{jump} instructions), those had to be removed from the blocks before analysis. While none of these tools, apart from \gus{} ---~which is dynamic~---, is able to account for branching costs, these two analyzers were also unable to account for the front- and backend cost of the control flow instructions themselves as well ---~corresponding to the $TP_U$ mode introduced by \uica~\cite{uica}, while others measure $TP_L$. \subsection{Understanding \bhive's results}\label{ssec:bhive_errors} The error distribution of \bhive{} against \perf{}, plotted right in Figure~\ref{fig:exp_comparability}, puts forward irregularities in \bhive's results. Since \bhive{} is based on measures ---~instead of predictions~--- through hardware counters, an excellent accuracy is expected. Its lack of support for control flow instructions can be held accountable for a portion of this accuracy drop; our lifting method, based on block occurrences instead of paths, can explain another portion. We also find that \bhive{} fails to produce a result in about 40\,\% of the kernels explored ---~which means that, for those cases, \bhive{} failed to produce a result on at least one of the constituent basic blocks. In fact, this is due to the difficulties we mentioned in \autoref{sec:redefine_exec_time} earlier, related to the need to reconstruct the context of each basic block \textit{ex nihilo}. The basis of \bhive's method is to run the code to be measured, unrolled a number of times depending on the code size, with all memory pages but the code unmapped. As the code tries to access memory, it will raise segfaults, caught by \bhive's harness, which allocates a single shared-memory page, filled with a repeated constant, that it will map wherever segfaults occur before restarting the program. The main causes of \bhive{} failure are bad code behaviour (\eg{} control flow not reaching the exit point of the measure if a bad jump is inserted), too many segfaults to be handled, or a segfault that occurs even after mapping a page at the problematic address. The registers are also initialized, at the beginning of the measurement, to the fixed constant \texttt{0x2324000}. We show through two examples that this initial value can be of crucial importance. The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU (Cascade Lake), with hyperthreading disabled. \paragraph{Imprecise analysis.} We consider the following x86-64 kernel. \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] vmulsd (%rax), %xmm3, %xmm0 vmovsd %xmm0, (%r10) \end{lstlisting} \end{minipage} When executed with all the general purpose registers initialized to the default constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and \reg{r10} hold the same value, inducing a read-after-write dependency between the two instructions. If, however, \bhive{} is tweaked to initialize \reg{r10} to a value that aliases (\wrt{} physical addresses) with the value in \reg{rax}, \eg{} between \texttt{0x10000} and \texttt{0x10007} (inclusive), it reports 19 cycles per iteration instead; while a value between \texttt{0x10008} and \texttt{0x1009f} (inclusive) yields the expected 1 cycle ---~except for values in \texttt{0x10039}-\texttt{0x1003f} and \texttt{0x10079}-\texttt{0x1007f}, yielding 2 cycles as the store crosses a cache line boundary. In the same way, the value used to initialize the shared memory page can influence the results whenever it gets loaded into registers. \vspace{0.5em} \paragraph{Failed analysis.} Some memory accesses will always result in an error; for instance, it is impossible to \texttt{mmap} at an address lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus, with equal initial values for all registers, the following kernel would fail, since the second operation attempts to load at address 0: \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] subq %r11, %r10 movq (%r10), %rax \end{lstlisting} \end{minipage} Such errors can occur in more circumvoluted ways. The following x86-64 kernel, for instance, is extracted from a version of the \texttt{durbin} kernel\footnote{\texttt{durbin.pocc.noopt.default.unroll8.MEDIUM.kernel21.s} in the full results}. \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] vmovsd 0x10(%r8, %rcx), %xmm6 subl %eax, %esi movslq %esi, %rsi vfmadd231sd -8(%r9, %rsi, 8), \ %xmm6, %xmm0 \end{lstlisting} \end{minipage} Here, \bhive{} fails to measure the kernel when run with the general purpose registers initialized to the default constant at the 2\textsuperscript{nd} occurrence of the unrolled loop body, failing to recover from an error at the \texttt{vfmadd231sd} instruction with the \texttt{mmap} strategy. Indeed, after the first iteration the value in \reg{rsi} becomes null, then negative at the second iteration; thus, the second occurrence of the last instruction fetches at address \texttt{0xfffffffff0a03ff8}, which is in kernel space. This microkernel can be benchmarked with BHive \eg{} by initializing \reg{rax} to 1. Some other microkernels fail in a similar way when trying to access addresses that are not a virtual address in \emph{canonical form} space for x86-64 with 48 bits virtual addresses, as defined in Section~3.3.7.1 of Intel's Software Developer's Manual~\cite{ref:intel64_software_dev_reference_vol1} and Section~5.3.1 of the AMD64 Architecture Programmer's Manual~\cite{ref:amd64_architecture_dev_reference_vol2}. Others still fail with accesses relative to the instruction pointer, as \bhive{} read-protects the unrolled microkernel's instructions page. \subsection{Bottleneck prediction}\label{ssec:bottleneck_pred_analysis} We introduced in Section~\ref{ssec:bottleneck_diversity} earlier that some of the tools studied are also able to report suspected bottlenecks for the evaluated program, whose results are presented in Table~\ref{table:coverage}. This feature might be even more useful than raw throughput predictions to the users of these tools willing to optimize their program, as they strongly hint towards what needs to be enhanced. In the majority of the cases studied, the tools are not able to agree on the presence or absence of a type of bottleneck. Although it might seem that the tools are performing better on frontend bottleneck detection, it must be recalled that only two tools (versus three in the other cases) are reporting frontend bottlenecks, thus making it easier for them to agree. \begin{table} \centering \begin{tabular}{l r r r r} \toprule \textbf{Tool} & \multicolumn{2}{c}{\textbf{Ports}} & \multicolumn{2}{c}{\textbf{Dependencies}} \\ \midrule \llvmmca{} & 567 & (24.6 \%) & 1032 & (41.9 \%) \\ \uica{} & 516 & (22.4 \%) & 530 & (21.5 \%) \\ \iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\ \bottomrule \end{tabular} \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred} \end{table} The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases on which three tools disagree into the number of times one tool makes a diverging prediction ---~\ie{} the tool predicts differently than the two others. In the case of ports, \iaca{} is responsible for half of the divergences ---~which is not sufficient to conclude that the prediction of the other tools is correct. In the case of dependencies, however, there is no clear outlier, even though \uica{} seems to fare better than others. In no case one tool seems to be responsible for the vast majority of disagreements, which could hint towards it failing to predict correctly this bottleneck. In the absence of a source of truth indicating whether a bottleneck is effectively present, and with no clear-cut result for (a subset of) tool predictions, we cannot conclude on the quality of the predictions from each tool for each kind of bottleneck. \subsection{Impact of dependency-boundness}\label{ssec:memlatbound} \begin{table} \centering \footnotesize \begin{tabular}{l r r r r r r r r r} \toprule \textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\ \midrule BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\ llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\ UiCA & 2388 & 0 & (0.00\,\%) & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\ Ithemal & 2388 & 0 & (0.00\,\%) & 62.66\,\% & 53.84\,\% & 24.12\,\% & 81.95\,\% & 0.40\\ Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.82\\ Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\ \bottomrule \end{tabular} \caption{Statistical analysis of overall results, without latency bound through memory-carried dependencies rows}\label{table:nomemdeps_stats} \end{table} An overview of the full results table hints towards two main tendencies: on a significant number of rows, the static tools ---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput predictions \emph{together}; and many of these rows are those using the \texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the latter). To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{} ---~yielding 1050 rows each. All of these share 869 rows (82.8\,\%), which we call \textit{jointly bad rows}. Among these 869 jointly bad rows, we further find that respectively 342 (39.4\,\%) and 337 (38.8\,\%) are compiled using the \texttt{O1} and \texttt{O1autovect}, totalling to 679 (78.1\,\%) of \texttt{O1}-based rows, against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for \texttt{O3nosimd}. This result is significant enough to be used as a hint to investigate the issue. \begin{figure} \centering \includegraphics[width=0.5\linewidth]{nomemdeps_boxplot.pdf} \caption{Statistical distribution of relative errors, with and without pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot} \end{figure} Insofar as our approach maintains a strong link between the basic blocks studied and the source codes from which they are extracted, it is possible to identify the high-level characteristics of the concerned microbenchmarks. In the overwhelming majority (97.5\,\%) of those jointly bad rows, the tools predicted fewer cycles than measured, meaning that a bottleneck is either missed or underestimated. Manual investigation of a few simple benchmarks (no polyhedral transformation applied, \texttt{O1} mode, not unrolled) further hints towards dependencies: for instance, the \texttt{gemver} benchmark, which is \emph{not} among the badly predicted benchmarks, has this kernel: \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[ANSI]C}] for(c3) A[c1][c3] += u1[c1] * v1[c3] + u2[c1] * v2[c3]; \end{lstlisting} \end{minipage} while the \texttt{atax} benchmark, which is among the badly predicted ones, has this kernel: \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language=c] for(c3) tmp[c1] += A[c1][c3] * x[c3]; \end{lstlisting} \end{minipage} The first one exhibits no obvious dependency-boundness, while the second, accumulating on \texttt{tmp[c1]} (independent of the iteration variable) lacks in instruction-level parallelism. Among the simple benchmarks (as described above), 8 are in the badly predicted list, all of which exhibit a read-after-write data dependency to the preceding iteration. Looking at the assembly code generated for those in \texttt{O1} modes, it appears that the dependencies exhibited at the C level are compiled to \emph{memory-carried} dependencies: the read-after-write happens for a given memory address, instead of for a register. This kind of dependency, prone to aliasing and dependent on the values of the registers, is hard to infer for a static tool and is not supported by the analyzers under scrutiny in the general case; it could thus reasonably explain the results observed. There is no easy way, however, to know for certain which of the 3500 benchmarks are latency bound: no hardware counter reports this. We investigate this further using \gus's sensitivity analysis: in complement of the ``normal'' throughput estimation of \gus, we run it a second time, disabling the accounting for latency through memory dependencies. By construction, this second measurement should be either very close to the first one, or significantly below. We then assume a benchmark to be latency bound due to memory-carried dependencies when it is at least 40\,\% faster when this latency is disabled; there are 1112 (31.8\,\%) such benchmarks. Of the 869 jointly bad rows, 745 (85.7\,\%) are declared latency bound through memory-carried dependencies by \gus. We conclude that the main reason for these jointly badly predicted benchmarks is that the predictors under scrutiny failed to correctly detect these dependencies. In Section~\ref{ssec:overall_results}, we presented in Figure~\ref{fig:overall_analysis_boxplot} and Table~\ref{table:overall_analysis_stats} general statistics on the tools on the full set of benchmarks. We now remove the 1112 benchmarks flagged as latency bound through memory-carried dependencies by \gus{} from the dataset, and present in Figure~\ref{fig:nomemdeps_boxplot} a comparative box plot for the tools under scrutiny. We also present in Table~\ref{table:nomemdeps_stats} the same statistics on this pruned dataset. While the results for \llvmmca, \uica{} and \iaca{} globally improved significantly, the most noticeable improvements are the reduced spread of the results and the Kendall's $\tau$ correlation coefficient's increase. From this, we argue that detecting memory-carried dependencies is a weak point in current state-of-the-art static analyzers, and that their results could be significantly more accurate if improvements are made in this direction.