2023-09-25 17:00:07 +02:00
|
|
|
\section{Results analysis}\label{sec:results_analysis}
|
|
|
|
|
|
|
|
The raw complete output from our benchmarking harness ---~roughly speaking, a
|
|
|
|
large table with, for each benchmark, a cycle measurement, cycle count for each
|
|
|
|
throughput analyzer, the resulting relative error, and a synthesis of the
|
|
|
|
bottlenecks reported by each tool~--- enables many analyses that, we believe,
|
|
|
|
could be useful both to throughput analysis tool developers and users. Tool
|
|
|
|
designers can draw insights on their tool's best strengths and weaknesses, and
|
|
|
|
work towards improving them with a clearer vision. Users can gain a better
|
|
|
|
understanding of which tool is more suited for each situation.
|
|
|
|
|
|
|
|
\subsection{Throughput results}\label{ssec:overall_results}
|
|
|
|
|
2023-09-25 17:41:37 +02:00
|
|
|
\begin{table}
|
2023-09-25 17:00:07 +02:00
|
|
|
\centering
|
2023-09-25 17:41:37 +02:00
|
|
|
\footnotesize
|
2023-09-25 17:00:07 +02:00
|
|
|
\begin{tabular}{l r r r r r r r r r}
|
|
|
|
\toprule
|
2023-09-25 17:41:37 +02:00
|
|
|
\textbf{Bencher} & \textbf{Datapoints} &
|
|
|
|
\multicolumn{2}{c}{\textbf{Failures}} &
|
|
|
|
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} & \textbf{Time}\\
|
|
|
|
& & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) & & (CPU$\cdot$h) \\
|
2023-09-25 17:00:07 +02:00
|
|
|
\midrule
|
2023-09-25 17:41:37 +02:00
|
|
|
BHive & 2198 & 1302 & (37.20) & 27.95 & 7.78 & 3.01 & 23.01 & 0.81 & 1.37\\
|
|
|
|
llvm-mca & 3500 & 0 & (0.00) & 36.71 & 27.80 & 12.92 & 59.80 & 0.57 & 0.96 \\
|
|
|
|
UiCA & 3500 & 0 & (0.00) & 29.59 & 18.26 & 7.11 & 52.99 & 0.58 & 2.12 \\
|
|
|
|
Ithemal & 3500 & 0 & (0.00) & 57.04 & 48.70 & 22.92 & 75.69 & 0.39 & 0.38 \\
|
|
|
|
Iaca & 3500 & 0 & (0.00) & 30.23 & 18.51 & 7.13 & 57.18 & 0.59 & 1.31 \\
|
|
|
|
Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
|
2023-09-25 17:00:07 +02:00
|
|
|
\bottomrule
|
|
|
|
\end{tabular}
|
2023-09-25 17:41:37 +02:00
|
|
|
\caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
|
|
|
|
\end{table}
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
The error distribution of the relative errors, for each tool, is presented as a
|
|
|
|
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
|
|
|
|
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
|
2023-09-25 18:45:35 +02:00
|
|
|
each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
|
|
|
|
in \autoref{chap:palmed} and \autoref{chap:frontend}.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
\begin{figure}
|
2023-09-25 17:41:37 +02:00
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.5\linewidth]{overall_analysis_boxplot.pdf}
|
2023-09-25 17:00:07 +02:00
|
|
|
\caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
|
|
These results are, overall, significantly worse than what each tool's article
|
|
|
|
presents. We attribute this difference mostly to the specificities of
|
|
|
|
Polybench: being composed of computation kernels, it intrinsically stresses the
|
2024-08-18 17:42:44 +02:00
|
|
|
CPU more than basic blocks extracted out of \eg{} the Spec benchmark suite.
|
|
|
|
This difference is clearly reflected in the experimental section of Palmed in
|
2023-09-25 18:45:35 +02:00
|
|
|
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
|
2023-09-25 17:00:07 +02:00
|
|
|
Spec, often by more than a factor of two.
|
|
|
|
|
|
|
|
As \bhive{} and \ithemal{} do not support control flow instructions
|
|
|
|
(\eg{} \texttt{jump} instructions), those had
|
|
|
|
to be removed from the blocks before analysis. While none of these tools, apart
|
|
|
|
from \gus{} ---~which is dynamic~---, is able to account for branching costs,
|
|
|
|
these two analyzers were also unable to account for the front- and backend cost
|
|
|
|
of the control flow instructions themselves as well ---~corresponding to the
|
|
|
|
$TP_U$ mode introduced by \uica~\cite{uica}, while others
|
|
|
|
measure $TP_L$.
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Understanding \bhive's results}\label{ssec:bhive_errors}
|
|
|
|
|
|
|
|
The error distribution of \bhive{} against \perf{}, plotted right in
|
|
|
|
Figure~\ref{fig:exp_comparability}, puts forward irregularities in \bhive's
|
|
|
|
results. Since \bhive{} is based on measures ---~instead of predictions~---
|
|
|
|
through hardware counters, an excellent accuracy is expected. Its lack of
|
|
|
|
support for control flow instructions can be held accountable for a portion of
|
|
|
|
this accuracy drop; our lifting method, based on block occurrences instead of
|
|
|
|
paths, can explain another portion. We also find that \bhive{} fails to produce
|
2023-09-26 11:39:26 +02:00
|
|
|
a result in about 40\,\% of the kernels explored ---~which means that, for
|
|
|
|
those cases, \bhive{} failed to produce a result on at least one of the
|
|
|
|
constituent basic blocks. In fact, this is due to the difficulties we mentioned
|
|
|
|
in \autoref{sec:redefine_exec_time} earlier, related to the need to reconstruct
|
|
|
|
the context of each basic block \textit{ex nihilo}.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
The basis of \bhive's method is to run the code to be measured, unrolled a
|
|
|
|
number of times depending on the code size, with all memory pages but the
|
|
|
|
code unmapped. As the code tries to access memory, it will raise segfaults,
|
|
|
|
caught by \bhive's harness, which allocates a single shared-memory page, filled
|
|
|
|
with a repeated constant, that it will map wherever segfaults occur before
|
|
|
|
restarting the program.
|
|
|
|
The main causes of \bhive{} failure are bad code behaviour (\eg{} control flow
|
|
|
|
not reaching the exit point of the measure if a bad jump is inserted), too many
|
|
|
|
segfaults to be handled, or a segfault that occurs even after mapping a page at
|
|
|
|
the problematic address.
|
|
|
|
|
|
|
|
The registers are also initialized, at the beginning of the measurement, to the
|
|
|
|
fixed constant \texttt{0x2324000}. We show through two examples that this
|
|
|
|
initial value can be of crucial importance.
|
|
|
|
|
|
|
|
The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
|
|
|
|
(Cascade Lake), with hyperthreading disabled.
|
|
|
|
|
2023-09-26 11:39:26 +02:00
|
|
|
\paragraph{Imprecise analysis.} We consider the following x86-64 kernel.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
\begin{minipage}{0.95\linewidth}
|
|
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
|
|
vmulsd (%rax), %xmm3, %xmm0
|
|
|
|
vmovsd %xmm0, (%r10)
|
|
|
|
\end{lstlisting}
|
|
|
|
\end{minipage}
|
|
|
|
|
2024-08-18 17:42:44 +02:00
|
|
|
Note here that the \lstxasm{vmulsd out, in1, in2} instruction is the scalar
|
|
|
|
double-precision float multiplication of values from \lstxasm{in1} and
|
|
|
|
\lstxasm{in2}, storing the result in \lstxasm{out}; while \lstxasm{vmovsd out,
|
|
|
|
in} is a simple \lstxasm{mov} operation from \lstxasm{in} to \lstxasm{out}
|
|
|
|
operating on double-precision floats in \reg{xmm} registers.
|
|
|
|
|
2023-09-25 17:00:07 +02:00
|
|
|
When executed with all the general purpose registers initialized to the default
|
|
|
|
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
|
|
|
|
\reg{r10} hold the same value, inducing a read-after-write dependency between
|
|
|
|
the two instructions. If, however, \bhive{} is tweaked to initialize \reg{r10}
|
|
|
|
to a value that aliases (\wrt{} physical addresses) with the value in
|
|
|
|
\reg{rax}, \eg{} between \texttt{0x10000} and \texttt{0x10007} (inclusive), it
|
|
|
|
reports 19 cycles per iteration instead; while a value between \texttt{0x10008}
|
|
|
|
and \texttt{0x1009f} (inclusive) yields the expected 1 cycle ---~except for
|
|
|
|
values in \texttt{0x10039}-\texttt{0x1003f} and
|
|
|
|
\texttt{0x10079}-\texttt{0x1007f}, yielding 2 cycles as the store crosses a
|
|
|
|
cache line boundary.
|
|
|
|
|
|
|
|
In the same way, the value used to initialize the shared memory page can
|
|
|
|
influence the results whenever it gets loaded into registers.
|
|
|
|
|
|
|
|
\vspace{0.5em}
|
|
|
|
|
2023-09-26 11:39:26 +02:00
|
|
|
\paragraph{Failed analysis.} Some memory accesses will always result in an
|
2024-08-18 17:42:44 +02:00
|
|
|
error; for instance, on Linux, it is impossible to \texttt{mmap} at an address
|
|
|
|
lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to
|
|
|
|
\texttt{0x10000}. Thus, with equal initial values for all registers, the
|
|
|
|
following kernel would fail, since the second operation attempts to load at
|
|
|
|
address 0:
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
\begin{minipage}{0.95\linewidth}
|
|
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
|
|
subq %r11, %r10
|
|
|
|
movq (%r10), %rax
|
|
|
|
\end{lstlisting}
|
|
|
|
\end{minipage}
|
|
|
|
|
|
|
|
Such errors can occur in more circumvoluted ways. The following x86-64 kernel,
|
|
|
|
for instance, is extracted from a version of the \texttt{durbin}
|
|
|
|
kernel\footnote{\texttt{durbin.pocc.noopt.default.unroll8.MEDIUM.kernel21.s}
|
|
|
|
in the full results}.
|
|
|
|
|
|
|
|
\begin{minipage}{0.95\linewidth}
|
|
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
|
|
vmovsd 0x10(%r8, %rcx), %xmm6
|
|
|
|
subl %eax, %esi
|
|
|
|
movslq %esi, %rsi
|
|
|
|
vfmadd231sd -8(%r9, %rsi, 8), \
|
|
|
|
%xmm6, %xmm0
|
|
|
|
\end{lstlisting}
|
|
|
|
\end{minipage}
|
|
|
|
|
|
|
|
Here, \bhive{} fails to measure the kernel when run with the general purpose
|
|
|
|
registers initialized to the default constant at the 2\textsuperscript{nd}
|
|
|
|
occurrence of the unrolled loop body, failing to recover from an error at the
|
|
|
|
\texttt{vfmadd231sd} instruction with the \texttt{mmap} strategy. Indeed, after
|
|
|
|
the first iteration the value in \reg{rsi} becomes null, then negative at the
|
|
|
|
second iteration; thus, the second occurrence of the last instruction fetches
|
|
|
|
at address \texttt{0xfffffffff0a03ff8}, which is in kernel space. This
|
|
|
|
microkernel can be benchmarked with BHive \eg{} by initializing \reg{rax} to 1.
|
|
|
|
|
|
|
|
Some other microkernels fail in a similar way when trying to access addresses
|
|
|
|
that are not a virtual address in \emph{canonical form} space for x86-64 with
|
|
|
|
48 bits virtual addresses, as defined in Section~3.3.7.1 of Intel's Software
|
|
|
|
Developer's Manual~\cite{ref:intel64_software_dev_reference_vol1} and
|
|
|
|
Section~5.3.1 of the AMD64 Architecture Programmer's
|
|
|
|
Manual~\cite{ref:amd64_architecture_dev_reference_vol2}. Others still fail with
|
|
|
|
accesses relative to the instruction pointer, as \bhive{} read-protects the
|
|
|
|
unrolled microkernel's instructions page.
|
|
|
|
|
|
|
|
\subsection{Bottleneck prediction}\label{ssec:bottleneck_pred_analysis}
|
|
|
|
|
|
|
|
We introduced in Section~\ref{ssec:bottleneck_diversity} earlier that some of
|
|
|
|
the tools studied are also able to report suspected bottlenecks for the
|
|
|
|
evaluated program, whose results are presented in Table~\ref{table:coverage}.
|
|
|
|
This feature might be even more useful than raw throughput predictions to the
|
|
|
|
users of these tools willing to optimize their program, as they strongly hint
|
|
|
|
towards what needs to be enhanced.
|
|
|
|
|
|
|
|
In the majority of the cases studied, the tools are not able to agree on the
|
|
|
|
presence or absence of a type of bottleneck. Although it might seem that the
|
|
|
|
tools are performing better on frontend bottleneck detection, it must be
|
|
|
|
recalled that only two tools (versus three in the other cases) are reporting
|
2024-08-18 17:42:44 +02:00
|
|
|
frontend bottlenecks, thus making it more likely for them to agree.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
\begin{table}
|
|
|
|
\centering
|
|
|
|
\begin{tabular}{l r r r r}
|
|
|
|
\toprule
|
|
|
|
\textbf{Tool}
|
|
|
|
& \multicolumn{2}{c}{\textbf{Ports}}
|
|
|
|
& \multicolumn{2}{c}{\textbf{Dependencies}} \\
|
|
|
|
\midrule
|
|
|
|
\llvmmca{} & 567 & (24.6 \%) & 1032 & (41.9 \%) \\
|
|
|
|
\uica{} & 516 & (22.4 \%) & 530 & (21.5 \%) \\
|
|
|
|
\iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
|
|
|
|
\bottomrule
|
|
|
|
\end{tabular}
|
2023-09-25 17:41:37 +02:00
|
|
|
\caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
|
2023-09-25 17:00:07 +02:00
|
|
|
\end{table}
|
|
|
|
|
|
|
|
The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
|
|
|
|
on which three tools disagree into the number of times one tool makes a
|
|
|
|
diverging prediction ---~\ie{} the tool predicts differently than the two
|
|
|
|
others. In the case of ports, \iaca{} is responsible for half of the
|
|
|
|
divergences ---~which is not sufficient to conclude that the prediction of the
|
|
|
|
other tools is correct. In the case of dependencies, however, there is no clear
|
|
|
|
outlier, even though \uica{} seems to fare better than others.
|
|
|
|
|
|
|
|
In no case one tool seems to be responsible for the vast majority of
|
|
|
|
disagreements, which could hint towards it failing to predict correctly this
|
|
|
|
bottleneck. In the absence of a source of truth indicating whether a bottleneck
|
|
|
|
is effectively present, and with no clear-cut result for (a subset of) tool
|
|
|
|
predictions, we cannot conclude on the quality of the predictions from each
|
|
|
|
tool for each kind of bottleneck.
|
|
|
|
|
|
|
|
\subsection{Impact of dependency-boundness}\label{ssec:memlatbound}
|
|
|
|
|
2023-09-25 17:41:37 +02:00
|
|
|
\begin{table}
|
2023-09-25 17:00:07 +02:00
|
|
|
\centering
|
2023-09-25 17:41:37 +02:00
|
|
|
\footnotesize
|
2023-09-25 17:00:07 +02:00
|
|
|
\begin{tabular}{l r r r r r r r r r}
|
|
|
|
\toprule
|
|
|
|
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
|
2023-09-25 17:41:37 +02:00
|
|
|
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
|
2023-09-25 17:00:07 +02:00
|
|
|
\midrule
|
|
|
|
BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
|
|
|
|
llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
|
|
|
|
UiCA & 2388 & 0 & (0.00\,\%) & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
|
|
|
|
Ithemal & 2388 & 0 & (0.00\,\%) & 62.66\,\% & 53.84\,\% & 24.12\,\% & 81.95\,\% & 0.40\\
|
|
|
|
Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.82\\
|
|
|
|
Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
|
|
|
|
\bottomrule
|
|
|
|
\end{tabular}
|
2023-09-25 17:41:37 +02:00
|
|
|
\caption{Statistical analysis of overall results, without latency bound
|
|
|
|
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
|
|
|
|
\end{table}
|
2023-09-25 17:00:07 +02:00
|
|
|
|
2023-09-25 18:45:35 +02:00
|
|
|
An overview of the full results table hints towards two main tendencies: on a
|
|
|
|
significant number of rows, the static tools ---~thus leaving \gus{} and
|
|
|
|
\bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
|
|
|
|
predictions \emph{together}; and many of these rows are those using the
|
|
|
|
\texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
|
|
|
|
\texttt{-O1}, plus vectorisation options for the latter).
|
2023-09-25 17:00:07 +02:00
|
|
|
|
|
|
|
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
|
|
|
|
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
|
|
|
|
---~yielding 1050 rows each. All of these share 869 rows (82.8\,\%), which we
|
|
|
|
call \textit{jointly bad rows}.
|
|
|
|
|
|
|
|
Among these 869 jointly bad rows, we further find that respectively 342
|
|
|
|
(39.4\,\%) and 337 (38.8\,\%) are compiled using the \texttt{O1} and
|
|
|
|
\texttt{O1autovect}, totalling to 679 (78.1\,\%) of \texttt{O1}-based rows,
|
|
|
|
against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
|
|
|
|
\texttt{O3nosimd}. This result is significant enough to be used as a hint to
|
|
|
|
investigate the issue.
|
|
|
|
|
|
|
|
\begin{figure}
|
2023-09-25 17:41:37 +02:00
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.5\linewidth]{nomemdeps_boxplot.pdf}
|
2023-09-25 17:00:07 +02:00
|
|
|
\caption{Statistical distribution of relative errors, with and without
|
|
|
|
pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
|
|
Insofar as our approach maintains a strong link between the basic blocks studied and
|
|
|
|
the source codes from which they are extracted, it is possible to identify the
|
|
|
|
high-level characteristics of the concerned microbenchmarks.
|
|
|
|
In the overwhelming majority (97.5\,\%) of those jointly bad rows, the tools predicted
|
|
|
|
fewer cycles than measured, meaning that a bottleneck is either missed or
|
|
|
|
underestimated.
|
|
|
|
Manual investigation of a few simple benchmarks (no polyhedral transformation
|
|
|
|
applied, \texttt{O1} mode, not unrolled) further hints towards dependencies:
|
|
|
|
for instance, the \texttt{gemver} benchmark, which is \emph{not} among the
|
|
|
|
badly predicted benchmarks, has this kernel:
|
|
|
|
|
|
|
|
\begin{minipage}{0.95\linewidth}
|
|
|
|
\begin{lstlisting}[language={[ANSI]C}]
|
|
|
|
for(c3)
|
|
|
|
A[c1][c3] += u1[c1] * v1[c3]
|
|
|
|
+ u2[c1] * v2[c3];
|
|
|
|
\end{lstlisting}
|
|
|
|
\end{minipage}
|
|
|
|
|
|
|
|
while the \texttt{atax} benchmark, which is among the badly predicted ones, has
|
|
|
|
this kernel:
|
|
|
|
|
|
|
|
\begin{minipage}{0.95\linewidth}
|
|
|
|
\begin{lstlisting}[language=c]
|
|
|
|
for(c3)
|
|
|
|
tmp[c1] += A[c1][c3] * x[c3];
|
|
|
|
\end{lstlisting}
|
|
|
|
\end{minipage}
|
|
|
|
|
|
|
|
The first one exhibits no obvious dependency-boundness, while the second,
|
|
|
|
accumulating on \texttt{tmp[c1]} (independent of the iteration variable) lacks
|
|
|
|
in instruction-level parallelism. Among the simple benchmarks (as described
|
|
|
|
above), 8 are in the badly predicted list, all of which exhibit a
|
|
|
|
read-after-write data dependency to the preceding iteration.
|
|
|
|
|
|
|
|
Looking at the assembly code generated for those in \texttt{O1} modes, it
|
|
|
|
appears that the dependencies exhibited at the C level are compiled to
|
|
|
|
\emph{memory-carried} dependencies: the read-after-write happens for a given
|
|
|
|
memory address, instead of for a register. This kind of dependency, prone to
|
|
|
|
aliasing and dependent on the values of the registers, is hard to infer for a
|
|
|
|
static tool and is not supported by the analyzers under scrutiny in the general
|
|
|
|
case; it could thus reasonably explain the results observed.
|
|
|
|
|
|
|
|
There is no easy way, however, to know for certain which of the 3500 benchmarks
|
|
|
|
are latency bound: no hardware counter reports this. We investigate this
|
|
|
|
further using \gus's sensitivity analysis: in complement of the ``normal''
|
|
|
|
throughput estimation of \gus, we run it a second time, disabling the
|
|
|
|
accounting for latency through memory dependencies. By construction, this second measurement should be
|
|
|
|
either very close to the first one, or significantly below. We then assume a
|
|
|
|
benchmark to be latency bound due to memory-carried dependencies when it is at
|
|
|
|
least 40\,\% faster when this latency is disabled; there are 1112 (31.8\,\%) such
|
|
|
|
benchmarks.
|
|
|
|
|
|
|
|
Of the 869 jointly bad rows, 745 (85.7\,\%) are declared latency
|
|
|
|
bound through memory-carried dependencies by \gus. We conclude that the main
|
|
|
|
reason for these jointly badly predicted benchmarks is that the predictors
|
|
|
|
under scrutiny failed to correctly detect these dependencies.
|
|
|
|
|
|
|
|
In Section~\ref{ssec:overall_results}, we presented in
|
|
|
|
Figure~\ref{fig:overall_analysis_boxplot} and
|
|
|
|
Table~\ref{table:overall_analysis_stats} general statistics on the tools
|
|
|
|
on the full set of benchmarks. We now remove the 1112 benchmarks
|
|
|
|
flagged as latency bound through memory-carried dependencies by \gus{} from the
|
|
|
|
dataset, and present in Figure~\ref{fig:nomemdeps_boxplot} a comparative
|
|
|
|
box plot for the tools under scrutiny. We also present in
|
|
|
|
Table~\ref{table:nomemdeps_stats} the same statistics on this pruned dataset.
|
|
|
|
While the results for \llvmmca, \uica{} and \iaca{} globally improved
|
|
|
|
significantly, the most noticeable improvements are the reduced spread of the
|
|
|
|
results and the Kendall's $\tau$ correlation coefficient's increase.
|
|
|
|
|
2024-08-18 17:42:44 +02:00
|
|
|
\medskip{}
|
|
|
|
|
2023-09-25 17:00:07 +02:00
|
|
|
From this,
|
|
|
|
we argue that detecting memory-carried dependencies is a weak point in current
|
|
|
|
state-of-the-art static analyzers, and that their results could be
|
|
|
|
significantly more accurate if improvements are made in this direction.
|