phd-thesis/manuscrit/60_staticdeps/50_eval.tex

\section{Evaluation}

We evaluate the relevance of \staticdeps{} results in two ways: first, we
compare the detected dependencies to those extracted at runtime by \depsim{},
to evaluate the proportion of dependencies actually detected. Then, we evaluate
the relevance of our static analysis from a performance debugging point of
view, by enriching \uica{}'s model with \staticdeps{} and assessing, using
\cesasme{}, the benefits brought to the model.

We finally evaluate our claim that using a static model instead of a dynamic
analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable
amount of time.

\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim}

The \staticdeps{}'s model contribution largely resides in its ability to track
memory-carried dependencies, including loop-carried ones. We thus focus on
evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to
memory-carried dependencies.

We use the binaries produced by \cesasme{} as a dataset, as we already assessed
its relevance and contains enough benchmarks to be statistically meaningful. We
also already have tooling and basic-block segmentation available for those
benchmarks, making the analysis more convenient.

\medskip{}

For each binary previously generated by \cesasme{}, we use its cached basic
block splitting and occurrence count. Among each binary, we discard any basic
block with fewer than 10\,\% of the occurrence count of the most-hit basic
block; this avoids considering basic blocks which were not originally inside
loops, and for which loop-carried dependencies would make no sense ---~and
could possibly create false positives.

For each of the considered binaries, we run our dynamic analysis, \depsim{},
and record its results.

For each of the considered basic blocks, we run our static analysis,
\staticdeps{}. We translate the detected dependencies back to original ELF
addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does
not report an equivalent parameter, but only a pair of program counters. Each
of the dependencies reported by \depsim{} whose source and destination
addresses belong to the basic block considered are then classified as either
detected or missed by \staticdeps{}. Dynamically detected dependencies spanning
across basic blocks are discarded, as \staticdeps{} cannot detect them by
construction.

\medskip{}

We consider two metrics: the unweighted dependencies coverage, \[
    \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
\]

as well as the weighted dependencies coverage, \[
    \cov_w =
        \dfrac{
            \sum_{d \in \text{found}} \rho_d
        }{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
\]
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
detected by \depsim.

\begin{table}
    \centering
    \begin{tabular}{r r r}
        \toprule
        \textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\
        \midrule
        $\infty$ & 38.1\,\% & 44.0\,\% \\
        1024 & 57.6\,\% & 58.2\,\% \\
        512 & 56.4\,\% & 63.5\,\% \\
        \bottomrule
    \end{tabular}
    \caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
    binaries}\label{table:cov_staticdeps}
\end{table}


These metrics are presented for the 3\,500 binaries of \cesasme{} in the first
data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about
40\,\%, is lower than expected.

\bigskip{}

Manual investigation on the missed dependencies revealed some surprising
dependencies dynamically detected by \depsim{}, that did not appear to actually
be read-after-write dependencies. In the following (simplified) example,
roughly implementing $A[i] = C\times{}A[i] + B[i]$,
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
loop:
    vmulsd (%rax,%rdi), %xmm0, %xmm1
    vaddsd (%rbx,%rdi), %xmm1, %xmm1
    vmovsd %xmm1, (%rax,%rdi)
    add $8, %rdi
    cmp %rdi, %r10
    jne loop
\end{lstlisting}
\end{minipage}\\
a read-after-write dependency from line 4 to line 2 was reported ---~while no
such dependency actually exists.

The reason for that is that, in \cesasme{}'s benchmarks, the whole program
would roughly look like
\begin{lstlisting}[language=C]
/* Initialize A, B, C here */
for(int measure=0; measure < NUM_MEASURES; ++measure) {
    measure_start();
    for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
        for(int i=0; i < ARRAY_SIZE; ++i) {
            A[i] = C * A[i] + B[i];
        }
    }
    measure_stop();
}
\end{lstlisting}

Thus, the previously reported dependency did not come from within the kernel,
but \emph{from one outer iteration to the next} (\ie{}, iteration on
\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not,
in practice, relevant: under the common code analyzers assumptions, the most
inner loop is long enough to be considered infinite in steady state; thus, two
outer loop iterations are too far separated in time for this dependency to have
any relevance, as the source iteration is long executed when the destination
iteration is scheduled.

\medskip{}

To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}.
As we do not have access without a heavy runtime slowdown to elapsed cycles in
\valgrind{}, we define a \emph{timestamp} as the number of instructions
executed since beginning of the program's execution; we increment this count at
each branch instruction to avoid excessive instrumentation slowdown.

We further annotate every write to the shadow memory with the timestamp at
which it occurred. Whenever a dependency should be added, we first check that
the dependency has not expired ---~that is, that it is not older than a given
threshold.

We re-run the previous experiments with lifetimes of respectively 1\,024 and
512 instructions, which roughly corresponds to the order of magnitude of the
size of a reorder buffer (see \autoref{ssec:staticdeps_detection} above); results can also be found in
\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions
lifetime greatly improves the coverage rates, both unweighted and weighted,
further reducing this lifetime to 512 does not yield significant enhancements.

\bigskip{}

The final coverage results, with a rough 60\,\% detection rate, are reasonable
and detect a significant proportion of dependencies; however, many are still
not detected.

This may be explained by the limitations studied in
\autoref{ssec:staticdeps_limits} above, and especially the inability of
\staticdeps{} to detect dependencies through aliasing pointers. This falls,
more broadly, into the problem of lack of context that we expressed before and
emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale
of the whole program, that would be able to integrate constraints stemming from
outside of the loop body, would capture many more dependencies.

\subsection{Enriching \uica{}'s model}

To estimate the real gain in performance debugging scenarios, however, we
integrate \staticdeps{} into \uica{}.

There is, however, a discrepancy between the two tools: while \staticdeps{}
works at the assembly instruction level, \uica{} works at the \uop{} level. In
real hardware, dependencies indeed occur between \uops{}; however, we are not
aware of the existence of a \uop{}-level semantic description of the x86-64
ISA, which made this level of detail unsuitable for the \staticdeps{} analysis.

We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
are found to be dependant, we add a dependency between each couple $\mu_1 \in
i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict
execution times biased towards a slower computation kernel. A finer model, or a
finer (conservative) filtering of which \uops{} must be considered dependent
---~\eg{} a memory dependency can only come from a memory-related \uop{}~---
may enhance the accuracy of our integration.

\medskip{}

We then evaluate our gains by running \cesasme{}'s harness as we did in
\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets:
first, the full set of 3\,500 binaries from the previous chapter; then, the
set of binaries pruned to exclude benchmarks heavily relying on memory-carried
dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is
beneficial to \uica{}, we expect \uicadeps{} to yield significantly better
results than \uica{} alone on the first dataset. On the second dataset,
however, \staticdeps{} should provide no significant contribution, as the
dataset was pruned to not exhibit significant memory-carried latency-boundness.
We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as
the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}.

\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l l r r r r r r r}
        \toprule
        \textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
\midrule
        \multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\
                              & + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\
\midrule
        \multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
                                & + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\
\bottomrule
    \end{tabular}
    \caption{Evaluation through \cesasme{} of the integration of \staticdeps{}
    to \uica{}}\label{table:staticdeps_uica_cesasme}
\end{table}

\begin{figure}
    \centering
    \includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg}
    \caption{Statistical distribution of relative errors of \uica{}, with and
    without \staticdeps{} hints, with and without pruning latency bound through
memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot}
\end{figure}

\medskip{}

The full dataset \uicadeps{} row is extremely close, on every metric, to the
pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}' addition
to \uica{} is very conclusive: the hints provided by \staticdeps{} are
sufficient to make \uica{}'s results as good on the full dataset as they were
before on a dataset pruned of precisely the kind of dependencies we aim to
detect. Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are
extremely close: this further supports the accuracy of \staticdeps{}.

\medskip{}

While the results obtained against \depsim{} in
\autoref{ssec:staticdeps_eval_depsim} above were reasonable, they were not
excellent either, and showed that many kind of dependencies were still missed
by \staticdeps{}. However, our evaluation on \cesasme{} by enriching \uica{}
shows that, at least on the workload considered, the dependencies that actually
matter from a performance debugging point of view are properly found.

This, however, might not be true for other kinds of applications that would
require a dependencies analysis.

\subsection{Analysis speed}

The main advantage of a static analysis of dependencies over a dynamic one is
its execution time ---~we should expect from \staticdeps{} an analysis time far
lower than \depsim{}'s.

To assess this, we evaluate on the same \cesasme{} kernels four data sequences:
\begin{enumerate}[(i)]
    \item{}\label{messeq:depsim} the execution time of \depsim{} on each of
        \cesasme{}'s kernels;
    \item{}\label{messeq:staticdeps_one} the execution time of \staticdeps{} on
        each of the basic blocks of each of \cesasme{}'s kernels;
    \item{}\label{messeq:staticdeps_sum} for each of those kernels, the sum of
        the execution times of \staticdeps{} on the kernel's constituting basic
        blocks;
    \item{}\label{messeq:staticdeps_speedup} for each basic block of each of
        \cesasme{}'s kernels, \staticdeps' speedup \wrt{} \depsim{}, that is,
        \depsim{}'s execution time divided by \staticdeps{}'.
\end{enumerate}

As \staticdeps{} is likely to be used at the scale of a basic block, we argue
that the sequence~(\ref{messeq:staticdeps_one}) is more relevant than
sequence~(\ref{messeq:staticdeps_sum}); however, the latter might be seen as
more fair, as one run of \depsim{} yields dependencies of all of the kernel's
constituting basic blocks.

\begin{figure}
    \centering
    \begin{minipage}{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
        \captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
        on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
    \end{minipage}\hfill\begin{minipage}{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
        \captionof{figure}{Statistical distribution of \staticdeps{}' speedup over \depsim{} on \cesasme{}'s kernels}\label{fig:staticdeps_cesasme_speedup_boxplot}
    \end{minipage}
\end{figure}

\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l r r r r}
        \toprule
        \textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
\midrule
        Seq.\ (\ref{messeq:depsim}) --~\depsim{}
                          & 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
        Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum)
                          & 2307 ms & 677 ms &  557 ms & 2700 ms \\
        Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single)
                          & 529 ms & 545 ms & 425 ms & 588 ms \\
\midrule
        Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup
                          & $\times$36.1 & $\times$33.5 & $\times$30.1 &
                          $\times$41.7 \\
\bottomrule
    \end{tabular}
    \caption{Statistical distribution of \staticdeps{} and \depsim{} run times
    and speedup on \cesasme{}'s kernels}\label{table:staticdeps_cesasme_time_eval}
\end{table}

\bigskip{}

We plot the statistical distribution of these series in
\autoref{fig:staticdeps_cesasme_runtime_boxplot} and
\autoref{fig:staticdeps_cesasme_speedup_boxplot}, and give numerical data for
some statistical indicators in \autoref{table:staticdeps_cesasme_time_eval}. We
note that \staticdeps{} is 30 to 40 times faster than \depsim{}. Furthermore,
\staticdeps{} is written in Python, more as a proof-of-concept than as
production-ready software; meanwhile, \depsim{} is written in C on top of
\valgrind{}, an efficient, production-ready software. We expect that with
optimization efforts, and a rewrite in a compiled language, the speedup would
reach two to three orders of magnitude.
No results found.