phd-thesis/manuscrit/60_staticdeps/50_eval.tex

\section{Evaluation}

We evaluate the relevance of \staticdeps{} results in two ways: first, we
compare the detected dependencies to those extracted at runtime by \depsim{},
to evaluate the proportion of dependencies actually detected. Then, we evaluate
the relevance of our static analysis from a performance debugging point of
view, by enriching \uica{}'s model with \staticdeps{} and assessing, using
\cesasme{}, the benefits brought to the model.

We finally evaluate our claim that using a static model instead of a dynamic
analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable
amount of time.

\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim}

The \staticdeps{}'s model contribution largely resides in its ability to track
memory-carried dependencies, including loop-carried ones. We thus focus on
evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to
memory-carried dependencies.

We use the binaries produced by \cesasme{} as a dataset, as we already assessed
its relevance and contains enough benchmarks to be statistically meaningful. We
also already have tooling and basic-block segmentation available for those
benchmarks, making the analysis more convenient.

\subsubsection{Recompiling \cesasme{}'s dataset}

In practice, benchmarks from \cesasme{} are roughly of the following form:

\begin{lstlisting}[language=C]
for(int measure=0; measure < NUM_MEASURES; ++measure) {
    measure_start();
    for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
        for(int i=0; i < BENCHMARK_SIZE; ++i) {
            /* Some kernel, independent of measure, repeat */
        }
    }
    measure_stop();
}
\end{lstlisting}

While this is sensible for conducting throughput measures, it also introduces
unwanted dependencies. If, for instance, the kernel consists in
$A[i] = C\times{}A[i] + B[i]$, implemented by\\
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
loop:
    vmulsd (%rax,%rdi), %xmm0, %xmm1
    vaddsd (%rbx,%rdi), %xmm1, %xmm1
    vmovsd %xmm1, (%rax,%rdi)
    add $8, %rdi
    cmp %rdi, %r10
    jne loop
\end{lstlisting}
\end{minipage}\\
a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
----~although there is no such dependency inherent to the kernel.

However, each iteration of the
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
loop will read again each \texttt{A[i]} (\ie{} \lstxasm{(\%rax,\%rdi)} in the
assembly) value from the previous inner loop, and
write it back. This creates a dependency to the previous iteration of the inner
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
does not report a dependency's distance, they are considered meaningful; and as
they cannot be detected by \staticdeps{} ---~which is unaware of the outer and
inner loop~---, they introduce unfairness in the evaluation. The actual loss of
precision introduced by not discovering such dependencies is instead assessed
later by enriching \uica{} with \staticdeps{}.

\medskip{}

To avoid detecting these dependencies with \depsim{}, we \textbf{recompile
\cesasme{}'s benchmarks} from the C source code of each benchmark with
\lstc{NUM_MEASURES = NUM_REPEATS = 1}. We use these recompiled benchmarks only
in the current section. While we do not re-run code transformations from the
original Polybenchs, we do recompile the benchmarks from C source. Thus, the
results from this section \emph{are not comparable} with results from other
sections, as the compiler may have used different optimisations, instructions,
etc.

\subsubsection{Dependency coverage}

For each binary generated by \cesasme{}, we use its cached basic
block splitting and occurrence count. Among each binary, we discard any basic
block with fewer than 10\,\% of the occurrence count of the most-hit basic
block; this avoids considering basic blocks which were not originally inside
loops, and for which loop-carried dependencies would make no sense ---~and
could possibly create false positives.

For each of the considered binaries, we run our dynamic analysis, \depsim{},
and record its results. We use a lifetime of 512 instructions for this
analysis, as this is roughly the size of recent Intel reorder
buffers~\cite{wikichip_intel_rob_size}; as discussed in
\autoref{ssec:staticdeps_detection}, dependencies spanning farther than the
size of the ROB are not microarchitecturally relevant. Dependencies whose
source and destination program counters are not in the same basic block are
discarded, as \staticdeps{} cannot detect them by construction.

For each of the considered basic blocks, we run our static analysis,
\staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations
the dependency spans~---, as our dynamic analysis does not report an equivalent
parameter, but only a pair of program counters.

Dynamic dependencies from \depsim{} are converted to
\emph{periodic dependencies} in the sense of \staticdeps{} as described in
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
least 80\% of the block's iterations are kept ---~else, dependencies are
considered measurement artifacts. The \emph{periodic coverage}
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
proportion of dependencies found by \staticdeps{} among the periodic
dependencies extracted from \depsim{}:
\[
    \cov_p = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
\]

\smallskip{}

We also keep the raw dependencies from \depsim{} ---~that is, without converting
them to periodic dependencies. From these, we consider two metrics:
the unweighted dependencies coverage, \[
    \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
\]
identical to $\cov_p$ but based on unfiltered dependencies,
as well as the weighted dependencies coverage, \[
    \cov_w =
        \dfrac{
            \sum_{d \in \text{found}} \rho_d
        }{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
\]
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
detected by \depsim. Note that such a metric is not meaningful for periodic
dependencies as, by construction, each dependency occurs as many times
as the loop iterates.

\begin{table}
    \centering
    \begin{tabular}{r r r}
        \toprule
        $\cov_p$ (\%) & $\cov_u$ (\%) & $\cov_w$ (\%) \\
        \midrule
        96.0 & 94.4 & 98.3 \\
        \bottomrule
    \end{tabular}
    \caption{Periodic, unweighted and weighted coverage of \staticdeps{} on
    \cesasme{}'s binaries recompiled without repetitions, with a lifetime of
    512.}\label{table:cov_staticdeps}
\end{table}

These metrics are presented for the 3\,500 binaries of \cesasme{} in
\autoref{table:cov_staticdeps}. The obtained coverage is consistent between the
three metrics used ($\cov_p$, $cov_u$, $cov_w$) and the reported coverage is
very close to 100\,\%, giving us good confidence on the accuracy of
\staticdeps.

\subsubsection{``Points-to'' aliasing analysis}

The same methodology can be re-used as a proxy for estimating the rate of
aliasing independent pointers in our dataset. Indeed, a major approximation
made by \staticdeps{} is to assume that any new encountered pointer ---~function
parameters, value read from memory, \ldots~--- does \emph{not} alias with
previously encountered values. This is implemented by the use of a fresh
random value for each value yet unknown.

Determining which pointers may point to which other pointers ---~and, by
extension, may point to the same memory region~--- is called a \emph{points-to
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
the pointers for which taking a fresh value was \emph{not} representative of
the reality.

If we detect, through dynamic analysis, that a value derived from a
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
--~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to}
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
execution is equal to \lstc{b + l} at the very end of the execution: although
the pointers will not alias (that is, share the same value at the same moment),
they still point to the same memory region and should not be treated as
independent.

Our dynamic analyzer, \depsim{}, does not have this granularity, as it only
reports dependencies between two program counters. A dependency from a PC
$p$ to a PC $q$ however implies that a value written to memory at $q$ was read
from memory at $p$, and thus that one of the pointers used at $p$ aliases with
one of the pointers used at $q$.

\medskip{}

We thus conduct the same analysis as before, but with an infinite lifetime to
account for far-ranging dependencies. We then use $\cov_u$ and $\cov_w$ as a
proxy to measure whether assuming the pointers independent was reasonable: a
bad coverage would be a clear indication of non-independent pointers treated as
independent. A good coverage is not, formally, an indication of the absence of
non-independent pointers: the detected static dependencies may come of other
pointers at the same PC. We however believe it reasonable to consider it a good
proxy for this metric, as a single assembly line often reads a single value,
and usually at most two. We do not use the $\cov_p$ metric here, as we want to
keep every detected dependency to detect possible aliasing.

\begin{table}
    \centering
    \begin{tabular}{r r}
        \toprule
        $\cov_u$ (\%) & $\cov_w$ (\%) \\
        \midrule
        95.0 & 93.7 \\
        \bottomrule
    \end{tabular}
    \caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
    binaries recompiled without repetitions, with an infinite
    lifetime, as a proxy for points-to analysis.}\label{table:cov_staticdeps_pointsto}
\end{table}

The results of this analysis are presented in
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
good confidence that our hypothesis of independent pointers is reasonable, at
least within the scope of Polybench, which we believe representative of
scientific computation ---~one of the prominent use-cases of tools such as code
analyzers.

\subsection{Enriching \uica{}'s model}

To estimate the real gain in performance debugging scenarios, we integrate
\staticdeps{} into \uica{}.

There is, however, a discrepancy between the two tools: while \staticdeps{}
works at the assembly instruction level, \uica{} works at the \uop{} level. In
real hardware, dependencies indeed occur between \uops{}; however, we are not
aware of the existence of a \uop{}-level semantic description of the x86-64 ISA
(which, by essence, would be declined for each specific processor, as the ISA
itself is not concerned with \uops{}). This level of detail was thus unsuitable
for the \staticdeps{} analysis.

We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
are found to be dependant, we add a dependency between each couple $\mu_1 \in
i_1, \mu_2 \in i_2$. This approximation is thus largely pessimistic, and should
predict execution times biased towards a slower computation kernel. A finer
model, or a finer (conservative) filtering of which \uops{} must be considered
dependent ---~\eg{} a memory dependency can only come from a memory-related
\uop{}~--- may enhance the accuracy of our integration.

\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l l r r r r r r r}
        \toprule
        \textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
\midrule
        \multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\
                              & + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\
\midrule
        \multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
                                & + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\
\bottomrule
    \end{tabular}
    \caption{Evaluation through \cesasme{} of the integration of \staticdeps{}
    to \uica{}}\label{table:staticdeps_uica_cesasme}
\end{table}

\begin{figure}
    \centering
    \includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg}
    \caption{Statistical distribution of relative errors of \uica{}, with and
    without \staticdeps{} hints, with and without pruning latency bound through
memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot}
\end{figure}

\medskip{}

We then evaluate our gains by running \cesasme{}'s harness as we did in
\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets:
first, the full set of 3\,500 binaries from the previous chapter; then, the
set of binaries pruned to exclude benchmarks heavily relying on memory-carried
dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is
beneficial to \uica{}, we expect \uicadeps{} to yield significantly better
results than \uica{} alone on the first dataset. On the second dataset,
however, \staticdeps{} should provide no significant contribution, as the
dataset was pruned to not exhibit significant memory-carried latency-boundness.
We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as
the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}.

\medskip{}

We deduce two things from this experiment.

First, the full dataset \uicadeps{} row is extremely close, on every metric, to
the pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}'
addition to \uica{} is very conclusive: the hints provided by \staticdeps{} are
sufficient to make \uica{}'s results as good on the full dataset as they were
before on a dataset pruned of precisely the kind of dependencies we aim to
detect. Thus, at least on workloads similar to Polybench, \staticdeps{} is able
to resolve the issue of memory-carried dependencies for \uica{}'s throughput
analysis.

Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are
extremely close. From this, we argue that \staticdeps{} does not introduce
false positives when no dependency should be found; its addition to \uica{}
does not negatively impact its accuracy whenever it is not relevant.

\subsection{Analysis speed}

The main advantage of a static analysis of dependencies over a dynamic one is
its execution time ---~we should expect from \staticdeps{} an analysis time far
lower than \depsim{}'s.

To assess this, we evaluate on the same \cesasme{} kernels four data sequences:
\begin{enumerate}[(i)]
    \item{}\label{messeq:depsim} the execution time of \depsim{} on each of
        \cesasme{}'s kernels;
    \item{}\label{messeq:staticdeps_one} the execution time of \staticdeps{} on
        each of the basic blocks of each of \cesasme{}'s kernels;
    \item{}\label{messeq:staticdeps_sum} for each of those kernels, the sum of
        the execution times of \staticdeps{} on the kernel's constituting basic
        blocks;
    \item{}\label{messeq:staticdeps_speedup} for each basic block of each of
        \cesasme{}'s kernels, \staticdeps' speedup \wrt{} \depsim{}, that is,
        \depsim{}'s execution time divided by \staticdeps{}'.
\end{enumerate}

As \staticdeps{} is likely to be used at the scale of a basic block, we argue
that the sequence~(\ref{messeq:staticdeps_one}) is more relevant than
sequence~(\ref{messeq:staticdeps_sum}); however, the latter might be seen as
more fair, as one run of \depsim{} yields dependencies of all of the kernel's
constituting basic blocks.

\begin{figure}
    \centering
    \begin{minipage}{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
        \captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
        on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
    \end{minipage}\hfill\begin{minipage}{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
        \captionof{figure}{Statistical distribution of \staticdeps{}' speedup over \depsim{} on \cesasme{}'s kernels}\label{fig:staticdeps_cesasme_speedup_boxplot}
    \end{minipage}
\end{figure}

\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l r r r r}
        \toprule
        \textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
\midrule
        Seq.\ (\ref{messeq:depsim}) ---~\depsim{}
                          & 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
        Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum)
                          & 2307 ms & 677 ms &  557 ms & 2700 ms \\
        Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single)
                          & 529 ms & 545 ms & 425 ms & 588 ms \\
\midrule
        Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup
                          & $\times$36.1 & $\times$33.5 & $\times$30.1 &
                          $\times$41.7 \\
\bottomrule
    \end{tabular}
    \caption{Statistical distribution of \staticdeps{} and \depsim{} run times
    and speedup on \cesasme{}'s kernels}\label{table:staticdeps_cesasme_time_eval}
\end{table}

\bigskip{}

We plot the statistical distribution of these series in
\autoref{fig:staticdeps_cesasme_runtime_boxplot} and
\autoref{fig:staticdeps_cesasme_speedup_boxplot}, and give numerical data for
some statistical indicators in \autoref{table:staticdeps_cesasme_time_eval}. We
note that \staticdeps{} is 30 to 40 times faster than \depsim{}. Furthermore,
\staticdeps{} is written in Python, more as a proof-of-concept than as
production-ready software; meanwhile, \depsim{} is written in C on top of
\valgrind{}, an efficient, production-ready software. We expect that with
optimization efforts, and a rewrite in a compiled language, the speedup would
reach two to three orders of magnitude.
No results found.