316 lines
14 KiB
TeX
316 lines
14 KiB
TeX
\section{Evaluation}
|
|
|
|
We evaluate the relevance of \staticdeps{} results in two ways: first, we
|
|
compare the detected dependencies to those extracted at runtime by \depsim{},
|
|
to evaluate the proportion of dependencies actually detected. Then, we evaluate
|
|
the relevance of our static analysis from a performance debugging point of
|
|
view, by enriching \uica{}'s model with \staticdeps{} and assessing, using
|
|
\cesasme{}, the benefits brought to the model.
|
|
|
|
We finally evaluate our claim that using a static model instead of a dynamic
|
|
analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable
|
|
amount of time.
|
|
|
|
\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim}
|
|
|
|
The \staticdeps{}'s model contribution largely resides in its ability to track
|
|
memory-carried dependencies, including loop-carried ones. We thus focus on
|
|
evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to
|
|
memory-carried dependencies.
|
|
|
|
We use the binaries produced by \cesasme{} as a dataset, as we already assessed
|
|
its relevance and contains enough benchmarks to be statistically meaningful. We
|
|
also already have tooling and basic-block segmentation available for those
|
|
benchmarks, making the analysis more convenient.
|
|
|
|
\medskip{}
|
|
|
|
For each binary previously generated by \cesasme{}, we use its cached basic
|
|
block splitting and occurrence count. Among each binary, we discard any basic
|
|
block with fewer than 10\,\% of the occurrence count of the most-hit basic
|
|
block; this avoids considering basic blocks which were not originally inside
|
|
loops, and for which loop-carried dependencies would make no sense ---~and
|
|
could possibly create false positives.
|
|
|
|
For each of the considered binaries, we run our dynamic analysis, \depsim{},
|
|
and record its results.
|
|
|
|
For each of the considered basic blocks, we run our static analysis,
|
|
\staticdeps{}. We translate the detected dependencies back to original ELF
|
|
addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does
|
|
not report an equivalent parameter, but only a pair of program counters. Each
|
|
of the dependencies reported by \depsim{} whose source and destination
|
|
addresses belong to the basic block considered are then classified as either
|
|
detected or missed by \staticdeps{}. Dynamically detected dependencies spanning
|
|
across basic blocks are discarded, as \staticdeps{} cannot detect them by
|
|
construction.
|
|
|
|
\medskip{}
|
|
|
|
We consider two metrics: the unweighted dependencies coverage, \[
|
|
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
|
|
\]
|
|
|
|
as well as the weighted dependencies coverage, \[
|
|
\cov_w =
|
|
\dfrac{
|
|
\sum_{d \in \text{found}} \rho_d
|
|
}{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
|
|
\]
|
|
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
|
|
detected by \depsim.
|
|
|
|
\begin{table}
|
|
\centering
|
|
\begin{tabular}{r r r}
|
|
\toprule
|
|
\textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\
|
|
\midrule
|
|
$\infty$ & 38.1\,\% & 44.0\,\% \\
|
|
1024 & 57.6\,\% & 58.2\,\% \\
|
|
512 & 56.4\,\% & 63.5\,\% \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
|
|
binaries}\label{table:cov_staticdeps}
|
|
\end{table}
|
|
|
|
|
|
These metrics are presented for the 3\,500 binaries of \cesasme{} in the first
|
|
data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about
|
|
40\,\%, is lower than expected.
|
|
|
|
\bigskip{}
|
|
|
|
Manual investigation on the missed dependencies revealed some surprising
|
|
dependencies dynamically detected by \depsim{}, that did not appear to actually
|
|
be read-after-write dependencies. In the following (simplified) example,
|
|
roughly implementing $A[i] = C\times{}A[i] + B[i]$,
|
|
\begin{minipage}{0.95\linewidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
loop:
|
|
vmulsd (%rax,%rdi), %xmm0, %xmm1
|
|
vaddsd (%rbx,%rdi), %xmm1, %xmm1
|
|
vmovsd %xmm1, (%rax,%rdi)
|
|
add $8, %rdi
|
|
cmp %rdi, %r10
|
|
jne loop
|
|
\end{lstlisting}
|
|
\end{minipage}\\
|
|
a read-after-write dependency from line 4 to line 2 was reported ---~while no
|
|
such dependency actually exists.
|
|
|
|
The reason for that is that, in \cesasme{}'s benchmarks, the whole program
|
|
would roughly look like
|
|
\begin{lstlisting}[language=C]
|
|
/* Initialize A, B, C here */
|
|
for(int measure=0; measure < NUM_MEASURES; ++measure) {
|
|
measure_start();
|
|
for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
|
|
for(int i=0; i < ARRAY_SIZE; ++i) {
|
|
A[i] = C * A[i] + B[i];
|
|
}
|
|
}
|
|
measure_stop();
|
|
}
|
|
\end{lstlisting}
|
|
|
|
Thus, the previously reported dependency did not come from within the kernel,
|
|
but \emph{from one outer iteration to the next} (\ie{}, iteration on
|
|
\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not,
|
|
in practice, relevant: under the common code analyzers assumptions, the most
|
|
inner loop is long enough to be considered infinite in steady state; thus, two
|
|
outer loop iterations are too far separated in time for this dependency to have
|
|
any relevance, as the source iteration is long executed when the destination
|
|
iteration is scheduled.
|
|
|
|
\medskip{}
|
|
|
|
To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}.
|
|
As we do not have access without a heavy runtime slowdown to elapsed cycles in
|
|
\valgrind{}, we define a \emph{timestamp} as the number of instructions
|
|
executed since beginning of the program's execution; we increment this count at
|
|
each branch instruction to avoid excessive instrumentation slowdown.
|
|
|
|
We further annotate every write to the shadow memory with the timestamp at
|
|
which it occurred. Whenever a dependency should be added, we first check that
|
|
the dependency has not expired ---~that is, that it is not older than a given
|
|
threshold.
|
|
|
|
We re-run the previous experiments with lifetimes of respectively 1\,024 and
|
|
512 instructions, which roughly corresponds to the order of magnitude of the
|
|
size of a reorder buffer (see \autoref{ssec:staticdeps_detection} above); results can also be found in
|
|
\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions
|
|
lifetime greatly improves the coverage rates, both unweighted and weighted,
|
|
further reducing this lifetime to 512 does not yield significant enhancements.
|
|
|
|
\bigskip{}
|
|
|
|
The final coverage results, with a rough 60\,\% detection rate, are reasonable
|
|
and detect a significant proportion of dependencies; however, many are still
|
|
not detected.
|
|
|
|
This may be explained by the limitations studied in
|
|
\autoref{ssec:staticdeps_limits} above, and especially the inability of
|
|
\staticdeps{} to detect dependencies through aliasing pointers. This falls,
|
|
more broadly, into the problem of lack of context that we expressed before and
|
|
emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale
|
|
of the whole program, that would be able to integrate constraints stemming from
|
|
outside of the loop body, would capture many more dependencies.
|
|
|
|
\subsection{Enriching \uica{}'s model}
|
|
|
|
To estimate the real gain in performance debugging scenarios, however, we
|
|
integrate \staticdeps{} into \uica{}.
|
|
|
|
There is, however, a discrepancy between the two tools: while \staticdeps{}
|
|
works at the assembly instruction level, \uica{} works at the \uop{} level. In
|
|
real hardware, dependencies indeed occur between \uops{}; however, we are not
|
|
aware of the existence of a \uop{}-level semantic description of the x86-64
|
|
ISA, which made this level of detail unsuitable for the \staticdeps{} analysis.
|
|
|
|
We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
|
|
are found to be dependant, we add a dependency between each couple $\mu_1 \in
|
|
i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict
|
|
execution times biased towards a slower computation kernel. A finer model, or a
|
|
finer (conservative) filtering of which \uops{} must be considered dependent
|
|
---~\eg{} a memory dependency can only come from a memory-related \uop{}~---
|
|
may enhance the accuracy of our integration.
|
|
|
|
\medskip{}
|
|
|
|
We then evaluate our gains by running \cesasme{}'s harness as we did in
|
|
\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets:
|
|
first, the full set of 3\,500 binaries from the previous chapter; then, the
|
|
set of binaries pruned to exclude benchmarks heavily relying on memory-carried
|
|
dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is
|
|
beneficial to \uica{}, we expect \uicadeps{} to yield significantly better
|
|
results than \uica{} alone on the first dataset. On the second dataset,
|
|
however, \staticdeps{} should provide no significant contribution, as the
|
|
dataset was pruned to not exhibit significant memory-carried latency-boundness.
|
|
We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as
|
|
the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}.
|
|
|
|
\begin{table}
|
|
\centering
|
|
\footnotesize
|
|
\begin{tabular}{l l r r r r r r r}
|
|
\toprule
|
|
\textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
|
|
\midrule
|
|
\multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\
|
|
& + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\
|
|
\midrule
|
|
\multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
|
|
& + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Evaluation through \cesasme{} of the integration of \staticdeps{}
|
|
to \uica{}}\label{table:staticdeps_uica_cesasme}
|
|
\end{table}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg}
|
|
\caption{Statistical distribution of relative errors of \uica{}, with and
|
|
without \staticdeps{} hints, with and without pruning latency bound through
|
|
memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot}
|
|
\end{figure}
|
|
|
|
\medskip{}
|
|
|
|
The full dataset \uicadeps{} row is extremely close, on every metric, to the
|
|
pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}' addition
|
|
to \uica{} is very conclusive: the hints provided by \staticdeps{} are
|
|
sufficient to make \uica{}'s results as good on the full dataset as they were
|
|
before on a dataset pruned of precisely the kind of dependencies we aim to
|
|
detect. Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are
|
|
extremely close: this further supports the accuracy of \staticdeps{}.
|
|
|
|
\medskip{}
|
|
|
|
While the results obtained against \depsim{} in
|
|
\autoref{ssec:staticdeps_eval_depsim} above were reasonable, they were not
|
|
excellent either, and showed that many kind of dependencies were still missed
|
|
by \staticdeps{}. However, our evaluation on \cesasme{} by enriching \uica{}
|
|
shows that, at least on the workload considered, the dependencies that actually
|
|
matter from a performance debugging point of view are properly found.
|
|
|
|
This, however, might not be true for other kinds of applications that would
|
|
require a dependencies analysis.
|
|
|
|
\subsection{Analysis speed}
|
|
|
|
The main advantage of a static analysis of dependencies over a dynamic one is
|
|
its execution time ---~we should expect from \staticdeps{} an analysis time far
|
|
lower than \depsim{}'s.
|
|
|
|
To assess this, we evaluate on the same \cesasme{} kernels four data sequences:
|
|
\begin{enumerate}[(i)]
|
|
\item{}\label{messeq:depsim} the execution time of \depsim{} on each of
|
|
\cesasme{}'s kernels;
|
|
\item{}\label{messeq:staticdeps_one} the execution time of \staticdeps{} on
|
|
each of the basic blocks of each of \cesasme{}'s kernels;
|
|
\item{}\label{messeq:staticdeps_sum} for each of those kernels, the sum of
|
|
the execution times of \staticdeps{} on the kernel's constituting basic
|
|
blocks;
|
|
\item{}\label{messeq:staticdeps_speedup} for each basic block of each of
|
|
\cesasme{}'s kernels, \staticdeps' speedup \wrt{} \depsim{}, that is,
|
|
\depsim{}'s execution time divided by \staticdeps{}'.
|
|
\end{enumerate}
|
|
|
|
As \staticdeps{} is likely to be used at the scale of a basic block, we argue
|
|
that the sequence~(\ref{messeq:staticdeps_one}) is more relevant than
|
|
sequence~(\ref{messeq:staticdeps_sum}); however, the latter might be seen as
|
|
more fair, as one run of \depsim{} yields dependencies of all of the kernel's
|
|
constituting basic blocks.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\begin{minipage}{0.48\linewidth}
|
|
\centering
|
|
\includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
|
|
\captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
|
|
on \cesasme{}'s kernels --~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
|
|
\end{minipage}\hfill\begin{minipage}{0.48\linewidth}
|
|
\centering
|
|
\includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
|
|
\captionof{figure}{Statistical distribution of \staticdeps{}' speedup over \depsim{} on \cesasme{}'s kernels}\label{fig:staticdeps_cesasme_speedup_boxplot}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
\begin{table}
|
|
\centering
|
|
\footnotesize
|
|
\begin{tabular}{l r r r r}
|
|
\toprule
|
|
\textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
|
|
\midrule
|
|
Seq.\ (\ref{messeq:depsim}) --~\depsim{}
|
|
& 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
|
|
Seq.\ (\ref{messeq:staticdeps_sum}) --~\staticdeps{} (sum)
|
|
& 2307 ms & 677 ms & 557 ms & 2700 ms \\
|
|
Seq.\ (\ref{messeq:staticdeps_one}) --~\staticdeps{} (single)
|
|
& 529 ms & 545 ms & 425 ms & 588 ms \\
|
|
\midrule
|
|
Seq.\ (\ref{messeq:staticdeps_speedup}) --~speedup
|
|
& $\times$36.1 & $\times$33.5 & $\times$30.1 &
|
|
$\times$41.7 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Statistical distribution of \staticdeps{} and \depsim{} run times
|
|
and speedup on \cesasme{}'s kernels}\label{table:staticdeps_cesasme_time_eval}
|
|
\end{table}
|
|
|
|
\bigskip{}
|
|
|
|
We plot the statistical distribution of these series in
|
|
\autoref{fig:staticdeps_cesasme_runtime_boxplot} and
|
|
\autoref{fig:staticdeps_cesasme_speedup_boxplot}, and give numerical data for
|
|
some statistical indicators in \autoref{table:staticdeps_cesasme_time_eval}. We
|
|
note that \staticdeps{} is 30 to 40 times faster than \depsim{}. Furthermore,
|
|
\staticdeps{} is written in Python, more as a proof-of-concept than as
|
|
production-ready software; meanwhile, \depsim{} is written in C on top of
|
|
\valgrind{}, an efficient, production-ready software. We expect that with
|
|
optimization efforts, and a rewrite in a compiled language, the speedup would
|
|
reach two to three orders of magnitude.
|