374 lines
18 KiB
TeX
374 lines
18 KiB
TeX
\section{Evaluation}
|
|
|
|
We evaluate the relevance of \staticdeps{} results in two ways: first, we
|
|
compare the detected dependencies to those extracted at runtime by \depsim{},
|
|
to evaluate the proportion of dependencies actually detected. Then, we evaluate
|
|
the relevance of our static analysis from a performance debugging point of
|
|
view, by enriching \uica{}'s model with \staticdeps{} and assessing, using
|
|
\cesasme{}, the benefits brought to the model.
|
|
|
|
We finally evaluate our claim that using a static model instead of a dynamic
|
|
analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable
|
|
amount of time.
|
|
|
|
\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim}
|
|
|
|
The \staticdeps{}'s model contribution largely resides in its ability to track
|
|
memory-carried dependencies, including loop-carried ones. We thus focus on
|
|
evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to
|
|
memory-carried dependencies.
|
|
|
|
We use the binaries produced by \cesasme{} as a dataset, as we already assessed
|
|
its relevance and contains enough benchmarks to be statistically meaningful. We
|
|
also already have tooling and basic-block segmentation available for those
|
|
benchmarks, making the analysis more convenient.
|
|
|
|
\subsubsection{Recompiling \cesasme{}'s dataset}
|
|
|
|
In practice, benchmarks from \cesasme{} are roughly of the following form:
|
|
|
|
\begin{lstlisting}[language=C]
|
|
for(int measure=0; measure < NUM_MEASURES; ++measure) {
|
|
measure_start();
|
|
for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
|
|
for(int i=0; i < BENCHMARK_SIZE; ++i) {
|
|
/* Some kernel, independent of measure, repeat */
|
|
}
|
|
}
|
|
measure_stop();
|
|
}
|
|
\end{lstlisting}
|
|
|
|
While this is sensible for conducting throughput measures, it also introduces
|
|
unwanted dependencies. If, for instance, the kernel consists in
|
|
$A[i] = C\times{}A[i] + B[i]$, implemented by\\
|
|
\begin{minipage}{0.95\linewidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
loop:
|
|
vmulsd (%rax,%rdi), %xmm0, %xmm1
|
|
vaddsd (%rbx,%rdi), %xmm1, %xmm1
|
|
vmovsd %xmm1, (%rax,%rdi)
|
|
add $8, %rdi
|
|
cmp %rdi, %r10
|
|
jne loop
|
|
\end{lstlisting}
|
|
\end{minipage}\\
|
|
a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
|
|
----~although there is no such dependency inherent to the kernel.
|
|
|
|
However, each iteration of the
|
|
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
|
|
loop will read again each \texttt{A[i]} (\ie{} \lstxasm{(\%rax,\%rdi)} in the
|
|
assembly) value from the previous inner loop, and
|
|
write it back. This creates a dependency to the previous iteration of the inner
|
|
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
|
|
enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
|
|
does not report a dependency's distance, they are considered meaningful; and as
|
|
they cannot be detected by \staticdeps{} ---~which is unaware of the outer and
|
|
inner loop~---, they introduce unfairness in the evaluation. The actual loss of
|
|
precision introduced by not discovering such dependencies is instead assessed
|
|
later by enriching \uica{} with \staticdeps{}.
|
|
|
|
\medskip{}
|
|
|
|
To avoid detecting these dependencies with \depsim{}, we \textbf{recompile
|
|
\cesasme{}'s benchmarks} from the C source code of each benchmark with
|
|
\lstc{NUM_MEASURES = NUM_REPEATS = 1}. We use these recompiled benchmarks only
|
|
in the current section. While we do not re-run code transformations from the
|
|
original Polybenchs, we do recompile the benchmarks from C source. Thus, the
|
|
results from this section \emph{are not comparable} with results from other
|
|
sections, as the compiler may have used different optimisations, instructions,
|
|
etc.
|
|
|
|
\subsubsection{Dependency coverage}
|
|
|
|
For each binary generated by \cesasme{}, we use its cached basic
|
|
block splitting and occurrence count. Among each binary, we discard any basic
|
|
block with fewer than 10\,\% of the occurrence count of the most-hit basic
|
|
block; this avoids considering basic blocks which were not originally inside
|
|
loops, and for which loop-carried dependencies would make no sense ---~and
|
|
could possibly create false positives.
|
|
|
|
For each of the considered binaries, we run our dynamic analysis, \depsim{},
|
|
and record its results. We use a lifetime of 512 instructions for this
|
|
analysis, as this is roughly the size of recent Intel reorder
|
|
buffers~\cite{wikichip_intel_rob_size}; as discussed in
|
|
\autoref{ssec:staticdeps_detection}, dependencies spanning farther than the
|
|
size of the ROB are not microarchitecturally relevant. Dependencies whose
|
|
source and destination program counters are not in the same basic block are
|
|
discarded, as \staticdeps{} cannot detect them by construction.
|
|
|
|
For each of the considered basic blocks, we run our static analysis,
|
|
\staticdeps{}. We discard the $\Delta{}k$ parameter ---~how many loop iterations
|
|
the dependency spans~---, as our dynamic analysis does not report an equivalent
|
|
parameter, but only a pair of program counters.
|
|
|
|
Dynamic dependencies from \depsim{} are converted to
|
|
\emph{periodic dependencies} in the sense of \staticdeps{} as described in
|
|
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
|
|
least 80\% of the block's iterations are kept ---~else, dependencies are
|
|
considered measurement artifacts. The \emph{periodic coverage}
|
|
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
|
|
proportion of dependencies found by \staticdeps{} among the periodic
|
|
dependencies extracted from \depsim{}:
|
|
\[
|
|
\cov_p = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
|
|
\]
|
|
|
|
\smallskip{}
|
|
|
|
We also keep the raw dependencies from \depsim{} ---~that is, without converting
|
|
them to periodic dependencies. From these, we consider two metrics:
|
|
the unweighted dependencies coverage, \[
|
|
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
|
|
\]
|
|
identical to $\cov_p$ but based on unfiltered dependencies,
|
|
as well as the weighted dependencies coverage, \[
|
|
\cov_w =
|
|
\dfrac{
|
|
\sum_{d \in \text{found}} \rho_d
|
|
}{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
|
|
\]
|
|
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
|
|
detected by \depsim. Note that such a metric is not meaningful for periodic
|
|
dependencies as, by construction, each dependency occurs as many times
|
|
as the loop iterates.
|
|
|
|
\begin{table}
|
|
\centering
|
|
\begin{tabular}{r r r}
|
|
\toprule
|
|
$\cov_p$ (\%) & $\cov_u$ (\%) & $\cov_w$ (\%) \\
|
|
\midrule
|
|
96.0 & 94.4 & 98.3 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Periodic, unweighted and weighted coverage of \staticdeps{} on
|
|
\cesasme{}'s binaries recompiled without repetitions, with a lifetime of
|
|
512.}\label{table:cov_staticdeps}
|
|
\end{table}
|
|
|
|
These metrics are presented for the 3\,500 binaries of \cesasme{} in
|
|
\autoref{table:cov_staticdeps}. The obtained coverage is consistent between the
|
|
three metrics used ($\cov_p$, $cov_u$, $cov_w$) and the reported coverage is
|
|
very close to 100\,\%, giving us good confidence on the accuracy of
|
|
\staticdeps.
|
|
|
|
\subsubsection{``Points-to'' aliasing analysis}
|
|
|
|
The same methodology can be re-used as a proxy for estimating the rate of
|
|
aliasing independent pointers in our dataset. Indeed, a major approximation
|
|
made by \staticdeps{} is to assume that any new encountered pointer ---~function
|
|
parameters, value read from memory, \ldots~--- does \emph{not} alias with
|
|
previously encountered values. This is implemented by the use of a fresh
|
|
random value for each value yet unknown.
|
|
|
|
Determining which pointers may point to which other pointers ---~and, by
|
|
extension, may point to the same memory region~--- is called a \emph{points-to
|
|
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
|
|
the pointers for which taking a fresh value was \emph{not} representative of
|
|
the reality.
|
|
|
|
If we detect, through dynamic analysis, that a value derived from a
|
|
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
|
|
--~say, \lstc{a + k == b + l}~---, we can deduce that \lstc{a} \emph{points-to}
|
|
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
|
|
execution is equal to \lstc{b + l} at the very end of the execution: although
|
|
the pointers will not alias (that is, share the same value at the same moment),
|
|
they still point to the same memory region and should not be treated as
|
|
independent.
|
|
|
|
Our dynamic analyzer, \depsim{}, does not have this granularity, as it only
|
|
reports dependencies between two program counters. A dependency from a PC
|
|
$p$ to a PC $q$ however implies that a value written to memory at $q$ was read
|
|
from memory at $p$, and thus that one of the pointers used at $p$ aliases with
|
|
one of the pointers used at $q$.
|
|
|
|
\medskip{}
|
|
|
|
We thus conduct the same analysis as before, but with an infinite lifetime to
|
|
account for far-ranging dependencies. We then use $\cov_u$ and $\cov_w$ as a
|
|
proxy to measure whether assuming the pointers independent was reasonable: a
|
|
bad coverage would be a clear indication of non-independent pointers treated as
|
|
independent. A good coverage is not, formally, an indication of the absence of
|
|
non-independent pointers: the detected static dependencies may come of other
|
|
pointers at the same PC. We however believe it reasonable to consider it a good
|
|
proxy for this metric, as a single assembly line often reads a single value,
|
|
and usually at most two. We do not use the $\cov_p$ metric here, as we want to
|
|
keep every detected dependency to detect possible aliasing.
|
|
|
|
\begin{table}
|
|
\centering
|
|
\begin{tabular}{r r}
|
|
\toprule
|
|
$\cov_u$ (\%) & $\cov_w$ (\%) \\
|
|
\midrule
|
|
95.0 & 93.7 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
|
|
binaries recompiled without repetitions, with an infinite
|
|
lifetime, as a proxy for points-to analysis.}\label{table:cov_staticdeps_pointsto}
|
|
\end{table}
|
|
|
|
The results of this analysis are presented in
|
|
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
|
|
good confidence that our hypothesis of independent pointers is reasonable, at
|
|
least within the scope of Polybench, which we believe representative of
|
|
scientific computation ---~one of the prominent use-cases of tools such as code
|
|
analyzers.
|
|
|
|
\subsection{Enriching \uica{}'s model}
|
|
|
|
To estimate the real gain in performance debugging scenarios, we integrate
|
|
\staticdeps{} into \uica{}.
|
|
|
|
There is, however, a discrepancy between the two tools: while \staticdeps{}
|
|
works at the assembly instruction level, \uica{} works at the \uop{} level. In
|
|
real hardware, dependencies indeed occur between \uops{}; however, we are not
|
|
aware of the existence of a \uop{}-level semantic description of the x86-64 ISA
|
|
(which, by essence, would be declined for each specific processor, as the ISA
|
|
itself is not concerned with \uops{}). This level of detail was thus unsuitable
|
|
for the \staticdeps{} analysis.
|
|
|
|
We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
|
|
are found to be dependant, we add a dependency between each couple $\mu_1 \in
|
|
i_1, \mu_2 \in i_2$. This approximation is thus largely pessimistic, and should
|
|
predict execution times biased towards a slower computation kernel. A finer
|
|
model, or a finer (conservative) filtering of which \uops{} must be considered
|
|
dependent ---~\eg{} a memory dependency can only come from a memory-related
|
|
\uop{}~--- may enhance the accuracy of our integration.
|
|
|
|
\begin{table}
|
|
\centering
|
|
\footnotesize
|
|
\begin{tabular}{l l r r r r r r r}
|
|
\toprule
|
|
\textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
|
|
\midrule
|
|
\multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\
|
|
& + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\
|
|
\midrule
|
|
\multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
|
|
& + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Evaluation through \cesasme{} of the integration of \staticdeps{}
|
|
to \uica{}}\label{table:staticdeps_uica_cesasme}
|
|
\end{table}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg}
|
|
\caption{Statistical distribution of relative errors of \uica{}, with and
|
|
without \staticdeps{} hints, with and without pruning latency bound through
|
|
memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot}
|
|
\end{figure}
|
|
|
|
\medskip{}
|
|
|
|
We then evaluate our gains by running \cesasme{}'s harness as we did in
|
|
\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets:
|
|
first, the full set of 3\,500 binaries from the previous chapter; then, the
|
|
set of binaries pruned to exclude benchmarks heavily relying on memory-carried
|
|
dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is
|
|
beneficial to \uica{}, we expect \uicadeps{} to yield significantly better
|
|
results than \uica{} alone on the first dataset. On the second dataset,
|
|
however, \staticdeps{} should provide no significant contribution, as the
|
|
dataset was pruned to not exhibit significant memory-carried latency-boundness.
|
|
We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as
|
|
the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}.
|
|
|
|
\medskip{}
|
|
|
|
We deduce two things from this experiment.
|
|
|
|
First, the full dataset \uicadeps{} row is extremely close, on every metric, to
|
|
the pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}'
|
|
addition to \uica{} is very conclusive: the hints provided by \staticdeps{} are
|
|
sufficient to make \uica{}'s results as good on the full dataset as they were
|
|
before on a dataset pruned of precisely the kind of dependencies we aim to
|
|
detect. Thus, at least on workloads similar to Polybench, \staticdeps{} is able
|
|
to resolve the issue of memory-carried dependencies for \uica{}'s throughput
|
|
analysis.
|
|
|
|
Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are
|
|
extremely close. From this, we argue that \staticdeps{} does not introduce
|
|
false positives when no dependency should be found; its addition to \uica{}
|
|
does not negatively impact its accuracy whenever it is not relevant.
|
|
|
|
\subsection{Analysis speed}
|
|
|
|
The main advantage of a static analysis of dependencies over a dynamic one is
|
|
its execution time ---~we should expect from \staticdeps{} an analysis time far
|
|
lower than \depsim{}'s.
|
|
|
|
To assess this, we evaluate on the same \cesasme{} kernels four data sequences:
|
|
\begin{enumerate}[(i)]
|
|
\item{}\label{messeq:depsim} the execution time of \depsim{} on each of
|
|
\cesasme{}'s kernels;
|
|
\item{}\label{messeq:staticdeps_one} the execution time of \staticdeps{} on
|
|
each of the basic blocks of each of \cesasme{}'s kernels;
|
|
\item{}\label{messeq:staticdeps_sum} for each of those kernels, the sum of
|
|
the execution times of \staticdeps{} on the kernel's constituting basic
|
|
blocks;
|
|
\item{}\label{messeq:staticdeps_speedup} for each basic block of each of
|
|
\cesasme{}'s kernels, \staticdeps' speedup \wrt{} \depsim{}, that is,
|
|
\depsim{}'s execution time divided by \staticdeps{}'.
|
|
\end{enumerate}
|
|
|
|
As \staticdeps{} is likely to be used at the scale of a basic block, we argue
|
|
that the sequence~(\ref{messeq:staticdeps_one}) is more relevant than
|
|
sequence~(\ref{messeq:staticdeps_sum}); however, the latter might be seen as
|
|
more fair, as one run of \depsim{} yields dependencies of all of the kernel's
|
|
constituting basic blocks.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\begin{minipage}{0.48\linewidth}
|
|
\centering
|
|
\includegraphics[width=\linewidth]{staticdeps_time_boxplot.svg}
|
|
\captionof{figure}{Statistical distribution of \staticdeps{} and \depsim{} run times
|
|
on \cesasme{}'s kernels ---~log y scale}\label{fig:staticdeps_cesasme_runtime_boxplot}
|
|
\end{minipage}\hfill\begin{minipage}{0.48\linewidth}
|
|
\centering
|
|
\includegraphics[width=\linewidth]{staticdeps_speedup_boxplot.svg}
|
|
\captionof{figure}{Statistical distribution of \staticdeps{}' speedup over \depsim{} on \cesasme{}'s kernels}\label{fig:staticdeps_cesasme_speedup_boxplot}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
\begin{table}
|
|
\centering
|
|
\footnotesize
|
|
\begin{tabular}{l r r r r}
|
|
\toprule
|
|
\textbf{Sequence} & \textbf{Average} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} \\
|
|
\midrule
|
|
Seq.\ (\ref{messeq:depsim}) ---~\depsim{}
|
|
& 18083 ms & 17645 ms & 17080 ms & 18650 ms \\
|
|
Seq.\ (\ref{messeq:staticdeps_sum}) ---~\staticdeps{} (sum)
|
|
& 2307 ms & 677 ms & 557 ms & 2700 ms \\
|
|
Seq.\ (\ref{messeq:staticdeps_one}) ---~\staticdeps{} (single)
|
|
& 529 ms & 545 ms & 425 ms & 588 ms \\
|
|
\midrule
|
|
Seq.\ (\ref{messeq:staticdeps_speedup}) ---~speedup
|
|
& $\times$36.1 & $\times$33.5 & $\times$30.1 &
|
|
$\times$41.7 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Statistical distribution of \staticdeps{} and \depsim{} run times
|
|
and speedup on \cesasme{}'s kernels}\label{table:staticdeps_cesasme_time_eval}
|
|
\end{table}
|
|
|
|
\bigskip{}
|
|
|
|
We plot the statistical distribution of these series in
|
|
\autoref{fig:staticdeps_cesasme_runtime_boxplot} and
|
|
\autoref{fig:staticdeps_cesasme_speedup_boxplot}, and give numerical data for
|
|
some statistical indicators in \autoref{table:staticdeps_cesasme_time_eval}. We
|
|
note that \staticdeps{} is 30 to 40 times faster than \depsim{}. Furthermore,
|
|
\staticdeps{} is written in Python, more as a proof-of-concept than as
|
|
production-ready software; meanwhile, \depsim{} is written in C on top of
|
|
\valgrind{}, an efficient, production-ready software. We expect that with
|
|
optimization efforts, and a rewrite in a compiled language, the speedup would
|
|
reach two to three orders of magnitude.
|