Staticdeps: evaluation part
This commit is contained in:
parent
3511d27516
commit
a9cfaef2f9
5 changed files with 2066 additions and 12 deletions
|
@ -1,4 +1,4 @@
|
|||
\section{Finding basic blocks to evaluate \palmed{}}
|
||||
\section{Finding basic blocks to evaluate \palmed{}}\label{sec:benchsuite_bb}
|
||||
|
||||
In the context of all that is described above, my main task in the environment
|
||||
of \palmed{} was to build a system able to evaluate a produced mapping on a
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
\section{The \staticdeps{} heuristic}
|
||||
\section{Staticdeps}
|
||||
|
||||
The static analyzer we present, \staticdeps{}, only aims to tackle the
|
||||
difficulty~\ref{memcarried_difficulty_arith} mentioned above: tracking
|
||||
|
@ -12,7 +12,8 @@ This problem could be solved using symbolic calculus algorithms. However, those
|
|||
algorithms are not straightforward to implement, and the equality test between
|
||||
two arbitrary expressions can be costly.
|
||||
|
||||
\medskip{}
|
||||
\subsection{The \staticdeps{} heuristic}
|
||||
|
||||
Instead, we use an heuristic based on random values. We consider the set $\calR
|
||||
= \left\{0, 1, \ldots, 2^{64}-1\right\}$ of values representable by a 64-bits
|
||||
unsigned integer; we extend this set to $\bar\calR = \calR \cup \{\bot\}$,
|
||||
|
@ -45,20 +46,54 @@ register-carried dependencies, applying the following principles.
|
|||
known).
|
||||
\end{itemize}
|
||||
|
||||
The semantics needed to compute encountered operations are obtained by lifting
|
||||
the kernel's assembly to \valgrind{}'s \vex{} intermediary representation.
|
||||
\subsection{Practical implementation}
|
||||
|
||||
We implement \staticdeps{} in Python, using \texttt{pyelftools} and the
|
||||
\texttt{capstone} disassembler ---~which we already introduced in
|
||||
\autoref{sec:benchsuite_bb}~--- to extract and disassemble the targeted basic
|
||||
block. The semantics needed to compute encountered operations are obtained by
|
||||
lifting the kernel's assembly to \valgrind{}'s \vex{} intermediary
|
||||
representation.
|
||||
|
||||
\medskip{}
|
||||
|
||||
This first analysis provides us with a raw list of dependencies across
|
||||
iterations of the considered basic block. We then ``re-roll'' the unrolled
|
||||
kernel by transcribing each dependency to a triplet $(\texttt{source\_insn},
|
||||
\texttt{dest\_insn}, \Delta{}k)$, where the first two elements are the source
|
||||
and destination instruction of the dependency \emph{in the original,
|
||||
non-unrolled kernel}, and $\Delta{}k$ is the number of iterations of the kernel
|
||||
between the source and destination instruction of the dependency.
|
||||
The implementation of the heuristic detailed above provides us with a raw list
|
||||
of dependencies across iterations of the considered basic block. We then
|
||||
``re-roll'' the unrolled kernel by transcribing each dependency to a triplet
|
||||
$(\texttt{source\_insn}, \texttt{dest\_insn}, \Delta{}k)$, where the first two
|
||||
elements are the source and destination instruction of the dependency \emph{in
|
||||
the original, non-unrolled kernel}, and $\Delta{}k$ is the number of iterations
|
||||
of the kernel between the source and destination instruction of the dependency.
|
||||
|
||||
Finally, we filter out spurious dependencies: each dependency found should
|
||||
occur for each kernel iteration $i$ at which $i + \Delta{}k$ is within bounds.
|
||||
If the dependency is found for less than $80\,\%$ of those iterations, the
|
||||
dependency is declared spurious and is dropped.
|
||||
|
||||
\subsection{Limitations}\label{ssec:staticdeps_limits}
|
||||
|
||||
In \autoref{chap:CesASMe}, we argued that one of the shortcomings that most
|
||||
crippled state-of-the-art tools was that analyses were conducted
|
||||
out-of-context, considering only the basic block at hand. This analysis is also
|
||||
true for \staticdeps{}, as it is still focused on a single basic block in
|
||||
isolation; in particular, any aliasing that stems from outside of the analyzed
|
||||
basic block is not visible to \staticdeps{}.
|
||||
|
||||
Work towards a broader analysis range, \eg{} at the scale of a function, or at
|
||||
least initializing values with gathered assertions ---~maybe based on abstract
|
||||
interpretation techniques~--- could be beneficial to the quality of
|
||||
dependencies detections.
|
||||
|
||||
\medskip{}
|
||||
|
||||
As \staticdeps{}'s heuristic is based on randomness in a Monte Carlo sense, it
|
||||
may yield false positives: two registers could theoretically be assigned the
|
||||
same value sampled at random, making them aliasing addresses. This is, however,
|
||||
very improbable, as values are sampled from a set of cardinality $2^{64}$. If
|
||||
necessary, the error can be reduced by amplification: running multiple times
|
||||
the algorithm on different randomness seeds reduces the error exponentially.
|
||||
|
||||
Conversely, \staticdeps{} should not present false negatives due to randomness.
|
||||
Dependencies may go undetected, \eg{} because of out-of-scope aliasing or
|
||||
unsupported operations. However, no dependency that falls into the scope of
|
||||
\depsim{}'s analysis should be missed because of random initialisations.
|
||||
|
|
|
@ -1 +1,244 @@
|
|||
\section{Evaluation}
|
||||
|
||||
We evaluate the relevance of \staticdeps{} results in two ways: first, we
|
||||
compare the detected dependencies to those extracted at runtime by \depsim{},
|
||||
to evaluate the proportion of dependencies actually detected. Then, we evaluate
|
||||
the relevance of our static analysis from a performance debugging point of
|
||||
view, by enriching \uica{}'s model with \staticdeps{} and assessing, using
|
||||
\cesasme{}, the benefits brought to the model.
|
||||
|
||||
We finally evaluate our claim that using a static model instead of a dynamic
|
||||
analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable
|
||||
amount of time.
|
||||
|
||||
\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim}
|
||||
|
||||
The \staticdeps{}'s model contribution largely resides in its ability to track
|
||||
memory-carried dependencies, including loop-carried ones. We thus focus on
|
||||
evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to
|
||||
memory-carried dependencies.
|
||||
|
||||
We use the binaries produced by \cesasme{} as a dataset, as we already assessed
|
||||
its relevance and contains enough benchmarks to be statistically meaningful. We
|
||||
also already have tooling and basic-block segmentation available for those
|
||||
benchmarks, making the analysis more convenient.
|
||||
|
||||
\medskip{}
|
||||
|
||||
For each binary previously generated by \cesasme{}, we use its cached basic
|
||||
block splitting and occurrence count. Among each binary, we discard any basic
|
||||
block with fewer than 10\,\% of the occurrence count of the most-hit basic
|
||||
block; this avoids considering basic blocks which were not originally inside
|
||||
loops, and for which loop-carried dependencies would make no sense ---~and
|
||||
could possibly create false positives.
|
||||
|
||||
For each of the considered binaries, we run our dynamic analysis, \depsim{},
|
||||
and record its results.
|
||||
|
||||
For each of the considered basic blocks, we run our static analysis,
|
||||
\staticdeps{}. We translate the detected dependencies back to original ELF
|
||||
addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does
|
||||
not report an equivalent parameter, but only a pair of program counters. Each
|
||||
of the dependencies reported by \depsim{} whose source and destination
|
||||
addresses belong to the basic block considered are then classified as either
|
||||
detected or missed by \staticdeps{}. Dynamically detected dependencies spanning
|
||||
across basic blocks are discarded, as \staticdeps{} cannot detect them by
|
||||
construction.
|
||||
|
||||
\medskip{}
|
||||
|
||||
We consider two metrics: the unweighted dependencies coverage, \[
|
||||
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
|
||||
\]
|
||||
|
||||
as well as the weighted dependencies coverage, \[
|
||||
\cov_w =
|
||||
\dfrac{
|
||||
\sum_{d \in \text{found}} \rho_d
|
||||
}{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
|
||||
\]
|
||||
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
|
||||
detected by \depsim.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\begin{tabular}{r r r}
|
||||
\toprule
|
||||
\textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\
|
||||
\midrule
|
||||
$\infty$ & 38.1\,\% & 44.0\,\% \\
|
||||
1024 & 57.6\,\% & 58.2\,\% \\
|
||||
512 & 56.4\,\% & 63.5\,\% \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
|
||||
binaries}\label{table:cov_staticdeps}
|
||||
\end{table}
|
||||
|
||||
|
||||
These metrics are presented for the 3\,500 binaries of \cesasme{} in the first
|
||||
data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about
|
||||
40\,\%, is lower than expected.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
Manual investigation on the missed dependencies revealed some surprising
|
||||
dependencies dynamically detected by \depsim{}, that did not appear to actually
|
||||
be read-after-write dependencies. In the following (simplified) example,
|
||||
roughly implementing $A[i] = C\times{}A[i] + B[i]$,
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
loop:
|
||||
vmulsd (%rax,%rdi), %xmm0, %xmm1
|
||||
vaddsd (%rbx,%rdi), %xmm1, %xmm1
|
||||
vmovsd %xmm1, (%rax,%rdi)
|
||||
add $8, %rdi
|
||||
cmp %rdi, %r10
|
||||
jne loop
|
||||
\end{lstlisting}
|
||||
\end{minipage}\\
|
||||
a read-after-write dependency from line 4 to line 2 was reported ---~while no
|
||||
such dependency actually exists.
|
||||
|
||||
The reason for that is that, in \cesasme{}'s benchmarks, the whole program
|
||||
would roughly look like
|
||||
\begin{lstlisting}[language=C]
|
||||
/* Initialize A, B, C here */
|
||||
for(int measure=0; measure < NUM_MEASURES; ++measure) {
|
||||
measure_start();
|
||||
for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
|
||||
for(int i=0; i < ARRAY_SIZE; ++i) {
|
||||
A[i] = C * A[i] + B[i];
|
||||
}
|
||||
}
|
||||
measure_stop();
|
||||
}
|
||||
\end{lstlisting}
|
||||
|
||||
Thus, the previously reported dependency did not come from within the kernel,
|
||||
but \emph{from one outer iteration to the next} (\ie{}, iteration on
|
||||
\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not,
|
||||
in practice, relevant: under the common code analyzers assumptions, the most
|
||||
inner loop is long enough to be considered infinite in steady state; thus, two
|
||||
outer loop iterations are too far separated in time for this dependency to have
|
||||
any relevance, as the source iteration is long executed when the destination
|
||||
iteration is scheduled.
|
||||
|
||||
\medskip{}
|
||||
|
||||
To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}.
|
||||
As we do not have access without a heavy runtime slowdown to elapsed cycles in
|
||||
\valgrind{}, we define a \emph{timestamp} as the number of instructions
|
||||
executed since beginning of the program's execution; we increment this count at
|
||||
each branch instruction to avoid excessive instrumentation slowdown.
|
||||
|
||||
We further annotate every write to the shadow memory with the timestamp at
|
||||
which it occurred. Whenever a dependency should be added, we first check that
|
||||
the dependency has not expired ---~that is, that it is not older than a given
|
||||
threshold.
|
||||
|
||||
We re-run the previous experiments with lifetimes of respectively 1\,024 and
|
||||
512 instructions, which roughly corresponds to the order of magnitude of the
|
||||
size of a reorder buffer; results can also be found in
|
||||
\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions
|
||||
lifetime greatly improves the coverage rates, both unweighted and weighted,
|
||||
further reducing this lifetime to 512 does not yield significant enhancements.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
The final coverage results, with a rough 60\,\% detection rate, are reasonable
|
||||
and detect a significant proportion of dependencies; however, many are still
|
||||
not detected.
|
||||
|
||||
This may be explained by the limitations studied in
|
||||
\autoref{ssec:staticdeps_limits} above, and especially the inability of
|
||||
\staticdeps{} to detect dependencies through aliasing pointers. This falls,
|
||||
more broadly, into the problem of lack of context that we expressed before and
|
||||
emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale
|
||||
of the whole program, that would be able to integrate constraints stemming from
|
||||
outside of the loop body, would capture many more dependencies.
|
||||
|
||||
\subsection{Enriching \uica{}'s model}
|
||||
|
||||
To estimate the real gain in performance debugging scenarios, however, we
|
||||
integrate \staticdeps{} into \uica{}.
|
||||
|
||||
There is, however, a discrepancy between the two tools: while \staticdeps{}
|
||||
works at the assembly instruction level, \uica{} works at the \uop{} level. In
|
||||
real hardware, dependencies indeed occur between \uops{}; however, we are not
|
||||
aware of the existence of a \uop{}-level semantic description of the x86-64
|
||||
ISA, which made this level of detail unsuitable for the \staticdeps{} analysis.
|
||||
|
||||
We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
|
||||
are found to be dependant, we add a dependency between each couple $\mu_1 \in
|
||||
i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict
|
||||
execution times biased towards a slower computation kernel. A finer model, or a
|
||||
finer (conservative) filtering of which \uops{} must be considered dependent
|
||||
---~\eg{} a memory dependency can only come from a memory-related \uop{}~---
|
||||
may enhance the accuracy of our integration.
|
||||
|
||||
\medskip{}
|
||||
|
||||
We then evaluate our gains by running \cesasme{}'s harness as we did in
|
||||
\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets:
|
||||
first, the full set of 3\,500 binaries from the previous chapter; then, the
|
||||
set of binaries pruned to exclude benchmarks heavily relying on memory-carried
|
||||
dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is
|
||||
beneficial to \uica{}, we expect \uicadeps{} to yield significantly better
|
||||
results than \uica{} alone on the first dataset. On the second dataset,
|
||||
however, \staticdeps{} should provide no significant contribution, as the
|
||||
dataset was pruned to not exhibit significant memory-carried latency-boundness.
|
||||
We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as
|
||||
the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\footnotesize
|
||||
\begin{tabular}{l l r r r r r r r}
|
||||
\toprule
|
||||
\textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
|
||||
\midrule
|
||||
\multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\
|
||||
& + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\
|
||||
\midrule
|
||||
\multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
|
||||
& + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Evaluation through \cesasme{} of the integration of \staticdeps{}
|
||||
to \uica{}}\label{table:staticdeps_uica_cesasme}
|
||||
\end{table}
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg}
|
||||
\caption{Statistical distribution of relative errors of \uica{}, with and
|
||||
without \staticdeps{} hints, with and without pruning latency bound through
|
||||
memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot}
|
||||
\end{figure}
|
||||
|
||||
\medskip{}
|
||||
|
||||
The full dataset \uicadeps{} row is extremely close, on every metric, to the
|
||||
pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}' addition
|
||||
to \uica{} is very conclusive: the hints provided by \staticdeps{} are
|
||||
sufficient to make \uica{}'s results as good on the full dataset as they were
|
||||
before on a dataset pruned of precisely the kind of dependencies we aim to
|
||||
detect. Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are
|
||||
extremely close: this further supports the accuracy of \staticdeps{}.
|
||||
|
||||
\medskip{}
|
||||
|
||||
While the results obtained against \depsim{} in
|
||||
\autoref{ssec:staticdeps_eval_depsim} above were reasonable, they were not
|
||||
excellent either, and showed that many kind of dependencies were still missed
|
||||
by \staticdeps{}. However, our evaluation on \cesasme{} by enriching \uica{}
|
||||
shows that, at least on the workload considered, the dependencies that actually
|
||||
matter from a performance debugging point of view are properly found.
|
||||
|
||||
This, however, might not be true for other kinds of applications that would
|
||||
require a dependencies analysis.
|
||||
|
||||
\subsection{Analysis speed}
|
||||
|
||||
\todo{}
|
||||
|
|
1772
manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg
Normal file
1772
manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg
Normal file
File diff suppressed because it is too large
Load diff
After Width: | Height: | Size: 87 KiB |
|
@ -52,6 +52,8 @@
|
|||
\newcommand{\valgrind}{\texttt{valgrind}}
|
||||
\newcommand{\vex}{\texttt{VEX}}
|
||||
|
||||
\newcommand{\uicadeps}{\uica{}~+~\staticdeps{}}
|
||||
|
||||
\newcommand{\gdb}{\texttt{gdb}}
|
||||
|
||||
\newcommand{\coeq}{CO$_{2}$eq}
|
||||
|
@ -60,5 +62,7 @@
|
|||
|
||||
\newcommand{\reg}[1]{\texttt{\%#1}}
|
||||
|
||||
\newcommand{\cov}{\operatorname{cov}}
|
||||
|
||||
% Hyperlinks
|
||||
\newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}}
|
||||
|
|
Loading…
Reference in a new issue