Staticdeps: evaluation part

This commit is contained in:
Théophile Bastian 2023-09-29 16:04:48 +02:00
parent 3511d27516
commit a9cfaef2f9
5 changed files with 2066 additions and 12 deletions

View file

@ -1,4 +1,4 @@
\section{Finding basic blocks to evaluate \palmed{}}
\section{Finding basic blocks to evaluate \palmed{}}\label{sec:benchsuite_bb}
In the context of all that is described above, my main task in the environment
of \palmed{} was to build a system able to evaluate a produced mapping on a

View file

@ -1,4 +1,4 @@
\section{The \staticdeps{} heuristic}
\section{Staticdeps}
The static analyzer we present, \staticdeps{}, only aims to tackle the
difficulty~\ref{memcarried_difficulty_arith} mentioned above: tracking
@ -12,7 +12,8 @@ This problem could be solved using symbolic calculus algorithms. However, those
algorithms are not straightforward to implement, and the equality test between
two arbitrary expressions can be costly.
\medskip{}
\subsection{The \staticdeps{} heuristic}
Instead, we use an heuristic based on random values. We consider the set $\calR
= \left\{0, 1, \ldots, 2^{64}-1\right\}$ of values representable by a 64-bits
unsigned integer; we extend this set to $\bar\calR = \calR \cup \{\bot\}$,
@ -45,20 +46,54 @@ register-carried dependencies, applying the following principles.
known).
\end{itemize}
The semantics needed to compute encountered operations are obtained by lifting
the kernel's assembly to \valgrind{}'s \vex{} intermediary representation.
\subsection{Practical implementation}
We implement \staticdeps{} in Python, using \texttt{pyelftools} and the
\texttt{capstone} disassembler ---~which we already introduced in
\autoref{sec:benchsuite_bb}~--- to extract and disassemble the targeted basic
block. The semantics needed to compute encountered operations are obtained by
lifting the kernel's assembly to \valgrind{}'s \vex{} intermediary
representation.
\medskip{}
This first analysis provides us with a raw list of dependencies across
iterations of the considered basic block. We then ``re-roll'' the unrolled
kernel by transcribing each dependency to a triplet $(\texttt{source\_insn},
\texttt{dest\_insn}, \Delta{}k)$, where the first two elements are the source
and destination instruction of the dependency \emph{in the original,
non-unrolled kernel}, and $\Delta{}k$ is the number of iterations of the kernel
between the source and destination instruction of the dependency.
The implementation of the heuristic detailed above provides us with a raw list
of dependencies across iterations of the considered basic block. We then
``re-roll'' the unrolled kernel by transcribing each dependency to a triplet
$(\texttt{source\_insn}, \texttt{dest\_insn}, \Delta{}k)$, where the first two
elements are the source and destination instruction of the dependency \emph{in
the original, non-unrolled kernel}, and $\Delta{}k$ is the number of iterations
of the kernel between the source and destination instruction of the dependency.
Finally, we filter out spurious dependencies: each dependency found should
occur for each kernel iteration $i$ at which $i + \Delta{}k$ is within bounds.
If the dependency is found for less than $80\,\%$ of those iterations, the
dependency is declared spurious and is dropped.
\subsection{Limitations}\label{ssec:staticdeps_limits}
In \autoref{chap:CesASMe}, we argued that one of the shortcomings that most
crippled state-of-the-art tools was that analyses were conducted
out-of-context, considering only the basic block at hand. This analysis is also
true for \staticdeps{}, as it is still focused on a single basic block in
isolation; in particular, any aliasing that stems from outside of the analyzed
basic block is not visible to \staticdeps{}.
Work towards a broader analysis range, \eg{} at the scale of a function, or at
least initializing values with gathered assertions ---~maybe based on abstract
interpretation techniques~--- could be beneficial to the quality of
dependencies detections.
\medskip{}
As \staticdeps{}'s heuristic is based on randomness in a Monte Carlo sense, it
may yield false positives: two registers could theoretically be assigned the
same value sampled at random, making them aliasing addresses. This is, however,
very improbable, as values are sampled from a set of cardinality $2^{64}$. If
necessary, the error can be reduced by amplification: running multiple times
the algorithm on different randomness seeds reduces the error exponentially.
Conversely, \staticdeps{} should not present false negatives due to randomness.
Dependencies may go undetected, \eg{} because of out-of-scope aliasing or
unsupported operations. However, no dependency that falls into the scope of
\depsim{}'s analysis should be missed because of random initialisations.

View file

@ -1 +1,244 @@
\section{Evaluation}
We evaluate the relevance of \staticdeps{} results in two ways: first, we
compare the detected dependencies to those extracted at runtime by \depsim{},
to evaluate the proportion of dependencies actually detected. Then, we evaluate
the relevance of our static analysis from a performance debugging point of
view, by enriching \uica{}'s model with \staticdeps{} and assessing, using
\cesasme{}, the benefits brought to the model.
We finally evaluate our claim that using a static model instead of a dynamic
analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable
amount of time.
\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim}
The \staticdeps{}'s model contribution largely resides in its ability to track
memory-carried dependencies, including loop-carried ones. We thus focus on
evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to
memory-carried dependencies.
We use the binaries produced by \cesasme{} as a dataset, as we already assessed
its relevance and contains enough benchmarks to be statistically meaningful. We
also already have tooling and basic-block segmentation available for those
benchmarks, making the analysis more convenient.
\medskip{}
For each binary previously generated by \cesasme{}, we use its cached basic
block splitting and occurrence count. Among each binary, we discard any basic
block with fewer than 10\,\% of the occurrence count of the most-hit basic
block; this avoids considering basic blocks which were not originally inside
loops, and for which loop-carried dependencies would make no sense ---~and
could possibly create false positives.
For each of the considered binaries, we run our dynamic analysis, \depsim{},
and record its results.
For each of the considered basic blocks, we run our static analysis,
\staticdeps{}. We translate the detected dependencies back to original ELF
addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does
not report an equivalent parameter, but only a pair of program counters. Each
of the dependencies reported by \depsim{} whose source and destination
addresses belong to the basic block considered are then classified as either
detected or missed by \staticdeps{}. Dynamically detected dependencies spanning
across basic blocks are discarded, as \staticdeps{} cannot detect them by
construction.
\medskip{}
We consider two metrics: the unweighted dependencies coverage, \[
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
\]
as well as the weighted dependencies coverage, \[
\cov_w =
\dfrac{
\sum_{d \in \text{found}} \rho_d
}{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
\]
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
detected by \depsim.
\begin{table}
\centering
\begin{tabular}{r r r}
\toprule
\textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\
\midrule
$\infty$ & 38.1\,\% & 44.0\,\% \\
1024 & 57.6\,\% & 58.2\,\% \\
512 & 56.4\,\% & 63.5\,\% \\
\bottomrule
\end{tabular}
\caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
binaries}\label{table:cov_staticdeps}
\end{table}
These metrics are presented for the 3\,500 binaries of \cesasme{} in the first
data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about
40\,\%, is lower than expected.
\bigskip{}
Manual investigation on the missed dependencies revealed some surprising
dependencies dynamically detected by \depsim{}, that did not appear to actually
be read-after-write dependencies. In the following (simplified) example,
roughly implementing $A[i] = C\times{}A[i] + B[i]$,
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
loop:
vmulsd (%rax,%rdi), %xmm0, %xmm1
vaddsd (%rbx,%rdi), %xmm1, %xmm1
vmovsd %xmm1, (%rax,%rdi)
add $8, %rdi
cmp %rdi, %r10
jne loop
\end{lstlisting}
\end{minipage}\\
a read-after-write dependency from line 4 to line 2 was reported ---~while no
such dependency actually exists.
The reason for that is that, in \cesasme{}'s benchmarks, the whole program
would roughly look like
\begin{lstlisting}[language=C]
/* Initialize A, B, C here */
for(int measure=0; measure < NUM_MEASURES; ++measure) {
measure_start();
for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
for(int i=0; i < ARRAY_SIZE; ++i) {
A[i] = C * A[i] + B[i];
}
}
measure_stop();
}
\end{lstlisting}
Thus, the previously reported dependency did not come from within the kernel,
but \emph{from one outer iteration to the next} (\ie{}, iteration on
\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not,
in practice, relevant: under the common code analyzers assumptions, the most
inner loop is long enough to be considered infinite in steady state; thus, two
outer loop iterations are too far separated in time for this dependency to have
any relevance, as the source iteration is long executed when the destination
iteration is scheduled.
\medskip{}
To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}.
As we do not have access without a heavy runtime slowdown to elapsed cycles in
\valgrind{}, we define a \emph{timestamp} as the number of instructions
executed since beginning of the program's execution; we increment this count at
each branch instruction to avoid excessive instrumentation slowdown.
We further annotate every write to the shadow memory with the timestamp at
which it occurred. Whenever a dependency should be added, we first check that
the dependency has not expired ---~that is, that it is not older than a given
threshold.
We re-run the previous experiments with lifetimes of respectively 1\,024 and
512 instructions, which roughly corresponds to the order of magnitude of the
size of a reorder buffer; results can also be found in
\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions
lifetime greatly improves the coverage rates, both unweighted and weighted,
further reducing this lifetime to 512 does not yield significant enhancements.
\bigskip{}
The final coverage results, with a rough 60\,\% detection rate, are reasonable
and detect a significant proportion of dependencies; however, many are still
not detected.
This may be explained by the limitations studied in
\autoref{ssec:staticdeps_limits} above, and especially the inability of
\staticdeps{} to detect dependencies through aliasing pointers. This falls,
more broadly, into the problem of lack of context that we expressed before and
emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale
of the whole program, that would be able to integrate constraints stemming from
outside of the loop body, would capture many more dependencies.
\subsection{Enriching \uica{}'s model}
To estimate the real gain in performance debugging scenarios, however, we
integrate \staticdeps{} into \uica{}.
There is, however, a discrepancy between the two tools: while \staticdeps{}
works at the assembly instruction level, \uica{} works at the \uop{} level. In
real hardware, dependencies indeed occur between \uops{}; however, we are not
aware of the existence of a \uop{}-level semantic description of the x86-64
ISA, which made this level of detail unsuitable for the \staticdeps{} analysis.
We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
are found to be dependant, we add a dependency between each couple $\mu_1 \in
i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict
execution times biased towards a slower computation kernel. A finer model, or a
finer (conservative) filtering of which \uops{} must be considered dependent
---~\eg{} a memory dependency can only come from a memory-related \uop{}~---
may enhance the accuracy of our integration.
\medskip{}
We then evaluate our gains by running \cesasme{}'s harness as we did in
\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets:
first, the full set of 3\,500 binaries from the previous chapter; then, the
set of binaries pruned to exclude benchmarks heavily relying on memory-carried
dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is
beneficial to \uica{}, we expect \uicadeps{} to yield significantly better
results than \uica{} alone on the first dataset. On the second dataset,
however, \staticdeps{} should provide no significant contribution, as the
dataset was pruned to not exhibit significant memory-carried latency-boundness.
We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as
the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}.
\begin{table}
\centering
\footnotesize
\begin{tabular}{l l r r r r r r r}
\toprule
\textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
\midrule
\multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\
& + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\
\midrule
\multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
& + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\
\bottomrule
\end{tabular}
\caption{Evaluation through \cesasme{} of the integration of \staticdeps{}
to \uica{}}\label{table:staticdeps_uica_cesasme}
\end{table}
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg}
\caption{Statistical distribution of relative errors of \uica{}, with and
without \staticdeps{} hints, with and without pruning latency bound through
memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot}
\end{figure}
\medskip{}
The full dataset \uicadeps{} row is extremely close, on every metric, to the
pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}' addition
to \uica{} is very conclusive: the hints provided by \staticdeps{} are
sufficient to make \uica{}'s results as good on the full dataset as they were
before on a dataset pruned of precisely the kind of dependencies we aim to
detect. Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are
extremely close: this further supports the accuracy of \staticdeps{}.
\medskip{}
While the results obtained against \depsim{} in
\autoref{ssec:staticdeps_eval_depsim} above were reasonable, they were not
excellent either, and showed that many kind of dependencies were still missed
by \staticdeps{}. However, our evaluation on \cesasme{} by enriching \uica{}
shows that, at least on the workload considered, the dependencies that actually
matter from a performance debugging point of view are properly found.
This, however, might not be true for other kinds of applications that would
require a dependencies analysis.
\subsection{Analysis speed}
\todo{}

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 87 KiB

View file

@ -52,6 +52,8 @@
\newcommand{\valgrind}{\texttt{valgrind}}
\newcommand{\vex}{\texttt{VEX}}
\newcommand{\uicadeps}{\uica{}~+~\staticdeps{}}
\newcommand{\gdb}{\texttt{gdb}}
\newcommand{\coeq}{CO$_{2}$eq}
@ -60,5 +62,7 @@
\newcommand{\reg}[1]{\texttt{\%#1}}
\newcommand{\cov}{\operatorname{cov}}
% Hyperlinks
\newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}}