Staticdeps: evaluation part

2023-09-29 16:04:48 +02:00 · 2023-09-29 16:04:48 +02:00 · a9cfaef2f9
commit a9cfaef2f9
parent 3511d27516
5 changed files with 2066 additions and 12 deletions
--- a/manuscrit/30_palmed/35_benchsuite_bb.tex
+++ b/manuscrit/30_palmed/35_benchsuite_bb.tex
@ -1,4 +1,4 @@
-\section{Finding basic blocks to evaluate \palmed{}}
+\section{Finding basic blocks to evaluate \palmed{}}\label{sec:benchsuite_bb}

 In the context of all that is described above, my main task in the environment
 of \palmed{} was to build a system able to evaluate a produced mapping on a
--- a/manuscrit/60_staticdeps/40_staticdeps.tex
+++ b/manuscrit/60_staticdeps/40_staticdeps.tex
@ -1,4 +1,4 @@
-\section{The \staticdeps{} heuristic}
+\section{Staticdeps}

 The static analyzer we present, \staticdeps{}, only aims to tackle the
 difficulty~\ref{memcarried_difficulty_arith} mentioned above: tracking
@ -12,7 +12,8 @@ This problem could be solved using symbolic calculus algorithms. However, those
 algorithms are not straightforward to implement, and the equality test between
 two arbitrary expressions can be costly.

-\medskip{}
+\subsection{The \staticdeps{} heuristic}
+
 Instead, we use an heuristic based on random values. We consider the set $\calR
 = \left\{0, 1, \ldots, 2^{64}-1\right\}$ of values representable by a 64-bits
 unsigned integer; we extend this set to $\bar\calR = \calR \cup \{\bot\}$,
@ -45,20 +46,54 @@ register-carried dependencies, applying the following principles.
        known).
 \end{itemize}

-The semantics needed to compute encountered operations are obtained by lifting
-the kernel's assembly to \valgrind{}'s \vex{} intermediary representation.
+\subsection{Practical implementation}
+
+We implement \staticdeps{} in Python, using \texttt{pyelftools} and the
+\texttt{capstone} disassembler ---~which we already introduced in
+\autoref{sec:benchsuite_bb}~--- to extract and disassemble the targeted basic
+block. The semantics needed to compute encountered operations are obtained by
+lifting the kernel's assembly to \valgrind{}'s \vex{} intermediary
+representation.

 \medskip{}

-This first analysis provides us with a raw list of dependencies across
-iterations of the considered basic block. We then ``re-roll'' the unrolled
-kernel by transcribing each dependency to a triplet $(\texttt{source\_insn},
-\texttt{dest\_insn}, \Delta{}k)$, where the first two elements are the source
-and destination instruction of the dependency \emph{in the original,
-non-unrolled kernel}, and $\Delta{}k$ is the number of iterations of the kernel
-between the source and destination instruction of the dependency.
+The implementation of the heuristic detailed above provides us with a raw list
+of dependencies across iterations of the considered basic block. We then
+``re-roll'' the unrolled kernel by transcribing each dependency to a triplet
+$(\texttt{source\_insn}, \texttt{dest\_insn}, \Delta{}k)$, where the first two
+elements are the source and destination instruction of the dependency \emph{in
+the original, non-unrolled kernel}, and $\Delta{}k$ is the number of iterations
+of the kernel between the source and destination instruction of the dependency.

 Finally, we filter out spurious dependencies: each dependency found should
 occur for each kernel iteration $i$ at which $i + \Delta{}k$ is within bounds.
 If the dependency is found for less than $80\,\%$ of those iterations, the
 dependency is declared spurious and is dropped.
+
+\subsection{Limitations}\label{ssec:staticdeps_limits}
+
+In \autoref{chap:CesASMe}, we argued that one of the shortcomings that most
+crippled state-of-the-art tools was that analyses were conducted
+out-of-context, considering only the basic block at hand. This analysis is also
+true for \staticdeps{}, as it is still focused on a single basic block in
+isolation; in particular, any aliasing that stems from outside of the analyzed
+basic block is not visible to \staticdeps{}.
+
+Work towards a broader analysis range, \eg{} at the scale of a function, or at
+least initializing values with gathered assertions ---~maybe based on abstract
+interpretation techniques~--- could be beneficial to the quality of
+dependencies detections.
+
+\medskip{}
+
+As \staticdeps{}'s heuristic is based on randomness in a Monte Carlo sense, it
+may yield false positives: two registers could theoretically be assigned the
+same value sampled at random, making them aliasing addresses. This is, however,
+very improbable, as values are sampled from a set of cardinality $2^{64}$. If
+necessary, the error can be reduced by amplification: running multiple times
+the algorithm on different randomness seeds reduces the error exponentially.
+
+Conversely, \staticdeps{} should not present false negatives due to randomness.
+Dependencies may go undetected, \eg{} because of out-of-scope aliasing or
+unsupported operations. However, no dependency that falls into the scope of
+\depsim{}'s analysis should be missed because of random initialisations.
--- a/manuscrit/60_staticdeps/50_eval.tex
+++ b/manuscrit/60_staticdeps/50_eval.tex
@ -1 +1,244 @@
 \section{Evaluation}
+
+We evaluate the relevance of \staticdeps{} results in two ways: first, we
+compare the detected dependencies to those extracted at runtime by \depsim{},
+to evaluate the proportion of dependencies actually detected. Then, we evaluate
+the relevance of our static analysis from a performance debugging point of
+view, by enriching \uica{}'s model with \staticdeps{} and assessing, using
+\cesasme{}, the benefits brought to the model.
+
+We finally evaluate our claim that using a static model instead of a dynamic
+analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable
+amount of time.
+
+\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim}
+
+The \staticdeps{}'s model contribution largely resides in its ability to track
+memory-carried dependencies, including loop-carried ones. We thus focus on
+evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to
+memory-carried dependencies.
+
+We use the binaries produced by \cesasme{} as a dataset, as we already assessed
+its relevance and contains enough benchmarks to be statistically meaningful. We
+also already have tooling and basic-block segmentation available for those
+benchmarks, making the analysis more convenient.
+
+\medskip{}
+
+For each binary previously generated by \cesasme{}, we use its cached basic
+block splitting and occurrence count. Among each binary, we discard any basic
+block with fewer than 10\,\% of the occurrence count of the most-hit basic
+block; this avoids considering basic blocks which were not originally inside
+loops, and for which loop-carried dependencies would make no sense ---~and
+could possibly create false positives.
+
+For each of the considered binaries, we run our dynamic analysis, \depsim{},
+and record its results.
+
+For each of the considered basic blocks, we run our static analysis,
+\staticdeps{}. We translate the detected dependencies back to original ELF
+addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does
+not report an equivalent parameter, but only a pair of program counters. Each
+of the dependencies reported by \depsim{} whose source and destination
+addresses belong to the basic block considered are then classified as either
+detected or missed by \staticdeps{}. Dynamically detected dependencies spanning
+across basic blocks are discarded, as \staticdeps{} cannot detect them by
+construction.
+
+\medskip{}
+
+We consider two metrics: the unweighted dependencies coverage, \[
+    \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
+\]
+
+as well as the weighted dependencies coverage, \[
+    \cov_w = 
+        \dfrac{
+            \sum_{d \in \text{found}} \rho_d
+        }{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
+\]
+where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
+detected by \depsim.
+
+\begin{table}
+    \centering
+    \begin{tabular}{r r r}
+        \toprule
+        \textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\
+        \midrule
+        $\infty$ & 38.1\,\% & 44.0\,\% \\
+        1024 & 57.6\,\% & 58.2\,\% \\
+        512 & 56.4\,\% & 63.5\,\% \\
+        \bottomrule
+    \end{tabular}
+    \caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
+    binaries}\label{table:cov_staticdeps}
+\end{table}
+
+
+These metrics are presented for the 3\,500 binaries of \cesasme{} in the first
+data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about
+40\,\%, is lower than expected.
+
+\bigskip{}
+
+Manual investigation on the missed dependencies revealed some surprising
+dependencies dynamically detected by \depsim{}, that did not appear to actually
+be read-after-write dependencies. In the following (simplified) example,
+roughly implementing $A[i] = C\times{}A[i] + B[i]$,
+\begin{minipage}{0.95\linewidth}
+\begin{lstlisting}[language={[x86masm]Assembler}]
+loop:
+    vmulsd (%rax,%rdi), %xmm0, %xmm1
+    vaddsd (%rbx,%rdi), %xmm1, %xmm1
+    vmovsd %xmm1, (%rax,%rdi)
+    add $8, %rdi
+    cmp %rdi, %r10
+    jne loop
+\end{lstlisting}
+\end{minipage}\\
+a read-after-write dependency from line 4 to line 2 was reported ---~while no
+such dependency actually exists.
+
+The reason for that is that, in \cesasme{}'s benchmarks, the whole program
+would roughly look like
+\begin{lstlisting}[language=C]
+/* Initialize A, B, C here */
+for(int measure=0; measure < NUM_MEASURES; ++measure) {
+    measure_start();
+    for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
+        for(int i=0; i < ARRAY_SIZE; ++i) {
+            A[i] = C * A[i] + B[i];
+        }
+    }
+    measure_stop();
+}
+\end{lstlisting}
+
+Thus, the previously reported dependency did not come from within the kernel,
+but \emph{from one outer iteration to the next} (\ie{}, iteration on
+\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not,
+in practice, relevant: under the common code analyzers assumptions, the most
+inner loop is long enough to be considered infinite in steady state; thus, two
+outer loop iterations are too far separated in time for this dependency to have
+any relevance, as the source iteration is long executed when the destination
+iteration is scheduled.
+
+\medskip{}
+
+To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}.
+As we do not have access without a heavy runtime slowdown to elapsed cycles in
+\valgrind{}, we define a \emph{timestamp} as the number of instructions
+executed since beginning of the program's execution; we increment this count at
+each branch instruction to avoid excessive instrumentation slowdown.
+
+We further annotate every write to the shadow memory with the timestamp at
+which it occurred. Whenever a dependency should be added, we first check that
+the dependency has not expired ---~that is, that it is not older than a given
+threshold.
+
+We re-run the previous experiments with lifetimes of respectively 1\,024 and
+512 instructions, which roughly corresponds to the order of magnitude of the
+size of a reorder buffer; results can also be found in
+\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions
+lifetime greatly improves the coverage rates, both unweighted and weighted,
+further reducing this lifetime to 512 does not yield significant enhancements.
+
+\bigskip{}
+
+The final coverage results, with a rough 60\,\% detection rate, are reasonable
+and detect a significant proportion of dependencies; however, many are still
+not detected.
+
+This may be explained by the limitations studied in
+\autoref{ssec:staticdeps_limits} above, and especially the inability of
+\staticdeps{} to detect dependencies through aliasing pointers. This falls,
+more broadly, into the problem of lack of context that we expressed before and
+emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale
+of the whole program, that would be able to integrate constraints stemming from
+outside of the loop body, would capture many more dependencies.
+
+\subsection{Enriching \uica{}'s model}
+
+To estimate the real gain in performance debugging scenarios, however, we
+integrate \staticdeps{} into \uica{}.
+
+There is, however, a discrepancy between the two tools: while \staticdeps{}
+works at the assembly instruction level, \uica{} works at the \uop{} level. In
+real hardware, dependencies indeed occur between \uops{}; however, we are not
+aware of the existence of a \uop{}-level semantic description of the x86-64
+ISA, which made this level of detail unsuitable for the \staticdeps{} analysis.
+
+We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
+are found to be dependant, we add a dependency between each couple $\mu_1 \in
+i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict
+execution times biased towards a slower computation kernel. A finer model, or a
+finer (conservative) filtering of which \uops{} must be considered dependent
+---~\eg{} a memory dependency can only come from a memory-related \uop{}~---
+may enhance the accuracy of our integration.
+
+\medskip{}
+
+We then evaluate our gains by running \cesasme{}'s harness as we did in
+\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets:
+first, the full set of 3\,500 binaries from the previous chapter; then, the
+set of binaries pruned to exclude benchmarks heavily relying on memory-carried
+dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is
+beneficial to \uica{}, we expect \uicadeps{} to yield significantly better
+results than \uica{} alone on the first dataset. On the second dataset,
+however, \staticdeps{} should provide no significant contribution, as the
+dataset was pruned to not exhibit significant memory-carried latency-boundness.
+We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as
+the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}.
+
+\begin{table}
+    \centering
+    \footnotesize
+    \begin{tabular}{l l r r r r r r r}
+        \toprule
+        \textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
+\midrule
+        \multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\
+                              & + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\
+\midrule
+        \multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
+                                & + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\
+\bottomrule
+    \end{tabular}
+    \caption{Evaluation through \cesasme{} of the integration of \staticdeps{}
+    to \uica{}}\label{table:staticdeps_uica_cesasme}
+\end{table}
+
+\begin{figure}
+    \centering
+    \includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg}
+    \caption{Statistical distribution of relative errors of \uica{}, with and
+    without \staticdeps{} hints, with and without pruning latency bound through
+memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot}
+\end{figure}
+
+\medskip{}
+
+The full dataset \uicadeps{} row is extremely close, on every metric, to the
+pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}' addition
+to \uica{} is very conclusive: the hints provided by \staticdeps{} are
+sufficient to make \uica{}'s results as good on the full dataset as they were
+before on a dataset pruned of precisely the kind of dependencies we aim to
+detect. Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are
+extremely close: this further supports the accuracy of \staticdeps{}.
+
+\medskip{}
+
+While the results obtained against \depsim{} in
+\autoref{ssec:staticdeps_eval_depsim} above were reasonable, they were not
+excellent either, and showed that many kind of dependencies were still missed
+by \staticdeps{}. However, our evaluation on \cesasme{} by enriching \uica{}
+shows that, at least on the workload considered, the dependencies that actually
+matter from a performance debugging point of view are properly found.
+
+This, however, might not be true for other kinds of applications that would
+require a dependencies analysis.
+
+\subsection{Analysis speed}
+
+\todo{}
--- a/manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg
+++ b/manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg
--- a/manuscrit/include/macros.tex
+++ b/manuscrit/include/macros.tex
@ -52,6 +52,8 @@
 \newcommand{\valgrind}{\texttt{valgrind}}
 \newcommand{\vex}{\texttt{VEX}}

+\newcommand{\uicadeps}{\uica{}~+~\staticdeps{}}
+
 \newcommand{\gdb}{\texttt{gdb}}

 \newcommand{\coeq}{CO$_{2}$eq}
@ -60,5 +62,7 @@

 \newcommand{\reg}[1]{\texttt{\%#1}}

+\newcommand{\cov}{\operatorname{cov}}
+
 % Hyperlinks
 \newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}}