From a9cfaef2f925e9165dc85ed76253654e7e9f96a8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= Date: Fri, 29 Sep 2023 16:04:48 +0200 Subject: [PATCH] Staticdeps: evaluation part --- manuscrit/30_palmed/35_benchsuite_bb.tex | 2 +- manuscrit/60_staticdeps/40_staticdeps.tex | 57 +- manuscrit/60_staticdeps/50_eval.tex | 243 +++ .../60_staticdeps/uica_cesasme_boxplot.svg | 1772 +++++++++++++++++ manuscrit/include/macros.tex | 4 + 5 files changed, 2066 insertions(+), 12 deletions(-) create mode 100644 manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg diff --git a/manuscrit/30_palmed/35_benchsuite_bb.tex b/manuscrit/30_palmed/35_benchsuite_bb.tex index b47c1ae..53c067f 100644 --- a/manuscrit/30_palmed/35_benchsuite_bb.tex +++ b/manuscrit/30_palmed/35_benchsuite_bb.tex @@ -1,4 +1,4 @@ -\section{Finding basic blocks to evaluate \palmed{}} +\section{Finding basic blocks to evaluate \palmed{}}\label{sec:benchsuite_bb} In the context of all that is described above, my main task in the environment of \palmed{} was to build a system able to evaluate a produced mapping on a diff --git a/manuscrit/60_staticdeps/40_staticdeps.tex b/manuscrit/60_staticdeps/40_staticdeps.tex index 2ccef69..389f1a3 100644 --- a/manuscrit/60_staticdeps/40_staticdeps.tex +++ b/manuscrit/60_staticdeps/40_staticdeps.tex @@ -1,4 +1,4 @@ -\section{The \staticdeps{} heuristic} +\section{Staticdeps} The static analyzer we present, \staticdeps{}, only aims to tackle the difficulty~\ref{memcarried_difficulty_arith} mentioned above: tracking @@ -12,7 +12,8 @@ This problem could be solved using symbolic calculus algorithms. However, those algorithms are not straightforward to implement, and the equality test between two arbitrary expressions can be costly. -\medskip{} +\subsection{The \staticdeps{} heuristic} + Instead, we use an heuristic based on random values. We consider the set $\calR = \left\{0, 1, \ldots, 2^{64}-1\right\}$ of values representable by a 64-bits unsigned integer; we extend this set to $\bar\calR = \calR \cup \{\bot\}$, @@ -45,20 +46,54 @@ register-carried dependencies, applying the following principles. known). \end{itemize} -The semantics needed to compute encountered operations are obtained by lifting -the kernel's assembly to \valgrind{}'s \vex{} intermediary representation. +\subsection{Practical implementation} + +We implement \staticdeps{} in Python, using \texttt{pyelftools} and the +\texttt{capstone} disassembler ---~which we already introduced in +\autoref{sec:benchsuite_bb}~--- to extract and disassemble the targeted basic +block. The semantics needed to compute encountered operations are obtained by +lifting the kernel's assembly to \valgrind{}'s \vex{} intermediary +representation. \medskip{} -This first analysis provides us with a raw list of dependencies across -iterations of the considered basic block. We then ``re-roll'' the unrolled -kernel by transcribing each dependency to a triplet $(\texttt{source\_insn}, -\texttt{dest\_insn}, \Delta{}k)$, where the first two elements are the source -and destination instruction of the dependency \emph{in the original, -non-unrolled kernel}, and $\Delta{}k$ is the number of iterations of the kernel -between the source and destination instruction of the dependency. +The implementation of the heuristic detailed above provides us with a raw list +of dependencies across iterations of the considered basic block. We then +``re-roll'' the unrolled kernel by transcribing each dependency to a triplet +$(\texttt{source\_insn}, \texttt{dest\_insn}, \Delta{}k)$, where the first two +elements are the source and destination instruction of the dependency \emph{in +the original, non-unrolled kernel}, and $\Delta{}k$ is the number of iterations +of the kernel between the source and destination instruction of the dependency. Finally, we filter out spurious dependencies: each dependency found should occur for each kernel iteration $i$ at which $i + \Delta{}k$ is within bounds. If the dependency is found for less than $80\,\%$ of those iterations, the dependency is declared spurious and is dropped. + +\subsection{Limitations}\label{ssec:staticdeps_limits} + +In \autoref{chap:CesASMe}, we argued that one of the shortcomings that most +crippled state-of-the-art tools was that analyses were conducted +out-of-context, considering only the basic block at hand. This analysis is also +true for \staticdeps{}, as it is still focused on a single basic block in +isolation; in particular, any aliasing that stems from outside of the analyzed +basic block is not visible to \staticdeps{}. + +Work towards a broader analysis range, \eg{} at the scale of a function, or at +least initializing values with gathered assertions ---~maybe based on abstract +interpretation techniques~--- could be beneficial to the quality of +dependencies detections. + +\medskip{} + +As \staticdeps{}'s heuristic is based on randomness in a Monte Carlo sense, it +may yield false positives: two registers could theoretically be assigned the +same value sampled at random, making them aliasing addresses. This is, however, +very improbable, as values are sampled from a set of cardinality $2^{64}$. If +necessary, the error can be reduced by amplification: running multiple times +the algorithm on different randomness seeds reduces the error exponentially. + +Conversely, \staticdeps{} should not present false negatives due to randomness. +Dependencies may go undetected, \eg{} because of out-of-scope aliasing or +unsupported operations. However, no dependency that falls into the scope of +\depsim{}'s analysis should be missed because of random initialisations. diff --git a/manuscrit/60_staticdeps/50_eval.tex b/manuscrit/60_staticdeps/50_eval.tex index cd3fc7f..ae62939 100644 --- a/manuscrit/60_staticdeps/50_eval.tex +++ b/manuscrit/60_staticdeps/50_eval.tex @@ -1 +1,244 @@ \section{Evaluation} + +We evaluate the relevance of \staticdeps{} results in two ways: first, we +compare the detected dependencies to those extracted at runtime by \depsim{}, +to evaluate the proportion of dependencies actually detected. Then, we evaluate +the relevance of our static analysis from a performance debugging point of +view, by enriching \uica{}'s model with \staticdeps{} and assessing, using +\cesasme{}, the benefits brought to the model. + +We finally evaluate our claim that using a static model instead of a dynamic +analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable +amount of time. + +\subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim} + +The \staticdeps{}'s model contribution largely resides in its ability to track +memory-carried dependencies, including loop-carried ones. We thus focus on +evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to +memory-carried dependencies. + +We use the binaries produced by \cesasme{} as a dataset, as we already assessed +its relevance and contains enough benchmarks to be statistically meaningful. We +also already have tooling and basic-block segmentation available for those +benchmarks, making the analysis more convenient. + +\medskip{} + +For each binary previously generated by \cesasme{}, we use its cached basic +block splitting and occurrence count. Among each binary, we discard any basic +block with fewer than 10\,\% of the occurrence count of the most-hit basic +block; this avoids considering basic blocks which were not originally inside +loops, and for which loop-carried dependencies would make no sense ---~and +could possibly create false positives. + +For each of the considered binaries, we run our dynamic analysis, \depsim{}, +and record its results. + +For each of the considered basic blocks, we run our static analysis, +\staticdeps{}. We translate the detected dependencies back to original ELF +addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does +not report an equivalent parameter, but only a pair of program counters. Each +of the dependencies reported by \depsim{} whose source and destination +addresses belong to the basic block considered are then classified as either +detected or missed by \staticdeps{}. Dynamically detected dependencies spanning +across basic blocks are discarded, as \staticdeps{} cannot detect them by +construction. + +\medskip{} + +We consider two metrics: the unweighted dependencies coverage, \[ + \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}} +\] + +as well as the weighted dependencies coverage, \[ + \cov_w = + \dfrac{ + \sum_{d \in \text{found}} \rho_d + }{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d} +\] +where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically +detected by \depsim. + +\begin{table} + \centering + \begin{tabular}{r r r} + \toprule + \textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\ + \midrule + $\infty$ & 38.1\,\% & 44.0\,\% \\ + 1024 & 57.6\,\% & 58.2\,\% \\ + 512 & 56.4\,\% & 63.5\,\% \\ + \bottomrule + \end{tabular} + \caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s + binaries}\label{table:cov_staticdeps} +\end{table} + + +These metrics are presented for the 3\,500 binaries of \cesasme{} in the first +data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about +40\,\%, is lower than expected. + +\bigskip{} + +Manual investigation on the missed dependencies revealed some surprising +dependencies dynamically detected by \depsim{}, that did not appear to actually +be read-after-write dependencies. In the following (simplified) example, +roughly implementing $A[i] = C\times{}A[i] + B[i]$, +\begin{minipage}{0.95\linewidth} +\begin{lstlisting}[language={[x86masm]Assembler}] +loop: + vmulsd (%rax,%rdi), %xmm0, %xmm1 + vaddsd (%rbx,%rdi), %xmm1, %xmm1 + vmovsd %xmm1, (%rax,%rdi) + add $8, %rdi + cmp %rdi, %r10 + jne loop +\end{lstlisting} +\end{minipage}\\ +a read-after-write dependency from line 4 to line 2 was reported ---~while no +such dependency actually exists. + +The reason for that is that, in \cesasme{}'s benchmarks, the whole program +would roughly look like +\begin{lstlisting}[language=C] +/* Initialize A, B, C here */ +for(int measure=0; measure < NUM_MEASURES; ++measure) { + measure_start(); + for(int repeat=0; repeat < NUM_REPEATS; ++repeat) { + for(int i=0; i < ARRAY_SIZE; ++i) { + A[i] = C * A[i] + B[i]; + } + } + measure_stop(); +} +\end{lstlisting} + +Thus, the previously reported dependency did not come from within the kernel, +but \emph{from one outer iteration to the next} (\ie{}, iteration on +\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not, +in practice, relevant: under the common code analyzers assumptions, the most +inner loop is long enough to be considered infinite in steady state; thus, two +outer loop iterations are too far separated in time for this dependency to have +any relevance, as the source iteration is long executed when the destination +iteration is scheduled. + +\medskip{} + +To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}. +As we do not have access without a heavy runtime slowdown to elapsed cycles in +\valgrind{}, we define a \emph{timestamp} as the number of instructions +executed since beginning of the program's execution; we increment this count at +each branch instruction to avoid excessive instrumentation slowdown. + +We further annotate every write to the shadow memory with the timestamp at +which it occurred. Whenever a dependency should be added, we first check that +the dependency has not expired ---~that is, that it is not older than a given +threshold. + +We re-run the previous experiments with lifetimes of respectively 1\,024 and +512 instructions, which roughly corresponds to the order of magnitude of the +size of a reorder buffer; results can also be found in +\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions +lifetime greatly improves the coverage rates, both unweighted and weighted, +further reducing this lifetime to 512 does not yield significant enhancements. + +\bigskip{} + +The final coverage results, with a rough 60\,\% detection rate, are reasonable +and detect a significant proportion of dependencies; however, many are still +not detected. + +This may be explained by the limitations studied in +\autoref{ssec:staticdeps_limits} above, and especially the inability of +\staticdeps{} to detect dependencies through aliasing pointers. This falls, +more broadly, into the problem of lack of context that we expressed before and +emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale +of the whole program, that would be able to integrate constraints stemming from +outside of the loop body, would capture many more dependencies. + +\subsection{Enriching \uica{}'s model} + +To estimate the real gain in performance debugging scenarios, however, we +integrate \staticdeps{} into \uica{}. + +There is, however, a discrepancy between the two tools: while \staticdeps{} +works at the assembly instruction level, \uica{} works at the \uop{} level. In +real hardware, dependencies indeed occur between \uops{}; however, we are not +aware of the existence of a \uop{}-level semantic description of the x86-64 +ISA, which made this level of detail unsuitable for the \staticdeps{} analysis. + +We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$ +are found to be dependant, we add a dependency between each couple $\mu_1 \in +i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict +execution times biased towards a slower computation kernel. A finer model, or a +finer (conservative) filtering of which \uops{} must be considered dependent +---~\eg{} a memory dependency can only come from a memory-related \uop{}~--- +may enhance the accuracy of our integration. + +\medskip{} + +We then evaluate our gains by running \cesasme{}'s harness as we did in +\autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets: +first, the full set of 3\,500 binaries from the previous chapter; then, the +set of binaries pruned to exclude benchmarks heavily relying on memory-carried +dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is +beneficial to \uica{}, we expect \uicadeps{} to yield significantly better +results than \uica{} alone on the first dataset. On the second dataset, +however, \staticdeps{} should provide no significant contribution, as the +dataset was pruned to not exhibit significant memory-carried latency-boundness. +We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as +the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}. + +\begin{table} + \centering + \footnotesize + \begin{tabular}{l l r r r r r r r} + \toprule + \textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\ +\midrule + \multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\ + & + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\ +\midrule + \multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\ + & + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\ +\bottomrule + \end{tabular} + \caption{Evaluation through \cesasme{} of the integration of \staticdeps{} + to \uica{}}\label{table:staticdeps_uica_cesasme} +\end{table} + +\begin{figure} + \centering + \includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg} + \caption{Statistical distribution of relative errors of \uica{}, with and + without \staticdeps{} hints, with and without pruning latency bound through +memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot} +\end{figure} + +\medskip{} + +The full dataset \uicadeps{} row is extremely close, on every metric, to the +pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}' addition +to \uica{} is very conclusive: the hints provided by \staticdeps{} are +sufficient to make \uica{}'s results as good on the full dataset as they were +before on a dataset pruned of precisely the kind of dependencies we aim to +detect. Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are +extremely close: this further supports the accuracy of \staticdeps{}. + +\medskip{} + +While the results obtained against \depsim{} in +\autoref{ssec:staticdeps_eval_depsim} above were reasonable, they were not +excellent either, and showed that many kind of dependencies were still missed +by \staticdeps{}. However, our evaluation on \cesasme{} by enriching \uica{} +shows that, at least on the workload considered, the dependencies that actually +matter from a performance debugging point of view are properly found. + +This, however, might not be true for other kinds of applications that would +require a dependencies analysis. + +\subsection{Analysis speed} + +\todo{} diff --git a/manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg b/manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg new file mode 100644 index 0000000..faf2088 --- /dev/null +++ b/manuscrit/assets/imgs/60_staticdeps/uica_cesasme_boxplot.svg @@ -0,0 +1,1772 @@ + + + + + + + + 2023-09-29T15:53:45.676503 + image/svg+xml + + + Matplotlib v3.7.1, https://matplotlib.orgdiff --git a/manuscrit/include/macros.tex b/manuscrit/include/macros.tex index 7aa4105..ca6ad8a 100644 --- a/manuscrit/include/macros.tex +++ b/manuscrit/include/macros.tex @@ -52,6 +52,8 @@ \newcommand{\valgrind}{\texttt{valgrind}} \newcommand{\vex}{\texttt{VEX}} +\newcommand{\uicadeps}{\uica{}~+~\staticdeps{}} + \newcommand{\gdb}{\texttt{gdb}} \newcommand{\coeq}{CO$_{2}$eq} @@ -60,5 +62,7 @@ \newcommand{\reg}[1]{\texttt{\%#1}} +\newcommand{\cov}{\operatorname{cov}} + % Hyperlinks \newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}}