Staticdeps: evaluate with recompiled CesASMe

2024-06-19 11:54:38 +02:00 · 2024-06-19 11:54:38 +02:00 · 726763f895
commit 726763f895
parent bf6fd7c3f5
4 changed files with 212 additions and 116 deletions
--- a/manuscrit/60_staticdeps/20_dynamic.tex
+++ b/manuscrit/60_staticdeps/20_dynamic.tex
@ -78,3 +78,19 @@ location. The shadow memory is instead implemented as a hash table.
 At the end of the run, all the dependencies retrieved are reported. Care is
 taken to translate back the runtime program counters to addresses in the
 original ELF files, using the running process' memory map.
+
+\medskip{}
+
+Dependencies are mostly relevant if their source and destination are close
+enough to be computationally meaningful. To this end, we also introduce in
+\depsim{} a notion of \emph{dependency lifetime}. As we do not have access
+without a heavy runtime slowdown to elapsed cycles in \valgrind{}, we define a
+\emph{timestamp} as the number of instructions executed since beginning of the
+program's execution; we increment this count at each branch instruction to
+avoid excessive instrumentation slowdown.
+
+We further annotate every write to the shadow memory with the timestamp at
+which it occurred. Whenever a dependency should be added, we first check that
+the dependency has not expired ---~that is, that it is not older than a given
+threshold. This threshold is tunable for each run --~and may be set to infinity
+to keep every dependency.
--- a/manuscrit/60_staticdeps/40_staticdeps.tex
+++ b/manuscrit/60_staticdeps/40_staticdeps.tex
@ -46,7 +46,7 @@ register-carried dependencies, applying the following principles.
        known).
 \end{itemize}

-\subsection{Practical implementation}
+\subsection{Practical implementation}\label{ssec:staticdeps:practical_implem}

 We implement \staticdeps{} in Python, using \texttt{pyelftools} and the
 \texttt{capstone} disassembler ---~which we already introduced in
--- a/manuscrit/60_staticdeps/50_eval.tex
+++ b/manuscrit/60_staticdeps/50_eval.tex
@ -23,69 +23,26 @@ its relevance and contains enough benchmarks to be statistically meaningful. We
 also already have tooling and basic-block segmentation available for those
 benchmarks, making the analysis more convenient.

-\medskip{}
+\subsubsection{Recompiling \cesasme{}'s dataset}

-For each binary previously generated by \cesasme{}, we use its cached basic
-block splitting and occurrence count. Among each binary, we discard any basic
-block with fewer than 10\,\% of the occurrence count of the most-hit basic
-block; this avoids considering basic blocks which were not originally inside
-loops, and for which loop-carried dependencies would make no sense ---~and
-could possibly create false positives.
+In practice, benchmarks from \cesasme{} are roughly of the following form:

-For each of the considered binaries, we run our dynamic analysis, \depsim{},
-and record its results.
+\begin{lstlisting}[language=C]
+/* Initialize A, B, C here */
+for(int measure=0; measure < NUM_MEASURES; ++measure) {
+    measure_start();
+    for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
+        for(int i=0; i < BENCHMARK_SIZE; ++i) {
+            /* Some kernel, independent of measure, repeat */
+        }
+    }
+    measure_stop();
+}
+\end{lstlisting}

-For each of the considered basic blocks, we run our static analysis,
-\staticdeps{}. We translate the detected dependencies back to original ELF
-addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does
-not report an equivalent parameter, but only a pair of program counters. Each
-of the dependencies reported by \depsim{} whose source and destination
-addresses belong to the basic block considered are then classified as either
-detected or missed by \staticdeps{}. Dynamically detected dependencies spanning
-across basic blocks are discarded, as \staticdeps{} cannot detect them by
-construction.
-
-\medskip{}
-
-We consider two metrics: the unweighted dependencies coverage, \[
-    \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
-\]
-
-as well as the weighted dependencies coverage, \[
-    \cov_w = 
-        \dfrac{
-            \sum_{d \in \text{found}} \rho_d
-        }{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
-\]
-where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
-detected by \depsim.
-
-\begin{table}
-    \centering
-    \begin{tabular}{r r r}
-        \toprule
-        \textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\
-        \midrule
-        $\infty$ & 38.1\,\% & 44.0\,\% \\
-        1024 & 57.6\,\% & 58.2\,\% \\
-        512 & 56.4\,\% & 63.5\,\% \\
-        \bottomrule
-    \end{tabular}
-    \caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
-    binaries}\label{table:cov_staticdeps}
-\end{table}
-
-
-These metrics are presented for the 3\,500 binaries of \cesasme{} in the first
-data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about
-40\,\%, is lower than expected.
-
-\bigskip{}
-
-Manual investigation on the missed dependencies revealed some surprising
-dependencies dynamically detected by \depsim{}, that did not appear to actually
-be read-after-write dependencies. In the following (simplified) example,
-roughly implementing $A[i] = C\times{}A[i] + B[i]$,
+While this is sensible for conducting throughput measures, it also introduces
+unwanted dependencies. If, for instance, the kernel consists in
+$A[i] = C\times{}A[i] + B[i]$, implemented by\\
 \begin{minipage}{0.95\linewidth}
 \begin{lstlisting}[language={[x86masm]Assembler}]
 loop:
@ -97,66 +54,169 @@ loop:
    jne loop
 \end{lstlisting}
 \end{minipage}\\
-a read-after-write dependency from line 4 to line 2 was reported ---~while no
-such dependency actually exists.
+a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
+---~although there is no such dependency inherent to the kernel.

-The reason for that is that, in \cesasme{}'s benchmarks, the whole program
-would roughly look like
-\begin{lstlisting}[language=C]
-/* Initialize A, B, C here */
-for(int measure=0; measure < NUM_MEASURES; ++measure) {
-    measure_start();
-    for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
-        for(int i=0; i < ARRAY_SIZE; ++i) {
-            A[i] = C * A[i] + B[i];
-        }
-    }
-    measure_stop();
-}
-\end{lstlisting}
-
-Thus, the previously reported dependency did not come from within the kernel,
-but \emph{from one outer iteration to the next} (\ie{}, iteration on
-\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not,
-in practice, relevant: under the common code analyzers assumptions, the most
-inner loop is long enough to be considered infinite in steady state; thus, two
-outer loop iterations are too far separated in time for this dependency to have
-any relevance, as the source iteration is long executed when the destination
-iteration is scheduled.
+However, each iteration of the
+\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
+loop will read again each \texttt{A[i]} (\ie{} \lstxasm{(\%rax,\%rdi)} in the
+assembly) value from the previous inner loop, and
+write it back. This creates a dependency to the previous iteration of the inner
+loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
+enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
+does not report a dependency's distance, they are considered meaningful; and as
+they cannot be detected by \staticdeps{} --~which is unaware of the outer and
+inner loop~--, they introduce unfairness in the evaluation. The actual loss of
+precision introduced by not discovering such dependencies is instead assessed
+later by enriching \uica{} with \staticdeps{}.

 \medskip{}

-To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}.
-As we do not have access without a heavy runtime slowdown to elapsed cycles in
-\valgrind{}, we define a \emph{timestamp} as the number of instructions
-executed since beginning of the program's execution; we increment this count at
-each branch instruction to avoid excessive instrumentation slowdown.
+To avoid detecting these dependencies with \depsim{}, we \textbf{recompile
+\cesasme{}'s benchmarks} from the C source code of each benchmark with
+\lstc{NUM_MEASURES = NUM_REPEATS = 1}. We use these recompiled benchmarks only
+in the current section. While we do not re-run code transformations from the
+original Polybenchs, we do recompile the benchmarks from C source. Thus, the
+results from this section \emph{are not comparable} with results from other
+sections, as the compiler may have used different optimisations, instructions,
+etc.

-We further annotate every write to the shadow memory with the timestamp at
-which it occurred. Whenever a dependency should be added, we first check that
-the dependency has not expired ---~that is, that it is not older than a given
-threshold.
+\subsubsection{Dependency coverage}

-We re-run the previous experiments with lifetimes of respectively 1\,024 and
-512 instructions, which roughly corresponds to the order of magnitude of the
-size of a reorder buffer (see \autoref{ssec:staticdeps_detection} above); results can also be found in
-\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions
-lifetime greatly improves the coverage rates, both unweighted and weighted,
-further reducing this lifetime to 512 does not yield significant enhancements.
+For each binary generated by \cesasme{}, we use its cached basic
+block splitting and occurrence count. Among each binary, we discard any basic
+block with fewer than 10\,\% of the occurrence count of the most-hit basic
+block; this avoids considering basic blocks which were not originally inside
+loops, and for which loop-carried dependencies would make no sense ---~and
+could possibly create false positives.

-\bigskip{}
+For each of the considered binaries, we run our dynamic analysis, \depsim{},
+and record its results. We use a lifetime of 512 instructions for this
+analysis, as this is roughly the size of recent Intel reorder
+buffers~\cite{wikichip_intel_rob_size}; as discussed in
+\autoref{ssec:staticdeps_detection}, dependencies spanning farther than the
+size of the ROB are not microarchitecturally relevant. Dependencies whose
+source and destination program counters are not in the same basic block are
+discarded, as \staticdeps{} cannot detect them by construction.

-The final coverage results, with a rough 60\,\% detection rate, are reasonable
-and detect a significant proportion of dependencies; however, many are still
-not detected.
+For each of the considered basic blocks, we run our static analysis,
+\staticdeps{}. We discard the $\Delta{}k$ parameter, as our dynamic analysis does
+not report an equivalent parameter, but only a pair of program counters.

-This may be explained by the limitations studied in
-\autoref{ssec:staticdeps_limits} above, and especially the inability of
-\staticdeps{} to detect dependencies through aliasing pointers. This falls,
-more broadly, into the problem of lack of context that we expressed before and
-emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale
-of the whole program, that would be able to integrate constraints stemming from
-outside of the loop body, would capture many more dependencies.
+Dynamic dependencies from \depsim{} are converted to
+\emph{periodic dependencies} in the sense of \staticdeps{} as described in
+\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
+least 80\% of the block's iterations are kept --~else, dependencies are
+considered measurement artifacts. The \emph{periodic coverage}
+of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
+proportion of dependencies found by \staticdeps{} among the periodic
+dependencies extracted from \depsim{}:
+\[
+    \cov_p = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
+\]
+
+\smallskip{}
+
+We also keep the raw dependencies from \depsim{} --~that is, without converting
+them to periodic dependencies. From these, we consider two metrics:
+the unweighted dependencies coverage, \[
+    \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
+\]
+identical to $\cov_p$ but based on unfiltered dependencies,
+as well as the weighted dependencies coverage, \[
+    \cov_w =
+        \dfrac{
+            \sum_{d \in \text{found}} \rho_d
+        }{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
+\]
+where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
+detected by \depsim. Note that such a metric is not meaningful for periodic
+dependencies as, by construction, each dependency occurs as many times
+as the loop iterates.
+
+\begin{table}
+    \centering
+    \begin{tabular}{r r r}
+        \toprule
+        $\cov_p$ (\%) & $\cov_u$ (\%) & $\cov_w$ (\%) \\
+        \midrule
+        96.0 & 94.4 & 98.3 \\
+        \bottomrule
+    \end{tabular}
+    \caption{Periodic, unweighted and weighted coverage of \staticdeps{} on
+    \cesasme{}'s binaries recompiled without repetitions, with a lifetime of
+    512.}\label{table:cov_staticdeps}
+\end{table}
+
+These metrics are presented for the 3\,500 binaries of \cesasme{} in
+\autoref{table:cov_staticdeps}. The obtained coverage is consistent between the
+three metrics used ($\cov_p$, $cov_u$, $cov_w$) and the reported coverage is
+very close to 100\,\%, giving us good confidence on the accuracy of
+\staticdeps.
+
+\subsubsection{``Points-to'' aliasing analysis}
+
+The same methodology can be re-used as a proxy for estimating the rate of
+aliasing independent pointers in our dataset. Indeed, a major approximation
+made by \staticdeps{} is to assume that any new encountered pointer --~function
+parameters, value read from memory, \ldots~-- does \emph{not} alias with
+previously encountered values. This is implemented by the use of a fresh
+random value for each value yet unknown.
+
+Determining which pointers may point to which other pointers --~and, by
+extension, may point to the same memory region~-- is called a \emph{points-to
+analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
+the pointers for which taking a fresh value was \emph{not} representative of
+the reality.
+
+If we detect, through dynamic analysis, that a value derived from a
+pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
+--~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to}
+\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
+execution is equal to \lstc{b + l} at the very end of the execution: although
+the pointers will not alias (that is, share the same value at the same moment),
+they still point to the same memory region and should not be treated as
+independent.
+
+Our dynamic analyzer, \depsim{}, does not have this granularity, as it only
+reports dependencies between two program counters. A dependency from a PC
+$p$ to a PC $q$ however implies that a value written to memory at $q$ was read
+from memory at $p$, and thus that one of the pointers used at $p$ aliases with
+one of the pointers used at $q$.
+
+\medskip{}
+
+We thus conduct the same analysis as before, but with an infinite lifetime to
+account for far-ranging dependencies. We then use $\cov_u$ and $\cov_w$ as a
+proxy to measure whether assuming the pointers independent was reasonable: a
+bad coverage would be a clear indication of non-independent pointers treated as
+independent. A good coverage is not, formally, an indication of the absence of
+non-independent pointers: the detected static dependencies may come of other
+pointers at the same PC. We however believe it reasonable to consider it a good
+proxy for this metric, as a single assembly line often reads a single value,
+and usually at most two. We do not use the $\cov_p$ metric here, as we want to
+keep every detected dependency to detect possible aliasing.
+
+\begin{table}
+    \centering
+    \begin{tabular}{r r}
+        \toprule
+        $\cov_u$ (\%) & $\cov_w$ (\%) \\
+        \midrule
+        95.0 & 93.7 \\
+        \bottomrule
+    \end{tabular}
+    \caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
+    binaries recompiled without repetitions, with an infinite
+    lifetime, as a proxy for points-to analysis.}\label{table:cov_staticdeps_pointsto}
+\end{table}
+
+The results of this analysis are presented in
+\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
+good confidence that our hypothesis of independent pointers is reasonable, at
+least within the scope of Polybench, which we believe representative of
+scientific computation --~one of the prominent use-cases of tools such as code
+analyzers.

 \subsection{Enriching \uica{}'s model}

@ -166,8 +226,10 @@ integrate \staticdeps{} into \uica{}.
 There is, however, a discrepancy between the two tools: while \staticdeps{}
 works at the assembly instruction level, \uica{} works at the \uop{} level. In
 real hardware, dependencies indeed occur between \uops{}; however, we are not
-aware of the existence of a \uop{}-level semantic description of the x86-64
-ISA, which made this level of detail unsuitable for the \staticdeps{} analysis.
+aware of the existence of a \uop{}-level semantic description of the x86-64 ISA
+(which, by essence, would be declined for each specific processor, as the ISA
+itself is not concerned with \uops{}). This level of detail was thus unsuitable
+for the \staticdeps{} analysis.

 We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
 are found to be dependant, we add a dependency between each couple $\mu_1 \in
--- a/manuscrit/biblio/misc.bib
+++ b/manuscrit/biblio/misc.bib
@ -257,8 +257,8 @@

@inproceedings{talla2001hwloops,
  author={Talla, D. and John, L.K.},
-  booktitle={Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001}, 
-  title={Cost-effective hardware acceleration of multimedia applications}, 
+  booktitle={Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001},
+  title={Cost-effective hardware acceleration of multimedia applications},
  year={2001},
  volume={},
  number={},
@ -266,3 +266,21 @@
  keywords={Hardware;Acceleration;Computer aided instruction;Streaming media;Parallel processing;Throughput;Concurrent computing;Application software;Microprocessors;Feeds},
  doi={10.1109/ICCD.2001.955060}}

+@article{points_to,
+    author = {Emami, Maryam and Ghiya, Rakesh and Hendren, Laurie J.},
+    title = {Context-sensitive interprocedural points-to analysis in the presence of function pointers},
+    year = {1994},
+    issue_date = {June 1994},
+    publisher = {Association for Computing Machinery},
+    address = {New York, NY, USA},
+    volume = {29},
+    number = {6},
+    issn = {0362-1340},
+    url = {https://doi.org/10.1145/773473.178264},
+    doi = {10.1145/773473.178264},
+    abstract = {This paper reports on the design, implementation, and empirical results of a new method for dealing with the aliasing problem in C. The method is based on approximating the points-to relationships between accessible stack locations, and can be used to generate alias pairs, or used directly for other analyses and transformations.Our method provides context-sensitive interprocedural information based on analysis over invocation graphs that capture all calling contexts including recursive and mutually-recursive calling contexts. Furthermore, the method allows the smooth integration for handling general function pointers in C.We illustrate the effectiveness of the method with empirical results from an implementation in the McCAT optimizing/parallelizing C compiler.},
+    journal = {SIGPLAN Not.},
+    month = {jun},
+    pages = {242–256},
+    numpages = {15}
+}