Staticdeps: evaluate with recompiled CesASMe

This commit is contained in:
Théophile Bastian 2024-06-19 11:54:38 +02:00
parent bf6fd7c3f5
commit 726763f895
4 changed files with 212 additions and 116 deletions

View file

@ -78,3 +78,19 @@ location. The shadow memory is instead implemented as a hash table.
At the end of the run, all the dependencies retrieved are reported. Care is
taken to translate back the runtime program counters to addresses in the
original ELF files, using the running process' memory map.
\medskip{}
Dependencies are mostly relevant if their source and destination are close
enough to be computationally meaningful. To this end, we also introduce in
\depsim{} a notion of \emph{dependency lifetime}. As we do not have access
without a heavy runtime slowdown to elapsed cycles in \valgrind{}, we define a
\emph{timestamp} as the number of instructions executed since beginning of the
program's execution; we increment this count at each branch instruction to
avoid excessive instrumentation slowdown.
We further annotate every write to the shadow memory with the timestamp at
which it occurred. Whenever a dependency should be added, we first check that
the dependency has not expired ---~that is, that it is not older than a given
threshold. This threshold is tunable for each run --~and may be set to infinity
to keep every dependency.

View file

@ -46,7 +46,7 @@ register-carried dependencies, applying the following principles.
known).
\end{itemize}
\subsection{Practical implementation}
\subsection{Practical implementation}\label{ssec:staticdeps:practical_implem}
We implement \staticdeps{} in Python, using \texttt{pyelftools} and the
\texttt{capstone} disassembler ---~which we already introduced in

View file

@ -23,69 +23,26 @@ its relevance and contains enough benchmarks to be statistically meaningful. We
also already have tooling and basic-block segmentation available for those
benchmarks, making the analysis more convenient.
\medskip{}
\subsubsection{Recompiling \cesasme{}'s dataset}
For each binary previously generated by \cesasme{}, we use its cached basic
block splitting and occurrence count. Among each binary, we discard any basic
block with fewer than 10\,\% of the occurrence count of the most-hit basic
block; this avoids considering basic blocks which were not originally inside
loops, and for which loop-carried dependencies would make no sense ---~and
could possibly create false positives.
In practice, benchmarks from \cesasme{} are roughly of the following form:
For each of the considered binaries, we run our dynamic analysis, \depsim{},
and record its results.
\begin{lstlisting}[language=C]
/* Initialize A, B, C here */
for(int measure=0; measure < NUM_MEASURES; ++measure) {
measure_start();
for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
for(int i=0; i < BENCHMARK_SIZE; ++i) {
/* Some kernel, independent of measure, repeat */
}
}
measure_stop();
}
\end{lstlisting}
For each of the considered basic blocks, we run our static analysis,
\staticdeps{}. We translate the detected dependencies back to original ELF
addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does
not report an equivalent parameter, but only a pair of program counters. Each
of the dependencies reported by \depsim{} whose source and destination
addresses belong to the basic block considered are then classified as either
detected or missed by \staticdeps{}. Dynamically detected dependencies spanning
across basic blocks are discarded, as \staticdeps{} cannot detect them by
construction.
\medskip{}
We consider two metrics: the unweighted dependencies coverage, \[
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
\]
as well as the weighted dependencies coverage, \[
\cov_w =
\dfrac{
\sum_{d \in \text{found}} \rho_d
}{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
\]
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
detected by \depsim.
\begin{table}
\centering
\begin{tabular}{r r r}
\toprule
\textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\
\midrule
$\infty$ & 38.1\,\% & 44.0\,\% \\
1024 & 57.6\,\% & 58.2\,\% \\
512 & 56.4\,\% & 63.5\,\% \\
\bottomrule
\end{tabular}
\caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
binaries}\label{table:cov_staticdeps}
\end{table}
These metrics are presented for the 3\,500 binaries of \cesasme{} in the first
data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about
40\,\%, is lower than expected.
\bigskip{}
Manual investigation on the missed dependencies revealed some surprising
dependencies dynamically detected by \depsim{}, that did not appear to actually
be read-after-write dependencies. In the following (simplified) example,
roughly implementing $A[i] = C\times{}A[i] + B[i]$,
While this is sensible for conducting throughput measures, it also introduces
unwanted dependencies. If, for instance, the kernel consists in
$A[i] = C\times{}A[i] + B[i]$, implemented by\\
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
loop:
@ -97,66 +54,169 @@ loop:
jne loop
\end{lstlisting}
\end{minipage}\\
a read-after-write dependency from line 4 to line 2 was reported ---~while no
such dependency actually exists.
a read-after-write dependency from line 4 to line 2 is reported by \depsim{}
---~although there is no such dependency inherent to the kernel.
The reason for that is that, in \cesasme{}'s benchmarks, the whole program
would roughly look like
\begin{lstlisting}[language=C]
/* Initialize A, B, C here */
for(int measure=0; measure < NUM_MEASURES; ++measure) {
measure_start();
for(int repeat=0; repeat < NUM_REPEATS; ++repeat) {
for(int i=0; i < ARRAY_SIZE; ++i) {
A[i] = C * A[i] + B[i];
}
}
measure_stop();
}
\end{lstlisting}
Thus, the previously reported dependency did not come from within the kernel,
but \emph{from one outer iteration to the next} (\ie{}, iteration on
\lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not,
in practice, relevant: under the common code analyzers assumptions, the most
inner loop is long enough to be considered infinite in steady state; thus, two
outer loop iterations are too far separated in time for this dependency to have
any relevance, as the source iteration is long executed when the destination
iteration is scheduled.
However, each iteration of the
\texttt{measure} (outer) loop and each iteration of the \texttt{repeat} (inner)
loop will read again each \texttt{A[i]} (\ie{} \lstxasm{(\%rax,\%rdi)} in the
assembly) value from the previous inner loop, and
write it back. This creates a dependency to the previous iteration of the inner
loop, which should in practice be meaningless if \lstc{BENCHMARK_SIZE} is large
enough. Such dependencies, however, pollute the evaluation results: as \depsim{}
does not report a dependency's distance, they are considered meaningful; and as
they cannot be detected by \staticdeps{} --~which is unaware of the outer and
inner loop~--, they introduce unfairness in the evaluation. The actual loss of
precision introduced by not discovering such dependencies is instead assessed
later by enriching \uica{} with \staticdeps{}.
\medskip{}
To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}.
As we do not have access without a heavy runtime slowdown to elapsed cycles in
\valgrind{}, we define a \emph{timestamp} as the number of instructions
executed since beginning of the program's execution; we increment this count at
each branch instruction to avoid excessive instrumentation slowdown.
To avoid detecting these dependencies with \depsim{}, we \textbf{recompile
\cesasme{}'s benchmarks} from the C source code of each benchmark with
\lstc{NUM_MEASURES = NUM_REPEATS = 1}. We use these recompiled benchmarks only
in the current section. While we do not re-run code transformations from the
original Polybenchs, we do recompile the benchmarks from C source. Thus, the
results from this section \emph{are not comparable} with results from other
sections, as the compiler may have used different optimisations, instructions,
etc.
We further annotate every write to the shadow memory with the timestamp at
which it occurred. Whenever a dependency should be added, we first check that
the dependency has not expired ---~that is, that it is not older than a given
threshold.
\subsubsection{Dependency coverage}
We re-run the previous experiments with lifetimes of respectively 1\,024 and
512 instructions, which roughly corresponds to the order of magnitude of the
size of a reorder buffer (see \autoref{ssec:staticdeps_detection} above); results can also be found in
\autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions
lifetime greatly improves the coverage rates, both unweighted and weighted,
further reducing this lifetime to 512 does not yield significant enhancements.
For each binary generated by \cesasme{}, we use its cached basic
block splitting and occurrence count. Among each binary, we discard any basic
block with fewer than 10\,\% of the occurrence count of the most-hit basic
block; this avoids considering basic blocks which were not originally inside
loops, and for which loop-carried dependencies would make no sense ---~and
could possibly create false positives.
\bigskip{}
For each of the considered binaries, we run our dynamic analysis, \depsim{},
and record its results. We use a lifetime of 512 instructions for this
analysis, as this is roughly the size of recent Intel reorder
buffers~\cite{wikichip_intel_rob_size}; as discussed in
\autoref{ssec:staticdeps_detection}, dependencies spanning farther than the
size of the ROB are not microarchitecturally relevant. Dependencies whose
source and destination program counters are not in the same basic block are
discarded, as \staticdeps{} cannot detect them by construction.
The final coverage results, with a rough 60\,\% detection rate, are reasonable
and detect a significant proportion of dependencies; however, many are still
not detected.
For each of the considered basic blocks, we run our static analysis,
\staticdeps{}. We discard the $\Delta{}k$ parameter, as our dynamic analysis does
not report an equivalent parameter, but only a pair of program counters.
This may be explained by the limitations studied in
\autoref{ssec:staticdeps_limits} above, and especially the inability of
\staticdeps{} to detect dependencies through aliasing pointers. This falls,
more broadly, into the problem of lack of context that we expressed before and
emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale
of the whole program, that would be able to integrate constraints stemming from
outside of the loop body, would capture many more dependencies.
Dynamic dependencies from \depsim{} are converted to
\emph{periodic dependencies} in the sense of \staticdeps{} as described in
\autoref{ssec:staticdeps:practical_implem}: only dependencies occurring on at
least 80\% of the block's iterations are kept --~else, dependencies are
considered measurement artifacts. The \emph{periodic coverage}
of \staticdeps{} dependencies for this basic block \wrt{} \depsim{} is the
proportion of dependencies found by \staticdeps{} among the periodic
dependencies extracted from \depsim{}:
\[
\cov_p = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
\]
\smallskip{}
We also keep the raw dependencies from \depsim{} --~that is, without converting
them to periodic dependencies. From these, we consider two metrics:
the unweighted dependencies coverage, \[
\cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}}
\]
identical to $\cov_p$ but based on unfiltered dependencies,
as well as the weighted dependencies coverage, \[
\cov_w =
\dfrac{
\sum_{d \in \text{found}} \rho_d
}{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d}
\]
where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically
detected by \depsim. Note that such a metric is not meaningful for periodic
dependencies as, by construction, each dependency occurs as many times
as the loop iterates.
\begin{table}
\centering
\begin{tabular}{r r r}
\toprule
$\cov_p$ (\%) & $\cov_u$ (\%) & $\cov_w$ (\%) \\
\midrule
96.0 & 94.4 & 98.3 \\
\bottomrule
\end{tabular}
\caption{Periodic, unweighted and weighted coverage of \staticdeps{} on
\cesasme{}'s binaries recompiled without repetitions, with a lifetime of
512.}\label{table:cov_staticdeps}
\end{table}
These metrics are presented for the 3\,500 binaries of \cesasme{} in
\autoref{table:cov_staticdeps}. The obtained coverage is consistent between the
three metrics used ($\cov_p$, $cov_u$, $cov_w$) and the reported coverage is
very close to 100\,\%, giving us good confidence on the accuracy of
\staticdeps.
\subsubsection{``Points-to'' aliasing analysis}
The same methodology can be re-used as a proxy for estimating the rate of
aliasing independent pointers in our dataset. Indeed, a major approximation
made by \staticdeps{} is to assume that any new encountered pointer --~function
parameters, value read from memory, \ldots~-- does \emph{not} alias with
previously encountered values. This is implemented by the use of a fresh
random value for each value yet unknown.
Determining which pointers may point to which other pointers --~and, by
extension, may point to the same memory region~-- is called a \emph{points-to
analysis}~\cite{points_to}. In the context of \staticdeps{}, it characterizes
the pointers for which taking a fresh value was \emph{not} representative of
the reality.
If we detect, through dynamic analysis, that a value derived from a
pointer \lstc{a} shares a value with one derived from a pointer \lstc{b}
--~say, \lstc{a + k == b + l}~--, we can deduce that \lstc{a} \emph{points-to}
\lstc{b}. This is true even if \lstc{a + k} at the very beginning of the
execution is equal to \lstc{b + l} at the very end of the execution: although
the pointers will not alias (that is, share the same value at the same moment),
they still point to the same memory region and should not be treated as
independent.
Our dynamic analyzer, \depsim{}, does not have this granularity, as it only
reports dependencies between two program counters. A dependency from a PC
$p$ to a PC $q$ however implies that a value written to memory at $q$ was read
from memory at $p$, and thus that one of the pointers used at $p$ aliases with
one of the pointers used at $q$.
\medskip{}
We thus conduct the same analysis as before, but with an infinite lifetime to
account for far-ranging dependencies. We then use $\cov_u$ and $\cov_w$ as a
proxy to measure whether assuming the pointers independent was reasonable: a
bad coverage would be a clear indication of non-independent pointers treated as
independent. A good coverage is not, formally, an indication of the absence of
non-independent pointers: the detected static dependencies may come of other
pointers at the same PC. We however believe it reasonable to consider it a good
proxy for this metric, as a single assembly line often reads a single value,
and usually at most two. We do not use the $\cov_p$ metric here, as we want to
keep every detected dependency to detect possible aliasing.
\begin{table}
\centering
\begin{tabular}{r r}
\toprule
$\cov_u$ (\%) & $\cov_w$ (\%) \\
\midrule
95.0 & 93.7 \\
\bottomrule
\end{tabular}
\caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s
binaries recompiled without repetitions, with an infinite
lifetime, as a proxy for points-to analysis.}\label{table:cov_staticdeps_pointsto}
\end{table}
The results of this analysis are presented in
\autoref{table:cov_staticdeps_pointsto}. The very high coverage rate gives us
good confidence that our hypothesis of independent pointers is reasonable, at
least within the scope of Polybench, which we believe representative of
scientific computation --~one of the prominent use-cases of tools such as code
analyzers.
\subsection{Enriching \uica{}'s model}
@ -166,8 +226,10 @@ integrate \staticdeps{} into \uica{}.
There is, however, a discrepancy between the two tools: while \staticdeps{}
works at the assembly instruction level, \uica{} works at the \uop{} level. In
real hardware, dependencies indeed occur between \uops{}; however, we are not
aware of the existence of a \uop{}-level semantic description of the x86-64
ISA, which made this level of detail unsuitable for the \staticdeps{} analysis.
aware of the existence of a \uop{}-level semantic description of the x86-64 ISA
(which, by essence, would be declined for each specific processor, as the ISA
itself is not concerned with \uops{}). This level of detail was thus unsuitable
for the \staticdeps{} analysis.
We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$
are found to be dependant, we add a dependency between each couple $\mu_1 \in

View file

@ -257,8 +257,8 @@
@inproceedings{talla2001hwloops,
author={Talla, D. and John, L.K.},
booktitle={Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001},
title={Cost-effective hardware acceleration of multimedia applications},
booktitle={Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001},
title={Cost-effective hardware acceleration of multimedia applications},
year={2001},
volume={},
number={},
@ -266,3 +266,21 @@
keywords={Hardware;Acceleration;Computer aided instruction;Streaming media;Parallel processing;Throughput;Concurrent computing;Application software;Microprocessors;Feeds},
doi={10.1109/ICCD.2001.955060}}
@article{points_to,
author = {Emami, Maryam and Ghiya, Rakesh and Hendren, Laurie J.},
title = {Context-sensitive interprocedural points-to analysis in the presence of function pointers},
year = {1994},
issue_date = {June 1994},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {29},
number = {6},
issn = {0362-1340},
url = {https://doi.org/10.1145/773473.178264},
doi = {10.1145/773473.178264},
abstract = {This paper reports on the design, implementation, and empirical results of a new method for dealing with the aliasing problem in C. The method is based on approximating the points-to relationships between accessible stack locations, and can be used to generate alias pairs, or used directly for other analyses and transformations.Our method provides context-sensitive interprocedural information based on analysis over invocation graphs that capture all calling contexts including recursive and mutually-recursive calling contexts. Furthermore, the method allows the smooth integration for handling general function pointers in C.We illustrate the effectiveness of the method with empirical results from an implementation in the McCAT optimizing/parallelizing C compiler.},
journal = {SIGPLAN Not.},
month = {jun},
pages = {242256},
numpages = {15}
}