\section{Evaluation} We evaluate the relevance of \staticdeps{} results in two ways: first, we compare the detected dependencies to those extracted at runtime by \depsim{}, to evaluate the proportion of dependencies actually detected. Then, we evaluate the relevance of our static analysis from a performance debugging point of view, by enriching \uica{}'s model with \staticdeps{} and assessing, using \cesasme{}, the benefits brought to the model. We finally evaluate our claim that using a static model instead of a dynamic analysis, such as \gus{}, makes \staticdeps{} yield a result in a reasonable amount of time. \subsection{Comparison to \depsim{} results}\label{ssec:staticdeps_eval_depsim} The \staticdeps{}'s model contribution largely resides in its ability to track memory-carried dependencies, including loop-carried ones. We thus focus on evaluating this aspect, and restrict both \depsim{} and \staticdeps{} to memory-carried dependencies. We use the binaries produced by \cesasme{} as a dataset, as we already assessed its relevance and contains enough benchmarks to be statistically meaningful. We also already have tooling and basic-block segmentation available for those benchmarks, making the analysis more convenient. \medskip{} For each binary previously generated by \cesasme{}, we use its cached basic block splitting and occurrence count. Among each binary, we discard any basic block with fewer than 10\,\% of the occurrence count of the most-hit basic block; this avoids considering basic blocks which were not originally inside loops, and for which loop-carried dependencies would make no sense ---~and could possibly create false positives. For each of the considered binaries, we run our dynamic analysis, \depsim{}, and record its results. For each of the considered basic blocks, we run our static analysis, \staticdeps{}. We translate the detected dependencies back to original ELF addresses, and discard the $\Delta{}k$ parameter, as our dynamic analysis does not report an equivalent parameter, but only a pair of program counters. Each of the dependencies reported by \depsim{} whose source and destination addresses belong to the basic block considered are then classified as either detected or missed by \staticdeps{}. Dynamically detected dependencies spanning across basic blocks are discarded, as \staticdeps{} cannot detect them by construction. \medskip{} We consider two metrics: the unweighted dependencies coverage, \[ \cov_u = \dfrac{\card{\text{found}}}{\card{\text{found}} + \card{\text{missed}}} \] as well as the weighted dependencies coverage, \[ \cov_w = \dfrac{ \sum_{d \in \text{found}} \rho_d }{\sum_{d \in \text{found}\,\cup\,\text{missed}} \rho_d} \] where $\rho_d$ is the number of occurrences of the dependency $d$, dynamically detected by \depsim. \begin{table} \centering \begin{tabular}{r r r} \toprule \textbf{Lifetime} & $\cov_u$ (\%) & $\cov_w$ (\%) \\ \midrule $\infty$ & 38.1\,\% & 44.0\,\% \\ 1024 & 57.6\,\% & 58.2\,\% \\ 512 & 56.4\,\% & 63.5\,\% \\ \bottomrule \end{tabular} \caption{Unweighted and weighted coverage of \staticdeps{} on \cesasme{}'s binaries}\label{table:cov_staticdeps} \end{table} These metrics are presented for the 3\,500 binaries of \cesasme{} in the first data row of \autoref{table:cov_staticdeps}. The obtained coverage, of about 40\,\%, is lower than expected. \bigskip{} Manual investigation on the missed dependencies revealed some surprising dependencies dynamically detected by \depsim{}, that did not appear to actually be read-after-write dependencies. In the following (simplified) example, roughly implementing $A[i] = C\times{}A[i] + B[i]$, \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] loop: vmulsd (%rax,%rdi), %xmm0, %xmm1 vaddsd (%rbx,%rdi), %xmm1, %xmm1 vmovsd %xmm1, (%rax,%rdi) add $8, %rdi cmp %rdi, %r10 jne loop \end{lstlisting} \end{minipage}\\ a read-after-write dependency from line 4 to line 2 was reported ---~while no such dependency actually exists. The reason for that is that, in \cesasme{}'s benchmarks, the whole program would roughly look like \begin{lstlisting}[language=C] /* Initialize A, B, C here */ for(int measure=0; measure < NUM_MEASURES; ++measure) { measure_start(); for(int repeat=0; repeat < NUM_REPEATS; ++repeat) { for(int i=0; i < ARRAY_SIZE; ++i) { A[i] = C * A[i] + B[i]; } } measure_stop(); } \end{lstlisting} Thus, the previously reported dependency did not come from within the kernel, but \emph{from one outer iteration to the next} (\ie{}, iteration on \lstc{repeat} or \lstc{measure} in the code above). Such a dependency is not, in practice, relevant: under the common code analyzers assumptions, the most inner loop is long enough to be considered infinite in steady state; thus, two outer loop iterations are too far separated in time for this dependency to have any relevance, as the source iteration is long executed when the destination iteration is scheduled. \medskip{} To this end, we introduce in \depsim{} a notion of \emph{dependency lifetime}. As we do not have access without a heavy runtime slowdown to elapsed cycles in \valgrind{}, we define a \emph{timestamp} as the number of instructions executed since beginning of the program's execution; we increment this count at each branch instruction to avoid excessive instrumentation slowdown. We further annotate every write to the shadow memory with the timestamp at which it occurred. Whenever a dependency should be added, we first check that the dependency has not expired ---~that is, that it is not older than a given threshold. We re-run the previous experiments with lifetimes of respectively 1\,024 and 512 instructions, which roughly corresponds to the order of magnitude of the size of a reorder buffer; results can also be found in \autoref{table:cov_staticdeps}. While the introduction of a 1\,024 instructions lifetime greatly improves the coverage rates, both unweighted and weighted, further reducing this lifetime to 512 does not yield significant enhancements. \bigskip{} The final coverage results, with a rough 60\,\% detection rate, are reasonable and detect a significant proportion of dependencies; however, many are still not detected. This may be explained by the limitations studied in \autoref{ssec:staticdeps_limits} above, and especially the inability of \staticdeps{} to detect dependencies through aliasing pointers. This falls, more broadly, into the problem of lack of context that we expressed before and emphasized in \autoref{chap:CesASMe}: we expect that an analysis at the scale of the whole program, that would be able to integrate constraints stemming from outside of the loop body, would capture many more dependencies. \subsection{Enriching \uica{}'s model} To estimate the real gain in performance debugging scenarios, however, we integrate \staticdeps{} into \uica{}. There is, however, a discrepancy between the two tools: while \staticdeps{} works at the assembly instruction level, \uica{} works at the \uop{} level. In real hardware, dependencies indeed occur between \uops{}; however, we are not aware of the existence of a \uop{}-level semantic description of the x86-64 ISA, which made this level of detail unsuitable for the \staticdeps{} analysis. We bridge this gap in a conservative way: whenever two instructions $i_1, i_2$ are found to be dependant, we add a dependency between each couple $\mu_1 \in i_1, \mu_2 \in i_2$. This approximation is thus pessimistic, and should predict execution times biased towards a slower computation kernel. A finer model, or a finer (conservative) filtering of which \uops{} must be considered dependent ---~\eg{} a memory dependency can only come from a memory-related \uop{}~--- may enhance the accuracy of our integration. \medskip{} We then evaluate our gains by running \cesasme{}'s harness as we did in \autoref{chap:CesASMe}, running both \uica{} and \uicadeps{}, on two datasets: first, the full set of 3\,500 binaries from the previous chapter; then, the set of binaries pruned to exclude benchmarks heavily relying on memory-carried dependencies introduced in \autoref{ssec:memlatbound}. If \staticdeps{} is beneficial to \uica{}, we expect \uicadeps{} to yield significantly better results than \uica{} alone on the first dataset. On the second dataset, however, \staticdeps{} should provide no significant contribution, as the dataset was pruned to not exhibit significant memory-carried latency-boundness. We present these results in \autoref{table:staticdeps_uica_cesasme}, as well as the corresponding box-plots in \autoref{fig:staticdeps_uica_cesasme_boxplot}. \begin{table} \centering \footnotesize \begin{tabular}{l l r r r r r r r} \toprule \textbf{Dataset} & \textbf{Bencher} & \textbf{Datapoints} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\ \midrule \multirow{2}{*}{Full} & \uica{} & 3500 & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58\\ & + \staticdeps{} & 3500 & 19.15\,\% & 14.44\,\% & 5.86\,\% & 23.96\,\% & 0.81\\ \midrule \multirow{2}{*}{Pruned} & \uica{} & 2388 & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\ & + \staticdeps{} & 2388 & 18.77\,\% & 12.18\,\% & 5.31\,\% & 23.55\,\% & 0.80\\ \bottomrule \end{tabular} \caption{Evaluation through \cesasme{} of the integration of \staticdeps{} to \uica{}}\label{table:staticdeps_uica_cesasme} \end{table} \begin{figure} \centering \includegraphics[width=0.5\linewidth]{uica_cesasme_boxplot.svg} \caption{Statistical distribution of relative errors of \uica{}, with and without \staticdeps{} hints, with and without pruning latency bound through memory-carried dependencies rows}\label{fig:staticdeps_uica_cesasme_boxplot} \end{figure} \medskip{} The full dataset \uicadeps{} row is extremely close, on every metric, to the pruned, \uica{}-only row. On this basis, we argue that \staticdeps{}' addition to \uica{} is very conclusive: the hints provided by \staticdeps{} are sufficient to make \uica{}'s results as good on the full dataset as they were before on a dataset pruned of precisely the kind of dependencies we aim to detect. Furthermore, \uica{} and \uicadeps{}' results on the pruned dataset are extremely close: this further supports the accuracy of \staticdeps{}. \medskip{} While the results obtained against \depsim{} in \autoref{ssec:staticdeps_eval_depsim} above were reasonable, they were not excellent either, and showed that many kind of dependencies were still missed by \staticdeps{}. However, our evaluation on \cesasme{} by enriching \uica{} shows that, at least on the workload considered, the dependencies that actually matter from a performance debugging point of view are properly found. This, however, might not be true for other kinds of applications that would require a dependencies analysis. \subsection{Analysis speed} \todo{}