From ceea025b65cb28532097bfed1ef2b5d580760896 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr>
Date: Wed, 24 Apr 2024 20:08:50 +0200
Subject: [PATCH] Conclusion: mostly written, maybe lacks a conclusive
 paragraph

---
 manuscrit/99_conclusion/main.tex | 69 ++++++++++++++++++++++++++------
 1 file changed, 56 insertions(+), 13 deletions(-)

diff --git a/manuscrit/99_conclusion/main.tex b/manuscrit/99_conclusion/main.tex
index 06cd632..ba79415 100644
--- a/manuscrit/99_conclusion/main.tex
+++ b/manuscrit/99_conclusion/main.tex
@@ -12,21 +12,21 @@ analyzing the low-level performance of a microkernel:
         prevent the backend from being saturated; the latter is stalled
         awaiting previous results (\autoref{chap:staticdeps}).
 \end{itemize}
-We also conduced in \autoref{chap:CesASMe} a systematic comparative study of a
+We also conducted in \autoref{chap:CesASMe} a systematic comparative study of a
 variety of state-of-the-art code analyzers.
 
 \bigskip{}
 
 State-of-the-art code analyzers such as \llvmmca{} or \uica{} already
-boast a good accuracy. Both of these models ---~and most of the others also~---
+boast a good accuracy. Both of these tools ---~and most of the others also~---
 are however based on models obtained by various degrees of manual
-investigation, and are unable to scale without further manual effort to future
+investigation, and cannot be adapted without further manual effort to future
 or uncharted microprocessors.
 
 The field of microarchitectural models for code
 analysis emerged with fundamentally manual methods, such as Agner Fog's tables.
 Such tables, however, may now be produced in a more automated way using
-\uopsinfo{} ---~at least for certain microarchitectures~---; \pmevo{} pushes
+\uopsinfo{} ---~at least for certain microarchitectures; \pmevo{} pushes
 further in this direction by automatically computing a frontend model from
 benchmarks ---~but still has trouble scaling to a full instruction set. In its
 own way, \ithemal{}, a machine-learning based approach, could also be
@@ -38,28 +38,28 @@ supercomputer area.
 
 \medskip{}
 
-We investigate this direction by exploring the three major bottlenecks
+We investigated this direction by exploring the three major bottlenecks
 mentioned earlier in the perspective of providing fully-automated,
 benchmarks-based models for each of them. Optimally, these models should be
 generated by simply executing a program on a machine running on top of the
 targeted microarchitecture.
 
 \begin{itemize}
-    \item We contribute to \palmed{}, a framework able to extract a
+    \item We contributed to \palmed{}, a framework able to extract a
         port-mapping of a processor, serving as a backend model.
-    \item We manually extract a frontend model for the Cortex A72 processor. We
-        believe that the foundation of our methodology works on most
+    \item We manually extracted a frontend model for the Cortex A72 processor.
+        We believe that the foundation of our methodology works on most
         processors. The main characteristics of a frontend, apart from their
         instructions' \uops{} decomposition and issue width, must however still
         be investigated, and their relative importance evaluated.
-    \item We provide with \staticdeps{} a method to to extract data
+    \item We provided with \staticdeps{} a method to to extract data
         dependencies between instructions. It is able to detect
         \textit{loop-carried} dependencies (dependencies that span across
         multiple loop iterations), as well as \textit{memory-carried}
         dependencies (dependencies based on reading at a memory address written
         by another instruction). While the former is widely implemented, the
         latter is, to the best of our knowledge, an original contribution. We
-        bundle this method in a processor-independent tool, based on semantics
+        bundled this method in a processor-independent tool, based on semantics
         of the ISA provided by \valgrind{}, which supports a variety of ISAs.
 \end{itemize}
 
@@ -76,11 +76,12 @@ together ---~or, even better, when each of them could be combined with any
 other model of the other parts. To the best of our knowledge, however, no such
 modular tool exists; nor is there any standardized approach to interact with
 such models. The usual approach of the domain to try a new idea, instead, is to
-create a full analyzer implementing this idea, such as we did with \palmed{}
-for backend models, or such as \uica{}'s implementation.
+create a full analyzer implementing this idea, such as what we did with \palmed{}
+for backend models, or such as \uica{}'s implementation, focusing on frontend
+analysis.
 
 In hindsight, we advocate for the emergence of such a modular code analyzer.
-It would maybe not be as convenient or well-packaged as ``production-ready''
+It would maybe not be as convenient or well-integrated as ``production-ready''
 code analyzers, such as \llvmmca{} ---~which is packaged for Debian. It could,
 however, greatly simplify the academic process of trying a new idea on any of
 the three main models, by decorrelating them. It would also ease the
@@ -98,4 +99,46 @@ comparative experiments with \cesasme{}.
 
 \smallskip{}
 
+First, none of the state-of-the-art tools have a good support for dependencies
+across memory. Such dependencies were present in about a third of \cesasme{}'s
+benchmark set. While we built this benchmark set aiming for representative
+data, there is no clear evidence that these dependencies are so strongly
+present in the codes analyzed in real usecases. We however believe that such
+cases regularly occur, and we also saw that the performance of code analyzers
+drop sharply in their presence.
 
+\smallskip{}
+
+We also found the bottleneck prediction offered by some code analyzers still
+uncertain. In our experiments, the tools disagreed more often than not on the
+presence or absence of a bottleneck, with no outstanding tool; we are thus
+unable to conclude on the relative performance of tools on this aspect. On the
+other hand, sensitivity analysis, as implemented \eg{} by \gus{}, seems a
+theoretically sound way to evaluate the presence or absence of a bottleneck in
+a microkernel; it is, however, prohibitively slow for many usecases. In this
+respect, a study of code analyzers' predictions against results from
+sensitivity analysis would certainly bring more conclusive results.
+
+\smallskip{}
+
+Finally, we observed on \bhive{}'s results the effects of a \emph{lack of
+context} for an analysis. \bhive{} measures a real execution, on real hardware,
+of a kernel; as such, it yields excellent accuracy in many cases, with a median
+error of about 8\%. Yet, it still lacks in accuracy in many other cases, with
+its third quartile (23\%) above \uica{} or \iaca{}'s median result (about
+18\%), and far-reaching outliers bringing its mean error on-par with \uica{}'s.
+Indeed, what precedes a loop nest and the real values present in registers
+impact the performance of the loop nest. The effects can be of fairly high
+level, such as pointer aliasing, leading to false positives or negatives in
+dependency detections. They can also be of a microarchitectural level, such as
+the observable performance loss of memory accesses ---~even with cache hits~---
+when memory reads cross a cache line boundary.
+
+This lack of context incurs a significant loss of accuracy for
+static analyzers, as we saw in \autoref{ssec:bhive_errors} that the same
+instruction, depending on its registers' values, can be twice as slow even
+without aliasing, or 19 times slower upon aliasing. With \cesasme{}, we sketch
+the embryo of a solution, with a simple and fast pass of dynamic analysis
+through instrumentation, gathering data for a subsequent pass of static
+analysis. Such a method might help recreating the context needed for an
+accurate analysis.