CesASMe: brutal paper import. Not compiling yet.

2023-09-25 17:00:07 +02:00 · 2023-09-25 17:00:07 +02:00 · fc9182428d
commit fc9182428d
parent 0b089085e0
14 changed files with 1143 additions and 0 deletions
--- a/manuscrit/50_CesASMe/00_intro.tex
+++ b/manuscrit/50_CesASMe/00_intro.tex
@ -0,0 +1,141 @@
+\begin{abstract}
+    A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or
+    \ithemal{}, strive to statically predict the throughput of a computation
+    kernel. Each analyzer is based on its own simplified CPU model
+    reasoning at the scale of an isolated basic block.
+    Facing this diversity, evaluating their strengths and
+    weaknesses is important to guide both their usage and their enhancement.
+
+    We argue that reasoning at the scale of a single basic block is not
+    always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled
+    solution to evaluate code analyzers on C-level benchmarks. It is composed of a
+    benchmark derivation procedure that feeds an evaluation harness. We use it to
+    evaluate state-of-the-art code analyzers and to provide insights on their
+    precision. We use \tool's results to show that memory-carried data
+    dependencies are a major source of imprecision for these tools.
+\end{abstract}
+
+\section{Introduction}\label{sec:intro}
+
+At a time when software is expected to perform more computations, faster and in
+more constrained environments, tools that statically predict the resources (and
+in particular the CPU resources) they consume are very useful to guide their
+optimization. This need is reflected in the diversity of binary or assembly
+code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
+maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
+\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various
+performance metrics, including the number of CPU cycles a computation kernel will take
+---~which roughly translates to execution time.
+In addition to raw measurements (relying on hardware counters), these model-based analyses provide
+higher-level and refined data, to expose the bottlenecks and guide the
+optimization of a given code. This feedback is useful to experts optimizing
+computation kernels, including scientific simulations and deep-learning
+kernels.
+
+An exact throughput prediction would require a cycle-accurate simulator of the
+processor, based on microarchitectural data that is most often not publicly
+available, and would be prohibitively slow in any case. These tools thus each
+solve in their own way the challenge of modeling complex CPUs while remaining
+simple enough to yield a prediction in a reasonable time, ending up with
+different models. For instance, on the following x86-64 basic block computing a
+general matrix multiplication,
+\begin{minipage}{0.95\linewidth}
+\begin{lstlisting}[language={[x86masm]Assembler}]
+movsd (%rcx, %rax), %xmm0
+mulsd %xmm1, %xmm0
+addsd (%rdx, %rax), %xmm0
+movsd %xmm0, (%rdx, %rax)
+addq $8, %rax
+cmpq $0x2260, %rax
+jne 0x16e0
+\end{lstlisting}
+\end{minipage}
+
+\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
+predicts 3 cycles. One may wonder which tool is correct.
+
+
+The obvious solution to assess their predictions is to compare them to an
+actual measure. However, as these tools reason at the basic block level, this
+is not as trivially defined as it would seem. Take for instance the following
+kernel:
+
+\begin{minipage}{0.90\linewidth}
+\begin{lstlisting}[language={[x86masm]Assembler}]
+mov (%rax, %rcx, 1), %r10
+mov %r10, (%rbx, %rcx, 1)
+add $8, %rcx
+\end{lstlisting}
+\end{minipage}
+
+\input{overview}
+
+\noindent{}At first, it looks like an array copy from location \reg{rax} to
+\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
+\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
+instruction and the second instruction at the previous iteration; which makes
+the throughput drop significantly. As we shall see in
+Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
+block's throughput is not well-defined}.
+
+To recover the context of each basic block, we reason instead at the scale of
+a C source code. This
+makes the measures unambiguous: one can use hardware counters to measure the
+elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
+that both is representative of the domain studied, and wide enough to have a
+good coverage of the domain. However, this is not in itself sufficient to
+evaluate static tools: on the preceding matrix multiplication kernel, counters
+report 80,059 elapsed cycles ---~for the total loop.
+This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
+basic block-level predictions seen above.
+
+A common practice to make these numbers comparable is to renormalize them to
+instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
+$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
+$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
+case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal.  Yet,
+IPC is a metric for microarchitectural load, and \textit{tells nothing about a
+kernel's efficiency}.  Indeed, the static number of instructions is affected by
+many compiler passes, such as scalar evolution, strength reduction, register
+allocation, instruction selection\ldots{} Thus, when comparing two compiled
+versions of the same code, IPC alone does not necessarily point to the most
+efficient version.  For instance, a kernel using SIMD instructions will use
+fewer instructions than one using only scalars, and thus exhibit a lower or
+constant IPC; yet, its performance will unquestionably increase.
+
+The total cycles elapsed to solve a given problem, on the other
+hand, is a sound metric of the efficiency of an implementation. We thus
+instead \emph{lift} the predictions at basic-block level to a total number of
+cycles. In simple cases, this simply means multiplying the block-level
+prediction by the number of loop iterations; however, this bound might not
+generally be known. More importantly, the compiler may apply any number of
+transformations: unrolling, for instance, changes this number. Control flow may
+also be complicated by code versioning.
+
+%In the general case, instrumenting the generated code to obtain the number of
+%occurrences of the basic block yields accurate results.
+
+\bigskip
+
+In this article, we present a fully-tooled solution to evaluate and compare the
+diversity of static throughput predictors. Our tool, \tool, solves two main
+issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
+\tool{} generates a wide variety of computation kernels stressing different
+parameters of the architecture, and thus of the predictors' models, while
+staying close to representative workloads. To achieve this, we use
+Polybench~\cite{polybench}, a C-level benchmark suite representative of
+scientific computation workloads, that we combine with a variety of
+optimisations, including polyhedral loop transformations.
+In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to
+evaluate throughput predictors on this set of benchmarks by lifting their
+predictions to a total number of cycles that can be compared to a hardware
+counters-based measure. A
+high-level view of \tool{} is shown in Figure~\ref{fig:contrib}.
+
+In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
+methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
+analyze the results of \tool{}.
+ In addition to statistical studies, we use \tool's results
+to investigate analyzers' flaws. We show that code
+analyzers do not always correctly model data dependencies through memory
+accesses, substantially impacting their precision.
--- a/manuscrit/50_CesASMe/05_related_works.tex
+++ b/manuscrit/50_CesASMe/05_related_works.tex
@ -0,0 +1,56 @@
+\section{Related works}
+
+The static throughput analyzers studied rely on a variety of models.
+\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
+relies on Intel's expertise on their own processors.
+The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
+architectures. These models are used in the LLVM Machine Code Analyzer,
+\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
+of assembly.
+Independently, Abel and Reineke used an automated microbenchmark generation
+approach to generate port mappings of many architectures in
+\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
+This work was continued with \uica~\cite{uica}, extending this model with an
+extensive frontend description.
+Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
+neural network to predict basic blocks throughput. To obtain enough data to
+train its model, the authors also developed \bhive~\cite{bhive}, a profiling
+tool working on basic blocks.
+
+Another static tool, \osaca~\cite{osaca2}, provides lower- and
+upper-bounds to the execution time of a basic block. As this kind of
+information cannot be fairly compared with tools yielding an exact throughput
+prediction, we exclude it from our scope.
+
+All these tools statically predict the number of cycles taken by a piece of
+assembly or binary that is assumed to be the body of an infinite ---~or
+sufficiently large~--- loop in steady state, all its data being L1-resident. As
+discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
+analyzers; \eg{} by assuming that the loop is either unrolled or has control
+instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
+necessarily work on a single basic block, while some others, \eg{} \iaca, work
+on a section of code delimited by markers. However, even in the second case,
+the code is assumed to be \emph{straight-line code}: branch instructions, if
+any, are assumed not taken.
+
+\smallskip
+
+Throughput prediction tools, however, are not all static.
+\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
+region, instrumenting it to retrieve the exact events occurring through its
+execution. This way, \gus{} can more finely detect bottlenecks by
+sensitivity analysis, at the cost of a significantly longer run time.
+
+\smallskip
+
+The \bhive{} profiler~\cite{bhive} takes another approach to basic block
+throughput measurement: by mapping memory at any address accessed by a basic
+block, it can effectively run and measure arbitrary code without context, often
+---~but not always, as we discuss later~--- yielding good results.
+
+\smallskip
+
+The \anica{} framework~\cite{anica} also attempts to evaluate throughput
+predictors by finding examples on which they are inaccurate. \anica{} starts
+with randomly generated assembly snippets, and refines them through a process
+derived from abstract interpretation to reach general categories of problems.
--- a/manuscrit/50_CesASMe/10_bench_gen.tex
+++ b/manuscrit/50_CesASMe/10_bench_gen.tex
@ -0,0 +1,109 @@
+\section{Generating microbenchmarks}\label{sec:bench_gen}
+
+Our framework aims to generate \emph{microbenchmarks} relevant to a specific
+domain.
+A microbenchmark is a code that is as simplified as possible to expose the
+behaviour under consideration.
+The specified computations should be representative of the considered domain,
+and at the same time they should stress the different aspects of the
+target architecture ---~which is modeled by code analyzers.
+
+In practice, a microbenchmark's \textit{computational kernel} is a simple
+\texttt{for} loop, whose
+body contains no loops and whose bounds are statically known.
+A \emph{measure} is a number of repetitions $n$ of this computational
+kernel, $n$ being an user-specified parameter.
+The measure may be repeated an arbitrary number of times to improve
+stability.
+
+Furthermore, such a microbenchmark should be a function whose computation
+happens without leaving the L1 cache.
+This requirement helps measurements and analyses to be
+undisturbed by memory accesses, but it is also a matter of comparability.
+Indeed, most of the static analyzers make the assumption that the code under
+consideration is L1-resident; if it is not, their results are meaningless, and
+can not be compared with an actual measurement.
+
+The generation of such microbenchmarks is achieved through four distinct
+components, whose parameter variations are specified in configuration files~:
+a benchmark suite, C-to-C loop nest optimizers, a constraining utility
+and a C-to-binary compiler.
+
+\subsection{Benchmark suite}\label{ssec:bench_suite}
+Our first component is an initial set of benchmarks which materializes
+the human expertise we intend to exploit for the generation of relevant codes.
+The considered suite must embed computation kernels
+delimited by ad-hoc \texttt{\#pragma}s,
+whose arrays are accessed
+directly (no indirections) and whose loops are affine.
+These constraints are necessary to ensure that the microkernelification phase,
+presented below, generates segfault-free code.
+
+In this case, we use Polybench~\cite{polybench}, a suite of 30
+benchmarks for polyhedral compilation ---~of which we use only 26. The
+\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
+removed because they are incompatible with PoCC (introduced below). The
+\texttt{lu} benchmark is left out as its execution alone takes longer than all
+others together, making its dynamic analysis (\eg{} with \gus) impractical.
+In addition to the importance of linear algebra within
+it, one of its important features is that it does not include computational
+kernels with conditional control flow (\eg{} \texttt{if-then-else})
+---~however, it does includes conditional data flow, using the ternary
+conditional operator of C.
+
+\subsection{C-to-C loop nest optimizers}\label{ssec:loop_nest_optimizer}
+Loop nest optimizers transform an initial benchmark in different ways (generate different
+\textit{versions} of the same benchmark), varying the stress on
+resources of the target architecture, and by extension the models on which the
+static analyzers are based.
+
+In this case, we chose to use the
+\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling,
+skewing, vectorization/simdization, loop unrolling, loop permutation,
+loop fusion.
+These transformations are meant to maximize variety within the initial
+benchmark suite. Eventually, the generated benchmarks are expected to
+highlight the impact on performance of the resulting behaviours.
+For instance, \textit{skewing} introduces non-trivial pointer arithmetics,
+increasing the pressure on address computation units~; \textit{loop unrolling},
+among many things, opens the way to register promotion, which exposes dependencies
+and alleviates load-store units~;
+\textit{vectorization} stresses SIMD units and decreases
+pressure on the front-end~; and so on.
+
+\subsection{Constraining utility}\label{ssec:kernelify}
+
+A constraining utility transforms the code in order to respect an arbitrary number of non-functional
+properties.
+In this case, we apply a pass of \emph{microkernelification}: we
+extract a computational kernel from the arbitrarily deep and arbitrarily
+long loop nest generated by the previous component.
+The loop chosen to form the microkernel is the one considered to be
+the \textit{hottest}; the \textit{hotness} of a loop being obtained by
+multiplying the number of arithmetic operations it contains by the number of
+times it is iterated. This metric allows us to prioritize the parts of the
+code that have the greatest impact on performance.
+
+At this point, the resulting code can
+compute a different result from the initial code;
+for instance, the composition of tiling and
+kernelification reduces the number of loop iterations.
+Indeed, our framework is not meant to preserve the
+functional semantics of the benchmarks.
+Our goal is only to generate codes that are relevant from the point of view of
+performance analysis.
+
+\subsection{C-to-binary compiler}\label{ssec:compile}
+
+A C-to-binary compiler varies binary optimization options by
+enabling/disabling auto-vectorization, extended instruction
+sets, \textit{etc}. We use \texttt{gcc}.
+
+\bigskip
+
+Eventually, the relevance of the microbenchmarks set generated using this approach
+derives not only from initial benchmark suite and the relevance of the
+transformations chosen at each
+stage, but also from the combinatorial explosion generated by the composition
+of the four stages. In our experimental setup, this yields up to 144
+microbenchmarks per benchmark of the original suite.
--- a/manuscrit/50_CesASMe/15_harness.tex
+++ b/manuscrit/50_CesASMe/15_harness.tex
@ -0,0 +1,87 @@
+\section{Benchmarking harness}\label{sec:bench_harness}
+
+To compare full-kernel cycle measurements to throughput predictions on
+individual basic blocks, we lift predictions by adding the weighted basic block
+predictions:
+
+\[
+\text{lifted\_pred}(\mathcal{K}) =
+    \sum_{b \in \operatorname{BBs}(\mathcal{K})}
+    \operatorname{occurences}(b) \times \operatorname{pred}(b)
+\]
+
+Our benchmarking harness works in three successive stages. It first
+extracts the basic blocks constituting a computation kernel, and instruments it
+to retrieve their respective occurrences in the original context. It then runs
+all the studied tools on each basic block, while also running measures on the
+whole computation kernel. Finally, the block-level results are lifted to
+kernel-level results thanks to the occurrences previously measured.
+
+\subsection{Basic block extraction}\label{ssec:bb_extr}
+
+Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
+code at each control flow instruction (jump, call, return, \ldots) and each
+jump site.
+
+To accurately obtain the occurrences of each basic block in the whole kernel's
+computation,
+we then instrument it with \texttt{gdb} by placing a break
+point at each basic block's first instruction in order to count the occurrences
+of each basic block between two calls to the \perf{} counters\footnote{We
+assume the program under analysis to be deterministic.}.  While this
+instrumentation takes about 50 to 100$\times$ more time than a regular run, it
+can safely be run in parallel, as the performance results are discarded.
+
+\subsection{Throughput predictions and measures}\label{ssec:throughput_pred_meas}
+
+The harness leverages a variety of tools: actual CPU measurement; the \bhive{}
+basic block profiler~\cite{bhive}; \llvmmca~\cite{llvm-mca}, \uica~\cite{uica}
+and \iaca~\cite{iaca}, which leverage microarchitectural
+models to predict a block's throughput; \ithemal~\cite{ithemal}, a machine
+learning model; and \gus~\cite{phd:gruber}, a dynamic analyzer based on \qemu{}
+that works at the whole binary level.
+
+The execution time of the full kernel is measured using Linux
+\perf~\cite{tool:perf} CPU counters around the full computation kernel. The
+measure is repeated four times and the smallest is kept; this ensures that the
+cache is warm and compensates for context switching or other measurement
+artifacts. \gus{} instruments the whole function body. The other tools included
+all work at basic block level; these are run on each basic block of each
+benchmark.
+
+We emphasize the importance, throughout the whole evaluation chain, to keep the
+exact same assembled binary. Indeed, recompiling the kernel from source
+\emph{cannot} be assumed to produce the same assembly kernel. This is even more
+important in the presence of slight changes: for instance, inserting \iaca{}
+markers at the C-level ---~as is intended~--- around the kernel \emph{might}
+change the compiled kernel, if only for alignment regions. We argue that, in
+the case of \iaca{} markers, the problem is even more critical, as those
+markers prevent a binary from being run by overwriting registers with arbitrary
+values. This forces a user to run and measure a version which is different from
+the analyzed one. In our harness, we circumvent this issue by adding markers
+directly at the assembly level, editing the already compiled version.  Our
+\texttt{gdb} instrumentation procedure also respects this principle of
+single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
+\gus{} with a preloaded stub shared library to be able to instrument binaries
+containing calls to \perf.
+
+\subsection{Prediction lifting and filtering}\label{ssec:harness_lifting}
+
+We finally lift single basic block predictions to a whole-kernel cycle
+prediction by summing the block-level results, weighted by the occurrences of
+the basic block in the original context (formula above). If an analyzer fails
+on one of the basic blocks of a benchmark, the whole benchmark is discarded for
+this analyzer.
+
+In the presence of complex control flow, \eg{} with conditionals inside loops,
+our approach based on basic block occurrences is arguably less precise than an
+approach based on paths occurrences, as we have less information available
+---~for instance, whether a branch is taken with a regular pattern, whether we
+have constraints on register values, etc. We however chose this block-based
+approach, as most throughput prediction tools work a basic block-level, and are
+thus readily available and can be directly plugged into our harness.
+
+Finally, we control the proportion of cache misses in the program's execution
+using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more
+than 15\,\% of cache misses on a warm cache are not considered L1-resident and
+are discarded.
--- a/manuscrit/50_CesASMe/20_evaluation.tex
+++ b/manuscrit/50_CesASMe/20_evaluation.tex
@ -0,0 +1,213 @@
+\section{Experimental setup and evaluation}\label{sec:exp_setup}
+
+Running the harness described above provides us with 3500
+benchmarks ---~after filtering out non-L1-resident
+benchmarks~---, on which each throughput predictor is run. We make the full
+output of our tool available in our artifact. Before analyzing these results in
+Section~\ref{sec:results_analysis}, we evaluate the relevance of the
+methodology presented in Section~\ref{sec:bench_harness} to make the tools'
+predictions comparable to baseline hardware counter measures.
+
+\subsection{Experimental environment}
+
+The experiments presented in this paper were all realized on a Dell PowerEdge
+C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
+of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
+Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
+
+The experiments themselves were run inside a Docker environment very close to
+our artifact, based on Debian Bullseye. Care was taken to disable
+hyperthreading to improve measurements stability. For tools whose output is
+based on a direct measurement (\perf, \bhive), the benchmarks were run
+sequentially on a single core with no experiments on the other cores. No such
+care was taken for \gus{} as, although based on a dynamic run, its prediction
+is purely function of recorded program events and not of program measures. All
+other tools were run in parallel.
+
+We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
+at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
+commit \texttt{87463c9}, \ithemal{} at commit \texttt{b3c39a8}.
+
+\subsection{Comparability of the results}
+
+We define the relative error of a time prediction
+$C_\text{pred}$ (in cycles) with respect to a baseline $C_\text{baseline}$ as
+\[
+    \operatorname{err} = \frac{\left| C_\text{pred} - C_\text{baseline}
+    \right|}{C_\text{baseline}}
+\]
+
+We assess the comparability of the whole benchmark, measured with \perf{}, to
+lifted block-based results by measuring the statistical distribution of the
+relative error of two series: the predictions made by \bhive, and the series of
+the best block-based prediction for each benchmark.
+
+We single out \bhive{} as it is the only tool able to \textit{measure}
+---~instead of predicting~--- an isolated basic block's timing. This, however, is
+not sufficient: as discussed later in Section~\ref{ssec:bhive_errors}, \bhive{}
+is not able to yield a result for about $40\,\%$ of the benchmarks, and is
+subject to large errors in some cases.  For this purpose, we also consider, for
+each benchmark, the best block-based prediction: we argue that if, for most
+benchmarks, at least one of these predictors is able to yield a satisfyingly
+accurate result, then the lifting methodology is sound in practice.
+
+The result of this analysis is presented in Table~\ref{table:exp_comparability}
+and in Figure~\ref{fig:exp_comparability}. The results are in a range
+compatible with common results of the field, as seen \eg{} in~\cite{uica}
+reporting Mean Absolute Percentage Error (MAPE, corresponding to the
+``Average'' row) of about 10-15\,\% in many cases. While lifted \bhive's
+average error is driven high by large errors on certain benchmarks,
+investigated later in this article, its median error is still comparable to the
+errors of state-of-the-art tools. From this, we conclude that lifted cycle
+measures and predictions are consistent with whole-benchmark measures; and
+consequently, lifted predictions can reasonably be compared to one another.
+
+\begin{figure}
+    \centering
+    \includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf}
+    \caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
+\end{figure}
+
+\begin{table}
+    \centering
+    \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
+    \begin{tabular}{l r r}
+        \toprule
+        & \textbf{Best block-based} & \textbf{BHive} \\
+        \midrule
+        Datapoints & 3500 & 2198 \\
+        Errors & 0 & 1302 \\
+         & (0\,\%) & (37.20\,\%) \\
+        Average (\%) & 11.60 & 27.95 \\
+        Median (\%) & 5.81 & 7.78 \\
+        Q1 (\%) & 1.99 & 3.01 \\
+        Q3 (\%) & 15.41 & 23.01 \\
+        \bottomrule
+    \end{tabular}
+\end{table}
+
+
+\begin{table*}[!htbp]
+    \centering
+    \caption{Bottleneck reports from the studied tools}\label{table:coverage}
+
+    \begin{tabular}{l | r r r | r r r | r r r}
+        \toprule
+ & \multicolumn{3}{c|}{\textbf{Frontend}}
+ & \multicolumn{3}{c|}{\textbf{Ports}}
+ & \multicolumn{3}{c}{\textbf{Dependencies}} \\
+ & \textbf{yes}  &  \textbf{no}  & \textbf{disagr.} & \textbf{yes}  &  \textbf{no}  & \textbf{disagr.} & \textbf{yes}  & \textbf{no}  & \textbf{disagr.} \\
+
+        \midrule
+2mm                  &   34 &   61 &  25.8 \% &   25 &   13 &  70.3 \% &   18 &   29 &  63.3 \% \\
+3mm                  &   44 &   61 &  18.0 \% &   30 &   13 &  66.4 \% &   23 &   37 &  53.1 \% \\
+atax                 &   13 &   72 &  41.0 \% &   25 &   17 &  70.8 \% &   23 &   30 &  63.2 \% \\
+bicg                 &   19 &   59 &  45.8 \% &   25 &   25 &  65.3 \% &   21 &   37 &  59.7 \% \\
+doitgen              &   51 &   25 &  40.6 \% &   36 &   30 &  48.4 \% &   17 &   22 &  69.5 \% \\
+mvt                  &   27 &   53 &  33.3 \% &    9 &   18 &  77.5 \% &    7 &   32 &  67.5 \% \\
+gemver               &   62 &   13 &  39.5 \% &    2 &   48 &  59.7 \% &    1 &   28 &  76.6 \% \\
+gesummv              &   16 &   69 &  41.0 \% &   17 &   23 &  72.2 \% &   24 &   28 &  63.9 \% \\
+syr2k                &   51 &   37 &  38.9 \% &    8 &   42 &  65.3 \% &   19 &   34 &  63.2 \% \\
+trmm                 &   69 &   27 &  25.0 \% &   16 &   30 &  64.1 \% &   15 &   30 &  64.8 \% \\
+symm                 &    0 &  121 &  11.0 \% &    5 &   20 &  81.6 \% &    9 &    5 &  89.7 \% \\
+syrk                 &   54 &   46 &  30.6 \% &   12 &   42 &  62.5 \% &   20 &   48 &  52.8 \% \\
+gemm                 &   42 &   41 &  42.4 \% &   30 &   41 &  50.7 \% &   16 &   57 &  49.3 \% \\
+gramschmidt          &   48 &   52 &  21.9 \% &   16 &   20 &  71.9 \% &   24 &   39 &  50.8 \% \\
+cholesky             &   24 &   72 &  33.3 \% &    0 &   19 &  86.8 \% &    5 &   14 &  86.8 \% \\
+durbin               &   49 &   52 &  29.9 \% &    0 &   65 &  54.9 \% &    2 &   39 &  71.5 \% \\
+trisolv              &   53 &   84 &   4.9 \% &    6 &   22 &  80.6 \% &    4 &   16 &  86.1 \% \\
+jacobi-1d            &   18 &   78 &  33.3 \% &   66 &    9 &  47.9 \% &    0 &   13 &  91.0 \% \\
+heat-3d              &   32 &    8 &  72.2 \% &   26 &    0 &  81.9 \% &    0 &    0 & 100.0 \% \\
+seidel-2d            &    0 &  112 &  22.2 \% &   32 &    0 &  77.8 \% &    0 &    0 & 100.0 \% \\
+fdtd-2d              &   52 &   22 &  47.1 \% &   20 &   41 &  56.4 \% &    0 &   40 &  71.4 \% \\
+jacobi-2d            &    6 &   31 &  73.6 \% &   24 &   61 &  39.3 \% &    0 &   44 &  68.6 \% \\
+adi                  &   12 &   76 &  21.4 \% &   40 &    0 &  64.3 \% &    0 &    0 & 100.0 \% \\
+correlation          &   18 &   36 &  51.8 \% &   19 &   30 &  56.2 \% &   23 &   45 &  39.3 \% \\
+covariance           &   39 &   36 &  37.5 \% &    4 &   34 &  68.3 \% &   19 &   53 &  40.0 \% \\
+floyd-warshall       &   74 &   16 &  29.7 \% &   16 &   24 &  68.8 \% &   20 &    8 &  78.1 \% \\
+\textbf{Total}       &  907 & 1360 &  35.2 \% &  509 &  687 &  65.8 \% &  310 &  728 &  70.3 \% \\
+\bottomrule
+    \end{tabular}
+\end{table*}
+
+\subsection{Relevance and representativity (bottleneck
+analysis)}\label{ssec:bottleneck_diversity}
+
+The results provided by our harness are only relevant to evaluate the parts of
+the tools' models that are stressed by the benchmarks generated; it is hence
+critical that our benchmark generation procedure in Section~\ref{sec:bench_gen}
+yields diverse results. This should be true by construction, as the various
+polyhedral compilation techniques used stress different parts of the
+microarchitecture.
+
+To assess this, we study the generated benchmarks' bottlenecks, \ie{}
+architectural resources on which a release of pressure improves execution time.
+Note that a saturated resource is not necessarily a bottleneck: a code that
+uses \eg{} 100\,\% of the arithmetics units available for computations outside
+of the critical path, at a point where a chain of dependencies is blocking,
+will not run faster if the arithmetics operations are removed; hence, hardware
+counters alone are not sufficient to find bottlenecks.
+
+However, some static analyzers report the bottlenecks they detect. To unify
+their results and keep things simple, we study three general kinds of
+bottlenecks.
+
+\begin{itemize}
+\item{} \emph{Frontend:} the CPU's frontend is not able to issue
+  micro-operations to the backend fast enough. \iaca{} and \uica{} are
+  able to detect this.
+\item{} \emph{Ports:} at least one of the backend ports has too much work;
+  reducing its pressure would accelerate the computation.
+  \llvmmca, \iaca{} and \uica{} are able to detect this.
+\item{} \emph{Dependencies:} there is a chain of data dependencies slowing
+  down the computation.
+  \llvmmca, \iaca{} and \uica{} are able to detect this.
+\end{itemize}
+
+For each source benchmark from Polybench and each type of bottleneck, we report
+in Table~\ref{table:coverage} the number of derived benchmarks on which all the
+tools agree that the bottleneck is present or absent. We also report the
+proportion of cases in which the tools failed to agree. We analyze those
+results later in Section~\ref{ssec:bottleneck_pred_analysis}.
+
+As we have no source of truth indicating whether a bottleneck is effectively
+present in a microbenchmark, we adopt a conservative approach, and consider
+only the subset of the microbenchmarks on which the tools agree on the status
+of all three resources; for those, we have a good confidence on the bottlenecks
+reported. Obviously, this approach is limited, because it excludes
+microbenchmarks that might be worth considering, and is most probably subject
+to selection bias.
+
+Of the 3,500 microbenchmarks we have generated, 261 (7.5\,\%) are the subject
+of the above-mentioned consensus. This sample is made up of microbenchmarks
+generated from 21 benchmarks ---~\ie{} for 5 benchmarks, none of the derived
+microbenchmarks reached a consensus among the tools~---, yielding a wide
+variety of calculations, including floating-point arithmetic, pointer
+arithmetic or Boolean arithmetic. Of these, 200 (76.6\,\%) are bottlenecked on
+the CPU front-end, 19 (7,3\,\%) on back-end ports, and 81 (31.0\,\%) on latency
+introduced by dependencies. As mentioned above, this distribution
+probably does not transcribe the distribution among the 3,500 original
+benchmarks, as the 261 were not uniformly sampled. However, we argue that, as
+all categories are represented in the sample, the initial hypothesis that the
+generated benchmarks are diverse and representative is confirmed ---~thanks to
+the transformations described in Section~\ref{sec:bench_gen}.
+
+\subsection{Carbon footprint}
+
+Generating and running the full suite of benchmarks required about 30h of
+continuous computation on a single machine.  During the experiments, the power
+supply units reported a near-constant consumption of about 350W. The carbon
+intensity of the power grid for the region where the experiment was run, at the
+time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
+
+The electricity consumed directly by the server thus amounts to about
+10.50\,kWh. Assuming a Power Usage Efficiency of 1.5, the total electricity
+consumption roughly amounts to 15.75\,kWh, or about 450\,g\coeq.
+
+A carbon footprint estimate of the machine's manufacture itself was conducted
+by the manufacturer~\cite{poweredgeC6420lca}. Additionally accounting for the
+extra 160\,GB of DDR4 SDRAM~\cite{meta_ACT}, the hardware manufacturing,
+transport and end-of-life is evaluated to 1,266\,kg\coeq. In 2023, this
+computation cluster's usage rate was 35\,\%. Assuming 6 years of product life,
+30h of usage represents about 2,050\,g\coeq{}. The whole experiment thus amounts to
+2.5\,kg\coeq.
--- a/manuscrit/50_CesASMe/25_results_analysis.tex
+++ b/manuscrit/50_CesASMe/25_results_analysis.tex
@ -0,0 +1,338 @@
+\section{Results analysis}\label{sec:results_analysis}
+
+The raw complete output from our benchmarking harness ---~roughly speaking, a
+large table with, for each benchmark, a cycle measurement, cycle count for each
+throughput analyzer, the resulting relative error, and a synthesis of the
+bottlenecks reported by each tool~--- enables many analyses that, we believe,
+could be useful both to throughput analysis tool developers and users. Tool
+designers can draw insights on their tool's best strengths and weaknesses, and
+work towards improving them with a clearer vision. Users can gain a better
+understanding of which tool is more suited for each situation.
+
+\subsection{Throughput results}\label{ssec:overall_results}
+
+\begin{table*}
+    \centering
+    \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
+    \begin{tabular}{l r r r r r r r r r}
+        \toprule
+\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
+\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\
+\midrule
+BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\
+llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\
+UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\
+Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\
+Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\
+Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\
+\bottomrule
+    \end{tabular}
+\end{table*}
+
+The error distribution of the relative errors, for each tool, is presented as a
+box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
+are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
+each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator,
+used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
+well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
+anti-correlation and $1$ a full correlation. This is especially useful when one
+is not interested in a program's absolute throughput, but rather in comparing
+which program has a better throughput.
+
+\begin{figure}
+    \includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf}
+    \caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
+\end{figure}
+
+
+These results are, overall, significantly worse than what each tool's article
+presents. We attribute this difference mostly to the specificities of
+Polybench: being composed of computation kernels, it intrinsically stresses the
+CPU more than basic blocks extracted out of the Spec benchmark suite. This
+difference is clearly reflected in the experimental section of the Palmed
+article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
+Spec, often by more than a factor of two.
+
+As \bhive{} and \ithemal{} do not support control flow instructions
+(\eg{} \texttt{jump} instructions), those had
+to be removed from the blocks before analysis. While none of these tools, apart
+from \gus{} ---~which is dynamic~---, is able to account for branching costs,
+these two analyzers were also unable to account for the front- and backend cost
+of the control flow instructions themselves as well ---~corresponding to the
+$TP_U$ mode introduced by \uica~\cite{uica}, while others
+measure $TP_L$.
+
+
+\subsection{Understanding \bhive's results}\label{ssec:bhive_errors}
+
+The error distribution of \bhive{} against \perf{}, plotted right in
+Figure~\ref{fig:exp_comparability}, puts forward irregularities in \bhive's
+results. Since \bhive{} is based on measures ---~instead of predictions~---
+through hardware counters, an excellent accuracy is expected. Its lack of
+support for control flow instructions can be held accountable for a portion of
+this accuracy drop; our lifting method, based on block occurrences instead of
+paths, can explain another portion. We also find that \bhive{} fails to produce
+a result in about 40\,\% of the kernels explored ---~which means that, for those
+cases, \bhive{} failed to produce a result on at least one of the constituent
+basic blocks. In fact, this is due to the difficulties we mentioned in
+Section \ref{sec:intro} related to the need to reconstruct the context of each
+basic block \textit{ex nihilo}.
+
+The basis of \bhive's method is to run the code to be measured, unrolled a
+number of times depending on the code size, with all memory pages but the
+code unmapped. As the code tries to access memory, it will raise segfaults,
+caught by \bhive's harness, which allocates a single shared-memory page, filled
+with a repeated constant, that it will map wherever segfaults occur before
+restarting the program.
+The main causes of \bhive{} failure are bad code behaviour (\eg{} control flow
+not reaching the exit point of the measure if a bad jump is inserted), too many
+segfaults to be handled, or a segfault that occurs even after mapping a page at
+the problematic address.
+
+The registers are also initialized, at the beginning of the measurement, to the
+fixed constant \texttt{0x2324000}. We show through two examples that this
+initial value can be of crucial importance.
+
+The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
+(Cascade Lake), with hyperthreading disabled.
+
+\paragraph{Imprecise analysis} we consider the following x86-64 kernel.
+
+\begin{minipage}{0.95\linewidth}
+\begin{lstlisting}[language={[x86masm]Assembler}]
+    vmulsd (%rax), %xmm3, %xmm0
+    vmovsd %xmm0, (%r10)
+\end{lstlisting}
+\end{minipage}
+
+When executed with all the general purpose registers initialized to the default
+constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
+\reg{r10} hold the same value, inducing a read-after-write dependency between
+the two instructions. If, however, \bhive{} is tweaked to initialize \reg{r10}
+to a value that aliases (\wrt{} physical addresses) with the value in
+\reg{rax}, \eg{} between \texttt{0x10000} and \texttt{0x10007} (inclusive), it
+reports 19 cycles per iteration instead; while a value between \texttt{0x10008}
+and \texttt{0x1009f} (inclusive) yields the expected 1 cycle ---~except for
+values in \texttt{0x10039}-\texttt{0x1003f} and
+\texttt{0x10079}-\texttt{0x1007f}, yielding 2 cycles as the store crosses a
+cache line boundary.
+
+In the same way, the value used to initialize the shared memory page can
+influence the results whenever it gets loaded into registers.
+
+\vspace{0.5em}
+
+\paragraph{Failed analysis} some memory accesses will always result in an
+error; for instance, it is impossible to \texttt{mmap} at an address lower
+than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
+with equal initial values for all registers, the following kernel would fail,
+since the second operation attempts to load at address 0:
+
+\begin{minipage}{0.95\linewidth}
+\begin{lstlisting}[language={[x86masm]Assembler}]
+    subq %r11, %r10
+    movq (%r10), %rax
+\end{lstlisting}
+\end{minipage}
+
+Such errors can occur in more circumvoluted ways. The following x86-64 kernel,
+for instance, is extracted from a version of the \texttt{durbin}
+kernel\footnote{\texttt{durbin.pocc.noopt.default.unroll8.MEDIUM.kernel21.s}
+in the full results}.
+
+\begin{minipage}{0.95\linewidth}
+\begin{lstlisting}[language={[x86masm]Assembler}]
+    vmovsd 0x10(%r8, %rcx), %xmm6
+    subl %eax, %esi
+    movslq %esi, %rsi
+    vfmadd231sd -8(%r9, %rsi, 8), \
+        %xmm6, %xmm0
+\end{lstlisting}
+\end{minipage}
+
+Here, \bhive{} fails to measure the kernel when run with the general purpose
+registers initialized to the default constant at the 2\textsuperscript{nd}
+occurrence of the unrolled loop body, failing to recover from an error at the
+\texttt{vfmadd231sd} instruction with the \texttt{mmap} strategy. Indeed, after
+the first iteration the value in \reg{rsi} becomes null, then negative at the
+second iteration; thus, the second occurrence of the last instruction fetches
+at address \texttt{0xfffffffff0a03ff8}, which is in kernel space. This
+microkernel can be benchmarked with BHive \eg{} by initializing \reg{rax} to 1.
+
+Some other microkernels fail in a similar way when trying to access addresses
+that are not a virtual address in \emph{canonical form} space for x86-64 with
+48 bits virtual addresses, as defined in Section~3.3.7.1 of Intel's Software
+Developer's Manual~\cite{ref:intel64_software_dev_reference_vol1} and
+Section~5.3.1 of the AMD64 Architecture Programmer's
+Manual~\cite{ref:amd64_architecture_dev_reference_vol2}. Others still fail with
+accesses relative to the instruction pointer, as \bhive{} read-protects the
+unrolled microkernel's instructions page.
+
+\subsection{Bottleneck prediction}\label{ssec:bottleneck_pred_analysis}
+
+We introduced in Section~\ref{ssec:bottleneck_diversity} earlier that some of
+the tools studied are also able to report suspected bottlenecks for the
+evaluated program, whose results are presented in Table~\ref{table:coverage}.
+This feature might be even more useful than raw throughput predictions to the
+users of these tools willing to optimize their program, as they strongly hint
+towards what needs to be enhanced.
+
+In the majority of the cases studied, the tools are not able to agree on the
+presence or absence of a type of bottleneck. Although it might seem that the
+tools are performing better on frontend bottleneck detection, it must be
+recalled that only two tools (versus three in the other cases) are reporting
+frontend bottlenecks, thus making it easier for them to agree.
+
+\begin{table}
+    \centering
+    \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
+    \begin{tabular}{l r r r r}
+        \toprule
+        \textbf{Tool}
+            & \multicolumn{2}{c}{\textbf{Ports}}
+            & \multicolumn{2}{c}{\textbf{Dependencies}} \\
+        \midrule
+        \llvmmca{} & 567 & (24.6 \%) & 1032 & (41.9 \%) \\
+        \uica{} & 516 & (22.4 \%) & 530 & (21.5 \%) \\
+        \iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
+        \bottomrule
+    \end{tabular}
+\end{table}
+
+The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
+on which three tools disagree into the number of times one tool makes a
+diverging prediction ---~\ie{} the tool predicts differently than the two
+others. In the case of ports, \iaca{} is responsible for half of the
+divergences ---~which is not sufficient to conclude that the prediction of the
+other tools is correct. In the case of dependencies, however, there is no clear
+outlier, even though \uica{} seems to fare better than others.
+
+In no case one tool seems to be responsible for the vast majority of
+disagreements, which could hint towards it failing to predict correctly this
+bottleneck. In the absence of a source of truth indicating whether a bottleneck
+is effectively present, and with no clear-cut result for (a subset of) tool
+predictions, we cannot conclude on the quality of the predictions from each
+tool for each kind of bottleneck.
+
+\subsection{Impact of dependency-boundness}\label{ssec:memlatbound}
+
+\begin{table*}
+    \centering
+    \caption{Statistical analysis of overall results, without latency bound
+    through memory-carried dependencies rows}\label{table:nomemdeps_stats}
+    \begin{tabular}{l r r r r r r r r r}
+        \toprule
+\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
+\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\
+\midrule
+BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
+llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
+UiCA & 2388 & 0 & (0.00\,\%) & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
+Ithemal & 2388 & 0 & (0.00\,\%) & 62.66\,\% & 53.84\,\% & 24.12\,\% & 81.95\,\% & 0.40\\
+Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.82\\
+Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
+\bottomrule
+    \end{tabular}
+\end{table*}
+
+An overview of the full results table (available in our artifact) hints towards
+two main tendencies: on a significant number of rows, the static tools
+---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
+comparatively bad throughput predictions \emph{together}; and many of these
+rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
+setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
+latter).
+
+To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
+terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
+---~yielding 1050 rows each. All of these share 869 rows (82.8\,\%), which we
+call \textit{jointly bad rows}.
+
+Among these 869 jointly bad rows, we further find that respectively 342
+(39.4\,\%) and 337 (38.8\,\%) are compiled using the \texttt{O1} and
+\texttt{O1autovect}, totalling to 679 (78.1\,\%) of \texttt{O1}-based rows,
+against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
+\texttt{O3nosimd}. This result is significant enough to be used as a hint to
+investigate the issue.
+
+\begin{figure}
+    \includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf}
+    \caption{Statistical distribution of relative errors, with and without
+    pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
+\end{figure}
+
+
+Insofar as our approach maintains a strong link between the basic blocks studied and
+the source codes from which they are extracted, it is possible to identify the
+high-level characteristics of the concerned microbenchmarks.
+In the overwhelming majority (97.5\,\%) of those jointly bad rows, the tools predicted
+fewer cycles than measured, meaning that a bottleneck is either missed or
+underestimated.
+Manual investigation of a few simple benchmarks (no polyhedral transformation
+applied, \texttt{O1} mode, not unrolled) further hints towards dependencies:
+for instance, the \texttt{gemver} benchmark, which is \emph{not} among the
+badly predicted benchmarks, has this kernel:
+
+\begin{minipage}{0.95\linewidth}
+\begin{lstlisting}[language={[ANSI]C}]
+for(c3)
+    A[c1][c3] += u1[c1] * v1[c3]
+               + u2[c1] * v2[c3];
+\end{lstlisting}
+\end{minipage}
+
+while the \texttt{atax} benchmark, which is among the badly predicted ones, has
+this kernel:
+
+\begin{minipage}{0.95\linewidth}
+\begin{lstlisting}[language=c]
+for(c3)
+    tmp[c1] += A[c1][c3] * x[c3];
+\end{lstlisting}
+\end{minipage}
+
+The first one exhibits no obvious dependency-boundness, while the second,
+accumulating on \texttt{tmp[c1]} (independent of the iteration variable) lacks
+in instruction-level parallelism. Among the simple benchmarks (as described
+above), 8 are in the badly predicted list, all of which exhibit a
+read-after-write data dependency to the preceding iteration.
+
+Looking at the assembly code generated for those in \texttt{O1} modes, it
+appears that the dependencies exhibited at the C level are compiled to
+\emph{memory-carried} dependencies: the read-after-write happens for a given
+memory address, instead of for a register. This kind of dependency, prone to
+aliasing and dependent on the values of the registers, is hard to infer for a
+static tool and is not supported by the analyzers under scrutiny in the general
+case; it could thus reasonably explain the results observed.
+
+There is no easy way, however, to know for certain which of the 3500 benchmarks
+are latency bound: no hardware counter reports this. We investigate this
+further using \gus's sensitivity analysis: in complement of the ``normal''
+throughput estimation of \gus, we run it a second time, disabling the
+accounting for latency through memory dependencies. By construction, this second measurement should be
+either very close to the first one, or significantly below. We then assume a
+benchmark to be latency bound due to memory-carried dependencies when it is at
+least 40\,\% faster when this latency is disabled; there are 1112 (31.8\,\%) such
+benchmarks.
+
+Of the 869 jointly bad rows, 745 (85.7\,\%) are declared latency
+bound through memory-carried dependencies by \gus. We conclude that the main
+reason for these jointly badly predicted benchmarks is that the predictors
+under scrutiny failed to correctly detect these dependencies.
+
+In Section~\ref{ssec:overall_results}, we presented in
+Figure~\ref{fig:overall_analysis_boxplot} and
+Table~\ref{table:overall_analysis_stats} general statistics on the tools
+on the full set of benchmarks. We now remove the 1112 benchmarks
+flagged as latency bound through memory-carried dependencies by \gus{} from the
+dataset, and present in Figure~\ref{fig:nomemdeps_boxplot} a comparative
+box plot for the tools under scrutiny. We also present in
+Table~\ref{table:nomemdeps_stats} the same statistics on this pruned dataset.
+While the results for \llvmmca, \uica{} and \iaca{} globally improved
+significantly, the most noticeable improvements are the reduced spread of the
+results and the Kendall's $\tau$ correlation coefficient's increase.
+
+From this,
+we argue that detecting memory-carried dependencies is a weak point in current
+state-of-the-art static analyzers, and that their results could be
+significantly more accurate if improvements are made in this direction.
--- a/manuscrit/50_CesASMe/30_future_works.tex
+++ b/manuscrit/50_CesASMe/30_future_works.tex
@ -0,0 +1,106 @@
+\section{Conclusion and future works}
+
+In this article, we have presented a fully-tooled approach that enables:
+
+\begin{itemize}
+\item the generation of a wide variety of microbenchmarks, reflecting both the
+  expertise contained in an initial benchmark suite, and the diversity of
+  code transformations allowing to stress different aspects of a performance model
+  ---~or even a measurement environment, \eg{} \bhive; and
+\item the comparability of various measurements and
+  analyses applied to each of these microbenchmarks.
+\end{itemize}
+
+Thanks to this tooling, we were able to show the limits and strengths of
+various performance models in relation to the expertise contained in the
+Polybench suite. We discuss throughput results in
+Section~\ref{ssec:overall_results} and bottleneck prediction in
+Section~\ref{ssec:bottleneck_pred_analysis}.
+
+We were also able to demonstrate the difficulties of reasoning at the level of
+a basic block isolated from its context. We specifically study those
+difficulties in the case of \bhive{} in Section~\ref{ssec:bhive_errors}.
+Indeed, the actual values ---~both from registers and memory~--- involved in a
+basic block's computation are constitutive not only of its functional
+properties (\ie{} the result of the calculation), but also of some of its
+non-functional properties (\eg{} latency, throughput).
+
+We were also able to show in Section~\ref{ssec:memlatbound}
+that state-of-the-art static analyzers struggle to
+account for memory-carried dependencies; a weakness significantly impacting
+their overall results on our benchmarks. We believe that detecting
+and accounting for these dependencies is an important future works direction.
+
+Moreover, we present this work in the form of a modular software package, each
+component of which exposes numerous adjustable parameters. These components can
+also be replaced by others fulfilling the same abstract function: another
+initial benchmark suite in place of Polybench, other loop nest
+optimizers in place of PLUTO and PoCC, other code
+analyzers, and so on. This software modularity reflects the fact that our
+contribution is about interfacing and communication between distinct issues.
+
+\medskip
+
+Furthermore, we believe that the contributions we made in the course of this work
+may eventually be used to face different, yet neighbouring issues.
+These perspectives can also be seen as future works:
+
+\smallskip
+
+\paragraph{Program optimization} the whole program processing we have designed
+can be used not only to evaluate the performance model underlying a static
+analyzer, but also to guide program optimization itself. In such a perspective,
+we would generate different versions of the same program using the
+transformations discussed in Section~\ref{sec:bench_gen} and colored blue in
+Figure~\ref{fig:contrib}. These different versions would then feed the
+execution and measurement environment outlined in
+Section~\ref{sec:bench_harness} and colored orange in Figure~\ref{fig:contrib}.
+Indeed, thanks to our previous work, we know that the results of these
+comparable analyses and measurements would make it possible to identify which
+version is the most efficient, and even to reconstruct information indicating
+why (which bottlenecks, etc.).
+
+However, this approach would require that these different versions of the same
+program are functionally equivalent, \ie{} that they compute the same
+result from the same inputs; yet we saw in Section~\ref{sec:bench_harness}
+that, as it stands, the transformations we apply are not concerned with
+preserving the semantics of the input codes.  To recover this semantic
+preservation property, abandoning the kernelification pass we have presented
+suffices; this however would require to control L1-residence otherwise.
+
+\smallskip
+
+\paragraph{Dataset building} our microbenchmarks generation phase outputs a
+large, diverse and representative dataset of microkernels. In addition to our
+harness, we believe that such a dataset could be used to improve existing
+data-dependant solutions.
+
+%the measurement and execution environment we
+%propose is not the only type of tool whose function is to process a large
+%dataset (\ie{} the microbenchmarks generated earlier) to automatically
+%abstract its characteristics. We can also think of:
+
+Inductive methods, for instance in \anica, strive to preserve the properties of a basic
+block through successive abstractions of the instructions it contains, so as to
+draw the most general conclusions possible from a particular experiment.
+Currently, \anica{} starts off from randomly generated basic blocks. This
+approach guarantees a certain variety, and avoids
+over-specialization, which would prevent it from finding interesting cases too
+far from an initial dataset. However, it may well lead to the sample under
+consideration being systematically outside the relevant area of the search
+space ---~\ie{} having no relation to real-life programs or those in the user's
+field.
+
+On the other hand, machine learning methods based on neural networks, for
+instance in \ithemal, seek to correlate the result of a function with the
+characteristics of its input ---~in this case to correlate a throughput
+prediction with the instructions making up a basic block~--- by backpropagating
+the gradient of a cost function. In the case of \ithemal{}, it is trained on
+benchmarks originating from a data suite.  As opposed to random generation,
+this approach offers representative samples, but comes with a risk of lack of
+variety and over-specialization.
+
+Comparatively, our microbenchmark generation method is natively meant to
+produce a representative, varied and large dataset. We believe that
+enriching the dataset of the above-mentioned methods with our benchmarks might
+extend their results and reach.
--- a/manuscrit/50_CesASMe/99_conclusion.tex
+++ b/manuscrit/50_CesASMe/99_conclusion.tex
@ -0,0 +1,2 @@
+%% \section*{Conclusion}
+%% \todo{}
--- a/manuscrit/50_CesASMe/main.tex
+++ b/manuscrit/50_CesASMe/main.tex
@ -1 +1,9 @@
 \chapter{A more systematic approach to throughput prediction performance analysis}
+
+\input{00_intro.tex}
+\input{05_related_works.tex}
+\input{10_bench_gen.tex}
+\input{20_evaluation.tex}
+\input{25_results_analysis.tex}
+\input{30_future_works.tex}
+\input{99_conclusion.tex}
--- a/manuscrit/50_CesASMe/overview.tex
+++ b/manuscrit/50_CesASMe/overview.tex
@ -0,0 +1,82 @@
+\begin{figure*}[ht!]
+    \definecolor{col_bench_gen}{HTML}{5a7eff}
+    \definecolor{col_bench_gen_bg}{HTML}{dbeeff}
+    \definecolor{col_bench_harness}{HTML}{ffa673}
+    \definecolor{col_results}{HTML}{000000}
+\centerline{
+  \begin{tikzpicture}[
+      hiddennode/.style={rectangle,draw=white, very thick, minimum size=5mm, align=center, font=\footnotesize},
+      normnode/.style={rectangle,draw=black, very thick, minimum size=5mm, align=center, font=\footnotesize},
+  resultnode/.style={rectangle,draw=col_results, fill=black!2, very thick, minimum size=5mm, align=center, font=\footnotesize},
+  bluenode/.style={rectangle, draw=col_bench_gen, fill=col_bench_gen_bg, very thick, minimum height=5mm, minimum width=4cm, align=center, font=\footnotesize},
+  rednode/.style={rectangle, draw=col_bench_harness, fill=orange!5, very thick, minimum size=5mm, align=center, font=\footnotesize},
+  bencher/.style={rednode, minimum width=2.5cm, minimum height=5mm},
+  genarrow/.style={draw=col_bench_gen},
+  harnarrow/.style={draw=col_bench_harness},
+  ]
+  \centering
+  %Nodes
+  \node[bluenode]  (bench)                       {Benchmark suite \figref{ssec:bench_suite}};
+  \node[bluenode]  (pocc)     [below=of bench] {Loop nest optimizers \figref{ssec:loop_nest_optimizer}};
+  \node[bluenode]  (kernel)   [below=of pocc]    {Constraining utility \figref{ssec:kernelify}};
+  \node[bluenode]  (gcc)      [below=of kernel]  {Compilations \figref{ssec:compile}};
+  \node[rednode]   (gdb)      [right=0.1\textwidth of gcc] {Basic block \\extraction \figref{ssec:bb_extr}};
+  \node[bencher]   (ithemal)  [right=4cm of gdb]   {Ithemal};
+  \node[bencher]   (iaca)     [above=0.5em of ithemal]    {IACA};
+  \node[bencher]   (uica)     [above=0.5em of iaca]    {uiCA};
+  \node[bencher]   (llvm)     [above=0.5em of uica]     {llvm-mca};
+  \node[bencher]   (bhive)    [above=0.5em of llvm]     {BHive (measure)};
+  \node[rednode]   (ppapi)    [left=1cm of bhive]   {perf (measure)};
+  \node[rednode]   (gus)      [below=0.5em of ppapi] {Gus};
+  %% \node[rednode]   (uica)     [below=of gdb]     {uiCA};
+  \node[rednode]   (lifting)  [right=of bhive]   {
+      Prediction lifting\\\figref{ssec:harness_lifting}};
+  \node[
+    draw=black,
+    very thick,
+    dotted,
+    fit=(ppapi) (gus) (bhive) (llvm) (uica) (iaca) (ithemal)
+  ] (comps) {};
+  \node (throughput_label)  [above=0.2em of comps,align=center] {
+          \footnotesize Throughput predictions \\\footnotesize \& measures
+      \figref{ssec:throughput_pred_meas}};
+  \node[draw=black,
+    very thick,
+    dotted,
+    %% label={below:\footnotesize Variations},
+    label={[above,xshift=1cm]\footnotesize Variations},
+    fit=(pocc) (kernel) (gcc)
+  ] (vars) {};
+\node[resultnode]  (bench2) [below=of lifting] {Evaluation metrics \\ for
+        code analyzers};
+
+  % Key
+  \node[]  (keyblue1)     [below left=0.7cm and 0cm of vars]   {};
+  \node[hiddennode]  (keyblue2)     [right=0.5cm of keyblue1]   {Section~\ref{sec:bench_gen}~: generating microbenchmarks};
+  \node[]  (keyred1)     [right=0.6cm of keyblue2]   {};
+  \node[hiddennode]  (keyred2)     [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness};
+  \node[]  (keyresult1)     [right=0.6cm of keyred2]   {};
+  \node[hiddennode]  (keyresult2)     [right=0.5cm of keyresult1]
+      {Section~\ref{sec:results_analysis}~: results analysis};
+
+  %Lines
+    \draw[-, very thick, harnarrow] (keyred1.east)     -- (keyred2.west);
+    \draw[-, very thick, genarrow]  (keyblue1.east)     -- (keyblue2.west);
+    \draw[-, very thick]  (keyresult1.east)     -- (keyresult2.west);
+  \draw[->, very thick, genarrow]   (bench.south)   -- (pocc.north);
+  \draw[->, very thick, genarrow]   (pocc.south)    -- (kernel.north);
+  \draw[->, very thick, genarrow]   (kernel.south)  -- (gcc.north);
+  \draw[->, very thick, genarrow]   (gcc.east)     -- (gdb.west);
+  \draw[->, very thick, genarrow]   (gcc.east)     -- (ppapi.west);
+  \draw[->, very thick, genarrow]   (gcc.east)     -- (gus.west);
+  \draw[->, very thick, harnarrow]  (gdb.east)     -- (uica.west);
+  \draw[->, very thick, harnarrow]  (gdb.east)     -- (iaca.west);
+  \draw[->, very thick, harnarrow]  (gdb.east)     -- (ithemal.west);
+  \draw[->, very thick, harnarrow]  (gdb.east)     -- (bhive.west);
+  \draw[->, very thick, harnarrow]  (gdb.east)     -- (llvm.west);
+  \draw[->, very thick, harnarrow]  (comps.east|-lifting)   -- (lifting.west);
+  \draw[->, very thick]            (lifting.south)   -- (bench2.north);
+  \end{tikzpicture}
+}
+\caption{Our analysis and measurement environment.\label{fig:contrib}}
+\end{figure*}
--- a/manuscrit/assets/imgs/50_CesASMe/.gitignore
+++ b/manuscrit/assets/imgs/50_CesASMe/.gitignore
@ -0,0 +1 @@
+!*.pdf
--- a/manuscrit/assets/imgs/50_CesASMe/nomemdeps_boxplot.pdf
+++ b/manuscrit/assets/imgs/50_CesASMe/nomemdeps_boxplot.pdf
--- a/manuscrit/assets/imgs/50_CesASMe/overall_analysis_boxplot.pdf
+++ b/manuscrit/assets/imgs/50_CesASMe/overall_analysis_boxplot.pdf
--- a/manuscrit/assets/imgs/50_CesASMe/results_comparability_hist.pdf
+++ b/manuscrit/assets/imgs/50_CesASMe/results_comparability_hist.pdf