From f6f0336b34a9fd371efc9ad11a38a9d44e279677 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr>
Date: Mon, 25 Sep 2023 17:41:37 +0200
Subject: [PATCH] CesASMe: first adaptations

---
 manuscrit/50_CesASMe/00_intro.tex            | 51 ++++++--------------
 manuscrit/50_CesASMe/10_bench_gen.tex        |  6 ++-
 manuscrit/50_CesASMe/15_harness.tex          |  2 +-
 manuscrit/50_CesASMe/20_evaluation.tex       | 11 ++---
 manuscrit/50_CesASMe/25_results_analysis.tex | 46 ++++++++++--------
 manuscrit/50_CesASMe/main.tex                |  1 +
 manuscrit/50_CesASMe/overview.tex            | 12 ++---
 manuscrit/include/macros.tex                 | 16 ++++++
 manuscrit/include/packages.tex               |  4 ++
 9 files changed, 79 insertions(+), 70 deletions(-)

diff --git a/manuscrit/50_CesASMe/00_intro.tex b/manuscrit/50_CesASMe/00_intro.tex
index 8e70c60..02afaf0 100644
--- a/manuscrit/50_CesASMe/00_intro.tex
+++ b/manuscrit/50_CesASMe/00_intro.tex
@@ -1,20 +1,3 @@
-\begin{abstract}
-    A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or
-    \ithemal{}, strive to statically predict the throughput of a computation
-    kernel. Each analyzer is based on its own simplified CPU model
-    reasoning at the scale of an isolated basic block.
-    Facing this diversity, evaluating their strengths and
-    weaknesses is important to guide both their usage and their enhancement.
-
-    We argue that reasoning at the scale of a single basic block is not
-    always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled
-    solution to evaluate code analyzers on C-level benchmarks. It is composed of a
-    benchmark derivation procedure that feeds an evaluation harness. We use it to
-    evaluate state-of-the-art code analyzers and to provide insights on their
-    precision. We use \tool's results to show that memory-carried data
-    dependencies are a major source of imprecision for these tools.
-\end{abstract}
-
 \section{Introduction}\label{sec:intro}
 
 At a time when software is expected to perform more computations, faster and in
@@ -23,14 +6,14 @@ in particular the CPU resources) they consume are very useful to guide their
 optimization. This need is reflected in the diversity of binary or assembly
 code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
 maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
-\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various
-performance metrics, including the number of CPU cycles a computation kernel will take
----~which roughly translates to execution time.
-In addition to raw measurements (relying on hardware counters), these model-based analyses provide
-higher-level and refined data, to expose the bottlenecks and guide the
-optimization of a given code. This feedback is useful to experts optimizing
-computation kernels, including scientific simulations and deep-learning
-kernels.
+\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
+these tools strive to extract various performance metrics, including the number
+of CPU cycles a computation kernel will take ---~which roughly translates to
+execution time.  In addition to raw measurements (relying on hardware
+counters), these model-based analyses provide higher-level and refined data, to
+expose the bottlenecks and guide the optimization of a given code. This
+feedback is useful to experts optimizing computation kernels, including
+scientific simulations and deep-learning kernels.
 
 An exact throughput prediction would require a cycle-accurate simulator of the
 processor, based on microarchitectural data that is most often not publicly
@@ -39,6 +22,7 @@ solve in their own way the challenge of modeling complex CPUs while remaining
 simple enough to yield a prediction in a reasonable time, ending up with
 different models. For instance, on the following x86-64 basic block computing a
 general matrix multiplication,
+
 \begin{minipage}{0.95\linewidth}
 \begin{lstlisting}[language={[x86masm]Assembler}]
 movsd (%rcx, %rax), %xmm0
@@ -112,30 +96,27 @@ generally be known. More importantly, the compiler may apply any number of
 transformations: unrolling, for instance, changes this number. Control flow may
 also be complicated by code versioning.
 
-%In the general case, instrumenting the generated code to obtain the number of
-%occurrences of the basic block yields accurate results.
-
 \bigskip
 
 In this article, we present a fully-tooled solution to evaluate and compare the
-diversity of static throughput predictors. Our tool, \tool, solves two main
+diversity of static throughput predictors. Our tool, \cesasme, solves two main
 issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
-\tool{} generates a wide variety of computation kernels stressing different
+\cesasme{} generates a wide variety of computation kernels stressing different
 parameters of the architecture, and thus of the predictors' models, while
 staying close to representative workloads. To achieve this, we use
-Polybench~\cite{polybench}, a C-level benchmark suite representative of
+Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
 scientific computation workloads, that we combine with a variety of
 optimisations, including polyhedral loop transformations.
-In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to
+In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
 evaluate throughput predictors on this set of benchmarks by lifting their
 predictions to a total number of cycles that can be compared to a hardware
 counters-based measure. A
-high-level view of \tool{} is shown in Figure~\ref{fig:contrib}.
+high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
 
 In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
 methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
-analyze the results of \tool{}.
- In addition to statistical studies, we use \tool's results
+analyze the results of \cesasme{}.
+ In addition to statistical studies, we use \cesasme's results
 to investigate analyzers' flaws. We show that code
 analyzers do not always correctly model data dependencies through memory
 accesses, substantially impacting their precision.
diff --git a/manuscrit/50_CesASMe/10_bench_gen.tex b/manuscrit/50_CesASMe/10_bench_gen.tex
index c2e0c23..258c39f 100644
--- a/manuscrit/50_CesASMe/10_bench_gen.tex
+++ b/manuscrit/50_CesASMe/10_bench_gen.tex
@@ -39,7 +39,7 @@ directly (no indirections) and whose loops are affine.
 These constraints are necessary to ensure that the microkernelification phase,
 presented below, generates segfault-free code.
 
-In this case, we use Polybench~\cite{polybench}, a suite of 30
+In this case, we use Polybench~\cite{bench:polybench}, a suite of 30
 benchmarks for polyhedral compilation ---~of which we use only 26. The
 \texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
 removed because they are incompatible with PoCC (introduced below). The
@@ -58,7 +58,9 @@ resources of the target architecture, and by extension the models on which the
 static analyzers are based.
 
 In this case, we chose to use the
-\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling,
+\textsc{Pluto}~\cite{tool:pluto} and PoCC~\cite{tool:pocc} polyhedral
+compilers, to easily access common loop nest optimizations~: register tiling,
+tiling,
 skewing, vectorization/simdization, loop unrolling, loop permutation,
 loop fusion.
 These transformations are meant to maximize variety within the initial
diff --git a/manuscrit/50_CesASMe/15_harness.tex b/manuscrit/50_CesASMe/15_harness.tex
index da55ae7..9e6c9ce 100644
--- a/manuscrit/50_CesASMe/15_harness.tex
+++ b/manuscrit/50_CesASMe/15_harness.tex
@@ -82,6 +82,6 @@ approach, as most throughput prediction tools work a basic block-level, and are
 thus readily available and can be directly plugged into our harness.
 
 Finally, we control the proportion of cache misses in the program's execution
-using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more
+using \texttt{Cachegrind}~\cite{tool:valgrind} and \gus; programs that have more
 than 15\,\% of cache misses on a warm cache are not considered L1-resident and
 are discarded.
diff --git a/manuscrit/50_CesASMe/20_evaluation.tex b/manuscrit/50_CesASMe/20_evaluation.tex
index 755fe76..7b5fbae 100644
--- a/manuscrit/50_CesASMe/20_evaluation.tex
+++ b/manuscrit/50_CesASMe/20_evaluation.tex
@@ -64,13 +64,12 @@ consequently, lifted predictions can reasonably be compared to one another.
 
 \begin{figure}
     \centering
-    \includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf}
+    \includegraphics[width=0.7\linewidth]{results_comparability_hist.pdf}
     \caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
 \end{figure}
 
 \begin{table}
     \centering
-    \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
     \begin{tabular}{l r r}
         \toprule
         & \textbf{Best block-based} & \textbf{BHive} \\
@@ -84,13 +83,12 @@ consequently, lifted predictions can reasonably be compared to one another.
         Q3 (\%) & 15.41 & 23.01 \\
         \bottomrule
     \end{tabular}
+    \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
 \end{table}
 
 
-\begin{table*}[!htbp]
+\begin{table}[!htbp]
     \centering
-    \caption{Bottleneck reports from the studied tools}\label{table:coverage}
-
     \begin{tabular}{l | r r r | r r r | r r r}
         \toprule
  & \multicolumn{3}{c|}{\textbf{Frontend}}
@@ -128,7 +126,8 @@ floyd-warshall       &   74 &   16 &  29.7 \% &   16 &   24 &  68.8 \% &   20 &
 \textbf{Total}       &  907 & 1360 &  35.2 \% &  509 &  687 &  65.8 \% &  310 &  728 &  70.3 \% \\
 \bottomrule
     \end{tabular}
-\end{table*}
+    \caption{Bottleneck reports from the studied tools}\label{table:coverage}
+\end{table}
 
 \subsection{Relevance and representativity (bottleneck
 analysis)}\label{ssec:bottleneck_diversity}
diff --git a/manuscrit/50_CesASMe/25_results_analysis.tex b/manuscrit/50_CesASMe/25_results_analysis.tex
index 5a37fe2..7554fc9 100644
--- a/manuscrit/50_CesASMe/25_results_analysis.tex
+++ b/manuscrit/50_CesASMe/25_results_analysis.tex
@@ -11,28 +11,31 @@ understanding of which tool is more suited for each situation.
 
 \subsection{Throughput results}\label{ssec:overall_results}
 
-\begin{table*}
+\begin{table}
     \centering
-    \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
+    \footnotesize
     \begin{tabular}{l r r r r r r r r r}
         \toprule
-\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
-\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\
+        \textbf{Bencher} & \textbf{Datapoints} &
+        \multicolumn{2}{c}{\textbf{Failures}} &
+\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} & \textbf{Time}\\
+              & & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) & & (CPU$\cdot$h) \\
 \midrule
-BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\
-llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\
-UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\
-Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\
-Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\
-Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\
+BHive & 2198 & 1302 & (37.20) & 27.95 & 7.78 & 3.01 & 23.01 & 0.81 & 1.37\\
+llvm-mca & 3500 & 0 & (0.00) & 36.71 & 27.80 & 12.92 & 59.80 & 0.57 & 0.96 \\
+UiCA & 3500 & 0 & (0.00) & 29.59 & 18.26 & 7.11 & 52.99 & 0.58 & 2.12 \\
+Ithemal & 3500 & 0 & (0.00) & 57.04 & 48.70 & 22.92 & 75.69 & 0.39 & 0.38 \\
+Iaca & 3500 & 0 & (0.00) & 30.23 & 18.51 & 7.13 & 57.18 & 0.59 & 1.31 \\
+Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
 \bottomrule
     \end{tabular}
-\end{table*}
+    \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
+\end{table}
 
 The error distribution of the relative errors, for each tool, is presented as a
 box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
 are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
-each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator,
+each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
 used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
 well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
 anti-correlation and $1$ a full correlation. This is especially useful when one
@@ -40,7 +43,8 @@ is not interested in a program's absolute throughput, but rather in comparing
 which program has a better throughput.
 
 \begin{figure}
-    \includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf}
+    \centering
+    \includegraphics[width=0.5\linewidth]{overall_analysis_boxplot.pdf}
     \caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
 \end{figure}
 
@@ -185,7 +189,6 @@ frontend bottlenecks, thus making it easier for them to agree.
 
 \begin{table}
     \centering
-    \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
     \begin{tabular}{l r r r r}
         \toprule
         \textbf{Tool}
@@ -197,6 +200,7 @@ frontend bottlenecks, thus making it easier for them to agree.
         \iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
         \bottomrule
     \end{tabular}
+    \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
 \end{table}
 
 The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
@@ -216,14 +220,13 @@ tool for each kind of bottleneck.
 
 \subsection{Impact of dependency-boundness}\label{ssec:memlatbound}
 
-\begin{table*}
+\begin{table}
     \centering
-    \caption{Statistical analysis of overall results, without latency bound
-    through memory-carried dependencies rows}\label{table:nomemdeps_stats}
+    \footnotesize
     \begin{tabular}{l r r r r r r r r r}
         \toprule
 \textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
-\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\
+\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
 \midrule
 BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
 llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
@@ -233,7 +236,9 @@ Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.
 Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
 \bottomrule
     \end{tabular}
-\end{table*}
+    \caption{Statistical analysis of overall results, without latency bound
+    through memory-carried dependencies rows}\label{table:nomemdeps_stats}
+\end{table}
 
 An overview of the full results table (available in our artifact) hints towards
 two main tendencies: on a significant number of rows, the static tools
@@ -256,7 +261,8 @@ against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
 investigate the issue.
 
 \begin{figure}
-    \includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf}
+    \centering
+    \includegraphics[width=0.5\linewidth]{nomemdeps_boxplot.pdf}
     \caption{Statistical distribution of relative errors, with and without
     pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
 \end{figure}
diff --git a/manuscrit/50_CesASMe/main.tex b/manuscrit/50_CesASMe/main.tex
index da40a41..bf41cdb 100644
--- a/manuscrit/50_CesASMe/main.tex
+++ b/manuscrit/50_CesASMe/main.tex
@@ -3,6 +3,7 @@
 \input{00_intro.tex}
 \input{05_related_works.tex}
 \input{10_bench_gen.tex}
+\input{15_harness.tex}
 \input{20_evaluation.tex}
 \input{25_results_analysis.tex}
 \input{30_future_works.tex}
diff --git a/manuscrit/50_CesASMe/overview.tex b/manuscrit/50_CesASMe/overview.tex
index 55285f6..c049262 100644
--- a/manuscrit/50_CesASMe/overview.tex
+++ b/manuscrit/50_CesASMe/overview.tex
@@ -29,7 +29,7 @@
   \node[rednode]   (ppapi)    [left=1cm of bhive]   {perf (measure)};
   \node[rednode]   (gus)      [below=0.5em of ppapi] {Gus};
   %% \node[rednode]   (uica)     [below=of gdb]     {uiCA};
-  \node[rednode]   (lifting)  [right=of bhive]   {
+  \node[rednode]   (lifting)  [below right=1em and 0.2cm of gdb]   {
       Prediction lifting\\\figref{ssec:harness_lifting}};
   \node[
     draw=black,
@@ -47,15 +47,15 @@
     label={[above,xshift=1cm]\footnotesize Variations},
     fit=(pocc) (kernel) (gcc)
   ] (vars) {};
-\node[resultnode]  (bench2) [below=of lifting] {Evaluation metrics \\ for
+\node[resultnode]  (bench2) [right=of lifting] {Evaluation metrics \\ for
         code analyzers};
 
   % Key
   \node[]  (keyblue1)     [below left=0.7cm and 0cm of vars]   {};
   \node[hiddennode]  (keyblue2)     [right=0.5cm of keyblue1]   {Section~\ref{sec:bench_gen}~: generating microbenchmarks};
-  \node[]  (keyred1)     [right=0.6cm of keyblue2]   {};
+  \node[]  (keyred1)     [below=.5em of keyblue1]   {};
   \node[hiddennode]  (keyred2)     [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness};
-  \node[]  (keyresult1)     [right=0.6cm of keyred2]   {};
+  \node[]  (keyresult1)     [below=.5em of keyred1]   {};
   \node[hiddennode]  (keyresult2)     [right=0.5cm of keyresult1]
       {Section~\ref{sec:results_analysis}~: results analysis};
 
@@ -74,8 +74,8 @@
   \draw[->, very thick, harnarrow]  (gdb.east)     -- (ithemal.west);
   \draw[->, very thick, harnarrow]  (gdb.east)     -- (bhive.west);
   \draw[->, very thick, harnarrow]  (gdb.east)     -- (llvm.west);
-  \draw[->, very thick, harnarrow]  (comps.east|-lifting)   -- (lifting.west);
-  \draw[->, very thick]            (lifting.south)   -- (bench2.north);
+  \draw[->, very thick, harnarrow]  (comps.south-|lifting)   -- (lifting.north);
+  \draw[->, very thick]            (lifting.east)   -- (bench2.west);
   \end{tikzpicture}
 }
 \caption{Our analysis and measurement environment.\label{fig:contrib}}
diff --git a/manuscrit/include/macros.tex b/manuscrit/include/macros.tex
index 91854cc..0c2f98f 100644
--- a/manuscrit/include/macros.tex
+++ b/manuscrit/include/macros.tex
@@ -2,6 +2,8 @@
 \newcommand{\uops}{\uop{}s}
 
 \newcommand{\eg}{\textit{eg.}}
+\newcommand{\ie}{\textit{ie.}}
+\newcommand{\wrt}{\textit{wrt.}}
 
 \newcommand{\kerK}{\mathcal{K}}
 \newcommand{\calR}{\mathcal{R}}
@@ -36,6 +38,20 @@
 \newcommand{\pipedream}{\texttt{Pipedream}}
 \newcommand{\palmed}{\texttt{Palmed}}
 \newcommand{\pmevo}{\texttt{PMEvo}}
+\newcommand{\gus}{\texttt{Gus}}
+\newcommand{\ithemal}{\texttt{Ithemal}}
+\newcommand{\osaca}{\texttt{Osaca}}
+\newcommand{\bhive}{\texttt{BHive}}
+\newcommand{\anica}{\texttt{AnICA}}
+\newcommand{\cesasme}{\texttt{CesASMe}}
+
+\newcommand{\gdb}{\texttt{gdb}}
+
+\newcommand{\coeq}{CO$_{2}$eq}
+
+\newcommand{\figref}[1]{[\ref{#1}]}
+
+\newcommand{\reg}[1]{\texttt{\%#1}}
 
 % Hyperlinks
 \newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}}
diff --git a/manuscrit/include/packages.tex b/manuscrit/include/packages.tex
index aeeddda..bcce0f1 100644
--- a/manuscrit/include/packages.tex
+++ b/manuscrit/include/packages.tex
@@ -25,9 +25,13 @@
 \usepackage{import}
 \usepackage{wrapfig}
 \usepackage{float}
+\usepackage{tikz}
 \usepackage[bottom]{footmisc}  % footnotes are below floats
 \usepackage[final]{microtype}
 
+\usetikzlibrary{positioning}
+\usetikzlibrary{fit}
+
 \emergencystretch=1em
 
 % Local sty files