First pass on CesASMe -- intro is still a mess

2023-09-25 18:45:35 +02:00 · 2023-09-25 18:45:35 +02:00 · 853f792023
commit 853f792023
parent f6f0336b34
7 changed files with 45 additions and 87 deletions
--- a/manuscrit/20_foundations/main.tex
+++ b/manuscrit/20_foundations/main.tex
@ -1 +1 @@
-\chapter{Foundations}
+\chapter{Foundations}\label{chap:foundations}
--- a/manuscrit/50_CesASMe/00_intro.tex
+++ b/manuscrit/50_CesASMe/00_intro.tex
@ -1,21 +1,18 @@
-\section{Introduction}\label{sec:intro}
+In the previous chapters, we focused on two of the main bottleneck factors for
+computation kernels: \autoref{chap:palmed} investigated the backend aspect of
+throughput prediction, while \autoref{chap:frontend} dived into the frontend
+aspects.

-At a time when software is expected to perform more computations, faster and in
-more constrained environments, tools that statically predict the resources (and
-in particular the CPU resources) they consume are very useful to guide their
-optimization. This need is reflected in the diversity of binary or assembly
-code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
-maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
-\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
-these tools strive to extract various performance metrics, including the number
-of CPU cycles a computation kernel will take ---~which roughly translates to
-execution time.  In addition to raw measurements (relying on hardware
-counters), these model-based analyses provide higher-level and refined data, to
-expose the bottlenecks and guide the optimization of a given code. This
-feedback is useful to experts optimizing computation kernels, including
-scientific simulations and deep-learning kernels.
+Throughout those two chapters, we entirely left out another crucial
+factor: dependencies, and the latency they induce between instructions. We
+managed to do so, because our baseline of native execution was \pipedream{}
+measures, \emph{designed} to suppress any dependency.

-An exact throughput prediction would require a cycle-accurate simulator of the
+However, state-of-the-art tools strive to provide an estimation of the
+execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
+as possible}, and as such, cannot neglect this third major bottleneck.
+An exact
+throughput prediction would require a cycle-accurate simulator of the
 processor, based on microarchitectural data that is most often not publicly
 available, and would be prohibitively slow in any case. These tools thus each
 solve in their own way the challenge of modeling complex CPUs while remaining
@ -40,9 +37,9 @@ predicts 3 cycles. One may wonder which tool is correct.


 The obvious solution to assess their predictions is to compare them to an
-actual measure. However, as these tools reason at the basic block level, this
-is not as trivially defined as it would seem. Take for instance the following
-kernel:
+actual measure. However, accounting for dependencies at the scale of a basic
+block makes this \emph{actual measure} not as trivially defined as it would
+seem. Take for instance the following kernel:

 \begin{minipage}{0.90\linewidth}
 \begin{lstlisting}[language={[x86masm]Assembler}]
@ -98,7 +95,7 @@ also be complicated by code versioning.

 \bigskip

-In this article, we present a fully-tooled solution to evaluate and compare the
+In this chapter, we present a fully-tooled solution to evaluate and compare the
 diversity of static throughput predictors. Our tool, \cesasme, solves two main
 issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
 \cesasme{} generates a wide variety of computation kernels stressing different
--- a/manuscrit/50_CesASMe/05_related_works.tex
+++ b/manuscrit/50_CesASMe/05_related_works.tex
@ -1,40 +1,4 @@
 \section{Related works}
-
-The static throughput analyzers studied rely on a variety of models.
-\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
-relies on Intel's expertise on their own processors.
-The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
-architectures. These models are used in the LLVM Machine Code Analyzer,
-\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
-of assembly.
-Independently, Abel and Reineke used an automated microbenchmark generation
-approach to generate port mappings of many architectures in
-\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
-This work was continued with \uica~\cite{uica}, extending this model with an
-extensive frontend description.
-Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
-neural network to predict basic blocks throughput. To obtain enough data to
-train its model, the authors also developed \bhive~\cite{bhive}, a profiling
-tool working on basic blocks.
-
-Another static tool, \osaca~\cite{osaca2}, provides lower- and
-upper-bounds to the execution time of a basic block. As this kind of
-information cannot be fairly compared with tools yielding an exact throughput
-prediction, we exclude it from our scope.
-
-All these tools statically predict the number of cycles taken by a piece of
-assembly or binary that is assumed to be the body of an infinite ---~or
-sufficiently large~--- loop in steady state, all its data being L1-resident. As
-discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
-analyzers; \eg{} by assuming that the loop is either unrolled or has control
-instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
-necessarily work on a single basic block, while some others, \eg{} \iaca, work
-on a section of code delimited by markers. However, even in the second case,
-the code is assumed to be \emph{straight-line code}: branch instructions, if
-any, are assumed not taken.
-
-\smallskip
-
 Throughput prediction tools, however, are not all static.
 \gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
 region, instrumenting it to retrieve the exact events occurring through its
--- a/manuscrit/50_CesASMe/20_evaluation.tex
+++ b/manuscrit/50_CesASMe/20_evaluation.tex
@ -2,8 +2,8 @@

 Running the harness described above provides us with 3500
 benchmarks ---~after filtering out non-L1-resident
-benchmarks~---, on which each throughput predictor is run. We make the full
-output of our tool available in our artifact. Before analyzing these results in
+benchmarks~---, on which each throughput predictor is run.
+Before analyzing these results in
 Section~\ref{sec:results_analysis}, we evaluate the relevance of the
 methodology presented in Section~\ref{sec:bench_harness} to make the tools'
 predictions comparable to baseline hardware counter measures.
@ -15,14 +15,13 @@ C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
 of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
 Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.

-The experiments themselves were run inside a Docker environment very close to
-our artifact, based on Debian Bullseye. Care was taken to disable
-hyperthreading to improve measurements stability. For tools whose output is
-based on a direct measurement (\perf, \bhive), the benchmarks were run
-sequentially on a single core with no experiments on the other cores. No such
-care was taken for \gus{} as, although based on a dynamic run, its prediction
-is purely function of recorded program events and not of program measures. All
-other tools were run in parallel.
+The experiments themselves were run inside a Docker environment based on Debian
+Bullseye. Care was taken to disable hyperthreading to improve measurements
+stability. For tools whose output is based on a direct measurement (\perf,
+\bhive), the benchmarks were run sequentially on a single core with no
+experiments on the other cores. No such care was taken for \gus{} as, although
+based on a dynamic run, its prediction is purely function of recorded program
+events and not of program measures. All other tools were run in parallel.

 We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
 at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
@ -87,8 +86,9 @@ consequently, lifted predictions can reasonably be compared to one another.
 \end{table}


-\begin{table}[!htbp]
+\begin{table}
    \centering
+    \footnotesize
    \begin{tabular}{l | r r r | r r r | r r r}
        \toprule
 & \multicolumn{3}{c|}{\textbf{Frontend}}
@ -196,7 +196,7 @@ the transformations described in Section~\ref{sec:bench_gen}.
 Generating and running the full suite of benchmarks required about 30h of
 continuous computation on a single machine.  During the experiments, the power
 supply units reported a near-constant consumption of about 350W. The carbon
-intensity of the power grid for the region where the experiment was run, at the
+intensity of the power grid in France, where the experiment was run, at the
 time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.

 The electricity consumed directly by the server thus amounts to about
--- a/manuscrit/50_CesASMe/25_results_analysis.tex
+++ b/manuscrit/50_CesASMe/25_results_analysis.tex
@ -35,12 +35,8 @@ Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
 The error distribution of the relative errors, for each tool, is presented as a
 box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
 are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
-each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
-used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
-well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
-anti-correlation and $1$ a full correlation. This is especially useful when one
-is not interested in a program's absolute throughput, but rather in comparing
-which program has a better throughput.
+each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
+in \autoref{chap:palmed} and \autoref{chap:frontend}.

 \begin{figure}
    \centering
@ -53,8 +49,8 @@ These results are, overall, significantly worse than what each tool's article
 presents. We attribute this difference mostly to the specificities of
 Polybench: being composed of computation kernels, it intrinsically stresses the
 CPU more than basic blocks extracted out of the Spec benchmark suite. This
-difference is clearly reflected in the experimental section of the Palmed
-article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
+difference is clearly reflected in the experimental section of Palmed in
+\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
 Spec, often by more than a factor of two.

 As \bhive{} and \ithemal{} do not support control flow instructions
@ -79,7 +75,7 @@ paths, can explain another portion. We also find that \bhive{} fails to produce
 a result in about 40\,\% of the kernels explored ---~which means that, for those
 cases, \bhive{} failed to produce a result on at least one of the constituent
 basic blocks. In fact, this is due to the difficulties we mentioned in
-Section \ref{sec:intro} related to the need to reconstruct the context of each
+\qtodo{[ref intro]} related to the need to reconstruct the context of each
 basic block \textit{ex nihilo}.

 The basis of \bhive's method is to run the code to be measured, unrolled a
@ -240,13 +236,12 @@ Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.8
    through memory-carried dependencies rows}\label{table:nomemdeps_stats}
 \end{table}

-An overview of the full results table (available in our artifact) hints towards
-two main tendencies: on a significant number of rows, the static tools
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
-comparatively bad throughput predictions \emph{together}; and many of these
-rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
-setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
-latter).
+An overview of the full results table hints towards two main tendencies: on a
+significant number of rows, the static tools ---~thus leaving \gus{} and
+\bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
+predictions \emph{together}; and many of these rows are those using the
+\texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
+\texttt{-O1}, plus vectorisation options for the latter).

 To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
 terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
--- a/manuscrit/50_CesASMe/30_future_works.tex
+++ b/manuscrit/50_CesASMe/30_future_works.tex
@ -1,6 +1,6 @@
-\section{Conclusion and future works}
+\section*{Conclusion and future works}

-In this article, we have presented a fully-tooled approach that enables:
+In this chapter, we have presented a fully-tooled approach that enables:

 \begin{itemize}
 \item the generation of a wide variety of microbenchmarks, reflecting both the
--- a/manuscrit/biblio/tools.bib
+++ b/manuscrit/biblio/tools.bib
@ -48,6 +48,8 @@

@misc{tool:pocc,
    title={{PoCC}, the Polyhedral Compiler Collection},
+    author={Pouchet, Louis-No{\"e}l},
+    year=2009,
    note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}},
 }