First pass on CesASMe -- intro is still a mess

2023-09-25 18:45:35 +02:00 · 2023-09-25 18:45:35 +02:00 · 853f792023
commit 853f792023
parent f6f0336b34
7 changed files with 45 additions and 87 deletions
--- a/manuscrit/20_foundations/main.tex
+++ b/manuscrit/20_foundations/main.tex
@ -1 +1 @@
-\chapter{Foundations}
+\chapter{Foundations}\label{chap:foundations}
--- a/manuscrit/50_CesASMe/00_intro.tex
+++ b/manuscrit/50_CesASMe/00_intro.tex
@ -1,21 +1,18 @@
-\section{Introduction}\label{sec:intro}
+In the previous chapters, we focused on two of the main bottleneck factors for
 computation kernels: \autoref{chap:palmed} investigated the backend aspect of
 throughput prediction, while \autoref{chap:frontend} dived into the frontend
 aspects.
-At a time when software is expected to perform more computations, faster and in
+Throughout those two chapters, we entirely left out another crucial
-more constrained environments, tools that statically predict the resources (and
+factor: dependencies, and the latency they induce between instructions. We
-in particular the CPU resources) they consume are very useful to guide their
+managed to do so, because our baseline of native execution was \pipedream{}
-optimization. This need is reflected in the diversity of binary or assembly
+measures, \emph{designed} to suppress any dependency.
 code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
 maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
 \uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
 these tools strive to extract various performance metrics, including the number
 of CPU cycles a computation kernel will take ---~which roughly translates to
 execution time.  In addition to raw measurements (relying on hardware
 counters), these model-based analyses provide higher-level and refined data, to
 expose the bottlenecks and guide the optimization of a given code. This
 feedback is useful to experts optimizing computation kernels, including
 scientific simulations and deep-learning kernels.
-An exact throughput prediction would require a cycle-accurate simulator of the
+However, state-of-the-art tools strive to provide an estimation of the
 execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
 as possible}, and as such, cannot neglect this third major bottleneck.
 An exact
 throughput prediction would require a cycle-accurate simulator of the
 processor, based on microarchitectural data that is most often not publicly
 available, and would be prohibitively slow in any case. These tools thus each
 solve in their own way the challenge of modeling complex CPUs while remaining
@ -40,9 +37,9 @@ predicts 3 cycles. One may wonder which tool is correct.
 The obvious solution to assess their predictions is to compare them to an
-actual measure. However, as these tools reason at the basic block level, this
+actual measure. However, accounting for dependencies at the scale of a basic
-is not as trivially defined as it would seem. Take for instance the following
+block makes this \emph{actual measure} not as trivially defined as it would
-kernel:
+seem. Take for instance the following kernel:
 \begin{minipage}{0.90\linewidth}
 \begin{lstlisting}[language={[x86masm]Assembler}]
@ -98,7 +95,7 @@ also be complicated by code versioning.
 \bigskip
-In this article, we present a fully-tooled solution to evaluate and compare the
+In this chapter, we present a fully-tooled solution to evaluate and compare the
 diversity of static throughput predictors. Our tool, \cesasme, solves two main
 issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
 \cesasme{} generates a wide variety of computation kernels stressing different
--- a/manuscrit/50_CesASMe/05_related_works.tex
+++ b/manuscrit/50_CesASMe/05_related_works.tex
@ -1,40 +1,4 @@
 \section{Related works}
 The static throughput analyzers studied rely on a variety of models.
 \iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
 relies on Intel's expertise on their own processors.
 The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
 architectures. These models are used in the LLVM Machine Code Analyzer,
 \llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
 of assembly.
 Independently, Abel and Reineke used an automated microbenchmark generation
 approach to generate port mappings of many architectures in
 \texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
 This work was continued with \uica~\cite{uica}, extending this model with an
 extensive frontend description.
 Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
 neural network to predict basic blocks throughput. To obtain enough data to
 train its model, the authors also developed \bhive~\cite{bhive}, a profiling
 tool working on basic blocks.
 Another static tool, \osaca~\cite{osaca2}, provides lower- and
 upper-bounds to the execution time of a basic block. As this kind of
 information cannot be fairly compared with tools yielding an exact throughput
 prediction, we exclude it from our scope.
 All these tools statically predict the number of cycles taken by a piece of
 assembly or binary that is assumed to be the body of an infinite ---~or
 sufficiently large~--- loop in steady state, all its data being L1-resident. As
 discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
 analyzers; \eg{} by assuming that the loop is either unrolled or has control
 instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
 necessarily work on a single basic block, while some others, \eg{} \iaca, work
 on a section of code delimited by markers. However, even in the second case,
 the code is assumed to be \emph{straight-line code}: branch instructions, if
 any, are assumed not taken.
 \smallskip
 Throughput prediction tools, however, are not all static.
 \gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
 region, instrumenting it to retrieve the exact events occurring through its
--- a/manuscrit/50_CesASMe/20_evaluation.tex
+++ b/manuscrit/50_CesASMe/20_evaluation.tex
@ -2,8 +2,8 @@
 Running the harness described above provides us with 3500
 benchmarks ---~after filtering out non-L1-resident
-benchmarks~---, on which each throughput predictor is run. We make the full
+benchmarks~---, on which each throughput predictor is run.
-output of our tool available in our artifact. Before analyzing these results in
+Before analyzing these results in
 Section~\ref{sec:results_analysis}, we evaluate the relevance of the
 methodology presented in Section~\ref{sec:bench_harness} to make the tools'
 predictions comparable to baseline hardware counter measures.
@ -15,14 +15,13 @@ C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
 of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
 Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
-The experiments themselves were run inside a Docker environment very close to
+The experiments themselves were run inside a Docker environment based on Debian
-our artifact, based on Debian Bullseye. Care was taken to disable
+Bullseye. Care was taken to disable hyperthreading to improve measurements
-hyperthreading to improve measurements stability. For tools whose output is
+stability. For tools whose output is based on a direct measurement (\perf,
-based on a direct measurement (\perf, \bhive), the benchmarks were run
+\bhive), the benchmarks were run sequentially on a single core with no
-sequentially on a single core with no experiments on the other cores. No such
+experiments on the other cores. No such care was taken for \gus{} as, although
-care was taken for \gus{} as, although based on a dynamic run, its prediction
+based on a dynamic run, its prediction is purely function of recorded program
-is purely function of recorded program events and not of program measures. All
+events and not of program measures. All other tools were run in parallel.
 other tools were run in parallel.
 We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
 at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
@ -87,8 +86,9 @@ consequently, lifted predictions can reasonably be compared to one another.
 \end{table}
-\begin{table}[!htbp]
+\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l | r r r | r r r | r r r}
        \toprule
 & \multicolumn{3}{c|}{\textbf{Frontend}}
@ -196,7 +196,7 @@ the transformations described in Section~\ref{sec:bench_gen}.
 Generating and running the full suite of benchmarks required about 30h of
 continuous computation on a single machine.  During the experiments, the power
 supply units reported a near-constant consumption of about 350W. The carbon
-intensity of the power grid for the region where the experiment was run, at the
+intensity of the power grid in France, where the experiment was run, at the
 time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
 The electricity consumed directly by the server thus amounts to about
--- a/manuscrit/50_CesASMe/25_results_analysis.tex
+++ b/manuscrit/50_CesASMe/25_results_analysis.tex
@ -35,12 +35,8 @@ Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
 The error distribution of the relative errors, for each tool, is presented as a
 box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
 are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
-each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
+each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
-used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
+in \autoref{chap:palmed} and \autoref{chap:frontend}.
 well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
 anti-correlation and $1$ a full correlation. This is especially useful when one
 is not interested in a program's absolute throughput, but rather in comparing
 which program has a better throughput.
 \begin{figure}
    \centering
@ -53,8 +49,8 @@ These results are, overall, significantly worse than what each tool's article
 presents. We attribute this difference mostly to the specificities of
 Polybench: being composed of computation kernels, it intrinsically stresses the
 CPU more than basic blocks extracted out of the Spec benchmark suite. This
-difference is clearly reflected in the experimental section of the Palmed
+difference is clearly reflected in the experimental section of Palmed in
-article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
+\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
 Spec, often by more than a factor of two.
 As \bhive{} and \ithemal{} do not support control flow instructions
@ -79,7 +75,7 @@ paths, can explain another portion. We also find that \bhive{} fails to produce
 a result in about 40\,\% of the kernels explored ---~which means that, for those
 cases, \bhive{} failed to produce a result on at least one of the constituent
 basic blocks. In fact, this is due to the difficulties we mentioned in
-Section \ref{sec:intro} related to the need to reconstruct the context of each
+\qtodo{[ref intro]} related to the need to reconstruct the context of each
 basic block \textit{ex nihilo}.
 The basis of \bhive's method is to run the code to be measured, unrolled a
@ -240,13 +236,12 @@ Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.8
    through memory-carried dependencies rows}\label{table:nomemdeps_stats}
 \end{table}
-An overview of the full results table (available in our artifact) hints towards
+An overview of the full results table hints towards two main tendencies: on a
-two main tendencies: on a significant number of rows, the static tools
+significant number of rows, the static tools ---~thus leaving \gus{} and
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
+\bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
-comparatively bad throughput predictions \emph{together}; and many of these
+predictions \emph{together}; and many of these rows are those using the
-rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
+\texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
-setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
+\texttt{-O1}, plus vectorisation options for the latter).
 latter).
 To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
 terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
--- a/manuscrit/50_CesASMe/30_future_works.tex
+++ b/manuscrit/50_CesASMe/30_future_works.tex
@ -1,6 +1,6 @@
-\section{Conclusion and future works}
+\section*{Conclusion and future works}
-In this article, we have presented a fully-tooled approach that enables:
+In this chapter, we have presented a fully-tooled approach that enables:
 \begin{itemize}
 \item the generation of a wide variety of microbenchmarks, reflecting both the
--- a/manuscrit/biblio/tools.bib
+++ b/manuscrit/biblio/tools.bib
@ -48,6 +48,8 @@
@misc{tool:pocc,
    title={{PoCC}, the Polyhedral Compiler Collection},
    author={Pouchet, Louis-No{\"e}l},
    year=2009,
    note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}},
 }
`@ -1 +1 @@`
	`\chapter{Foundations}`	`\chapter{Foundations}\label{chap:foundations}`