Proof-read chapter 4 (CesASMe)

2024-08-18 17:42:44 +02:00 · 2024-08-18 17:42:44 +02:00 · 9cfeddeef7
commit 9cfeddeef7
parent 24e3d4a817
11 changed files with 43 additions and 31 deletions
--- a/manuscrit/30_palmed/35_benchsuite_bb.tex
+++ b/manuscrit/30_palmed/35_benchsuite_bb.tex
@ -72,7 +72,7 @@ details below.
 Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
 and 2\,664 polybench-based basic blocks.

-\subsection{Automating basic block extraction}
+\subsection{Automating basic block extraction}\label{ssec:palmed_bb_extraction}

 This manual method, however, has multiple drawbacks. It is, obviously, tedious
 to manually compile and run a benchmark suite, then extract basic blocks using
--- a/manuscrit/50_CesASMe/00_intro.tex
+++ b/manuscrit/50_CesASMe/00_intro.tex
@ -52,15 +52,17 @@ advocate for the measurement of the total execution time of a computation
 kernel in its original context, coupled with a precise measure of its number of
 iterations to normalize the measure.

-We then present a fully-tooled solution to evaluate and compare the
-diversity of static throughput predictors. Our tool, \cesasme, solves two main
-issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
-\cesasme{} generates a wide variety of computation kernels stressing different
-parameters of the architecture, and thus of the predictors' models, while
-staying close to representative workloads. To achieve this, we use
-Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
-scientific computation workloads, that we combine with a variety of
-optimisations, including polyhedral loop transformations.
+We then present a fully-tooled solution to evaluate and compare the diversity
+of static throughput predictors. Our tool, \cesasme, solves two main issues in
+this direction. In Section~\ref{sec:bench_gen}, we describe how \cesasme{}
+generates a wide variety of computation kernels stressing different parameters
+of the architecture, and thus of the predictors' models, while staying close to
+representative workloads. To achieve this, we use
+Polybench~\cite{bench:polybench}, a C-level benchmark suite that we already
+introduced for \palmed{} in \autoref{sec:benchsuite_bb}. Polybench is composed
+of benchmarks representative of scientific computation workloads, that we
+combine with a variety of optimisations, including polyhedral loop
+transformations.

 In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
 evaluate throughput predictors on this set of benchmarks by lifting their
--- a/manuscrit/50_CesASMe/05_related_works.tex
+++ b/manuscrit/50_CesASMe/05_related_works.tex
@ -12,7 +12,7 @@ SIMD arithmetic operation''.
 \paragraph{A dynamic code analyzer: \gus{}.}
 So far, this manuscript was mostly concerned with static code analyzers.
 Throughput prediction tools, however, are not all static.
-\gus is a dynamic tool first introduced in \fgruber{}'s PhD
+\gus{} is a dynamic tool first introduced in \fgruber{}'s PhD
 thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
 dynamically predict the throughput of user-defined regions of interest in whole
 program.
--- a/manuscrit/50_CesASMe/10_bench_gen.tex
+++ b/manuscrit/50_CesASMe/10_bench_gen.tex
@ -12,7 +12,7 @@ In practice, a microbenchmark's \textit{computational kernel} is a simple
 \texttt{for} loop, whose
 body contains no loops and whose bounds are statically known.
 A \emph{measure} is a number of repetitions $n$ of this computational
-kernel, $n$ being an user-specified parameter.
+kernel, $n$ being a user-specified parameter.
 The measure may be repeated an arbitrary number of times to improve
 stability.

@ -46,7 +46,7 @@ removed because they are incompatible with PoCC (introduced below). The
 \texttt{lu} benchmark is left out as its execution alone takes longer than all
 others together, making its dynamic analysis (\eg{} with \gus) impractical.
 In addition to the importance of linear algebra within
-it, one of its important features is that it does not include computational
+Polybench, one of its important features is that it does not include computational
 kernels with conditional control flow (\eg{} \texttt{if-then-else})
 ---~however, it does includes conditional data flow, using the ternary
 conditional operator of C.
--- a/manuscrit/50_CesASMe/15_harness.tex
+++ b/manuscrit/50_CesASMe/15_harness.tex
@ -21,7 +21,8 @@ kernel-level results thanks to the occurrences previously measured.

 Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
 code at each control flow instruction (jump, call, return, \ldots) and each
-jump site.
+jump site, as in \autoref{alg:bb_extr_procedure} from
+\autoref{ssec:palmed_bb_extraction}.

 To accurately obtain the occurrences of each basic block in the whole kernel's
 computation,
--- a/manuscrit/50_CesASMe/20_evaluation.tex
+++ b/manuscrit/50_CesASMe/20_evaluation.tex
@ -10,11 +10,11 @@ predictions comparable to baseline hardware counter measures.

 \subsection{Experimental environment}

-The experiments presented in this paper were all realized on a Dell PowerEdge
-C6420 machine, from the \textit{Dahu} cluster of Grid5000 in
-Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM
---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130
-CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
+The experiments presented in this chapter, unless stated otherwise, were all
+realized on a Dell PowerEdge C6420 machine, from the \textit{Dahu} cluster of
+Grid5000 in Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of
+DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Xeon
+Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.

 The experiments themselves were run inside a Docker environment based on Debian
 Bullseye. Care was taken to disable hyperthreading to improve measurements
@ -92,9 +92,11 @@ consequently, lifted predictions can reasonably be compared to one another.
    \footnotesize
    \begin{tabular}{l | r r r | r r r | r r r}
        \toprule
+    \multicolumn{1}{c|}{\textbf{Polybench}}
 & \multicolumn{3}{c|}{\textbf{Frontend}}
 & \multicolumn{3}{c|}{\textbf{Ports}}
 & \multicolumn{3}{c}{\textbf{Dependencies}} \\
+        \multicolumn{1}{c|}{\textbf{benchmark}}
 & \textbf{yes}  &  \textbf{no}  & \textbf{disagr.} & \textbf{yes}  &  \textbf{no}  & \textbf{disagr.} & \textbf{yes}  & \textbf{no}  & \textbf{disagr.} \\

        \midrule
--- a/manuscrit/50_CesASMe/25_results_analysis.tex
+++ b/manuscrit/50_CesASMe/25_results_analysis.tex
@ -48,8 +48,8 @@ in \autoref{chap:palmed} and \autoref{chap:frontend}.
 These results are, overall, significantly worse than what each tool's article
 presents. We attribute this difference mostly to the specificities of
 Polybench: being composed of computation kernels, it intrinsically stresses the
-CPU more than basic blocks extracted out of the Spec benchmark suite. This
-difference is clearly reflected in the experimental section of Palmed in
+CPU more than basic blocks extracted out of \eg{} the Spec benchmark suite.
+This difference is clearly reflected in the experimental section of Palmed in
 \autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
 Spec, often by more than a factor of two.

@ -105,6 +105,12 @@ The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
 \end{lstlisting}
 \end{minipage}

+Note here that the \lstxasm{vmulsd out, in1, in2} instruction is the scalar
+double-precision float multiplication of values from \lstxasm{in1} and
+\lstxasm{in2}, storing the result in \lstxasm{out}; while \lstxasm{vmovsd out,
+in} is a simple \lstxasm{mov} operation from \lstxasm{in} to \lstxasm{out}
+operating on double-precision floats in \reg{xmm} registers.
+
 When executed with all the general purpose registers initialized to the default
 constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
 \reg{r10} hold the same value, inducing a read-after-write dependency between
@ -123,10 +129,11 @@ influence the results whenever it gets loaded into registers.
 \vspace{0.5em}

 \paragraph{Failed analysis.} Some memory accesses will always result in an
-error; for instance, it is impossible to \texttt{mmap} at an address lower
-than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
-with equal initial values for all registers, the following kernel would fail,
-since the second operation attempts to load at address 0:
+error; for instance, on Linux, it is impossible to \texttt{mmap} at an address
+lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to
+\texttt{0x10000}. Thus, with equal initial values for all registers, the
+following kernel would fail, since the second operation attempts to load at
+address 0:

 \begin{minipage}{0.95\linewidth}
 \begin{lstlisting}[language={[x86masm]Assembler}]
@ -181,7 +188,7 @@ In the majority of the cases studied, the tools are not able to agree on the
 presence or absence of a type of bottleneck. Although it might seem that the
 tools are performing better on frontend bottleneck detection, it must be
 recalled that only two tools (versus three in the other cases) are reporting
-frontend bottlenecks, thus making it easier for them to agree.
+frontend bottlenecks, thus making it more likely for them to agree.

 \begin{table}
    \centering
@ -333,6 +340,8 @@ While the results for \llvmmca, \uica{} and \iaca{} globally improved
 significantly, the most noticeable improvements are the reduced spread of the
 results and the Kendall's $\tau$ correlation coefficient's increase.

+\medskip{}
+
 From this,
 we argue that detecting memory-carried dependencies is a weak point in current
 state-of-the-art static analyzers, and that their results could be
--- a/manuscrit/50_CesASMe/30_future_works.tex
+++ b/manuscrit/50_CesASMe/30_future_works.tex
@ -29,7 +29,8 @@ We were also able to show in Section~\ref{ssec:memlatbound}
 that state-of-the-art static analyzers struggle to
 account for memory-carried dependencies; a weakness significantly impacting
 their overall results on our benchmarks. We believe that detecting
-and accounting for these dependencies is an important future works direction.
+and accounting for these dependencies is an important topic --~which we will
+tackle in the following chapter.

 Moreover, we present this work in the form of a modular software package, each
 component of which exposes numerous adjustable parameters. These components can
--- a/manuscrit/50_CesASMe/99_conclusion.tex
+++ b/manuscrit/50_CesASMe/99_conclusion.tex
@ -1,2 +0,0 @@
-%% \section*{Conclusion}
-%% \todo{}
--- a/manuscrit/50_CesASMe/main.tex
+++ b/manuscrit/50_CesASMe/main.tex
@ -9,4 +9,3 @@ analysis: \cesasme{}}\label{chap:CesASMe}
 \input{20_evaluation.tex}
 \input{25_results_analysis.tex}
 \input{30_future_works.tex}
-\input{99_conclusion.tex}
--- a/manuscrit/include/macros.tex
+++ b/manuscrit/include/macros.tex
@ -68,7 +68,7 @@

 \newcommand{\coeq}{CO$_{2}$eq}

-\newcommand{\figref}[1]{[\ref{#1}]}
+\newcommand{\figref}[1]{[§\ref{#1}]}

 \newcommand{\reg}[1]{\texttt{\%#1}}