Proof-read chapter 4 (CesASMe)

This commit is contained in:
Théophile Bastian 2024-08-18 17:42:44 +02:00
parent 24e3d4a817
commit 9cfeddeef7
11 changed files with 43 additions and 31 deletions

View file

@ -72,7 +72,7 @@ details below.
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
and 2\,664 polybench-based basic blocks. and 2\,664 polybench-based basic blocks.
\subsection{Automating basic block extraction} \subsection{Automating basic block extraction}\label{ssec:palmed_bb_extraction}
This manual method, however, has multiple drawbacks. It is, obviously, tedious This manual method, however, has multiple drawbacks. It is, obviously, tedious
to manually compile and run a benchmark suite, then extract basic blocks using to manually compile and run a benchmark suite, then extract basic blocks using

View file

@ -52,15 +52,17 @@ advocate for the measurement of the total execution time of a computation
kernel in its original context, coupled with a precise measure of its number of kernel in its original context, coupled with a precise measure of its number of
iterations to normalize the measure. iterations to normalize the measure.
We then present a fully-tooled solution to evaluate and compare the We then present a fully-tooled solution to evaluate and compare the diversity
diversity of static throughput predictors. Our tool, \cesasme, solves two main of static throughput predictors. Our tool, \cesasme, solves two main issues in
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how this direction. In Section~\ref{sec:bench_gen}, we describe how \cesasme{}
\cesasme{} generates a wide variety of computation kernels stressing different generates a wide variety of computation kernels stressing different parameters
parameters of the architecture, and thus of the predictors' models, while of the architecture, and thus of the predictors' models, while staying close to
staying close to representative workloads. To achieve this, we use representative workloads. To achieve this, we use
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of Polybench~\cite{bench:polybench}, a C-level benchmark suite that we already
scientific computation workloads, that we combine with a variety of introduced for \palmed{} in \autoref{sec:benchsuite_bb}. Polybench is composed
optimisations, including polyhedral loop transformations. of benchmarks representative of scientific computation workloads, that we
combine with a variety of optimisations, including polyhedral loop
transformations.
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their evaluate throughput predictors on this set of benchmarks by lifting their

View file

@ -12,7 +12,7 @@ SIMD arithmetic operation''.
\paragraph{A dynamic code analyzer: \gus{}.} \paragraph{A dynamic code analyzer: \gus{}.}
So far, this manuscript was mostly concerned with static code analyzers. So far, this manuscript was mostly concerned with static code analyzers.
Throughput prediction tools, however, are not all static. Throughput prediction tools, however, are not all static.
\gus is a dynamic tool first introduced in \fgruber{}'s PhD \gus{} is a dynamic tool first introduced in \fgruber{}'s PhD
thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
dynamically predict the throughput of user-defined regions of interest in whole dynamically predict the throughput of user-defined regions of interest in whole
program. program.

View file

@ -12,7 +12,7 @@ In practice, a microbenchmark's \textit{computational kernel} is a simple
\texttt{for} loop, whose \texttt{for} loop, whose
body contains no loops and whose bounds are statically known. body contains no loops and whose bounds are statically known.
A \emph{measure} is a number of repetitions $n$ of this computational A \emph{measure} is a number of repetitions $n$ of this computational
kernel, $n$ being an user-specified parameter. kernel, $n$ being a user-specified parameter.
The measure may be repeated an arbitrary number of times to improve The measure may be repeated an arbitrary number of times to improve
stability. stability.
@ -46,7 +46,7 @@ removed because they are incompatible with PoCC (introduced below). The
\texttt{lu} benchmark is left out as its execution alone takes longer than all \texttt{lu} benchmark is left out as its execution alone takes longer than all
others together, making its dynamic analysis (\eg{} with \gus) impractical. others together, making its dynamic analysis (\eg{} with \gus) impractical.
In addition to the importance of linear algebra within In addition to the importance of linear algebra within
it, one of its important features is that it does not include computational Polybench, one of its important features is that it does not include computational
kernels with conditional control flow (\eg{} \texttt{if-then-else}) kernels with conditional control flow (\eg{} \texttt{if-then-else})
---~however, it does includes conditional data flow, using the ternary ---~however, it does includes conditional data flow, using the ternary
conditional operator of C. conditional operator of C.

View file

@ -21,7 +21,8 @@ kernel-level results thanks to the occurrences previously measured.
Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
code at each control flow instruction (jump, call, return, \ldots) and each code at each control flow instruction (jump, call, return, \ldots) and each
jump site. jump site, as in \autoref{alg:bb_extr_procedure} from
\autoref{ssec:palmed_bb_extraction}.
To accurately obtain the occurrences of each basic block in the whole kernel's To accurately obtain the occurrences of each basic block in the whole kernel's
computation, computation,

View file

@ -10,11 +10,11 @@ predictions comparable to baseline hardware counter measures.
\subsection{Experimental environment} \subsection{Experimental environment}
The experiments presented in this paper were all realized on a Dell PowerEdge The experiments presented in this chapter, unless stated otherwise, were all
C6420 machine, from the \textit{Dahu} cluster of Grid5000 in realized on a Dell PowerEdge C6420 machine, from the \textit{Dahu} cluster of
Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM Grid5000 in Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of
---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130 DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Xeon
CPUs (x86-64, Skylake microarchitecture) with 16 cores each. Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
The experiments themselves were run inside a Docker environment based on Debian The experiments themselves were run inside a Docker environment based on Debian
Bullseye. Care was taken to disable hyperthreading to improve measurements Bullseye. Care was taken to disable hyperthreading to improve measurements
@ -92,9 +92,11 @@ consequently, lifted predictions can reasonably be compared to one another.
\footnotesize \footnotesize
\begin{tabular}{l | r r r | r r r | r r r} \begin{tabular}{l | r r r | r r r | r r r}
\toprule \toprule
\multicolumn{1}{c|}{\textbf{Polybench}}
& \multicolumn{3}{c|}{\textbf{Frontend}} & \multicolumn{3}{c|}{\textbf{Frontend}}
& \multicolumn{3}{c|}{\textbf{Ports}} & \multicolumn{3}{c|}{\textbf{Ports}}
& \multicolumn{3}{c}{\textbf{Dependencies}} \\ & \multicolumn{3}{c}{\textbf{Dependencies}} \\
\multicolumn{1}{c|}{\textbf{benchmark}}
& \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\ & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\
\midrule \midrule

View file

@ -48,8 +48,8 @@ in \autoref{chap:palmed} and \autoref{chap:frontend}.
These results are, overall, significantly worse than what each tool's article These results are, overall, significantly worse than what each tool's article
presents. We attribute this difference mostly to the specificities of presents. We attribute this difference mostly to the specificities of
Polybench: being composed of computation kernels, it intrinsically stresses the Polybench: being composed of computation kernels, it intrinsically stresses the
CPU more than basic blocks extracted out of the Spec benchmark suite. This CPU more than basic blocks extracted out of \eg{} the Spec benchmark suite.
difference is clearly reflected in the experimental section of Palmed in This difference is clearly reflected in the experimental section of Palmed in
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on \autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
Spec, often by more than a factor of two. Spec, often by more than a factor of two.
@ -105,6 +105,12 @@ The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
\end{lstlisting} \end{lstlisting}
\end{minipage} \end{minipage}
Note here that the \lstxasm{vmulsd out, in1, in2} instruction is the scalar
double-precision float multiplication of values from \lstxasm{in1} and
\lstxasm{in2}, storing the result in \lstxasm{out}; while \lstxasm{vmovsd out,
in} is a simple \lstxasm{mov} operation from \lstxasm{in} to \lstxasm{out}
operating on double-precision floats in \reg{xmm} registers.
When executed with all the general purpose registers initialized to the default When executed with all the general purpose registers initialized to the default
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
\reg{r10} hold the same value, inducing a read-after-write dependency between \reg{r10} hold the same value, inducing a read-after-write dependency between
@ -123,10 +129,11 @@ influence the results whenever it gets loaded into registers.
\vspace{0.5em} \vspace{0.5em}
\paragraph{Failed analysis.} Some memory accesses will always result in an \paragraph{Failed analysis.} Some memory accesses will always result in an
error; for instance, it is impossible to \texttt{mmap} at an address lower error; for instance, on Linux, it is impossible to \texttt{mmap} at an address
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus, lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to
with equal initial values for all registers, the following kernel would fail, \texttt{0x10000}. Thus, with equal initial values for all registers, the
since the second operation attempts to load at address 0: following kernel would fail, since the second operation attempts to load at
address 0:
\begin{minipage}{0.95\linewidth} \begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}] \begin{lstlisting}[language={[x86masm]Assembler}]
@ -181,7 +188,7 @@ In the majority of the cases studied, the tools are not able to agree on the
presence or absence of a type of bottleneck. Although it might seem that the presence or absence of a type of bottleneck. Although it might seem that the
tools are performing better on frontend bottleneck detection, it must be tools are performing better on frontend bottleneck detection, it must be
recalled that only two tools (versus three in the other cases) are reporting recalled that only two tools (versus three in the other cases) are reporting
frontend bottlenecks, thus making it easier for them to agree. frontend bottlenecks, thus making it more likely for them to agree.
\begin{table} \begin{table}
\centering \centering
@ -333,6 +340,8 @@ While the results for \llvmmca, \uica{} and \iaca{} globally improved
significantly, the most noticeable improvements are the reduced spread of the significantly, the most noticeable improvements are the reduced spread of the
results and the Kendall's $\tau$ correlation coefficient's increase. results and the Kendall's $\tau$ correlation coefficient's increase.
\medskip{}
From this, From this,
we argue that detecting memory-carried dependencies is a weak point in current we argue that detecting memory-carried dependencies is a weak point in current
state-of-the-art static analyzers, and that their results could be state-of-the-art static analyzers, and that their results could be

View file

@ -29,7 +29,8 @@ We were also able to show in Section~\ref{ssec:memlatbound}
that state-of-the-art static analyzers struggle to that state-of-the-art static analyzers struggle to
account for memory-carried dependencies; a weakness significantly impacting account for memory-carried dependencies; a weakness significantly impacting
their overall results on our benchmarks. We believe that detecting their overall results on our benchmarks. We believe that detecting
and accounting for these dependencies is an important future works direction. and accounting for these dependencies is an important topic --~which we will
tackle in the following chapter.
Moreover, we present this work in the form of a modular software package, each Moreover, we present this work in the form of a modular software package, each
component of which exposes numerous adjustable parameters. These components can component of which exposes numerous adjustable parameters. These components can

View file

@ -1,2 +0,0 @@
%% \section*{Conclusion}
%% \todo{}

View file

@ -9,4 +9,3 @@ analysis: \cesasme{}}\label{chap:CesASMe}
\input{20_evaluation.tex} \input{20_evaluation.tex}
\input{25_results_analysis.tex} \input{25_results_analysis.tex}
\input{30_future_works.tex} \input{30_future_works.tex}
\input{99_conclusion.tex}

View file

@ -68,7 +68,7 @@
\newcommand{\coeq}{CO$_{2}$eq} \newcommand{\coeq}{CO$_{2}$eq}
\newcommand{\figref}[1]{[\ref{#1}]} \newcommand{\figref}[1]{[§\ref{#1}]}
\newcommand{\reg}[1]{\texttt{\%#1}} \newcommand{\reg}[1]{\texttt{\%#1}}