Proof-read chapter 4 (CesASMe)
This commit is contained in:
parent
24e3d4a817
commit
9cfeddeef7
11 changed files with 43 additions and 31 deletions
|
@ -72,7 +72,7 @@ details below.
|
|||
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
||||
and 2\,664 polybench-based basic blocks.
|
||||
|
||||
\subsection{Automating basic block extraction}
|
||||
\subsection{Automating basic block extraction}\label{ssec:palmed_bb_extraction}
|
||||
|
||||
This manual method, however, has multiple drawbacks. It is, obviously, tedious
|
||||
to manually compile and run a benchmark suite, then extract basic blocks using
|
||||
|
|
|
@ -52,15 +52,17 @@ advocate for the measurement of the total execution time of a computation
|
|||
kernel in its original context, coupled with a precise measure of its number of
|
||||
iterations to normalize the measure.
|
||||
|
||||
We then present a fully-tooled solution to evaluate and compare the
|
||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
||||
parameters of the architecture, and thus of the predictors' models, while
|
||||
staying close to representative workloads. To achieve this, we use
|
||||
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
||||
scientific computation workloads, that we combine with a variety of
|
||||
optimisations, including polyhedral loop transformations.
|
||||
We then present a fully-tooled solution to evaluate and compare the diversity
|
||||
of static throughput predictors. Our tool, \cesasme, solves two main issues in
|
||||
this direction. In Section~\ref{sec:bench_gen}, we describe how \cesasme{}
|
||||
generates a wide variety of computation kernels stressing different parameters
|
||||
of the architecture, and thus of the predictors' models, while staying close to
|
||||
representative workloads. To achieve this, we use
|
||||
Polybench~\cite{bench:polybench}, a C-level benchmark suite that we already
|
||||
introduced for \palmed{} in \autoref{sec:benchsuite_bb}. Polybench is composed
|
||||
of benchmarks representative of scientific computation workloads, that we
|
||||
combine with a variety of optimisations, including polyhedral loop
|
||||
transformations.
|
||||
|
||||
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
||||
evaluate throughput predictors on this set of benchmarks by lifting their
|
||||
|
|
|
@ -12,7 +12,7 @@ SIMD arithmetic operation''.
|
|||
\paragraph{A dynamic code analyzer: \gus{}.}
|
||||
So far, this manuscript was mostly concerned with static code analyzers.
|
||||
Throughput prediction tools, however, are not all static.
|
||||
\gus is a dynamic tool first introduced in \fgruber{}'s PhD
|
||||
\gus{} is a dynamic tool first introduced in \fgruber{}'s PhD
|
||||
thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
|
||||
dynamically predict the throughput of user-defined regions of interest in whole
|
||||
program.
|
||||
|
|
|
@ -12,7 +12,7 @@ In practice, a microbenchmark's \textit{computational kernel} is a simple
|
|||
\texttt{for} loop, whose
|
||||
body contains no loops and whose bounds are statically known.
|
||||
A \emph{measure} is a number of repetitions $n$ of this computational
|
||||
kernel, $n$ being an user-specified parameter.
|
||||
kernel, $n$ being a user-specified parameter.
|
||||
The measure may be repeated an arbitrary number of times to improve
|
||||
stability.
|
||||
|
||||
|
@ -46,7 +46,7 @@ removed because they are incompatible with PoCC (introduced below). The
|
|||
\texttt{lu} benchmark is left out as its execution alone takes longer than all
|
||||
others together, making its dynamic analysis (\eg{} with \gus) impractical.
|
||||
In addition to the importance of linear algebra within
|
||||
it, one of its important features is that it does not include computational
|
||||
Polybench, one of its important features is that it does not include computational
|
||||
kernels with conditional control flow (\eg{} \texttt{if-then-else})
|
||||
---~however, it does includes conditional data flow, using the ternary
|
||||
conditional operator of C.
|
||||
|
|
|
@ -21,7 +21,8 @@ kernel-level results thanks to the occurrences previously measured.
|
|||
|
||||
Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
|
||||
code at each control flow instruction (jump, call, return, \ldots) and each
|
||||
jump site.
|
||||
jump site, as in \autoref{alg:bb_extr_procedure} from
|
||||
\autoref{ssec:palmed_bb_extraction}.
|
||||
|
||||
To accurately obtain the occurrences of each basic block in the whole kernel's
|
||||
computation,
|
||||
|
|
|
@ -10,11 +10,11 @@ predictions comparable to baseline hardware counter measures.
|
|||
|
||||
\subsection{Experimental environment}
|
||||
|
||||
The experiments presented in this paper were all realized on a Dell PowerEdge
|
||||
C6420 machine, from the \textit{Dahu} cluster of Grid5000 in
|
||||
Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM
|
||||
---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130
|
||||
CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
||||
The experiments presented in this chapter, unless stated otherwise, were all
|
||||
realized on a Dell PowerEdge C6420 machine, from the \textit{Dahu} cluster of
|
||||
Grid5000 in Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of
|
||||
DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Xeon
|
||||
Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
||||
|
||||
The experiments themselves were run inside a Docker environment based on Debian
|
||||
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
||||
|
@ -92,9 +92,11 @@ consequently, lifted predictions can reasonably be compared to one another.
|
|||
\footnotesize
|
||||
\begin{tabular}{l | r r r | r r r | r r r}
|
||||
\toprule
|
||||
\multicolumn{1}{c|}{\textbf{Polybench}}
|
||||
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
||||
& \multicolumn{3}{c|}{\textbf{Ports}}
|
||||
& \multicolumn{3}{c}{\textbf{Dependencies}} \\
|
||||
\multicolumn{1}{c|}{\textbf{benchmark}}
|
||||
& \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\
|
||||
|
||||
\midrule
|
||||
|
|
|
@ -48,8 +48,8 @@ in \autoref{chap:palmed} and \autoref{chap:frontend}.
|
|||
These results are, overall, significantly worse than what each tool's article
|
||||
presents. We attribute this difference mostly to the specificities of
|
||||
Polybench: being composed of computation kernels, it intrinsically stresses the
|
||||
CPU more than basic blocks extracted out of the Spec benchmark suite. This
|
||||
difference is clearly reflected in the experimental section of Palmed in
|
||||
CPU more than basic blocks extracted out of \eg{} the Spec benchmark suite.
|
||||
This difference is clearly reflected in the experimental section of Palmed in
|
||||
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
|
||||
Spec, often by more than a factor of two.
|
||||
|
||||
|
@ -105,6 +105,12 @@ The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
|
|||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
Note here that the \lstxasm{vmulsd out, in1, in2} instruction is the scalar
|
||||
double-precision float multiplication of values from \lstxasm{in1} and
|
||||
\lstxasm{in2}, storing the result in \lstxasm{out}; while \lstxasm{vmovsd out,
|
||||
in} is a simple \lstxasm{mov} operation from \lstxasm{in} to \lstxasm{out}
|
||||
operating on double-precision floats in \reg{xmm} registers.
|
||||
|
||||
When executed with all the general purpose registers initialized to the default
|
||||
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
|
||||
\reg{r10} hold the same value, inducing a read-after-write dependency between
|
||||
|
@ -123,10 +129,11 @@ influence the results whenever it gets loaded into registers.
|
|||
\vspace{0.5em}
|
||||
|
||||
\paragraph{Failed analysis.} Some memory accesses will always result in an
|
||||
error; for instance, it is impossible to \texttt{mmap} at an address lower
|
||||
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
|
||||
with equal initial values for all registers, the following kernel would fail,
|
||||
since the second operation attempts to load at address 0:
|
||||
error; for instance, on Linux, it is impossible to \texttt{mmap} at an address
|
||||
lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to
|
||||
\texttt{0x10000}. Thus, with equal initial values for all registers, the
|
||||
following kernel would fail, since the second operation attempts to load at
|
||||
address 0:
|
||||
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
|
@ -181,7 +188,7 @@ In the majority of the cases studied, the tools are not able to agree on the
|
|||
presence or absence of a type of bottleneck. Although it might seem that the
|
||||
tools are performing better on frontend bottleneck detection, it must be
|
||||
recalled that only two tools (versus three in the other cases) are reporting
|
||||
frontend bottlenecks, thus making it easier for them to agree.
|
||||
frontend bottlenecks, thus making it more likely for them to agree.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
|
@ -333,6 +340,8 @@ While the results for \llvmmca, \uica{} and \iaca{} globally improved
|
|||
significantly, the most noticeable improvements are the reduced spread of the
|
||||
results and the Kendall's $\tau$ correlation coefficient's increase.
|
||||
|
||||
\medskip{}
|
||||
|
||||
From this,
|
||||
we argue that detecting memory-carried dependencies is a weak point in current
|
||||
state-of-the-art static analyzers, and that their results could be
|
||||
|
|
|
@ -29,7 +29,8 @@ We were also able to show in Section~\ref{ssec:memlatbound}
|
|||
that state-of-the-art static analyzers struggle to
|
||||
account for memory-carried dependencies; a weakness significantly impacting
|
||||
their overall results on our benchmarks. We believe that detecting
|
||||
and accounting for these dependencies is an important future works direction.
|
||||
and accounting for these dependencies is an important topic --~which we will
|
||||
tackle in the following chapter.
|
||||
|
||||
Moreover, we present this work in the form of a modular software package, each
|
||||
component of which exposes numerous adjustable parameters. These components can
|
||||
|
|
|
@ -1,2 +0,0 @@
|
|||
%% \section*{Conclusion}
|
||||
%% \todo{}
|
|
@ -9,4 +9,3 @@ analysis: \cesasme{}}\label{chap:CesASMe}
|
|||
\input{20_evaluation.tex}
|
||||
\input{25_results_analysis.tex}
|
||||
\input{30_future_works.tex}
|
||||
\input{99_conclusion.tex}
|
||||
|
|
|
@ -68,7 +68,7 @@
|
|||
|
||||
\newcommand{\coeq}{CO$_{2}$eq}
|
||||
|
||||
\newcommand{\figref}[1]{[\ref{#1}]}
|
||||
\newcommand{\figref}[1]{[§\ref{#1}]}
|
||||
|
||||
\newcommand{\reg}[1]{\texttt{\%#1}}
|
||||
|
||||
|
|
Loading…
Reference in a new issue