Proof-read chapter 4 (CesASMe)
This commit is contained in:
parent
24e3d4a817
commit
9cfeddeef7
11 changed files with 43 additions and 31 deletions
|
@ -72,7 +72,7 @@ details below.
|
||||||
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
||||||
and 2\,664 polybench-based basic blocks.
|
and 2\,664 polybench-based basic blocks.
|
||||||
|
|
||||||
\subsection{Automating basic block extraction}
|
\subsection{Automating basic block extraction}\label{ssec:palmed_bb_extraction}
|
||||||
|
|
||||||
This manual method, however, has multiple drawbacks. It is, obviously, tedious
|
This manual method, however, has multiple drawbacks. It is, obviously, tedious
|
||||||
to manually compile and run a benchmark suite, then extract basic blocks using
|
to manually compile and run a benchmark suite, then extract basic blocks using
|
||||||
|
|
|
@ -52,15 +52,17 @@ advocate for the measurement of the total execution time of a computation
|
||||||
kernel in its original context, coupled with a precise measure of its number of
|
kernel in its original context, coupled with a precise measure of its number of
|
||||||
iterations to normalize the measure.
|
iterations to normalize the measure.
|
||||||
|
|
||||||
We then present a fully-tooled solution to evaluate and compare the
|
We then present a fully-tooled solution to evaluate and compare the diversity
|
||||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
of static throughput predictors. Our tool, \cesasme, solves two main issues in
|
||||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
this direction. In Section~\ref{sec:bench_gen}, we describe how \cesasme{}
|
||||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
generates a wide variety of computation kernels stressing different parameters
|
||||||
parameters of the architecture, and thus of the predictors' models, while
|
of the architecture, and thus of the predictors' models, while staying close to
|
||||||
staying close to representative workloads. To achieve this, we use
|
representative workloads. To achieve this, we use
|
||||||
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
Polybench~\cite{bench:polybench}, a C-level benchmark suite that we already
|
||||||
scientific computation workloads, that we combine with a variety of
|
introduced for \palmed{} in \autoref{sec:benchsuite_bb}. Polybench is composed
|
||||||
optimisations, including polyhedral loop transformations.
|
of benchmarks representative of scientific computation workloads, that we
|
||||||
|
combine with a variety of optimisations, including polyhedral loop
|
||||||
|
transformations.
|
||||||
|
|
||||||
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
||||||
evaluate throughput predictors on this set of benchmarks by lifting their
|
evaluate throughput predictors on this set of benchmarks by lifting their
|
||||||
|
|
|
@ -12,7 +12,7 @@ SIMD arithmetic operation''.
|
||||||
\paragraph{A dynamic code analyzer: \gus{}.}
|
\paragraph{A dynamic code analyzer: \gus{}.}
|
||||||
So far, this manuscript was mostly concerned with static code analyzers.
|
So far, this manuscript was mostly concerned with static code analyzers.
|
||||||
Throughput prediction tools, however, are not all static.
|
Throughput prediction tools, however, are not all static.
|
||||||
\gus is a dynamic tool first introduced in \fgruber{}'s PhD
|
\gus{} is a dynamic tool first introduced in \fgruber{}'s PhD
|
||||||
thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
|
thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
|
||||||
dynamically predict the throughput of user-defined regions of interest in whole
|
dynamically predict the throughput of user-defined regions of interest in whole
|
||||||
program.
|
program.
|
||||||
|
|
|
@ -12,7 +12,7 @@ In practice, a microbenchmark's \textit{computational kernel} is a simple
|
||||||
\texttt{for} loop, whose
|
\texttt{for} loop, whose
|
||||||
body contains no loops and whose bounds are statically known.
|
body contains no loops and whose bounds are statically known.
|
||||||
A \emph{measure} is a number of repetitions $n$ of this computational
|
A \emph{measure} is a number of repetitions $n$ of this computational
|
||||||
kernel, $n$ being an user-specified parameter.
|
kernel, $n$ being a user-specified parameter.
|
||||||
The measure may be repeated an arbitrary number of times to improve
|
The measure may be repeated an arbitrary number of times to improve
|
||||||
stability.
|
stability.
|
||||||
|
|
||||||
|
@ -46,7 +46,7 @@ removed because they are incompatible with PoCC (introduced below). The
|
||||||
\texttt{lu} benchmark is left out as its execution alone takes longer than all
|
\texttt{lu} benchmark is left out as its execution alone takes longer than all
|
||||||
others together, making its dynamic analysis (\eg{} with \gus) impractical.
|
others together, making its dynamic analysis (\eg{} with \gus) impractical.
|
||||||
In addition to the importance of linear algebra within
|
In addition to the importance of linear algebra within
|
||||||
it, one of its important features is that it does not include computational
|
Polybench, one of its important features is that it does not include computational
|
||||||
kernels with conditional control flow (\eg{} \texttt{if-then-else})
|
kernels with conditional control flow (\eg{} \texttt{if-then-else})
|
||||||
---~however, it does includes conditional data flow, using the ternary
|
---~however, it does includes conditional data flow, using the ternary
|
||||||
conditional operator of C.
|
conditional operator of C.
|
||||||
|
|
|
@ -21,7 +21,8 @@ kernel-level results thanks to the occurrences previously measured.
|
||||||
|
|
||||||
Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
|
Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
|
||||||
code at each control flow instruction (jump, call, return, \ldots) and each
|
code at each control flow instruction (jump, call, return, \ldots) and each
|
||||||
jump site.
|
jump site, as in \autoref{alg:bb_extr_procedure} from
|
||||||
|
\autoref{ssec:palmed_bb_extraction}.
|
||||||
|
|
||||||
To accurately obtain the occurrences of each basic block in the whole kernel's
|
To accurately obtain the occurrences of each basic block in the whole kernel's
|
||||||
computation,
|
computation,
|
||||||
|
|
|
@ -10,11 +10,11 @@ predictions comparable to baseline hardware counter measures.
|
||||||
|
|
||||||
\subsection{Experimental environment}
|
\subsection{Experimental environment}
|
||||||
|
|
||||||
The experiments presented in this paper were all realized on a Dell PowerEdge
|
The experiments presented in this chapter, unless stated otherwise, were all
|
||||||
C6420 machine, from the \textit{Dahu} cluster of Grid5000 in
|
realized on a Dell PowerEdge C6420 machine, from the \textit{Dahu} cluster of
|
||||||
Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM
|
Grid5000 in Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of
|
||||||
---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130
|
DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Xeon
|
||||||
CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
||||||
|
|
||||||
The experiments themselves were run inside a Docker environment based on Debian
|
The experiments themselves were run inside a Docker environment based on Debian
|
||||||
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
||||||
|
@ -92,9 +92,11 @@ consequently, lifted predictions can reasonably be compared to one another.
|
||||||
\footnotesize
|
\footnotesize
|
||||||
\begin{tabular}{l | r r r | r r r | r r r}
|
\begin{tabular}{l | r r r | r r r | r r r}
|
||||||
\toprule
|
\toprule
|
||||||
|
\multicolumn{1}{c|}{\textbf{Polybench}}
|
||||||
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
||||||
& \multicolumn{3}{c|}{\textbf{Ports}}
|
& \multicolumn{3}{c|}{\textbf{Ports}}
|
||||||
& \multicolumn{3}{c}{\textbf{Dependencies}} \\
|
& \multicolumn{3}{c}{\textbf{Dependencies}} \\
|
||||||
|
\multicolumn{1}{c|}{\textbf{benchmark}}
|
||||||
& \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\
|
& \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\
|
||||||
|
|
||||||
\midrule
|
\midrule
|
||||||
|
|
|
@ -48,8 +48,8 @@ in \autoref{chap:palmed} and \autoref{chap:frontend}.
|
||||||
These results are, overall, significantly worse than what each tool's article
|
These results are, overall, significantly worse than what each tool's article
|
||||||
presents. We attribute this difference mostly to the specificities of
|
presents. We attribute this difference mostly to the specificities of
|
||||||
Polybench: being composed of computation kernels, it intrinsically stresses the
|
Polybench: being composed of computation kernels, it intrinsically stresses the
|
||||||
CPU more than basic blocks extracted out of the Spec benchmark suite. This
|
CPU more than basic blocks extracted out of \eg{} the Spec benchmark suite.
|
||||||
difference is clearly reflected in the experimental section of Palmed in
|
This difference is clearly reflected in the experimental section of Palmed in
|
||||||
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
|
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
|
||||||
Spec, often by more than a factor of two.
|
Spec, often by more than a factor of two.
|
||||||
|
|
||||||
|
@ -105,6 +105,12 @@ The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
|
||||||
\end{lstlisting}
|
\end{lstlisting}
|
||||||
\end{minipage}
|
\end{minipage}
|
||||||
|
|
||||||
|
Note here that the \lstxasm{vmulsd out, in1, in2} instruction is the scalar
|
||||||
|
double-precision float multiplication of values from \lstxasm{in1} and
|
||||||
|
\lstxasm{in2}, storing the result in \lstxasm{out}; while \lstxasm{vmovsd out,
|
||||||
|
in} is a simple \lstxasm{mov} operation from \lstxasm{in} to \lstxasm{out}
|
||||||
|
operating on double-precision floats in \reg{xmm} registers.
|
||||||
|
|
||||||
When executed with all the general purpose registers initialized to the default
|
When executed with all the general purpose registers initialized to the default
|
||||||
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
|
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
|
||||||
\reg{r10} hold the same value, inducing a read-after-write dependency between
|
\reg{r10} hold the same value, inducing a read-after-write dependency between
|
||||||
|
@ -123,10 +129,11 @@ influence the results whenever it gets loaded into registers.
|
||||||
\vspace{0.5em}
|
\vspace{0.5em}
|
||||||
|
|
||||||
\paragraph{Failed analysis.} Some memory accesses will always result in an
|
\paragraph{Failed analysis.} Some memory accesses will always result in an
|
||||||
error; for instance, it is impossible to \texttt{mmap} at an address lower
|
error; for instance, on Linux, it is impossible to \texttt{mmap} at an address
|
||||||
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
|
lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to
|
||||||
with equal initial values for all registers, the following kernel would fail,
|
\texttt{0x10000}. Thus, with equal initial values for all registers, the
|
||||||
since the second operation attempts to load at address 0:
|
following kernel would fail, since the second operation attempts to load at
|
||||||
|
address 0:
|
||||||
|
|
||||||
\begin{minipage}{0.95\linewidth}
|
\begin{minipage}{0.95\linewidth}
|
||||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||||
|
@ -181,7 +188,7 @@ In the majority of the cases studied, the tools are not able to agree on the
|
||||||
presence or absence of a type of bottleneck. Although it might seem that the
|
presence or absence of a type of bottleneck. Although it might seem that the
|
||||||
tools are performing better on frontend bottleneck detection, it must be
|
tools are performing better on frontend bottleneck detection, it must be
|
||||||
recalled that only two tools (versus three in the other cases) are reporting
|
recalled that only two tools (versus three in the other cases) are reporting
|
||||||
frontend bottlenecks, thus making it easier for them to agree.
|
frontend bottlenecks, thus making it more likely for them to agree.
|
||||||
|
|
||||||
\begin{table}
|
\begin{table}
|
||||||
\centering
|
\centering
|
||||||
|
@ -333,6 +340,8 @@ While the results for \llvmmca, \uica{} and \iaca{} globally improved
|
||||||
significantly, the most noticeable improvements are the reduced spread of the
|
significantly, the most noticeable improvements are the reduced spread of the
|
||||||
results and the Kendall's $\tau$ correlation coefficient's increase.
|
results and the Kendall's $\tau$ correlation coefficient's increase.
|
||||||
|
|
||||||
|
\medskip{}
|
||||||
|
|
||||||
From this,
|
From this,
|
||||||
we argue that detecting memory-carried dependencies is a weak point in current
|
we argue that detecting memory-carried dependencies is a weak point in current
|
||||||
state-of-the-art static analyzers, and that their results could be
|
state-of-the-art static analyzers, and that their results could be
|
||||||
|
|
|
@ -29,7 +29,8 @@ We were also able to show in Section~\ref{ssec:memlatbound}
|
||||||
that state-of-the-art static analyzers struggle to
|
that state-of-the-art static analyzers struggle to
|
||||||
account for memory-carried dependencies; a weakness significantly impacting
|
account for memory-carried dependencies; a weakness significantly impacting
|
||||||
their overall results on our benchmarks. We believe that detecting
|
their overall results on our benchmarks. We believe that detecting
|
||||||
and accounting for these dependencies is an important future works direction.
|
and accounting for these dependencies is an important topic --~which we will
|
||||||
|
tackle in the following chapter.
|
||||||
|
|
||||||
Moreover, we present this work in the form of a modular software package, each
|
Moreover, we present this work in the form of a modular software package, each
|
||||||
component of which exposes numerous adjustable parameters. These components can
|
component of which exposes numerous adjustable parameters. These components can
|
||||||
|
|
|
@ -1,2 +0,0 @@
|
||||||
%% \section*{Conclusion}
|
|
||||||
%% \todo{}
|
|
|
@ -9,4 +9,3 @@ analysis: \cesasme{}}\label{chap:CesASMe}
|
||||||
\input{20_evaluation.tex}
|
\input{20_evaluation.tex}
|
||||||
\input{25_results_analysis.tex}
|
\input{25_results_analysis.tex}
|
||||||
\input{30_future_works.tex}
|
\input{30_future_works.tex}
|
||||||
\input{99_conclusion.tex}
|
|
||||||
|
|
|
@ -68,7 +68,7 @@
|
||||||
|
|
||||||
\newcommand{\coeq}{CO$_{2}$eq}
|
\newcommand{\coeq}{CO$_{2}$eq}
|
||||||
|
|
||||||
\newcommand{\figref}[1]{[\ref{#1}]}
|
\newcommand{\figref}[1]{[§\ref{#1}]}
|
||||||
|
|
||||||
\newcommand{\reg}[1]{\texttt{\%#1}}
|
\newcommand{\reg}[1]{\texttt{\%#1}}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue