First pass on CesASMe -- intro is still a mess
This commit is contained in:
parent
f6f0336b34
commit
853f792023
7 changed files with 45 additions and 87 deletions
|
@ -1 +1 @@
|
||||||
\chapter{Foundations}
|
\chapter{Foundations}\label{chap:foundations}
|
||||||
|
|
|
@ -1,21 +1,18 @@
|
||||||
\section{Introduction}\label{sec:intro}
|
In the previous chapters, we focused on two of the main bottleneck factors for
|
||||||
|
computation kernels: \autoref{chap:palmed} investigated the backend aspect of
|
||||||
|
throughput prediction, while \autoref{chap:frontend} dived into the frontend
|
||||||
|
aspects.
|
||||||
|
|
||||||
At a time when software is expected to perform more computations, faster and in
|
Throughout those two chapters, we entirely left out another crucial
|
||||||
more constrained environments, tools that statically predict the resources (and
|
factor: dependencies, and the latency they induce between instructions. We
|
||||||
in particular the CPU resources) they consume are very useful to guide their
|
managed to do so, because our baseline of native execution was \pipedream{}
|
||||||
optimization. This need is reflected in the diversity of binary or assembly
|
measures, \emph{designed} to suppress any dependency.
|
||||||
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
|
|
||||||
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
|
|
||||||
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
|
|
||||||
these tools strive to extract various performance metrics, including the number
|
|
||||||
of CPU cycles a computation kernel will take ---~which roughly translates to
|
|
||||||
execution time. In addition to raw measurements (relying on hardware
|
|
||||||
counters), these model-based analyses provide higher-level and refined data, to
|
|
||||||
expose the bottlenecks and guide the optimization of a given code. This
|
|
||||||
feedback is useful to experts optimizing computation kernels, including
|
|
||||||
scientific simulations and deep-learning kernels.
|
|
||||||
|
|
||||||
An exact throughput prediction would require a cycle-accurate simulator of the
|
However, state-of-the-art tools strive to provide an estimation of the
|
||||||
|
execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
|
||||||
|
as possible}, and as such, cannot neglect this third major bottleneck.
|
||||||
|
An exact
|
||||||
|
throughput prediction would require a cycle-accurate simulator of the
|
||||||
processor, based on microarchitectural data that is most often not publicly
|
processor, based on microarchitectural data that is most often not publicly
|
||||||
available, and would be prohibitively slow in any case. These tools thus each
|
available, and would be prohibitively slow in any case. These tools thus each
|
||||||
solve in their own way the challenge of modeling complex CPUs while remaining
|
solve in their own way the challenge of modeling complex CPUs while remaining
|
||||||
|
@ -40,9 +37,9 @@ predicts 3 cycles. One may wonder which tool is correct.
|
||||||
|
|
||||||
|
|
||||||
The obvious solution to assess their predictions is to compare them to an
|
The obvious solution to assess their predictions is to compare them to an
|
||||||
actual measure. However, as these tools reason at the basic block level, this
|
actual measure. However, accounting for dependencies at the scale of a basic
|
||||||
is not as trivially defined as it would seem. Take for instance the following
|
block makes this \emph{actual measure} not as trivially defined as it would
|
||||||
kernel:
|
seem. Take for instance the following kernel:
|
||||||
|
|
||||||
\begin{minipage}{0.90\linewidth}
|
\begin{minipage}{0.90\linewidth}
|
||||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||||
|
@ -98,7 +95,7 @@ also be complicated by code versioning.
|
||||||
|
|
||||||
\bigskip
|
\bigskip
|
||||||
|
|
||||||
In this article, we present a fully-tooled solution to evaluate and compare the
|
In this chapter, we present a fully-tooled solution to evaluate and compare the
|
||||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
||||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
\cesasme{} generates a wide variety of computation kernels stressing different
|
||||||
|
|
|
@ -1,40 +1,4 @@
|
||||||
\section{Related works}
|
\section{Related works}
|
||||||
|
|
||||||
The static throughput analyzers studied rely on a variety of models.
|
|
||||||
\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
|
|
||||||
relies on Intel's expertise on their own processors.
|
|
||||||
The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
|
|
||||||
architectures. These models are used in the LLVM Machine Code Analyzer,
|
|
||||||
\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
|
|
||||||
of assembly.
|
|
||||||
Independently, Abel and Reineke used an automated microbenchmark generation
|
|
||||||
approach to generate port mappings of many architectures in
|
|
||||||
\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
|
|
||||||
This work was continued with \uica~\cite{uica}, extending this model with an
|
|
||||||
extensive frontend description.
|
|
||||||
Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
|
|
||||||
neural network to predict basic blocks throughput. To obtain enough data to
|
|
||||||
train its model, the authors also developed \bhive~\cite{bhive}, a profiling
|
|
||||||
tool working on basic blocks.
|
|
||||||
|
|
||||||
Another static tool, \osaca~\cite{osaca2}, provides lower- and
|
|
||||||
upper-bounds to the execution time of a basic block. As this kind of
|
|
||||||
information cannot be fairly compared with tools yielding an exact throughput
|
|
||||||
prediction, we exclude it from our scope.
|
|
||||||
|
|
||||||
All these tools statically predict the number of cycles taken by a piece of
|
|
||||||
assembly or binary that is assumed to be the body of an infinite ---~or
|
|
||||||
sufficiently large~--- loop in steady state, all its data being L1-resident. As
|
|
||||||
discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
|
|
||||||
analyzers; \eg{} by assuming that the loop is either unrolled or has control
|
|
||||||
instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
|
|
||||||
necessarily work on a single basic block, while some others, \eg{} \iaca, work
|
|
||||||
on a section of code delimited by markers. However, even in the second case,
|
|
||||||
the code is assumed to be \emph{straight-line code}: branch instructions, if
|
|
||||||
any, are assumed not taken.
|
|
||||||
|
|
||||||
\smallskip
|
|
||||||
|
|
||||||
Throughput prediction tools, however, are not all static.
|
Throughput prediction tools, however, are not all static.
|
||||||
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
|
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
|
||||||
region, instrumenting it to retrieve the exact events occurring through its
|
region, instrumenting it to retrieve the exact events occurring through its
|
||||||
|
|
|
@ -2,8 +2,8 @@
|
||||||
|
|
||||||
Running the harness described above provides us with 3500
|
Running the harness described above provides us with 3500
|
||||||
benchmarks ---~after filtering out non-L1-resident
|
benchmarks ---~after filtering out non-L1-resident
|
||||||
benchmarks~---, on which each throughput predictor is run. We make the full
|
benchmarks~---, on which each throughput predictor is run.
|
||||||
output of our tool available in our artifact. Before analyzing these results in
|
Before analyzing these results in
|
||||||
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
|
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
|
||||||
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
|
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
|
||||||
predictions comparable to baseline hardware counter measures.
|
predictions comparable to baseline hardware counter measures.
|
||||||
|
@ -15,14 +15,13 @@ C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
|
||||||
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
|
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
|
||||||
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
||||||
|
|
||||||
The experiments themselves were run inside a Docker environment very close to
|
The experiments themselves were run inside a Docker environment based on Debian
|
||||||
our artifact, based on Debian Bullseye. Care was taken to disable
|
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
||||||
hyperthreading to improve measurements stability. For tools whose output is
|
stability. For tools whose output is based on a direct measurement (\perf,
|
||||||
based on a direct measurement (\perf, \bhive), the benchmarks were run
|
\bhive), the benchmarks were run sequentially on a single core with no
|
||||||
sequentially on a single core with no experiments on the other cores. No such
|
experiments on the other cores. No such care was taken for \gus{} as, although
|
||||||
care was taken for \gus{} as, although based on a dynamic run, its prediction
|
based on a dynamic run, its prediction is purely function of recorded program
|
||||||
is purely function of recorded program events and not of program measures. All
|
events and not of program measures. All other tools were run in parallel.
|
||||||
other tools were run in parallel.
|
|
||||||
|
|
||||||
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
|
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
|
||||||
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
|
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
|
||||||
|
@ -87,8 +86,9 @@ consequently, lifted predictions can reasonably be compared to one another.
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
|
|
||||||
\begin{table}[!htbp]
|
\begin{table}
|
||||||
\centering
|
\centering
|
||||||
|
\footnotesize
|
||||||
\begin{tabular}{l | r r r | r r r | r r r}
|
\begin{tabular}{l | r r r | r r r | r r r}
|
||||||
\toprule
|
\toprule
|
||||||
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
||||||
|
@ -196,7 +196,7 @@ the transformations described in Section~\ref{sec:bench_gen}.
|
||||||
Generating and running the full suite of benchmarks required about 30h of
|
Generating and running the full suite of benchmarks required about 30h of
|
||||||
continuous computation on a single machine. During the experiments, the power
|
continuous computation on a single machine. During the experiments, the power
|
||||||
supply units reported a near-constant consumption of about 350W. The carbon
|
supply units reported a near-constant consumption of about 350W. The carbon
|
||||||
intensity of the power grid for the region where the experiment was run, at the
|
intensity of the power grid in France, where the experiment was run, at the
|
||||||
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
|
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
|
||||||
|
|
||||||
The electricity consumed directly by the server thus amounts to about
|
The electricity consumed directly by the server thus amounts to about
|
||||||
|
|
|
@ -35,12 +35,8 @@ Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
|
||||||
The error distribution of the relative errors, for each tool, is presented as a
|
The error distribution of the relative errors, for each tool, is presented as a
|
||||||
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
|
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
|
||||||
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
|
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
|
||||||
each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
|
each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
|
||||||
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
|
in \autoref{chap:palmed} and \autoref{chap:frontend}.
|
||||||
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
|
|
||||||
anti-correlation and $1$ a full correlation. This is especially useful when one
|
|
||||||
is not interested in a program's absolute throughput, but rather in comparing
|
|
||||||
which program has a better throughput.
|
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
|
@ -53,8 +49,8 @@ These results are, overall, significantly worse than what each tool's article
|
||||||
presents. We attribute this difference mostly to the specificities of
|
presents. We attribute this difference mostly to the specificities of
|
||||||
Polybench: being composed of computation kernels, it intrinsically stresses the
|
Polybench: being composed of computation kernels, it intrinsically stresses the
|
||||||
CPU more than basic blocks extracted out of the Spec benchmark suite. This
|
CPU more than basic blocks extracted out of the Spec benchmark suite. This
|
||||||
difference is clearly reflected in the experimental section of the Palmed
|
difference is clearly reflected in the experimental section of Palmed in
|
||||||
article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
|
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
|
||||||
Spec, often by more than a factor of two.
|
Spec, often by more than a factor of two.
|
||||||
|
|
||||||
As \bhive{} and \ithemal{} do not support control flow instructions
|
As \bhive{} and \ithemal{} do not support control flow instructions
|
||||||
|
@ -79,7 +75,7 @@ paths, can explain another portion. We also find that \bhive{} fails to produce
|
||||||
a result in about 40\,\% of the kernels explored ---~which means that, for those
|
a result in about 40\,\% of the kernels explored ---~which means that, for those
|
||||||
cases, \bhive{} failed to produce a result on at least one of the constituent
|
cases, \bhive{} failed to produce a result on at least one of the constituent
|
||||||
basic blocks. In fact, this is due to the difficulties we mentioned in
|
basic blocks. In fact, this is due to the difficulties we mentioned in
|
||||||
Section \ref{sec:intro} related to the need to reconstruct the context of each
|
\qtodo{[ref intro]} related to the need to reconstruct the context of each
|
||||||
basic block \textit{ex nihilo}.
|
basic block \textit{ex nihilo}.
|
||||||
|
|
||||||
The basis of \bhive's method is to run the code to be measured, unrolled a
|
The basis of \bhive's method is to run the code to be measured, unrolled a
|
||||||
|
@ -240,13 +236,12 @@ Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.8
|
||||||
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
|
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
An overview of the full results table (available in our artifact) hints towards
|
An overview of the full results table hints towards two main tendencies: on a
|
||||||
two main tendencies: on a significant number of rows, the static tools
|
significant number of rows, the static tools ---~thus leaving \gus{} and
|
||||||
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
|
\bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
|
||||||
comparatively bad throughput predictions \emph{together}; and many of these
|
predictions \emph{together}; and many of these rows are those using the
|
||||||
rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
|
\texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
|
||||||
setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
|
\texttt{-O1}, plus vectorisation options for the latter).
|
||||||
latter).
|
|
||||||
|
|
||||||
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
|
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
|
||||||
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
|
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
\section{Conclusion and future works}
|
\section*{Conclusion and future works}
|
||||||
|
|
||||||
In this article, we have presented a fully-tooled approach that enables:
|
In this chapter, we have presented a fully-tooled approach that enables:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item the generation of a wide variety of microbenchmarks, reflecting both the
|
\item the generation of a wide variety of microbenchmarks, reflecting both the
|
||||||
|
|
|
@ -48,6 +48,8 @@
|
||||||
|
|
||||||
@misc{tool:pocc,
|
@misc{tool:pocc,
|
||||||
title={{PoCC}, the Polyhedral Compiler Collection},
|
title={{PoCC}, the Polyhedral Compiler Collection},
|
||||||
|
author={Pouchet, Louis-No{\"e}l},
|
||||||
|
year=2009,
|
||||||
note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}},
|
note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue