First pass on CesASMe -- intro is still a mess
This commit is contained in:
parent
f6f0336b34
commit
853f792023
7 changed files with 45 additions and 87 deletions
|
@ -1 +1 @@
|
|||
\chapter{Foundations}
|
||||
\chapter{Foundations}\label{chap:foundations}
|
||||
|
|
|
@ -1,21 +1,18 @@
|
|||
\section{Introduction}\label{sec:intro}
|
||||
In the previous chapters, we focused on two of the main bottleneck factors for
|
||||
computation kernels: \autoref{chap:palmed} investigated the backend aspect of
|
||||
throughput prediction, while \autoref{chap:frontend} dived into the frontend
|
||||
aspects.
|
||||
|
||||
At a time when software is expected to perform more computations, faster and in
|
||||
more constrained environments, tools that statically predict the resources (and
|
||||
in particular the CPU resources) they consume are very useful to guide their
|
||||
optimization. This need is reflected in the diversity of binary or assembly
|
||||
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
|
||||
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
|
||||
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
|
||||
these tools strive to extract various performance metrics, including the number
|
||||
of CPU cycles a computation kernel will take ---~which roughly translates to
|
||||
execution time. In addition to raw measurements (relying on hardware
|
||||
counters), these model-based analyses provide higher-level and refined data, to
|
||||
expose the bottlenecks and guide the optimization of a given code. This
|
||||
feedback is useful to experts optimizing computation kernels, including
|
||||
scientific simulations and deep-learning kernels.
|
||||
Throughout those two chapters, we entirely left out another crucial
|
||||
factor: dependencies, and the latency they induce between instructions. We
|
||||
managed to do so, because our baseline of native execution was \pipedream{}
|
||||
measures, \emph{designed} to suppress any dependency.
|
||||
|
||||
An exact throughput prediction would require a cycle-accurate simulator of the
|
||||
However, state-of-the-art tools strive to provide an estimation of the
|
||||
execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
|
||||
as possible}, and as such, cannot neglect this third major bottleneck.
|
||||
An exact
|
||||
throughput prediction would require a cycle-accurate simulator of the
|
||||
processor, based on microarchitectural data that is most often not publicly
|
||||
available, and would be prohibitively slow in any case. These tools thus each
|
||||
solve in their own way the challenge of modeling complex CPUs while remaining
|
||||
|
@ -40,9 +37,9 @@ predicts 3 cycles. One may wonder which tool is correct.
|
|||
|
||||
|
||||
The obvious solution to assess their predictions is to compare them to an
|
||||
actual measure. However, as these tools reason at the basic block level, this
|
||||
is not as trivially defined as it would seem. Take for instance the following
|
||||
kernel:
|
||||
actual measure. However, accounting for dependencies at the scale of a basic
|
||||
block makes this \emph{actual measure} not as trivially defined as it would
|
||||
seem. Take for instance the following kernel:
|
||||
|
||||
\begin{minipage}{0.90\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
|
@ -98,7 +95,7 @@ also be complicated by code versioning.
|
|||
|
||||
\bigskip
|
||||
|
||||
In this article, we present a fully-tooled solution to evaluate and compare the
|
||||
In this chapter, we present a fully-tooled solution to evaluate and compare the
|
||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
||||
|
|
|
@ -1,40 +1,4 @@
|
|||
\section{Related works}
|
||||
|
||||
The static throughput analyzers studied rely on a variety of models.
|
||||
\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
|
||||
relies on Intel's expertise on their own processors.
|
||||
The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
|
||||
architectures. These models are used in the LLVM Machine Code Analyzer,
|
||||
\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
|
||||
of assembly.
|
||||
Independently, Abel and Reineke used an automated microbenchmark generation
|
||||
approach to generate port mappings of many architectures in
|
||||
\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
|
||||
This work was continued with \uica~\cite{uica}, extending this model with an
|
||||
extensive frontend description.
|
||||
Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
|
||||
neural network to predict basic blocks throughput. To obtain enough data to
|
||||
train its model, the authors also developed \bhive~\cite{bhive}, a profiling
|
||||
tool working on basic blocks.
|
||||
|
||||
Another static tool, \osaca~\cite{osaca2}, provides lower- and
|
||||
upper-bounds to the execution time of a basic block. As this kind of
|
||||
information cannot be fairly compared with tools yielding an exact throughput
|
||||
prediction, we exclude it from our scope.
|
||||
|
||||
All these tools statically predict the number of cycles taken by a piece of
|
||||
assembly or binary that is assumed to be the body of an infinite ---~or
|
||||
sufficiently large~--- loop in steady state, all its data being L1-resident. As
|
||||
discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
|
||||
analyzers; \eg{} by assuming that the loop is either unrolled or has control
|
||||
instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
|
||||
necessarily work on a single basic block, while some others, \eg{} \iaca, work
|
||||
on a section of code delimited by markers. However, even in the second case,
|
||||
the code is assumed to be \emph{straight-line code}: branch instructions, if
|
||||
any, are assumed not taken.
|
||||
|
||||
\smallskip
|
||||
|
||||
Throughput prediction tools, however, are not all static.
|
||||
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
|
||||
region, instrumenting it to retrieve the exact events occurring through its
|
||||
|
|
|
@ -2,8 +2,8 @@
|
|||
|
||||
Running the harness described above provides us with 3500
|
||||
benchmarks ---~after filtering out non-L1-resident
|
||||
benchmarks~---, on which each throughput predictor is run. We make the full
|
||||
output of our tool available in our artifact. Before analyzing these results in
|
||||
benchmarks~---, on which each throughput predictor is run.
|
||||
Before analyzing these results in
|
||||
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
|
||||
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
|
||||
predictions comparable to baseline hardware counter measures.
|
||||
|
@ -15,14 +15,13 @@ C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
|
|||
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
|
||||
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
||||
|
||||
The experiments themselves were run inside a Docker environment very close to
|
||||
our artifact, based on Debian Bullseye. Care was taken to disable
|
||||
hyperthreading to improve measurements stability. For tools whose output is
|
||||
based on a direct measurement (\perf, \bhive), the benchmarks were run
|
||||
sequentially on a single core with no experiments on the other cores. No such
|
||||
care was taken for \gus{} as, although based on a dynamic run, its prediction
|
||||
is purely function of recorded program events and not of program measures. All
|
||||
other tools were run in parallel.
|
||||
The experiments themselves were run inside a Docker environment based on Debian
|
||||
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
||||
stability. For tools whose output is based on a direct measurement (\perf,
|
||||
\bhive), the benchmarks were run sequentially on a single core with no
|
||||
experiments on the other cores. No such care was taken for \gus{} as, although
|
||||
based on a dynamic run, its prediction is purely function of recorded program
|
||||
events and not of program measures. All other tools were run in parallel.
|
||||
|
||||
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
|
||||
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
|
||||
|
@ -87,8 +86,9 @@ consequently, lifted predictions can reasonably be compared to one another.
|
|||
\end{table}
|
||||
|
||||
|
||||
\begin{table}[!htbp]
|
||||
\begin{table}
|
||||
\centering
|
||||
\footnotesize
|
||||
\begin{tabular}{l | r r r | r r r | r r r}
|
||||
\toprule
|
||||
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
||||
|
@ -196,7 +196,7 @@ the transformations described in Section~\ref{sec:bench_gen}.
|
|||
Generating and running the full suite of benchmarks required about 30h of
|
||||
continuous computation on a single machine. During the experiments, the power
|
||||
supply units reported a near-constant consumption of about 350W. The carbon
|
||||
intensity of the power grid for the region where the experiment was run, at the
|
||||
intensity of the power grid in France, where the experiment was run, at the
|
||||
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
|
||||
|
||||
The electricity consumed directly by the server thus amounts to about
|
||||
|
|
|
@ -35,12 +35,8 @@ Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
|
|||
The error distribution of the relative errors, for each tool, is presented as a
|
||||
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
|
||||
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
|
||||
each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
|
||||
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
|
||||
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
|
||||
anti-correlation and $1$ a full correlation. This is especially useful when one
|
||||
is not interested in a program's absolute throughput, but rather in comparing
|
||||
which program has a better throughput.
|
||||
each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
|
||||
in \autoref{chap:palmed} and \autoref{chap:frontend}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -53,8 +49,8 @@ These results are, overall, significantly worse than what each tool's article
|
|||
presents. We attribute this difference mostly to the specificities of
|
||||
Polybench: being composed of computation kernels, it intrinsically stresses the
|
||||
CPU more than basic blocks extracted out of the Spec benchmark suite. This
|
||||
difference is clearly reflected in the experimental section of the Palmed
|
||||
article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
|
||||
difference is clearly reflected in the experimental section of Palmed in
|
||||
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
|
||||
Spec, often by more than a factor of two.
|
||||
|
||||
As \bhive{} and \ithemal{} do not support control flow instructions
|
||||
|
@ -79,7 +75,7 @@ paths, can explain another portion. We also find that \bhive{} fails to produce
|
|||
a result in about 40\,\% of the kernels explored ---~which means that, for those
|
||||
cases, \bhive{} failed to produce a result on at least one of the constituent
|
||||
basic blocks. In fact, this is due to the difficulties we mentioned in
|
||||
Section \ref{sec:intro} related to the need to reconstruct the context of each
|
||||
\qtodo{[ref intro]} related to the need to reconstruct the context of each
|
||||
basic block \textit{ex nihilo}.
|
||||
|
||||
The basis of \bhive's method is to run the code to be measured, unrolled a
|
||||
|
@ -240,13 +236,12 @@ Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.8
|
|||
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
|
||||
\end{table}
|
||||
|
||||
An overview of the full results table (available in our artifact) hints towards
|
||||
two main tendencies: on a significant number of rows, the static tools
|
||||
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
|
||||
comparatively bad throughput predictions \emph{together}; and many of these
|
||||
rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
|
||||
setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
|
||||
latter).
|
||||
An overview of the full results table hints towards two main tendencies: on a
|
||||
significant number of rows, the static tools ---~thus leaving \gus{} and
|
||||
\bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
|
||||
predictions \emph{together}; and many of these rows are those using the
|
||||
\texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
|
||||
\texttt{-O1}, plus vectorisation options for the latter).
|
||||
|
||||
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
|
||||
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
\section{Conclusion and future works}
|
||||
\section*{Conclusion and future works}
|
||||
|
||||
In this article, we have presented a fully-tooled approach that enables:
|
||||
In this chapter, we have presented a fully-tooled approach that enables:
|
||||
|
||||
\begin{itemize}
|
||||
\item the generation of a wide variety of microbenchmarks, reflecting both the
|
||||
|
|
|
@ -48,6 +48,8 @@
|
|||
|
||||
@misc{tool:pocc,
|
||||
title={{PoCC}, the Polyhedral Compiler Collection},
|
||||
author={Pouchet, Louis-No{\"e}l},
|
||||
year=2009,
|
||||
note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}},
|
||||
}
|
||||
|
||||
|
|
Loading…
Reference in a new issue