First pass on CesASMe -- intro is still a mess

This commit is contained in:
Théophile Bastian 2023-09-25 18:45:35 +02:00
parent f6f0336b34
commit 853f792023
7 changed files with 45 additions and 87 deletions

View file

@ -1 +1 @@
\chapter{Foundations}
\chapter{Foundations}\label{chap:foundations}

View file

@ -1,21 +1,18 @@
\section{Introduction}\label{sec:intro}
In the previous chapters, we focused on two of the main bottleneck factors for
computation kernels: \autoref{chap:palmed} investigated the backend aspect of
throughput prediction, while \autoref{chap:frontend} dived into the frontend
aspects.
At a time when software is expected to perform more computations, faster and in
more constrained environments, tools that statically predict the resources (and
in particular the CPU resources) they consume are very useful to guide their
optimization. This need is reflected in the diversity of binary or assembly
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
these tools strive to extract various performance metrics, including the number
of CPU cycles a computation kernel will take ---~which roughly translates to
execution time. In addition to raw measurements (relying on hardware
counters), these model-based analyses provide higher-level and refined data, to
expose the bottlenecks and guide the optimization of a given code. This
feedback is useful to experts optimizing computation kernels, including
scientific simulations and deep-learning kernels.
Throughout those two chapters, we entirely left out another crucial
factor: dependencies, and the latency they induce between instructions. We
managed to do so, because our baseline of native execution was \pipedream{}
measures, \emph{designed} to suppress any dependency.
An exact throughput prediction would require a cycle-accurate simulator of the
However, state-of-the-art tools strive to provide an estimation of the
execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
as possible}, and as such, cannot neglect this third major bottleneck.
An exact
throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly
available, and would be prohibitively slow in any case. These tools thus each
solve in their own way the challenge of modeling complex CPUs while remaining
@ -40,9 +37,9 @@ predicts 3 cycles. One may wonder which tool is correct.
The obvious solution to assess their predictions is to compare them to an
actual measure. However, as these tools reason at the basic block level, this
is not as trivially defined as it would seem. Take for instance the following
kernel:
actual measure. However, accounting for dependencies at the scale of a basic
block makes this \emph{actual measure} not as trivially defined as it would
seem. Take for instance the following kernel:
\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
@ -98,7 +95,7 @@ also be complicated by code versioning.
\bigskip
In this article, we present a fully-tooled solution to evaluate and compare the
In this chapter, we present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different

View file

@ -1,40 +1,4 @@
\section{Related works}
The static throughput analyzers studied rely on a variety of models.
\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
relies on Intel's expertise on their own processors.
The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
architectures. These models are used in the LLVM Machine Code Analyzer,
\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
of assembly.
Independently, Abel and Reineke used an automated microbenchmark generation
approach to generate port mappings of many architectures in
\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
This work was continued with \uica~\cite{uica}, extending this model with an
extensive frontend description.
Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
neural network to predict basic blocks throughput. To obtain enough data to
train its model, the authors also developed \bhive~\cite{bhive}, a profiling
tool working on basic blocks.
Another static tool, \osaca~\cite{osaca2}, provides lower- and
upper-bounds to the execution time of a basic block. As this kind of
information cannot be fairly compared with tools yielding an exact throughput
prediction, we exclude it from our scope.
All these tools statically predict the number of cycles taken by a piece of
assembly or binary that is assumed to be the body of an infinite ---~or
sufficiently large~--- loop in steady state, all its data being L1-resident. As
discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
analyzers; \eg{} by assuming that the loop is either unrolled or has control
instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
necessarily work on a single basic block, while some others, \eg{} \iaca, work
on a section of code delimited by markers. However, even in the second case,
the code is assumed to be \emph{straight-line code}: branch instructions, if
any, are assumed not taken.
\smallskip
Throughput prediction tools, however, are not all static.
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
region, instrumenting it to retrieve the exact events occurring through its

View file

@ -2,8 +2,8 @@
Running the harness described above provides us with 3500
benchmarks ---~after filtering out non-L1-resident
benchmarks~---, on which each throughput predictor is run. We make the full
output of our tool available in our artifact. Before analyzing these results in
benchmarks~---, on which each throughput predictor is run.
Before analyzing these results in
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
predictions comparable to baseline hardware counter measures.
@ -15,14 +15,13 @@ C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
The experiments themselves were run inside a Docker environment very close to
our artifact, based on Debian Bullseye. Care was taken to disable
hyperthreading to improve measurements stability. For tools whose output is
based on a direct measurement (\perf, \bhive), the benchmarks were run
sequentially on a single core with no experiments on the other cores. No such
care was taken for \gus{} as, although based on a dynamic run, its prediction
is purely function of recorded program events and not of program measures. All
other tools were run in parallel.
The experiments themselves were run inside a Docker environment based on Debian
Bullseye. Care was taken to disable hyperthreading to improve measurements
stability. For tools whose output is based on a direct measurement (\perf,
\bhive), the benchmarks were run sequentially on a single core with no
experiments on the other cores. No such care was taken for \gus{} as, although
based on a dynamic run, its prediction is purely function of recorded program
events and not of program measures. All other tools were run in parallel.
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
@ -87,8 +86,9 @@ consequently, lifted predictions can reasonably be compared to one another.
\end{table}
\begin{table}[!htbp]
\begin{table}
\centering
\footnotesize
\begin{tabular}{l | r r r | r r r | r r r}
\toprule
& \multicolumn{3}{c|}{\textbf{Frontend}}
@ -196,7 +196,7 @@ the transformations described in Section~\ref{sec:bench_gen}.
Generating and running the full suite of benchmarks required about 30h of
continuous computation on a single machine. During the experiments, the power
supply units reported a near-constant consumption of about 350W. The carbon
intensity of the power grid for the region where the experiment was run, at the
intensity of the power grid in France, where the experiment was run, at the
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
The electricity consumed directly by the server thus amounts to about

View file

@ -35,12 +35,8 @@ Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
The error distribution of the relative errors, for each tool, is presented as a
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
anti-correlation and $1$ a full correlation. This is especially useful when one
is not interested in a program's absolute throughput, but rather in comparing
which program has a better throughput.
each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
in \autoref{chap:palmed} and \autoref{chap:frontend}.
\begin{figure}
\centering
@ -53,8 +49,8 @@ These results are, overall, significantly worse than what each tool's article
presents. We attribute this difference mostly to the specificities of
Polybench: being composed of computation kernels, it intrinsically stresses the
CPU more than basic blocks extracted out of the Spec benchmark suite. This
difference is clearly reflected in the experimental section of the Palmed
article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
difference is clearly reflected in the experimental section of Palmed in
\autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
Spec, often by more than a factor of two.
As \bhive{} and \ithemal{} do not support control flow instructions
@ -79,7 +75,7 @@ paths, can explain another portion. We also find that \bhive{} fails to produce
a result in about 40\,\% of the kernels explored ---~which means that, for those
cases, \bhive{} failed to produce a result on at least one of the constituent
basic blocks. In fact, this is due to the difficulties we mentioned in
Section \ref{sec:intro} related to the need to reconstruct the context of each
\qtodo{[ref intro]} related to the need to reconstruct the context of each
basic block \textit{ex nihilo}.
The basis of \bhive's method is to run the code to be measured, unrolled a
@ -240,13 +236,12 @@ Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.8
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
\end{table}
An overview of the full results table (available in our artifact) hints towards
two main tendencies: on a significant number of rows, the static tools
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
comparatively bad throughput predictions \emph{together}; and many of these
rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
latter).
An overview of the full results table hints towards two main tendencies: on a
significant number of rows, the static tools ---~thus leaving \gus{} and
\bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
predictions \emph{together}; and many of these rows are those using the
\texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
\texttt{-O1}, plus vectorisation options for the latter).
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}

View file

@ -1,6 +1,6 @@
\section{Conclusion and future works}
\section*{Conclusion and future works}
In this article, we have presented a fully-tooled approach that enables:
In this chapter, we have presented a fully-tooled approach that enables:
\begin{itemize}
\item the generation of a wide variety of microbenchmarks, reflecting both the

View file

@ -48,6 +48,8 @@
@misc{tool:pocc,
title={{PoCC}, the Polyhedral Compiler Collection},
author={Pouchet, Louis-No{\"e}l},
year=2009,
note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}},
}