First pass on CesASMe -- intro is still a mess

This commit is contained in:
Théophile Bastian 2023-09-25 18:45:35 +02:00
parent f6f0336b34
commit 853f792023
7 changed files with 45 additions and 87 deletions

View file

@ -1 +1 @@
\chapter{Foundations} \chapter{Foundations}\label{chap:foundations}

View file

@ -1,21 +1,18 @@
\section{Introduction}\label{sec:intro} In the previous chapters, we focused on two of the main bottleneck factors for
computation kernels: \autoref{chap:palmed} investigated the backend aspect of
throughput prediction, while \autoref{chap:frontend} dived into the frontend
aspects.
At a time when software is expected to perform more computations, faster and in Throughout those two chapters, we entirely left out another crucial
more constrained environments, tools that statically predict the resources (and factor: dependencies, and the latency they induce between instructions. We
in particular the CPU resources) they consume are very useful to guide their managed to do so, because our baseline of native execution was \pipedream{}
optimization. This need is reflected in the diversity of binary or assembly measures, \emph{designed} to suppress any dependency.
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
these tools strive to extract various performance metrics, including the number
of CPU cycles a computation kernel will take ---~which roughly translates to
execution time. In addition to raw measurements (relying on hardware
counters), these model-based analyses provide higher-level and refined data, to
expose the bottlenecks and guide the optimization of a given code. This
feedback is useful to experts optimizing computation kernels, including
scientific simulations and deep-learning kernels.
An exact throughput prediction would require a cycle-accurate simulator of the However, state-of-the-art tools strive to provide an estimation of the
execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
as possible}, and as such, cannot neglect this third major bottleneck.
An exact
throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly processor, based on microarchitectural data that is most often not publicly
available, and would be prohibitively slow in any case. These tools thus each available, and would be prohibitively slow in any case. These tools thus each
solve in their own way the challenge of modeling complex CPUs while remaining solve in their own way the challenge of modeling complex CPUs while remaining
@ -40,9 +37,9 @@ predicts 3 cycles. One may wonder which tool is correct.
The obvious solution to assess their predictions is to compare them to an The obvious solution to assess their predictions is to compare them to an
actual measure. However, as these tools reason at the basic block level, this actual measure. However, accounting for dependencies at the scale of a basic
is not as trivially defined as it would seem. Take for instance the following block makes this \emph{actual measure} not as trivially defined as it would
kernel: seem. Take for instance the following kernel:
\begin{minipage}{0.90\linewidth} \begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}] \begin{lstlisting}[language={[x86masm]Assembler}]
@ -98,7 +95,7 @@ also be complicated by code versioning.
\bigskip \bigskip
In this article, we present a fully-tooled solution to evaluate and compare the In this chapter, we present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different \cesasme{} generates a wide variety of computation kernels stressing different

View file

@ -1,40 +1,4 @@
\section{Related works} \section{Related works}
The static throughput analyzers studied rely on a variety of models.
\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
relies on Intel's expertise on their own processors.
The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
architectures. These models are used in the LLVM Machine Code Analyzer,
\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
of assembly.
Independently, Abel and Reineke used an automated microbenchmark generation
approach to generate port mappings of many architectures in
\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
This work was continued with \uica~\cite{uica}, extending this model with an
extensive frontend description.
Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
neural network to predict basic blocks throughput. To obtain enough data to
train its model, the authors also developed \bhive~\cite{bhive}, a profiling
tool working on basic blocks.
Another static tool, \osaca~\cite{osaca2}, provides lower- and
upper-bounds to the execution time of a basic block. As this kind of
information cannot be fairly compared with tools yielding an exact throughput
prediction, we exclude it from our scope.
All these tools statically predict the number of cycles taken by a piece of
assembly or binary that is assumed to be the body of an infinite ---~or
sufficiently large~--- loop in steady state, all its data being L1-resident. As
discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
analyzers; \eg{} by assuming that the loop is either unrolled or has control
instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
necessarily work on a single basic block, while some others, \eg{} \iaca, work
on a section of code delimited by markers. However, even in the second case,
the code is assumed to be \emph{straight-line code}: branch instructions, if
any, are assumed not taken.
\smallskip
Throughput prediction tools, however, are not all static. Throughput prediction tools, however, are not all static.
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program \gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
region, instrumenting it to retrieve the exact events occurring through its region, instrumenting it to retrieve the exact events occurring through its

View file

@ -2,8 +2,8 @@
Running the harness described above provides us with 3500 Running the harness described above provides us with 3500
benchmarks ---~after filtering out non-L1-resident benchmarks ---~after filtering out non-L1-resident
benchmarks~---, on which each throughput predictor is run. We make the full benchmarks~---, on which each throughput predictor is run.
output of our tool available in our artifact. Before analyzing these results in Before analyzing these results in
Section~\ref{sec:results_analysis}, we evaluate the relevance of the Section~\ref{sec:results_analysis}, we evaluate the relevance of the
methodology presented in Section~\ref{sec:bench_harness} to make the tools' methodology presented in Section~\ref{sec:bench_harness} to make the tools'
predictions comparable to baseline hardware counter measures. predictions comparable to baseline hardware counter measures.
@ -15,14 +15,13 @@ C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each. Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
The experiments themselves were run inside a Docker environment very close to The experiments themselves were run inside a Docker environment based on Debian
our artifact, based on Debian Bullseye. Care was taken to disable Bullseye. Care was taken to disable hyperthreading to improve measurements
hyperthreading to improve measurements stability. For tools whose output is stability. For tools whose output is based on a direct measurement (\perf,
based on a direct measurement (\perf, \bhive), the benchmarks were run \bhive), the benchmarks were run sequentially on a single core with no
sequentially on a single core with no experiments on the other cores. No such experiments on the other cores. No such care was taken for \gus{} as, although
care was taken for \gus{} as, although based on a dynamic run, its prediction based on a dynamic run, its prediction is purely function of recorded program
is purely function of recorded program events and not of program measures. All events and not of program measures. All other tools were run in parallel.
other tools were run in parallel.
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{} We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
@ -87,8 +86,9 @@ consequently, lifted predictions can reasonably be compared to one another.
\end{table} \end{table}
\begin{table}[!htbp] \begin{table}
\centering \centering
\footnotesize
\begin{tabular}{l | r r r | r r r | r r r} \begin{tabular}{l | r r r | r r r | r r r}
\toprule \toprule
& \multicolumn{3}{c|}{\textbf{Frontend}} & \multicolumn{3}{c|}{\textbf{Frontend}}
@ -196,7 +196,7 @@ the transformations described in Section~\ref{sec:bench_gen}.
Generating and running the full suite of benchmarks required about 30h of Generating and running the full suite of benchmarks required about 30h of
continuous computation on a single machine. During the experiments, the power continuous computation on a single machine. During the experiments, the power
supply units reported a near-constant consumption of about 350W. The carbon supply units reported a near-constant consumption of about 350W. The carbon
intensity of the power grid for the region where the experiment was run, at the intensity of the power grid in France, where the experiment was run, at the
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}. time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
The electricity consumed directly by the server thus amounts to about The electricity consumed directly by the server thus amounts to about

View file

@ -35,12 +35,8 @@ Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
The error distribution of the relative errors, for each tool, is presented as a The error distribution of the relative errors, for each tool, is presented as a
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator, each tool, its Kendall's tau indicator~\cite{kendalltau} that we used earlier
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how in \autoref{chap:palmed} and \autoref{chap:frontend}.
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
anti-correlation and $1$ a full correlation. This is especially useful when one
is not interested in a program's absolute throughput, but rather in comparing
which program has a better throughput.
\begin{figure} \begin{figure}
\centering \centering
@ -53,8 +49,8 @@ These results are, overall, significantly worse than what each tool's article
presents. We attribute this difference mostly to the specificities of presents. We attribute this difference mostly to the specificities of
Polybench: being composed of computation kernels, it intrinsically stresses the Polybench: being composed of computation kernels, it intrinsically stresses the
CPU more than basic blocks extracted out of the Spec benchmark suite. This CPU more than basic blocks extracted out of the Spec benchmark suite. This
difference is clearly reflected in the experimental section of the Palmed difference is clearly reflected in the experimental section of Palmed in
article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on \autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on
Spec, often by more than a factor of two. Spec, often by more than a factor of two.
As \bhive{} and \ithemal{} do not support control flow instructions As \bhive{} and \ithemal{} do not support control flow instructions
@ -79,7 +75,7 @@ paths, can explain another portion. We also find that \bhive{} fails to produce
a result in about 40\,\% of the kernels explored ---~which means that, for those a result in about 40\,\% of the kernels explored ---~which means that, for those
cases, \bhive{} failed to produce a result on at least one of the constituent cases, \bhive{} failed to produce a result on at least one of the constituent
basic blocks. In fact, this is due to the difficulties we mentioned in basic blocks. In fact, this is due to the difficulties we mentioned in
Section \ref{sec:intro} related to the need to reconstruct the context of each \qtodo{[ref intro]} related to the need to reconstruct the context of each
basic block \textit{ex nihilo}. basic block \textit{ex nihilo}.
The basis of \bhive's method is to run the code to be measured, unrolled a The basis of \bhive's method is to run the code to be measured, unrolled a
@ -240,13 +236,12 @@ Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.8
through memory-carried dependencies rows}\label{table:nomemdeps_stats} through memory-carried dependencies rows}\label{table:nomemdeps_stats}
\end{table} \end{table}
An overview of the full results table (available in our artifact) hints towards An overview of the full results table hints towards two main tendencies: on a
two main tendencies: on a significant number of rows, the static tools significant number of rows, the static tools ---~thus leaving \gus{} and
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield \bhive{} apart~---, excepted \ithemal, often yield comparatively bad throughput
comparatively bad throughput predictions \emph{together}; and many of these predictions \emph{together}; and many of these rows are those using the
rows are those using the \texttt{O1} and \texttt{O1autovect} compilation \texttt{O1} and \texttt{O1autovect} compilation setting (\texttt{gcc} with
setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the \texttt{-O1}, plus vectorisation options for the latter).
latter).
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{} terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}

View file

@ -1,6 +1,6 @@
\section{Conclusion and future works} \section*{Conclusion and future works}
In this article, we have presented a fully-tooled approach that enables: In this chapter, we have presented a fully-tooled approach that enables:
\begin{itemize} \begin{itemize}
\item the generation of a wide variety of microbenchmarks, reflecting both the \item the generation of a wide variety of microbenchmarks, reflecting both the

View file

@ -48,6 +48,8 @@
@misc{tool:pocc, @misc{tool:pocc,
title={{PoCC}, the Polyhedral Compiler Collection}, title={{PoCC}, the Polyhedral Compiler Collection},
author={Pouchet, Louis-No{\"e}l},
year=2009,
note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}}, note={\url{https://www.cs.colostate.edu/~pouchet/software/pocc/}},
} }