CesASMe: first adaptations

This commit is contained in:
Théophile Bastian 2023-09-25 17:41:37 +02:00
parent fc9182428d
commit f6f0336b34
9 changed files with 79 additions and 70 deletions

View file

@ -1,20 +1,3 @@
\begin{abstract}
A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or
\ithemal{}, strive to statically predict the throughput of a computation
kernel. Each analyzer is based on its own simplified CPU model
reasoning at the scale of an isolated basic block.
Facing this diversity, evaluating their strengths and
weaknesses is important to guide both their usage and their enhancement.
We argue that reasoning at the scale of a single basic block is not
always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled
solution to evaluate code analyzers on C-level benchmarks. It is composed of a
benchmark derivation procedure that feeds an evaluation harness. We use it to
evaluate state-of-the-art code analyzers and to provide insights on their
precision. We use \tool's results to show that memory-carried data
dependencies are a major source of imprecision for these tools.
\end{abstract}
\section{Introduction}\label{sec:intro}
At a time when software is expected to perform more computations, faster and in
@ -23,14 +6,14 @@ in particular the CPU resources) they consume are very useful to guide their
optimization. This need is reflected in the diversity of binary or assembly
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various
performance metrics, including the number of CPU cycles a computation kernel will take
---~which roughly translates to execution time.
In addition to raw measurements (relying on hardware counters), these model-based analyses provide
higher-level and refined data, to expose the bottlenecks and guide the
optimization of a given code. This feedback is useful to experts optimizing
computation kernels, including scientific simulations and deep-learning
kernels.
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
these tools strive to extract various performance metrics, including the number
of CPU cycles a computation kernel will take ---~which roughly translates to
execution time. In addition to raw measurements (relying on hardware
counters), these model-based analyses provide higher-level and refined data, to
expose the bottlenecks and guide the optimization of a given code. This
feedback is useful to experts optimizing computation kernels, including
scientific simulations and deep-learning kernels.
An exact throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly
@ -39,6 +22,7 @@ solve in their own way the challenge of modeling complex CPUs while remaining
simple enough to yield a prediction in a reasonable time, ending up with
different models. For instance, on the following x86-64 basic block computing a
general matrix multiplication,
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
movsd (%rcx, %rax), %xmm0
@ -112,30 +96,27 @@ generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.
%In the general case, instrumenting the generated code to obtain the number of
%occurrences of the basic block yields accurate results.
\bigskip
In this article, we present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \tool, solves two main
diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\tool{} generates a wide variety of computation kernels stressing different
\cesasme{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{polybench}, a C-level benchmark suite representative of
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.
In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \tool{} is shown in Figure~\ref{fig:contrib}.
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \tool{}.
In addition to statistical studies, we use \tool's results
analyze the results of \cesasme{}.
In addition to statistical studies, we use \cesasme's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.

View file

@ -39,7 +39,7 @@ directly (no indirections) and whose loops are affine.
These constraints are necessary to ensure that the microkernelification phase,
presented below, generates segfault-free code.
In this case, we use Polybench~\cite{polybench}, a suite of 30
In this case, we use Polybench~\cite{bench:polybench}, a suite of 30
benchmarks for polyhedral compilation ---~of which we use only 26. The
\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
removed because they are incompatible with PoCC (introduced below). The
@ -58,7 +58,9 @@ resources of the target architecture, and by extension the models on which the
static analyzers are based.
In this case, we chose to use the
\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling,
\textsc{Pluto}~\cite{tool:pluto} and PoCC~\cite{tool:pocc} polyhedral
compilers, to easily access common loop nest optimizations~: register tiling,
tiling,
skewing, vectorization/simdization, loop unrolling, loop permutation,
loop fusion.
These transformations are meant to maximize variety within the initial

View file

@ -82,6 +82,6 @@ approach, as most throughput prediction tools work a basic block-level, and are
thus readily available and can be directly plugged into our harness.
Finally, we control the proportion of cache misses in the program's execution
using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more
using \texttt{Cachegrind}~\cite{tool:valgrind} and \gus; programs that have more
than 15\,\% of cache misses on a warm cache are not considered L1-resident and
are discarded.

View file

@ -64,13 +64,12 @@ consequently, lifted predictions can reasonably be compared to one another.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf}
\includegraphics[width=0.7\linewidth]{results_comparability_hist.pdf}
\caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
\end{figure}
\begin{table}
\centering
\caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
\begin{tabular}{l r r}
\toprule
& \textbf{Best block-based} & \textbf{BHive} \\
@ -84,13 +83,12 @@ consequently, lifted predictions can reasonably be compared to one another.
Q3 (\%) & 15.41 & 23.01 \\
\bottomrule
\end{tabular}
\caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
\end{table}
\begin{table*}[!htbp]
\begin{table}[!htbp]
\centering
\caption{Bottleneck reports from the studied tools}\label{table:coverage}
\begin{tabular}{l | r r r | r r r | r r r}
\toprule
& \multicolumn{3}{c|}{\textbf{Frontend}}
@ -128,7 +126,8 @@ floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 &
\textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\
\bottomrule
\end{tabular}
\end{table*}
\caption{Bottleneck reports from the studied tools}\label{table:coverage}
\end{table}
\subsection{Relevance and representativity (bottleneck
analysis)}\label{ssec:bottleneck_diversity}

View file

@ -11,28 +11,31 @@ understanding of which tool is more suited for each situation.
\subsection{Throughput results}\label{ssec:overall_results}
\begin{table*}
\begin{table}
\centering
\caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
\footnotesize
\begin{tabular}{l r r r r r r r r r}
\toprule
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\
\textbf{Bencher} & \textbf{Datapoints} &
\multicolumn{2}{c}{\textbf{Failures}} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} & \textbf{Time}\\
& & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) & & (CPU$\cdot$h) \\
\midrule
BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\
llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\
UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\
Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\
Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\
Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\
BHive & 2198 & 1302 & (37.20) & 27.95 & 7.78 & 3.01 & 23.01 & 0.81 & 1.37\\
llvm-mca & 3500 & 0 & (0.00) & 36.71 & 27.80 & 12.92 & 59.80 & 0.57 & 0.96 \\
UiCA & 3500 & 0 & (0.00) & 29.59 & 18.26 & 7.11 & 52.99 & 0.58 & 2.12 \\
Ithemal & 3500 & 0 & (0.00) & 57.04 & 48.70 & 22.92 & 75.69 & 0.39 & 0.38 \\
Iaca & 3500 & 0 & (0.00) & 30.23 & 18.51 & 7.13 & 57.18 & 0.59 & 1.31 \\
Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
\bottomrule
\end{tabular}
\end{table*}
\caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
\end{table}
The error distribution of the relative errors, for each tool, is presented as a
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator,
each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
anti-correlation and $1$ a full correlation. This is especially useful when one
@ -40,7 +43,8 @@ is not interested in a program's absolute throughput, but rather in comparing
which program has a better throughput.
\begin{figure}
\includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf}
\centering
\includegraphics[width=0.5\linewidth]{overall_analysis_boxplot.pdf}
\caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
\end{figure}
@ -185,7 +189,6 @@ frontend bottlenecks, thus making it easier for them to agree.
\begin{table}
\centering
\caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
\begin{tabular}{l r r r r}
\toprule
\textbf{Tool}
@ -197,6 +200,7 @@ frontend bottlenecks, thus making it easier for them to agree.
\iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
\bottomrule
\end{tabular}
\caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
\end{table}
The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
@ -216,14 +220,13 @@ tool for each kind of bottleneck.
\subsection{Impact of dependency-boundness}\label{ssec:memlatbound}
\begin{table*}
\begin{table}
\centering
\caption{Statistical analysis of overall results, without latency bound
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
\footnotesize
\begin{tabular}{l r r r r r r r r r}
\toprule
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
\midrule
BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
@ -233,7 +236,9 @@ Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.
Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
\bottomrule
\end{tabular}
\end{table*}
\caption{Statistical analysis of overall results, without latency bound
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
\end{table}
An overview of the full results table (available in our artifact) hints towards
two main tendencies: on a significant number of rows, the static tools
@ -256,7 +261,8 @@ against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
investigate the issue.
\begin{figure}
\includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf}
\centering
\includegraphics[width=0.5\linewidth]{nomemdeps_boxplot.pdf}
\caption{Statistical distribution of relative errors, with and without
pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
\end{figure}

View file

@ -3,6 +3,7 @@
\input{00_intro.tex}
\input{05_related_works.tex}
\input{10_bench_gen.tex}
\input{15_harness.tex}
\input{20_evaluation.tex}
\input{25_results_analysis.tex}
\input{30_future_works.tex}

View file

@ -29,7 +29,7 @@
\node[rednode] (ppapi) [left=1cm of bhive] {perf (measure)};
\node[rednode] (gus) [below=0.5em of ppapi] {Gus};
%% \node[rednode] (uica) [below=of gdb] {uiCA};
\node[rednode] (lifting) [right=of bhive] {
\node[rednode] (lifting) [below right=1em and 0.2cm of gdb] {
Prediction lifting\\\figref{ssec:harness_lifting}};
\node[
draw=black,
@ -47,15 +47,15 @@
label={[above,xshift=1cm]\footnotesize Variations},
fit=(pocc) (kernel) (gcc)
] (vars) {};
\node[resultnode] (bench2) [below=of lifting] {Evaluation metrics \\ for
\node[resultnode] (bench2) [right=of lifting] {Evaluation metrics \\ for
code analyzers};
% Key
\node[] (keyblue1) [below left=0.7cm and 0cm of vars] {};
\node[hiddennode] (keyblue2) [right=0.5cm of keyblue1] {Section~\ref{sec:bench_gen}~: generating microbenchmarks};
\node[] (keyred1) [right=0.6cm of keyblue2] {};
\node[] (keyred1) [below=.5em of keyblue1] {};
\node[hiddennode] (keyred2) [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness};
\node[] (keyresult1) [right=0.6cm of keyred2] {};
\node[] (keyresult1) [below=.5em of keyred1] {};
\node[hiddennode] (keyresult2) [right=0.5cm of keyresult1]
{Section~\ref{sec:results_analysis}~: results analysis};
@ -74,8 +74,8 @@
\draw[->, very thick, harnarrow] (gdb.east) -- (ithemal.west);
\draw[->, very thick, harnarrow] (gdb.east) -- (bhive.west);
\draw[->, very thick, harnarrow] (gdb.east) -- (llvm.west);
\draw[->, very thick, harnarrow] (comps.east|-lifting) -- (lifting.west);
\draw[->, very thick] (lifting.south) -- (bench2.north);
\draw[->, very thick, harnarrow] (comps.south-|lifting) -- (lifting.north);
\draw[->, very thick] (lifting.east) -- (bench2.west);
\end{tikzpicture}
}
\caption{Our analysis and measurement environment.\label{fig:contrib}}

View file

@ -2,6 +2,8 @@
\newcommand{\uops}{\uop{}s}
\newcommand{\eg}{\textit{eg.}}
\newcommand{\ie}{\textit{ie.}}
\newcommand{\wrt}{\textit{wrt.}}
\newcommand{\kerK}{\mathcal{K}}
\newcommand{\calR}{\mathcal{R}}
@ -36,6 +38,20 @@
\newcommand{\pipedream}{\texttt{Pipedream}}
\newcommand{\palmed}{\texttt{Palmed}}
\newcommand{\pmevo}{\texttt{PMEvo}}
\newcommand{\gus}{\texttt{Gus}}
\newcommand{\ithemal}{\texttt{Ithemal}}
\newcommand{\osaca}{\texttt{Osaca}}
\newcommand{\bhive}{\texttt{BHive}}
\newcommand{\anica}{\texttt{AnICA}}
\newcommand{\cesasme}{\texttt{CesASMe}}
\newcommand{\gdb}{\texttt{gdb}}
\newcommand{\coeq}{CO$_{2}$eq}
\newcommand{\figref}[1]{[\ref{#1}]}
\newcommand{\reg}[1]{\texttt{\%#1}}
% Hyperlinks
\newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}}

View file

@ -25,9 +25,13 @@
\usepackage{import}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{tikz}
\usepackage[bottom]{footmisc} % footnotes are below floats
\usepackage[final]{microtype}
\usetikzlibrary{positioning}
\usetikzlibrary{fit}
\emergencystretch=1em
% Local sty files