CesASMe: first adaptations
This commit is contained in:
parent
fc9182428d
commit
f6f0336b34
9 changed files with 79 additions and 70 deletions
|
@ -1,20 +1,3 @@
|
|||
\begin{abstract}
|
||||
A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or
|
||||
\ithemal{}, strive to statically predict the throughput of a computation
|
||||
kernel. Each analyzer is based on its own simplified CPU model
|
||||
reasoning at the scale of an isolated basic block.
|
||||
Facing this diversity, evaluating their strengths and
|
||||
weaknesses is important to guide both their usage and their enhancement.
|
||||
|
||||
We argue that reasoning at the scale of a single basic block is not
|
||||
always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled
|
||||
solution to evaluate code analyzers on C-level benchmarks. It is composed of a
|
||||
benchmark derivation procedure that feeds an evaluation harness. We use it to
|
||||
evaluate state-of-the-art code analyzers and to provide insights on their
|
||||
precision. We use \tool's results to show that memory-carried data
|
||||
dependencies are a major source of imprecision for these tools.
|
||||
\end{abstract}
|
||||
|
||||
\section{Introduction}\label{sec:intro}
|
||||
|
||||
At a time when software is expected to perform more computations, faster and in
|
||||
|
@ -23,14 +6,14 @@ in particular the CPU resources) they consume are very useful to guide their
|
|||
optimization. This need is reflected in the diversity of binary or assembly
|
||||
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
|
||||
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
|
||||
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various
|
||||
performance metrics, including the number of CPU cycles a computation kernel will take
|
||||
---~which roughly translates to execution time.
|
||||
In addition to raw measurements (relying on hardware counters), these model-based analyses provide
|
||||
higher-level and refined data, to expose the bottlenecks and guide the
|
||||
optimization of a given code. This feedback is useful to experts optimizing
|
||||
computation kernels, including scientific simulations and deep-learning
|
||||
kernels.
|
||||
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
|
||||
these tools strive to extract various performance metrics, including the number
|
||||
of CPU cycles a computation kernel will take ---~which roughly translates to
|
||||
execution time. In addition to raw measurements (relying on hardware
|
||||
counters), these model-based analyses provide higher-level and refined data, to
|
||||
expose the bottlenecks and guide the optimization of a given code. This
|
||||
feedback is useful to experts optimizing computation kernels, including
|
||||
scientific simulations and deep-learning kernels.
|
||||
|
||||
An exact throughput prediction would require a cycle-accurate simulator of the
|
||||
processor, based on microarchitectural data that is most often not publicly
|
||||
|
@ -39,6 +22,7 @@ solve in their own way the challenge of modeling complex CPUs while remaining
|
|||
simple enough to yield a prediction in a reasonable time, ending up with
|
||||
different models. For instance, on the following x86-64 basic block computing a
|
||||
general matrix multiplication,
|
||||
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
movsd (%rcx, %rax), %xmm0
|
||||
|
@ -112,30 +96,27 @@ generally be known. More importantly, the compiler may apply any number of
|
|||
transformations: unrolling, for instance, changes this number. Control flow may
|
||||
also be complicated by code versioning.
|
||||
|
||||
%In the general case, instrumenting the generated code to obtain the number of
|
||||
%occurrences of the basic block yields accurate results.
|
||||
|
||||
\bigskip
|
||||
|
||||
In this article, we present a fully-tooled solution to evaluate and compare the
|
||||
diversity of static throughput predictors. Our tool, \tool, solves two main
|
||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||
\tool{} generates a wide variety of computation kernels stressing different
|
||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
||||
parameters of the architecture, and thus of the predictors' models, while
|
||||
staying close to representative workloads. To achieve this, we use
|
||||
Polybench~\cite{polybench}, a C-level benchmark suite representative of
|
||||
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
||||
scientific computation workloads, that we combine with a variety of
|
||||
optimisations, including polyhedral loop transformations.
|
||||
In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to
|
||||
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
||||
evaluate throughput predictors on this set of benchmarks by lifting their
|
||||
predictions to a total number of cycles that can be compared to a hardware
|
||||
counters-based measure. A
|
||||
high-level view of \tool{} is shown in Figure~\ref{fig:contrib}.
|
||||
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
|
||||
|
||||
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
|
||||
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
|
||||
analyze the results of \tool{}.
|
||||
In addition to statistical studies, we use \tool's results
|
||||
analyze the results of \cesasme{}.
|
||||
In addition to statistical studies, we use \cesasme's results
|
||||
to investigate analyzers' flaws. We show that code
|
||||
analyzers do not always correctly model data dependencies through memory
|
||||
accesses, substantially impacting their precision.
|
||||
|
|
|
@ -39,7 +39,7 @@ directly (no indirections) and whose loops are affine.
|
|||
These constraints are necessary to ensure that the microkernelification phase,
|
||||
presented below, generates segfault-free code.
|
||||
|
||||
In this case, we use Polybench~\cite{polybench}, a suite of 30
|
||||
In this case, we use Polybench~\cite{bench:polybench}, a suite of 30
|
||||
benchmarks for polyhedral compilation ---~of which we use only 26. The
|
||||
\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
|
||||
removed because they are incompatible with PoCC (introduced below). The
|
||||
|
@ -58,7 +58,9 @@ resources of the target architecture, and by extension the models on which the
|
|||
static analyzers are based.
|
||||
|
||||
In this case, we chose to use the
|
||||
\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling,
|
||||
\textsc{Pluto}~\cite{tool:pluto} and PoCC~\cite{tool:pocc} polyhedral
|
||||
compilers, to easily access common loop nest optimizations~: register tiling,
|
||||
tiling,
|
||||
skewing, vectorization/simdization, loop unrolling, loop permutation,
|
||||
loop fusion.
|
||||
These transformations are meant to maximize variety within the initial
|
||||
|
|
|
@ -82,6 +82,6 @@ approach, as most throughput prediction tools work a basic block-level, and are
|
|||
thus readily available and can be directly plugged into our harness.
|
||||
|
||||
Finally, we control the proportion of cache misses in the program's execution
|
||||
using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more
|
||||
using \texttt{Cachegrind}~\cite{tool:valgrind} and \gus; programs that have more
|
||||
than 15\,\% of cache misses on a warm cache are not considered L1-resident and
|
||||
are discarded.
|
||||
|
|
|
@ -64,13 +64,12 @@ consequently, lifted predictions can reasonably be compared to one another.
|
|||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf}
|
||||
\includegraphics[width=0.7\linewidth]{results_comparability_hist.pdf}
|
||||
\caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
|
||||
\end{figure}
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
|
||||
\begin{tabular}{l r r}
|
||||
\toprule
|
||||
& \textbf{Best block-based} & \textbf{BHive} \\
|
||||
|
@ -84,13 +83,12 @@ consequently, lifted predictions can reasonably be compared to one another.
|
|||
Q3 (\%) & 15.41 & 23.01 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
|
||||
\end{table}
|
||||
|
||||
|
||||
\begin{table*}[!htbp]
|
||||
\begin{table}[!htbp]
|
||||
\centering
|
||||
\caption{Bottleneck reports from the studied tools}\label{table:coverage}
|
||||
|
||||
\begin{tabular}{l | r r r | r r r | r r r}
|
||||
\toprule
|
||||
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
||||
|
@ -128,7 +126,8 @@ floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 &
|
|||
\textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table*}
|
||||
\caption{Bottleneck reports from the studied tools}\label{table:coverage}
|
||||
\end{table}
|
||||
|
||||
\subsection{Relevance and representativity (bottleneck
|
||||
analysis)}\label{ssec:bottleneck_diversity}
|
||||
|
|
|
@ -11,28 +11,31 @@ understanding of which tool is more suited for each situation.
|
|||
|
||||
\subsection{Throughput results}\label{ssec:overall_results}
|
||||
|
||||
\begin{table*}
|
||||
\begin{table}
|
||||
\centering
|
||||
\caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
|
||||
\footnotesize
|
||||
\begin{tabular}{l r r r r r r r r r}
|
||||
\toprule
|
||||
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
|
||||
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\
|
||||
\textbf{Bencher} & \textbf{Datapoints} &
|
||||
\multicolumn{2}{c}{\textbf{Failures}} &
|
||||
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} & \textbf{Time}\\
|
||||
& & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) & & (CPU$\cdot$h) \\
|
||||
\midrule
|
||||
BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\
|
||||
llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\
|
||||
UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\
|
||||
Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\
|
||||
Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\
|
||||
Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\
|
||||
BHive & 2198 & 1302 & (37.20) & 27.95 & 7.78 & 3.01 & 23.01 & 0.81 & 1.37\\
|
||||
llvm-mca & 3500 & 0 & (0.00) & 36.71 & 27.80 & 12.92 & 59.80 & 0.57 & 0.96 \\
|
||||
UiCA & 3500 & 0 & (0.00) & 29.59 & 18.26 & 7.11 & 52.99 & 0.58 & 2.12 \\
|
||||
Ithemal & 3500 & 0 & (0.00) & 57.04 & 48.70 & 22.92 & 75.69 & 0.39 & 0.38 \\
|
||||
Iaca & 3500 & 0 & (0.00) & 30.23 & 18.51 & 7.13 & 57.18 & 0.59 & 1.31 \\
|
||||
Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table*}
|
||||
\caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
|
||||
\end{table}
|
||||
|
||||
The error distribution of the relative errors, for each tool, is presented as a
|
||||
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
|
||||
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
|
||||
each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator,
|
||||
each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator,
|
||||
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
|
||||
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
|
||||
anti-correlation and $1$ a full correlation. This is especially useful when one
|
||||
|
@ -40,7 +43,8 @@ is not interested in a program's absolute throughput, but rather in comparing
|
|||
which program has a better throughput.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf}
|
||||
\centering
|
||||
\includegraphics[width=0.5\linewidth]{overall_analysis_boxplot.pdf}
|
||||
\caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
|
||||
\end{figure}
|
||||
|
||||
|
@ -185,7 +189,6 @@ frontend bottlenecks, thus making it easier for them to agree.
|
|||
|
||||
\begin{table}
|
||||
\centering
|
||||
\caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
|
||||
\begin{tabular}{l r r r r}
|
||||
\toprule
|
||||
\textbf{Tool}
|
||||
|
@ -197,6 +200,7 @@ frontend bottlenecks, thus making it easier for them to agree.
|
|||
\iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
|
||||
\end{table}
|
||||
|
||||
The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
|
||||
|
@ -216,14 +220,13 @@ tool for each kind of bottleneck.
|
|||
|
||||
\subsection{Impact of dependency-boundness}\label{ssec:memlatbound}
|
||||
|
||||
\begin{table*}
|
||||
\begin{table}
|
||||
\centering
|
||||
\caption{Statistical analysis of overall results, without latency bound
|
||||
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
|
||||
\footnotesize
|
||||
\begin{tabular}{l r r r r r r r r r}
|
||||
\toprule
|
||||
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
|
||||
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\
|
||||
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\
|
||||
\midrule
|
||||
BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
|
||||
llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
|
||||
|
@ -233,7 +236,9 @@ Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.
|
|||
Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table*}
|
||||
\caption{Statistical analysis of overall results, without latency bound
|
||||
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
|
||||
\end{table}
|
||||
|
||||
An overview of the full results table (available in our artifact) hints towards
|
||||
two main tendencies: on a significant number of rows, the static tools
|
||||
|
@ -256,7 +261,8 @@ against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
|
|||
investigate the issue.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf}
|
||||
\centering
|
||||
\includegraphics[width=0.5\linewidth]{nomemdeps_boxplot.pdf}
|
||||
\caption{Statistical distribution of relative errors, with and without
|
||||
pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
|
||||
\end{figure}
|
||||
|
|
|
@ -3,6 +3,7 @@
|
|||
\input{00_intro.tex}
|
||||
\input{05_related_works.tex}
|
||||
\input{10_bench_gen.tex}
|
||||
\input{15_harness.tex}
|
||||
\input{20_evaluation.tex}
|
||||
\input{25_results_analysis.tex}
|
||||
\input{30_future_works.tex}
|
||||
|
|
|
@ -29,7 +29,7 @@
|
|||
\node[rednode] (ppapi) [left=1cm of bhive] {perf (measure)};
|
||||
\node[rednode] (gus) [below=0.5em of ppapi] {Gus};
|
||||
%% \node[rednode] (uica) [below=of gdb] {uiCA};
|
||||
\node[rednode] (lifting) [right=of bhive] {
|
||||
\node[rednode] (lifting) [below right=1em and 0.2cm of gdb] {
|
||||
Prediction lifting\\\figref{ssec:harness_lifting}};
|
||||
\node[
|
||||
draw=black,
|
||||
|
@ -47,15 +47,15 @@
|
|||
label={[above,xshift=1cm]\footnotesize Variations},
|
||||
fit=(pocc) (kernel) (gcc)
|
||||
] (vars) {};
|
||||
\node[resultnode] (bench2) [below=of lifting] {Evaluation metrics \\ for
|
||||
\node[resultnode] (bench2) [right=of lifting] {Evaluation metrics \\ for
|
||||
code analyzers};
|
||||
|
||||
% Key
|
||||
\node[] (keyblue1) [below left=0.7cm and 0cm of vars] {};
|
||||
\node[hiddennode] (keyblue2) [right=0.5cm of keyblue1] {Section~\ref{sec:bench_gen}~: generating microbenchmarks};
|
||||
\node[] (keyred1) [right=0.6cm of keyblue2] {};
|
||||
\node[] (keyred1) [below=.5em of keyblue1] {};
|
||||
\node[hiddennode] (keyred2) [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness};
|
||||
\node[] (keyresult1) [right=0.6cm of keyred2] {};
|
||||
\node[] (keyresult1) [below=.5em of keyred1] {};
|
||||
\node[hiddennode] (keyresult2) [right=0.5cm of keyresult1]
|
||||
{Section~\ref{sec:results_analysis}~: results analysis};
|
||||
|
||||
|
@ -74,8 +74,8 @@
|
|||
\draw[->, very thick, harnarrow] (gdb.east) -- (ithemal.west);
|
||||
\draw[->, very thick, harnarrow] (gdb.east) -- (bhive.west);
|
||||
\draw[->, very thick, harnarrow] (gdb.east) -- (llvm.west);
|
||||
\draw[->, very thick, harnarrow] (comps.east|-lifting) -- (lifting.west);
|
||||
\draw[->, very thick] (lifting.south) -- (bench2.north);
|
||||
\draw[->, very thick, harnarrow] (comps.south-|lifting) -- (lifting.north);
|
||||
\draw[->, very thick] (lifting.east) -- (bench2.west);
|
||||
\end{tikzpicture}
|
||||
}
|
||||
\caption{Our analysis and measurement environment.\label{fig:contrib}}
|
||||
|
|
|
@ -2,6 +2,8 @@
|
|||
\newcommand{\uops}{\uop{}s}
|
||||
|
||||
\newcommand{\eg}{\textit{eg.}}
|
||||
\newcommand{\ie}{\textit{ie.}}
|
||||
\newcommand{\wrt}{\textit{wrt.}}
|
||||
|
||||
\newcommand{\kerK}{\mathcal{K}}
|
||||
\newcommand{\calR}{\mathcal{R}}
|
||||
|
@ -36,6 +38,20 @@
|
|||
\newcommand{\pipedream}{\texttt{Pipedream}}
|
||||
\newcommand{\palmed}{\texttt{Palmed}}
|
||||
\newcommand{\pmevo}{\texttt{PMEvo}}
|
||||
\newcommand{\gus}{\texttt{Gus}}
|
||||
\newcommand{\ithemal}{\texttt{Ithemal}}
|
||||
\newcommand{\osaca}{\texttt{Osaca}}
|
||||
\newcommand{\bhive}{\texttt{BHive}}
|
||||
\newcommand{\anica}{\texttt{AnICA}}
|
||||
\newcommand{\cesasme}{\texttt{CesASMe}}
|
||||
|
||||
\newcommand{\gdb}{\texttt{gdb}}
|
||||
|
||||
\newcommand{\coeq}{CO$_{2}$eq}
|
||||
|
||||
\newcommand{\figref}[1]{[\ref{#1}]}
|
||||
|
||||
\newcommand{\reg}[1]{\texttt{\%#1}}
|
||||
|
||||
% Hyperlinks
|
||||
\newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}}
|
||||
|
|
|
@ -25,9 +25,13 @@
|
|||
\usepackage{import}
|
||||
\usepackage{wrapfig}
|
||||
\usepackage{float}
|
||||
\usepackage{tikz}
|
||||
\usepackage[bottom]{footmisc} % footnotes are below floats
|
||||
\usepackage[final]{microtype}
|
||||
|
||||
\usetikzlibrary{positioning}
|
||||
\usetikzlibrary{fit}
|
||||
|
||||
\emergencystretch=1em
|
||||
|
||||
% Local sty files
|
||||
|
|
Loading…
Reference in a new issue