diff --git a/manuscrit/50_CesASMe/00_intro.tex b/manuscrit/50_CesASMe/00_intro.tex index 8e70c60..02afaf0 100644 --- a/manuscrit/50_CesASMe/00_intro.tex +++ b/manuscrit/50_CesASMe/00_intro.tex @@ -1,20 +1,3 @@ -\begin{abstract} - A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or - \ithemal{}, strive to statically predict the throughput of a computation - kernel. Each analyzer is based on its own simplified CPU model - reasoning at the scale of an isolated basic block. - Facing this diversity, evaluating their strengths and - weaknesses is important to guide both their usage and their enhancement. - - We argue that reasoning at the scale of a single basic block is not - always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled - solution to evaluate code analyzers on C-level benchmarks. It is composed of a - benchmark derivation procedure that feeds an evaluation harness. We use it to - evaluate state-of-the-art code analyzers and to provide insights on their - precision. We use \tool's results to show that memory-carried data - dependencies are a major source of imprecision for these tools. -\end{abstract} - \section{Introduction}\label{sec:intro} At a time when software is expected to perform more computations, faster and in @@ -23,14 +6,14 @@ in particular the CPU resources) they consume are very useful to guide their optimization. This need is reflected in the diversity of binary or assembly code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca}, -\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various -performance metrics, including the number of CPU cycles a computation kernel will take ----~which roughly translates to execution time. -In addition to raw measurements (relying on hardware counters), these model-based analyses provide -higher-level and refined data, to expose the bottlenecks and guide the -optimization of a given code. This feedback is useful to experts optimizing -computation kernels, including scientific simulations and deep-learning -kernels. +\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all +these tools strive to extract various performance metrics, including the number +of CPU cycles a computation kernel will take ---~which roughly translates to +execution time. In addition to raw measurements (relying on hardware +counters), these model-based analyses provide higher-level and refined data, to +expose the bottlenecks and guide the optimization of a given code. This +feedback is useful to experts optimizing computation kernels, including +scientific simulations and deep-learning kernels. An exact throughput prediction would require a cycle-accurate simulator of the processor, based on microarchitectural data that is most often not publicly @@ -39,6 +22,7 @@ solve in their own way the challenge of modeling complex CPUs while remaining simple enough to yield a prediction in a reasonable time, ending up with different models. For instance, on the following x86-64 basic block computing a general matrix multiplication, + \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] movsd (%rcx, %rax), %xmm0 @@ -112,30 +96,27 @@ generally be known. More importantly, the compiler may apply any number of transformations: unrolling, for instance, changes this number. Control flow may also be complicated by code versioning. -%In the general case, instrumenting the generated code to obtain the number of -%occurrences of the basic block yields accurate results. - \bigskip In this article, we present a fully-tooled solution to evaluate and compare the -diversity of static throughput predictors. Our tool, \tool, solves two main +diversity of static throughput predictors. Our tool, \cesasme, solves two main issues in this direction. In Section~\ref{sec:bench_gen}, we describe how -\tool{} generates a wide variety of computation kernels stressing different +\cesasme{} generates a wide variety of computation kernels stressing different parameters of the architecture, and thus of the predictors' models, while staying close to representative workloads. To achieve this, we use -Polybench~\cite{polybench}, a C-level benchmark suite representative of +Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of scientific computation workloads, that we combine with a variety of optimisations, including polyhedral loop transformations. -In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to +In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to evaluate throughput predictors on this set of benchmarks by lifting their predictions to a total number of cycles that can be compared to a hardware counters-based measure. A -high-level view of \tool{} is shown in Figure~\ref{fig:contrib}. +high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}. In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and -analyze the results of \tool{}. - In addition to statistical studies, we use \tool's results +analyze the results of \cesasme{}. + In addition to statistical studies, we use \cesasme's results to investigate analyzers' flaws. We show that code analyzers do not always correctly model data dependencies through memory accesses, substantially impacting their precision. diff --git a/manuscrit/50_CesASMe/10_bench_gen.tex b/manuscrit/50_CesASMe/10_bench_gen.tex index c2e0c23..258c39f 100644 --- a/manuscrit/50_CesASMe/10_bench_gen.tex +++ b/manuscrit/50_CesASMe/10_bench_gen.tex @@ -39,7 +39,7 @@ directly (no indirections) and whose loops are affine. These constraints are necessary to ensure that the microkernelification phase, presented below, generates segfault-free code. -In this case, we use Polybench~\cite{polybench}, a suite of 30 +In this case, we use Polybench~\cite{bench:polybench}, a suite of 30 benchmarks for polyhedral compilation ---~of which we use only 26. The \texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are removed because they are incompatible with PoCC (introduced below). The @@ -58,7 +58,9 @@ resources of the target architecture, and by extension the models on which the static analyzers are based. In this case, we chose to use the -\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling, +\textsc{Pluto}~\cite{tool:pluto} and PoCC~\cite{tool:pocc} polyhedral +compilers, to easily access common loop nest optimizations~: register tiling, +tiling, skewing, vectorization/simdization, loop unrolling, loop permutation, loop fusion. These transformations are meant to maximize variety within the initial diff --git a/manuscrit/50_CesASMe/15_harness.tex b/manuscrit/50_CesASMe/15_harness.tex index da55ae7..9e6c9ce 100644 --- a/manuscrit/50_CesASMe/15_harness.tex +++ b/manuscrit/50_CesASMe/15_harness.tex @@ -82,6 +82,6 @@ approach, as most throughput prediction tools work a basic block-level, and are thus readily available and can be directly plugged into our harness. Finally, we control the proportion of cache misses in the program's execution -using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more +using \texttt{Cachegrind}~\cite{tool:valgrind} and \gus; programs that have more than 15\,\% of cache misses on a warm cache are not considered L1-resident and are discarded. diff --git a/manuscrit/50_CesASMe/20_evaluation.tex b/manuscrit/50_CesASMe/20_evaluation.tex index 755fe76..7b5fbae 100644 --- a/manuscrit/50_CesASMe/20_evaluation.tex +++ b/manuscrit/50_CesASMe/20_evaluation.tex @@ -64,13 +64,12 @@ consequently, lifted predictions can reasonably be compared to one another. \begin{figure} \centering - \includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf} + \includegraphics[width=0.7\linewidth]{results_comparability_hist.pdf} \caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability} \end{figure} \begin{table} \centering - \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability} \begin{tabular}{l r r} \toprule & \textbf{Best block-based} & \textbf{BHive} \\ @@ -84,13 +83,12 @@ consequently, lifted predictions can reasonably be compared to one another. Q3 (\%) & 15.41 & 23.01 \\ \bottomrule \end{tabular} + \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability} \end{table} -\begin{table*}[!htbp] +\begin{table}[!htbp] \centering - \caption{Bottleneck reports from the studied tools}\label{table:coverage} - \begin{tabular}{l | r r r | r r r | r r r} \toprule & \multicolumn{3}{c|}{\textbf{Frontend}} @@ -128,7 +126,8 @@ floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 & \textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\ \bottomrule \end{tabular} -\end{table*} + \caption{Bottleneck reports from the studied tools}\label{table:coverage} +\end{table} \subsection{Relevance and representativity (bottleneck analysis)}\label{ssec:bottleneck_diversity} diff --git a/manuscrit/50_CesASMe/25_results_analysis.tex b/manuscrit/50_CesASMe/25_results_analysis.tex index 5a37fe2..7554fc9 100644 --- a/manuscrit/50_CesASMe/25_results_analysis.tex +++ b/manuscrit/50_CesASMe/25_results_analysis.tex @@ -11,28 +11,31 @@ understanding of which tool is more suited for each situation. \subsection{Throughput results}\label{ssec:overall_results} -\begin{table*} +\begin{table} \centering - \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats} + \footnotesize \begin{tabular}{l r r r r r r r r r} \toprule -\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} & -\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\ + \textbf{Bencher} & \textbf{Datapoints} & + \multicolumn{2}{c}{\textbf{Failures}} & +\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} & \textbf{Time}\\ + & & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) & & (CPU$\cdot$h) \\ \midrule -BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\ -llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\ -UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\ -Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\ -Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\ -Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\ +BHive & 2198 & 1302 & (37.20) & 27.95 & 7.78 & 3.01 & 23.01 & 0.81 & 1.37\\ +llvm-mca & 3500 & 0 & (0.00) & 36.71 & 27.80 & 12.92 & 59.80 & 0.57 & 0.96 \\ +UiCA & 3500 & 0 & (0.00) & 29.59 & 18.26 & 7.11 & 52.99 & 0.58 & 2.12 \\ +Ithemal & 3500 & 0 & (0.00) & 57.04 & 48.70 & 22.92 & 75.69 & 0.39 & 0.38 \\ +Iaca & 3500 & 0 & (0.00) & 30.23 & 18.51 & 7.13 & 57.18 & 0.59 & 1.31 \\ +Gus & 3500 & 0 & (0.00) & 20.37 & 15.01 & 7.82 & 30.59 & 0.82 & 188.04 \\ \bottomrule \end{tabular} -\end{table*} + \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats} +\end{table} The error distribution of the relative errors, for each tool, is presented as a box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators are also given in Table~\ref{table:overall_analysis_stats}. We also give, for -each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator, +each tool, its Kendall's tau indicator~\cite{kendalltau}: this indicator, used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how well the pair-wise ordering of benchmarks is preserved, $-1$ being a full anti-correlation and $1$ a full correlation. This is especially useful when one @@ -40,7 +43,8 @@ is not interested in a program's absolute throughput, but rather in comparing which program has a better throughput. \begin{figure} - \includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf} + \centering + \includegraphics[width=0.5\linewidth]{overall_analysis_boxplot.pdf} \caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot} \end{figure} @@ -185,7 +189,6 @@ frontend bottlenecks, thus making it easier for them to agree. \begin{table} \centering - \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred} \begin{tabular}{l r r r r} \toprule \textbf{Tool} @@ -197,6 +200,7 @@ frontend bottlenecks, thus making it easier for them to agree. \iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\ \bottomrule \end{tabular} + \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred} \end{table} The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases @@ -216,14 +220,13 @@ tool for each kind of bottleneck. \subsection{Impact of dependency-boundness}\label{ssec:memlatbound} -\begin{table*} +\begin{table} \centering - \caption{Statistical analysis of overall results, without latency bound - through memory-carried dependencies rows}\label{table:nomemdeps_stats} + \footnotesize \begin{tabular}{l r r r r r r r r r} \toprule \textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} & -\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\ +\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$}\\ \midrule BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\ llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\ @@ -233,7 +236,9 @@ Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0. Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\ \bottomrule \end{tabular} -\end{table*} + \caption{Statistical analysis of overall results, without latency bound + through memory-carried dependencies rows}\label{table:nomemdeps_stats} +\end{table} An overview of the full results table (available in our artifact) hints towards two main tendencies: on a significant number of rows, the static tools @@ -256,7 +261,8 @@ against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for investigate the issue. \begin{figure} - \includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf} + \centering + \includegraphics[width=0.5\linewidth]{nomemdeps_boxplot.pdf} \caption{Statistical distribution of relative errors, with and without pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot} \end{figure} diff --git a/manuscrit/50_CesASMe/main.tex b/manuscrit/50_CesASMe/main.tex index da40a41..bf41cdb 100644 --- a/manuscrit/50_CesASMe/main.tex +++ b/manuscrit/50_CesASMe/main.tex @@ -3,6 +3,7 @@ \input{00_intro.tex} \input{05_related_works.tex} \input{10_bench_gen.tex} +\input{15_harness.tex} \input{20_evaluation.tex} \input{25_results_analysis.tex} \input{30_future_works.tex} diff --git a/manuscrit/50_CesASMe/overview.tex b/manuscrit/50_CesASMe/overview.tex index 55285f6..c049262 100644 --- a/manuscrit/50_CesASMe/overview.tex +++ b/manuscrit/50_CesASMe/overview.tex @@ -29,7 +29,7 @@ \node[rednode] (ppapi) [left=1cm of bhive] {perf (measure)}; \node[rednode] (gus) [below=0.5em of ppapi] {Gus}; %% \node[rednode] (uica) [below=of gdb] {uiCA}; - \node[rednode] (lifting) [right=of bhive] { + \node[rednode] (lifting) [below right=1em and 0.2cm of gdb] { Prediction lifting\\\figref{ssec:harness_lifting}}; \node[ draw=black, @@ -47,15 +47,15 @@ label={[above,xshift=1cm]\footnotesize Variations}, fit=(pocc) (kernel) (gcc) ] (vars) {}; -\node[resultnode] (bench2) [below=of lifting] {Evaluation metrics \\ for +\node[resultnode] (bench2) [right=of lifting] {Evaluation metrics \\ for code analyzers}; % Key \node[] (keyblue1) [below left=0.7cm and 0cm of vars] {}; \node[hiddennode] (keyblue2) [right=0.5cm of keyblue1] {Section~\ref{sec:bench_gen}~: generating microbenchmarks}; - \node[] (keyred1) [right=0.6cm of keyblue2] {}; + \node[] (keyred1) [below=.5em of keyblue1] {}; \node[hiddennode] (keyred2) [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness}; - \node[] (keyresult1) [right=0.6cm of keyred2] {}; + \node[] (keyresult1) [below=.5em of keyred1] {}; \node[hiddennode] (keyresult2) [right=0.5cm of keyresult1] {Section~\ref{sec:results_analysis}~: results analysis}; @@ -74,8 +74,8 @@ \draw[->, very thick, harnarrow] (gdb.east) -- (ithemal.west); \draw[->, very thick, harnarrow] (gdb.east) -- (bhive.west); \draw[->, very thick, harnarrow] (gdb.east) -- (llvm.west); - \draw[->, very thick, harnarrow] (comps.east|-lifting) -- (lifting.west); - \draw[->, very thick] (lifting.south) -- (bench2.north); + \draw[->, very thick, harnarrow] (comps.south-|lifting) -- (lifting.north); + \draw[->, very thick] (lifting.east) -- (bench2.west); \end{tikzpicture} } \caption{Our analysis and measurement environment.\label{fig:contrib}} diff --git a/manuscrit/include/macros.tex b/manuscrit/include/macros.tex index 91854cc..0c2f98f 100644 --- a/manuscrit/include/macros.tex +++ b/manuscrit/include/macros.tex @@ -2,6 +2,8 @@ \newcommand{\uops}{\uop{}s} \newcommand{\eg}{\textit{eg.}} +\newcommand{\ie}{\textit{ie.}} +\newcommand{\wrt}{\textit{wrt.}} \newcommand{\kerK}{\mathcal{K}} \newcommand{\calR}{\mathcal{R}} @@ -36,6 +38,20 @@ \newcommand{\pipedream}{\texttt{Pipedream}} \newcommand{\palmed}{\texttt{Palmed}} \newcommand{\pmevo}{\texttt{PMEvo}} +\newcommand{\gus}{\texttt{Gus}} +\newcommand{\ithemal}{\texttt{Ithemal}} +\newcommand{\osaca}{\texttt{Osaca}} +\newcommand{\bhive}{\texttt{BHive}} +\newcommand{\anica}{\texttt{AnICA}} +\newcommand{\cesasme}{\texttt{CesASMe}} + +\newcommand{\gdb}{\texttt{gdb}} + +\newcommand{\coeq}{CO$_{2}$eq} + +\newcommand{\figref}[1]{[\ref{#1}]} + +\newcommand{\reg}[1]{\texttt{\%#1}} % Hyperlinks \newcommand{\pymodule}[1]{\href{https://docs.python.org/3/library/#1.html}{\lstpython{#1}}} diff --git a/manuscrit/include/packages.tex b/manuscrit/include/packages.tex index aeeddda..bcce0f1 100644 --- a/manuscrit/include/packages.tex +++ b/manuscrit/include/packages.tex @@ -25,9 +25,13 @@ \usepackage{import} \usepackage{wrapfig} \usepackage{float} +\usepackage{tikz} \usepackage[bottom]{footmisc} % footnotes are below floats \usepackage[final]{microtype} +\usetikzlibrary{positioning} +\usetikzlibrary{fit} + \emergencystretch=1em % Local sty files