diff --git a/manuscrit/50_CesASMe/00_intro.tex b/manuscrit/50_CesASMe/00_intro.tex new file mode 100644 index 0000000..8e70c60 --- /dev/null +++ b/manuscrit/50_CesASMe/00_intro.tex @@ -0,0 +1,141 @@ +\begin{abstract} + A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or + \ithemal{}, strive to statically predict the throughput of a computation + kernel. Each analyzer is based on its own simplified CPU model + reasoning at the scale of an isolated basic block. + Facing this diversity, evaluating their strengths and + weaknesses is important to guide both their usage and their enhancement. + + We argue that reasoning at the scale of a single basic block is not + always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled + solution to evaluate code analyzers on C-level benchmarks. It is composed of a + benchmark derivation procedure that feeds an evaluation harness. We use it to + evaluate state-of-the-art code analyzers and to provide insights on their + precision. We use \tool's results to show that memory-carried data + dependencies are a major source of imprecision for these tools. +\end{abstract} + +\section{Introduction}\label{sec:intro} + +At a time when software is expected to perform more computations, faster and in +more constrained environments, tools that statically predict the resources (and +in particular the CPU resources) they consume are very useful to guide their +optimization. This need is reflected in the diversity of binary or assembly +code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has +maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca}, +\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various +performance metrics, including the number of CPU cycles a computation kernel will take +---~which roughly translates to execution time. +In addition to raw measurements (relying on hardware counters), these model-based analyses provide +higher-level and refined data, to expose the bottlenecks and guide the +optimization of a given code. This feedback is useful to experts optimizing +computation kernels, including scientific simulations and deep-learning +kernels. + +An exact throughput prediction would require a cycle-accurate simulator of the +processor, based on microarchitectural data that is most often not publicly +available, and would be prohibitively slow in any case. These tools thus each +solve in their own way the challenge of modeling complex CPUs while remaining +simple enough to yield a prediction in a reasonable time, ending up with +different models. For instance, on the following x86-64 basic block computing a +general matrix multiplication, +\begin{minipage}{0.95\linewidth} +\begin{lstlisting}[language={[x86masm]Assembler}] +movsd (%rcx, %rax), %xmm0 +mulsd %xmm1, %xmm0 +addsd (%rdx, %rax), %xmm0 +movsd %xmm0, (%rdx, %rax) +addq $8, %rax +cmpq $0x2260, %rax +jne 0x16e0 +\end{lstlisting} +\end{minipage} + +\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{} +predicts 3 cycles. One may wonder which tool is correct. + + +The obvious solution to assess their predictions is to compare them to an +actual measure. However, as these tools reason at the basic block level, this +is not as trivially defined as it would seem. Take for instance the following +kernel: + +\begin{minipage}{0.90\linewidth} +\begin{lstlisting}[language={[x86masm]Assembler}] +mov (%rax, %rcx, 1), %r10 +mov %r10, (%rbx, %rcx, 1) +add $8, %rcx +\end{lstlisting} +\end{minipage} + +\input{overview} + +\noindent{}At first, it looks like an array copy from location \reg{rax} to +\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to +\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first +instruction and the second instruction at the previous iteration; which makes +the throughput drop significantly. As we shall see in +Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic +block's throughput is not well-defined}. + +To recover the context of each basic block, we reason instead at the scale of +a C source code. This +makes the measures unambiguous: one can use hardware counters to measure the +elapsed cycles during a loop nest. This requires a suite of benchmarks, in C, +that both is representative of the domain studied, and wide enough to have a +good coverage of the domain. However, this is not in itself sufficient to +evaluate static tools: on the preceding matrix multiplication kernel, counters +report 80,059 elapsed cycles ---~for the total loop. +This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{} +basic block-level predictions seen above. + +A common practice to make these numbers comparable is to renormalize them to +instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of +$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of +$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this +case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet, +IPC is a metric for microarchitectural load, and \textit{tells nothing about a +kernel's efficiency}. Indeed, the static number of instructions is affected by +many compiler passes, such as scalar evolution, strength reduction, register +allocation, instruction selection\ldots{} Thus, when comparing two compiled +versions of the same code, IPC alone does not necessarily point to the most +efficient version. For instance, a kernel using SIMD instructions will use +fewer instructions than one using only scalars, and thus exhibit a lower or +constant IPC; yet, its performance will unquestionably increase. + +The total cycles elapsed to solve a given problem, on the other +hand, is a sound metric of the efficiency of an implementation. We thus +instead \emph{lift} the predictions at basic-block level to a total number of +cycles. In simple cases, this simply means multiplying the block-level +prediction by the number of loop iterations; however, this bound might not +generally be known. More importantly, the compiler may apply any number of +transformations: unrolling, for instance, changes this number. Control flow may +also be complicated by code versioning. + +%In the general case, instrumenting the generated code to obtain the number of +%occurrences of the basic block yields accurate results. + +\bigskip + +In this article, we present a fully-tooled solution to evaluate and compare the +diversity of static throughput predictors. Our tool, \tool, solves two main +issues in this direction. In Section~\ref{sec:bench_gen}, we describe how +\tool{} generates a wide variety of computation kernels stressing different +parameters of the architecture, and thus of the predictors' models, while +staying close to representative workloads. To achieve this, we use +Polybench~\cite{polybench}, a C-level benchmark suite representative of +scientific computation workloads, that we combine with a variety of +optimisations, including polyhedral loop transformations. +In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to +evaluate throughput predictors on this set of benchmarks by lifting their +predictions to a total number of cycles that can be compared to a hardware +counters-based measure. A +high-level view of \tool{} is shown in Figure~\ref{fig:contrib}. + +In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our +methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and +analyze the results of \tool{}. + In addition to statistical studies, we use \tool's results +to investigate analyzers' flaws. We show that code +analyzers do not always correctly model data dependencies through memory +accesses, substantially impacting their precision. diff --git a/manuscrit/50_CesASMe/05_related_works.tex b/manuscrit/50_CesASMe/05_related_works.tex new file mode 100644 index 0000000..2abed1b --- /dev/null +++ b/manuscrit/50_CesASMe/05_related_works.tex @@ -0,0 +1,56 @@ +\section{Related works} + +The static throughput analyzers studied rely on a variety of models. +\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and +relies on Intel's expertise on their own processors. +The LLVM compiling ecosystem, to guide optimization passes, maintains models of many +architectures. These models are used in the LLVM Machine Code Analyzer, +\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment +of assembly. +Independently, Abel and Reineke used an automated microbenchmark generation +approach to generate port mappings of many architectures in +\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends. +This work was continued with \uica~\cite{uica}, extending this model with an +extensive frontend description. +Following a completely different approach, \ithemal~\cite{ithemal} uses a deep +neural network to predict basic blocks throughput. To obtain enough data to +train its model, the authors also developed \bhive~\cite{bhive}, a profiling +tool working on basic blocks. + +Another static tool, \osaca~\cite{osaca2}, provides lower- and +upper-bounds to the execution time of a basic block. As this kind of +information cannot be fairly compared with tools yielding an exact throughput +prediction, we exclude it from our scope. + +All these tools statically predict the number of cycles taken by a piece of +assembly or binary that is assumed to be the body of an infinite ---~or +sufficiently large~--- loop in steady state, all its data being L1-resident. As +discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between +analyzers; \eg{} by assuming that the loop is either unrolled or has control +instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal, +necessarily work on a single basic block, while some others, \eg{} \iaca, work +on a section of code delimited by markers. However, even in the second case, +the code is assumed to be \emph{straight-line code}: branch instructions, if +any, are assumed not taken. + +\smallskip + +Throughput prediction tools, however, are not all static. +\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program +region, instrumenting it to retrieve the exact events occurring through its +execution. This way, \gus{} can more finely detect bottlenecks by +sensitivity analysis, at the cost of a significantly longer run time. + +\smallskip + +The \bhive{} profiler~\cite{bhive} takes another approach to basic block +throughput measurement: by mapping memory at any address accessed by a basic +block, it can effectively run and measure arbitrary code without context, often +---~but not always, as we discuss later~--- yielding good results. + +\smallskip + +The \anica{} framework~\cite{anica} also attempts to evaluate throughput +predictors by finding examples on which they are inaccurate. \anica{} starts +with randomly generated assembly snippets, and refines them through a process +derived from abstract interpretation to reach general categories of problems. diff --git a/manuscrit/50_CesASMe/10_bench_gen.tex b/manuscrit/50_CesASMe/10_bench_gen.tex new file mode 100644 index 0000000..c2e0c23 --- /dev/null +++ b/manuscrit/50_CesASMe/10_bench_gen.tex @@ -0,0 +1,109 @@ +\section{Generating microbenchmarks}\label{sec:bench_gen} + +Our framework aims to generate \emph{microbenchmarks} relevant to a specific +domain. +A microbenchmark is a code that is as simplified as possible to expose the +behaviour under consideration. +The specified computations should be representative of the considered domain, +and at the same time they should stress the different aspects of the +target architecture ---~which is modeled by code analyzers. + +In practice, a microbenchmark's \textit{computational kernel} is a simple +\texttt{for} loop, whose +body contains no loops and whose bounds are statically known. +A \emph{measure} is a number of repetitions $n$ of this computational +kernel, $n$ being an user-specified parameter. +The measure may be repeated an arbitrary number of times to improve +stability. + +Furthermore, such a microbenchmark should be a function whose computation +happens without leaving the L1 cache. +This requirement helps measurements and analyses to be +undisturbed by memory accesses, but it is also a matter of comparability. +Indeed, most of the static analyzers make the assumption that the code under +consideration is L1-resident; if it is not, their results are meaningless, and +can not be compared with an actual measurement. + +The generation of such microbenchmarks is achieved through four distinct +components, whose parameter variations are specified in configuration files~: +a benchmark suite, C-to-C loop nest optimizers, a constraining utility +and a C-to-binary compiler. + +\subsection{Benchmark suite}\label{ssec:bench_suite} +Our first component is an initial set of benchmarks which materializes +the human expertise we intend to exploit for the generation of relevant codes. +The considered suite must embed computation kernels +delimited by ad-hoc \texttt{\#pragma}s, +whose arrays are accessed +directly (no indirections) and whose loops are affine. +These constraints are necessary to ensure that the microkernelification phase, +presented below, generates segfault-free code. + +In this case, we use Polybench~\cite{polybench}, a suite of 30 +benchmarks for polyhedral compilation ---~of which we use only 26. The +\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are +removed because they are incompatible with PoCC (introduced below). The +\texttt{lu} benchmark is left out as its execution alone takes longer than all +others together, making its dynamic analysis (\eg{} with \gus) impractical. +In addition to the importance of linear algebra within +it, one of its important features is that it does not include computational +kernels with conditional control flow (\eg{} \texttt{if-then-else}) +---~however, it does includes conditional data flow, using the ternary +conditional operator of C. + +\subsection{C-to-C loop nest optimizers}\label{ssec:loop_nest_optimizer} +Loop nest optimizers transform an initial benchmark in different ways (generate different +\textit{versions} of the same benchmark), varying the stress on +resources of the target architecture, and by extension the models on which the +static analyzers are based. + +In this case, we chose to use the +\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling, +skewing, vectorization/simdization, loop unrolling, loop permutation, +loop fusion. +These transformations are meant to maximize variety within the initial +benchmark suite. Eventually, the generated benchmarks are expected to +highlight the impact on performance of the resulting behaviours. +For instance, \textit{skewing} introduces non-trivial pointer arithmetics, +increasing the pressure on address computation units~; \textit{loop unrolling}, +among many things, opens the way to register promotion, which exposes dependencies +and alleviates load-store units~; +\textit{vectorization} stresses SIMD units and decreases +pressure on the front-end~; and so on. + +\subsection{Constraining utility}\label{ssec:kernelify} + +A constraining utility transforms the code in order to respect an arbitrary number of non-functional +properties. +In this case, we apply a pass of \emph{microkernelification}: we +extract a computational kernel from the arbitrarily deep and arbitrarily +long loop nest generated by the previous component. +The loop chosen to form the microkernel is the one considered to be +the \textit{hottest}; the \textit{hotness} of a loop being obtained by +multiplying the number of arithmetic operations it contains by the number of +times it is iterated. This metric allows us to prioritize the parts of the +code that have the greatest impact on performance. + +At this point, the resulting code can +compute a different result from the initial code; +for instance, the composition of tiling and +kernelification reduces the number of loop iterations. +Indeed, our framework is not meant to preserve the +functional semantics of the benchmarks. +Our goal is only to generate codes that are relevant from the point of view of +performance analysis. + +\subsection{C-to-binary compiler}\label{ssec:compile} + +A C-to-binary compiler varies binary optimization options by +enabling/disabling auto-vectorization, extended instruction +sets, \textit{etc}. We use \texttt{gcc}. + +\bigskip + +Eventually, the relevance of the microbenchmarks set generated using this approach +derives not only from initial benchmark suite and the relevance of the +transformations chosen at each +stage, but also from the combinatorial explosion generated by the composition +of the four stages. In our experimental setup, this yields up to 144 +microbenchmarks per benchmark of the original suite. diff --git a/manuscrit/50_CesASMe/15_harness.tex b/manuscrit/50_CesASMe/15_harness.tex new file mode 100644 index 0000000..da55ae7 --- /dev/null +++ b/manuscrit/50_CesASMe/15_harness.tex @@ -0,0 +1,87 @@ +\section{Benchmarking harness}\label{sec:bench_harness} + +To compare full-kernel cycle measurements to throughput predictions on +individual basic blocks, we lift predictions by adding the weighted basic block +predictions: + +\[ +\text{lifted\_pred}(\mathcal{K}) = + \sum_{b \in \operatorname{BBs}(\mathcal{K})} + \operatorname{occurences}(b) \times \operatorname{pred}(b) +\] + +Our benchmarking harness works in three successive stages. It first +extracts the basic blocks constituting a computation kernel, and instruments it +to retrieve their respective occurrences in the original context. It then runs +all the studied tools on each basic block, while also running measures on the +whole computation kernel. Finally, the block-level results are lifted to +kernel-level results thanks to the occurrences previously measured. + +\subsection{Basic block extraction}\label{ssec:bb_extr} + +Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly +code at each control flow instruction (jump, call, return, \ldots) and each +jump site. + +To accurately obtain the occurrences of each basic block in the whole kernel's +computation, +we then instrument it with \texttt{gdb} by placing a break +point at each basic block's first instruction in order to count the occurrences +of each basic block between two calls to the \perf{} counters\footnote{We +assume the program under analysis to be deterministic.}. While this +instrumentation takes about 50 to 100$\times$ more time than a regular run, it +can safely be run in parallel, as the performance results are discarded. + +\subsection{Throughput predictions and measures}\label{ssec:throughput_pred_meas} + +The harness leverages a variety of tools: actual CPU measurement; the \bhive{} +basic block profiler~\cite{bhive}; \llvmmca~\cite{llvm-mca}, \uica~\cite{uica} +and \iaca~\cite{iaca}, which leverage microarchitectural +models to predict a block's throughput; \ithemal~\cite{ithemal}, a machine +learning model; and \gus~\cite{phd:gruber}, a dynamic analyzer based on \qemu{} +that works at the whole binary level. + +The execution time of the full kernel is measured using Linux +\perf~\cite{tool:perf} CPU counters around the full computation kernel. The +measure is repeated four times and the smallest is kept; this ensures that the +cache is warm and compensates for context switching or other measurement +artifacts. \gus{} instruments the whole function body. The other tools included +all work at basic block level; these are run on each basic block of each +benchmark. + +We emphasize the importance, throughout the whole evaluation chain, to keep the +exact same assembled binary. Indeed, recompiling the kernel from source +\emph{cannot} be assumed to produce the same assembly kernel. This is even more +important in the presence of slight changes: for instance, inserting \iaca{} +markers at the C-level ---~as is intended~--- around the kernel \emph{might} +change the compiled kernel, if only for alignment regions. We argue that, in +the case of \iaca{} markers, the problem is even more critical, as those +markers prevent a binary from being run by overwriting registers with arbitrary +values. This forces a user to run and measure a version which is different from +the analyzed one. In our harness, we circumvent this issue by adding markers +directly at the assembly level, editing the already compiled version. Our +\texttt{gdb} instrumentation procedure also respects this principle of +single-compilation. As \qemu{} breaks the \perf{} interface, we have to run +\gus{} with a preloaded stub shared library to be able to instrument binaries +containing calls to \perf. + +\subsection{Prediction lifting and filtering}\label{ssec:harness_lifting} + +We finally lift single basic block predictions to a whole-kernel cycle +prediction by summing the block-level results, weighted by the occurrences of +the basic block in the original context (formula above). If an analyzer fails +on one of the basic blocks of a benchmark, the whole benchmark is discarded for +this analyzer. + +In the presence of complex control flow, \eg{} with conditionals inside loops, +our approach based on basic block occurrences is arguably less precise than an +approach based on paths occurrences, as we have less information available +---~for instance, whether a branch is taken with a regular pattern, whether we +have constraints on register values, etc. We however chose this block-based +approach, as most throughput prediction tools work a basic block-level, and are +thus readily available and can be directly plugged into our harness. + +Finally, we control the proportion of cache misses in the program's execution +using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more +than 15\,\% of cache misses on a warm cache are not considered L1-resident and +are discarded. diff --git a/manuscrit/50_CesASMe/20_evaluation.tex b/manuscrit/50_CesASMe/20_evaluation.tex new file mode 100644 index 0000000..755fe76 --- /dev/null +++ b/manuscrit/50_CesASMe/20_evaluation.tex @@ -0,0 +1,213 @@ +\section{Experimental setup and evaluation}\label{sec:exp_setup} + +Running the harness described above provides us with 3500 +benchmarks ---~after filtering out non-L1-resident +benchmarks~---, on which each throughput predictor is run. We make the full +output of our tool available in our artifact. Before analyzing these results in +Section~\ref{sec:results_analysis}, we evaluate the relevance of the +methodology presented in Section~\ref{sec:bench_harness} to make the tools' +predictions comparable to baseline hardware counter measures. + +\subsection{Experimental environment} + +The experiments presented in this paper were all realized on a Dell PowerEdge +C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB +of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel +Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each. + +The experiments themselves were run inside a Docker environment very close to +our artifact, based on Debian Bullseye. Care was taken to disable +hyperthreading to improve measurements stability. For tools whose output is +based on a direct measurement (\perf, \bhive), the benchmarks were run +sequentially on a single core with no experiments on the other cores. No such +care was taken for \gus{} as, although based on a dynamic run, its prediction +is purely function of recorded program events and not of program measures. All +other tools were run in parallel. + +We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{} +at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at +commit \texttt{87463c9}, \ithemal{} at commit \texttt{b3c39a8}. + +\subsection{Comparability of the results} + +We define the relative error of a time prediction +$C_\text{pred}$ (in cycles) with respect to a baseline $C_\text{baseline}$ as +\[ + \operatorname{err} = \frac{\left| C_\text{pred} - C_\text{baseline} + \right|}{C_\text{baseline}} +\] + +We assess the comparability of the whole benchmark, measured with \perf{}, to +lifted block-based results by measuring the statistical distribution of the +relative error of two series: the predictions made by \bhive, and the series of +the best block-based prediction for each benchmark. + +We single out \bhive{} as it is the only tool able to \textit{measure} +---~instead of predicting~--- an isolated basic block's timing. This, however, is +not sufficient: as discussed later in Section~\ref{ssec:bhive_errors}, \bhive{} +is not able to yield a result for about $40\,\%$ of the benchmarks, and is +subject to large errors in some cases. For this purpose, we also consider, for +each benchmark, the best block-based prediction: we argue that if, for most +benchmarks, at least one of these predictors is able to yield a satisfyingly +accurate result, then the lifting methodology is sound in practice. + +The result of this analysis is presented in Table~\ref{table:exp_comparability} +and in Figure~\ref{fig:exp_comparability}. The results are in a range +compatible with common results of the field, as seen \eg{} in~\cite{uica} +reporting Mean Absolute Percentage Error (MAPE, corresponding to the +``Average'' row) of about 10-15\,\% in many cases. While lifted \bhive's +average error is driven high by large errors on certain benchmarks, +investigated later in this article, its median error is still comparable to the +errors of state-of-the-art tools. From this, we conclude that lifted cycle +measures and predictions are consistent with whole-benchmark measures; and +consequently, lifted predictions can reasonably be compared to one another. + +\begin{figure} + \centering + \includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf} + \caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability} +\end{figure} + +\begin{table} + \centering + \caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability} + \begin{tabular}{l r r} + \toprule + & \textbf{Best block-based} & \textbf{BHive} \\ + \midrule + Datapoints & 3500 & 2198 \\ + Errors & 0 & 1302 \\ + & (0\,\%) & (37.20\,\%) \\ + Average (\%) & 11.60 & 27.95 \\ + Median (\%) & 5.81 & 7.78 \\ + Q1 (\%) & 1.99 & 3.01 \\ + Q3 (\%) & 15.41 & 23.01 \\ + \bottomrule + \end{tabular} +\end{table} + + +\begin{table*}[!htbp] + \centering + \caption{Bottleneck reports from the studied tools}\label{table:coverage} + + \begin{tabular}{l | r r r | r r r | r r r} + \toprule + & \multicolumn{3}{c|}{\textbf{Frontend}} + & \multicolumn{3}{c|}{\textbf{Ports}} + & \multicolumn{3}{c}{\textbf{Dependencies}} \\ + & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\ + + \midrule +2mm & 34 & 61 & 25.8 \% & 25 & 13 & 70.3 \% & 18 & 29 & 63.3 \% \\ +3mm & 44 & 61 & 18.0 \% & 30 & 13 & 66.4 \% & 23 & 37 & 53.1 \% \\ +atax & 13 & 72 & 41.0 \% & 25 & 17 & 70.8 \% & 23 & 30 & 63.2 \% \\ +bicg & 19 & 59 & 45.8 \% & 25 & 25 & 65.3 \% & 21 & 37 & 59.7 \% \\ +doitgen & 51 & 25 & 40.6 \% & 36 & 30 & 48.4 \% & 17 & 22 & 69.5 \% \\ +mvt & 27 & 53 & 33.3 \% & 9 & 18 & 77.5 \% & 7 & 32 & 67.5 \% \\ +gemver & 62 & 13 & 39.5 \% & 2 & 48 & 59.7 \% & 1 & 28 & 76.6 \% \\ +gesummv & 16 & 69 & 41.0 \% & 17 & 23 & 72.2 \% & 24 & 28 & 63.9 \% \\ +syr2k & 51 & 37 & 38.9 \% & 8 & 42 & 65.3 \% & 19 & 34 & 63.2 \% \\ +trmm & 69 & 27 & 25.0 \% & 16 & 30 & 64.1 \% & 15 & 30 & 64.8 \% \\ +symm & 0 & 121 & 11.0 \% & 5 & 20 & 81.6 \% & 9 & 5 & 89.7 \% \\ +syrk & 54 & 46 & 30.6 \% & 12 & 42 & 62.5 \% & 20 & 48 & 52.8 \% \\ +gemm & 42 & 41 & 42.4 \% & 30 & 41 & 50.7 \% & 16 & 57 & 49.3 \% \\ +gramschmidt & 48 & 52 & 21.9 \% & 16 & 20 & 71.9 \% & 24 & 39 & 50.8 \% \\ +cholesky & 24 & 72 & 33.3 \% & 0 & 19 & 86.8 \% & 5 & 14 & 86.8 \% \\ +durbin & 49 & 52 & 29.9 \% & 0 & 65 & 54.9 \% & 2 & 39 & 71.5 \% \\ +trisolv & 53 & 84 & 4.9 \% & 6 & 22 & 80.6 \% & 4 & 16 & 86.1 \% \\ +jacobi-1d & 18 & 78 & 33.3 \% & 66 & 9 & 47.9 \% & 0 & 13 & 91.0 \% \\ +heat-3d & 32 & 8 & 72.2 \% & 26 & 0 & 81.9 \% & 0 & 0 & 100.0 \% \\ +seidel-2d & 0 & 112 & 22.2 \% & 32 & 0 & 77.8 \% & 0 & 0 & 100.0 \% \\ +fdtd-2d & 52 & 22 & 47.1 \% & 20 & 41 & 56.4 \% & 0 & 40 & 71.4 \% \\ +jacobi-2d & 6 & 31 & 73.6 \% & 24 & 61 & 39.3 \% & 0 & 44 & 68.6 \% \\ +adi & 12 & 76 & 21.4 \% & 40 & 0 & 64.3 \% & 0 & 0 & 100.0 \% \\ +correlation & 18 & 36 & 51.8 \% & 19 & 30 & 56.2 \% & 23 & 45 & 39.3 \% \\ +covariance & 39 & 36 & 37.5 \% & 4 & 34 & 68.3 \% & 19 & 53 & 40.0 \% \\ +floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 & 8 & 78.1 \% \\ +\textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\ +\bottomrule + \end{tabular} +\end{table*} + +\subsection{Relevance and representativity (bottleneck +analysis)}\label{ssec:bottleneck_diversity} + +The results provided by our harness are only relevant to evaluate the parts of +the tools' models that are stressed by the benchmarks generated; it is hence +critical that our benchmark generation procedure in Section~\ref{sec:bench_gen} +yields diverse results. This should be true by construction, as the various +polyhedral compilation techniques used stress different parts of the +microarchitecture. + +To assess this, we study the generated benchmarks' bottlenecks, \ie{} +architectural resources on which a release of pressure improves execution time. +Note that a saturated resource is not necessarily a bottleneck: a code that +uses \eg{} 100\,\% of the arithmetics units available for computations outside +of the critical path, at a point where a chain of dependencies is blocking, +will not run faster if the arithmetics operations are removed; hence, hardware +counters alone are not sufficient to find bottlenecks. + +However, some static analyzers report the bottlenecks they detect. To unify +their results and keep things simple, we study three general kinds of +bottlenecks. + +\begin{itemize} +\item{} \emph{Frontend:} the CPU's frontend is not able to issue + micro-operations to the backend fast enough. \iaca{} and \uica{} are + able to detect this. +\item{} \emph{Ports:} at least one of the backend ports has too much work; + reducing its pressure would accelerate the computation. + \llvmmca, \iaca{} and \uica{} are able to detect this. +\item{} \emph{Dependencies:} there is a chain of data dependencies slowing + down the computation. + \llvmmca, \iaca{} and \uica{} are able to detect this. +\end{itemize} + +For each source benchmark from Polybench and each type of bottleneck, we report +in Table~\ref{table:coverage} the number of derived benchmarks on which all the +tools agree that the bottleneck is present or absent. We also report the +proportion of cases in which the tools failed to agree. We analyze those +results later in Section~\ref{ssec:bottleneck_pred_analysis}. + +As we have no source of truth indicating whether a bottleneck is effectively +present in a microbenchmark, we adopt a conservative approach, and consider +only the subset of the microbenchmarks on which the tools agree on the status +of all three resources; for those, we have a good confidence on the bottlenecks +reported. Obviously, this approach is limited, because it excludes +microbenchmarks that might be worth considering, and is most probably subject +to selection bias. + +Of the 3,500 microbenchmarks we have generated, 261 (7.5\,\%) are the subject +of the above-mentioned consensus. This sample is made up of microbenchmarks +generated from 21 benchmarks ---~\ie{} for 5 benchmarks, none of the derived +microbenchmarks reached a consensus among the tools~---, yielding a wide +variety of calculations, including floating-point arithmetic, pointer +arithmetic or Boolean arithmetic. Of these, 200 (76.6\,\%) are bottlenecked on +the CPU front-end, 19 (7,3\,\%) on back-end ports, and 81 (31.0\,\%) on latency +introduced by dependencies. As mentioned above, this distribution +probably does not transcribe the distribution among the 3,500 original +benchmarks, as the 261 were not uniformly sampled. However, we argue that, as +all categories are represented in the sample, the initial hypothesis that the +generated benchmarks are diverse and representative is confirmed ---~thanks to +the transformations described in Section~\ref{sec:bench_gen}. + +\subsection{Carbon footprint} + +Generating and running the full suite of benchmarks required about 30h of +continuous computation on a single machine. During the experiments, the power +supply units reported a near-constant consumption of about 350W. The carbon +intensity of the power grid for the region where the experiment was run, at the +time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}. + +The electricity consumed directly by the server thus amounts to about +10.50\,kWh. Assuming a Power Usage Efficiency of 1.5, the total electricity +consumption roughly amounts to 15.75\,kWh, or about 450\,g\coeq. + +A carbon footprint estimate of the machine's manufacture itself was conducted +by the manufacturer~\cite{poweredgeC6420lca}. Additionally accounting for the +extra 160\,GB of DDR4 SDRAM~\cite{meta_ACT}, the hardware manufacturing, +transport and end-of-life is evaluated to 1,266\,kg\coeq. In 2023, this +computation cluster's usage rate was 35\,\%. Assuming 6 years of product life, +30h of usage represents about 2,050\,g\coeq{}. The whole experiment thus amounts to +2.5\,kg\coeq. diff --git a/manuscrit/50_CesASMe/25_results_analysis.tex b/manuscrit/50_CesASMe/25_results_analysis.tex new file mode 100644 index 0000000..5a37fe2 --- /dev/null +++ b/manuscrit/50_CesASMe/25_results_analysis.tex @@ -0,0 +1,338 @@ +\section{Results analysis}\label{sec:results_analysis} + +The raw complete output from our benchmarking harness ---~roughly speaking, a +large table with, for each benchmark, a cycle measurement, cycle count for each +throughput analyzer, the resulting relative error, and a synthesis of the +bottlenecks reported by each tool~--- enables many analyses that, we believe, +could be useful both to throughput analysis tool developers and users. Tool +designers can draw insights on their tool's best strengths and weaknesses, and +work towards improving them with a clearer vision. Users can gain a better +understanding of which tool is more suited for each situation. + +\subsection{Throughput results}\label{ssec:overall_results} + +\begin{table*} + \centering + \caption{Statistical analysis of overall results}\label{table:overall_analysis_stats} + \begin{tabular}{l r r r r r r r r r} + \toprule +\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} & +\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\ +\midrule +BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\ +llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\ +UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\ +Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\ +Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\ +Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\ +\bottomrule + \end{tabular} +\end{table*} + +The error distribution of the relative errors, for each tool, is presented as a +box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators +are also given in Table~\ref{table:overall_analysis_stats}. We also give, for +each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator, +used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how +well the pair-wise ordering of benchmarks is preserved, $-1$ being a full +anti-correlation and $1$ a full correlation. This is especially useful when one +is not interested in a program's absolute throughput, but rather in comparing +which program has a better throughput. + +\begin{figure} + \includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf} + \caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot} +\end{figure} + + +These results are, overall, significantly worse than what each tool's article +presents. We attribute this difference mostly to the specificities of +Polybench: being composed of computation kernels, it intrinsically stresses the +CPU more than basic blocks extracted out of the Spec benchmark suite. This +difference is clearly reflected in the experimental section of the Palmed +article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on +Spec, often by more than a factor of two. + +As \bhive{} and \ithemal{} do not support control flow instructions +(\eg{} \texttt{jump} instructions), those had +to be removed from the blocks before analysis. While none of these tools, apart +from \gus{} ---~which is dynamic~---, is able to account for branching costs, +these two analyzers were also unable to account for the front- and backend cost +of the control flow instructions themselves as well ---~corresponding to the +$TP_U$ mode introduced by \uica~\cite{uica}, while others +measure $TP_L$. + + +\subsection{Understanding \bhive's results}\label{ssec:bhive_errors} + +The error distribution of \bhive{} against \perf{}, plotted right in +Figure~\ref{fig:exp_comparability}, puts forward irregularities in \bhive's +results. Since \bhive{} is based on measures ---~instead of predictions~--- +through hardware counters, an excellent accuracy is expected. Its lack of +support for control flow instructions can be held accountable for a portion of +this accuracy drop; our lifting method, based on block occurrences instead of +paths, can explain another portion. We also find that \bhive{} fails to produce +a result in about 40\,\% of the kernels explored ---~which means that, for those +cases, \bhive{} failed to produce a result on at least one of the constituent +basic blocks. In fact, this is due to the difficulties we mentioned in +Section \ref{sec:intro} related to the need to reconstruct the context of each +basic block \textit{ex nihilo}. + +The basis of \bhive's method is to run the code to be measured, unrolled a +number of times depending on the code size, with all memory pages but the +code unmapped. As the code tries to access memory, it will raise segfaults, +caught by \bhive's harness, which allocates a single shared-memory page, filled +with a repeated constant, that it will map wherever segfaults occur before +restarting the program. +The main causes of \bhive{} failure are bad code behaviour (\eg{} control flow +not reaching the exit point of the measure if a bad jump is inserted), too many +segfaults to be handled, or a segfault that occurs even after mapping a page at +the problematic address. + +The registers are also initialized, at the beginning of the measurement, to the +fixed constant \texttt{0x2324000}. We show through two examples that this +initial value can be of crucial importance. + +The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU +(Cascade Lake), with hyperthreading disabled. + +\paragraph{Imprecise analysis} we consider the following x86-64 kernel. + +\begin{minipage}{0.95\linewidth} +\begin{lstlisting}[language={[x86masm]Assembler}] + vmulsd (%rax), %xmm3, %xmm0 + vmovsd %xmm0, (%r10) +\end{lstlisting} +\end{minipage} + +When executed with all the general purpose registers initialized to the default +constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and +\reg{r10} hold the same value, inducing a read-after-write dependency between +the two instructions. If, however, \bhive{} is tweaked to initialize \reg{r10} +to a value that aliases (\wrt{} physical addresses) with the value in +\reg{rax}, \eg{} between \texttt{0x10000} and \texttt{0x10007} (inclusive), it +reports 19 cycles per iteration instead; while a value between \texttt{0x10008} +and \texttt{0x1009f} (inclusive) yields the expected 1 cycle ---~except for +values in \texttt{0x10039}-\texttt{0x1003f} and +\texttt{0x10079}-\texttt{0x1007f}, yielding 2 cycles as the store crosses a +cache line boundary. + +In the same way, the value used to initialize the shared memory page can +influence the results whenever it gets loaded into registers. + +\vspace{0.5em} + +\paragraph{Failed analysis} some memory accesses will always result in an +error; for instance, it is impossible to \texttt{mmap} at an address lower +than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus, +with equal initial values for all registers, the following kernel would fail, +since the second operation attempts to load at address 0: + +\begin{minipage}{0.95\linewidth} +\begin{lstlisting}[language={[x86masm]Assembler}] + subq %r11, %r10 + movq (%r10), %rax +\end{lstlisting} +\end{minipage} + +Such errors can occur in more circumvoluted ways. The following x86-64 kernel, +for instance, is extracted from a version of the \texttt{durbin} +kernel\footnote{\texttt{durbin.pocc.noopt.default.unroll8.MEDIUM.kernel21.s} +in the full results}. + +\begin{minipage}{0.95\linewidth} +\begin{lstlisting}[language={[x86masm]Assembler}] + vmovsd 0x10(%r8, %rcx), %xmm6 + subl %eax, %esi + movslq %esi, %rsi + vfmadd231sd -8(%r9, %rsi, 8), \ + %xmm6, %xmm0 +\end{lstlisting} +\end{minipage} + +Here, \bhive{} fails to measure the kernel when run with the general purpose +registers initialized to the default constant at the 2\textsuperscript{nd} +occurrence of the unrolled loop body, failing to recover from an error at the +\texttt{vfmadd231sd} instruction with the \texttt{mmap} strategy. Indeed, after +the first iteration the value in \reg{rsi} becomes null, then negative at the +second iteration; thus, the second occurrence of the last instruction fetches +at address \texttt{0xfffffffff0a03ff8}, which is in kernel space. This +microkernel can be benchmarked with BHive \eg{} by initializing \reg{rax} to 1. + +Some other microkernels fail in a similar way when trying to access addresses +that are not a virtual address in \emph{canonical form} space for x86-64 with +48 bits virtual addresses, as defined in Section~3.3.7.1 of Intel's Software +Developer's Manual~\cite{ref:intel64_software_dev_reference_vol1} and +Section~5.3.1 of the AMD64 Architecture Programmer's +Manual~\cite{ref:amd64_architecture_dev_reference_vol2}. Others still fail with +accesses relative to the instruction pointer, as \bhive{} read-protects the +unrolled microkernel's instructions page. + +\subsection{Bottleneck prediction}\label{ssec:bottleneck_pred_analysis} + +We introduced in Section~\ref{ssec:bottleneck_diversity} earlier that some of +the tools studied are also able to report suspected bottlenecks for the +evaluated program, whose results are presented in Table~\ref{table:coverage}. +This feature might be even more useful than raw throughput predictions to the +users of these tools willing to optimize their program, as they strongly hint +towards what needs to be enhanced. + +In the majority of the cases studied, the tools are not able to agree on the +presence or absence of a type of bottleneck. Although it might seem that the +tools are performing better on frontend bottleneck detection, it must be +recalled that only two tools (versus three in the other cases) are reporting +frontend bottlenecks, thus making it easier for them to agree. + +\begin{table} + \centering + \caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred} + \begin{tabular}{l r r r r} + \toprule + \textbf{Tool} + & \multicolumn{2}{c}{\textbf{Ports}} + & \multicolumn{2}{c}{\textbf{Dependencies}} \\ + \midrule + \llvmmca{} & 567 & (24.6 \%) & 1032 & (41.9 \%) \\ + \uica{} & 516 & (22.4 \%) & 530 & (21.5 \%) \\ + \iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\ + \bottomrule + \end{tabular} +\end{table} + +The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases +on which three tools disagree into the number of times one tool makes a +diverging prediction ---~\ie{} the tool predicts differently than the two +others. In the case of ports, \iaca{} is responsible for half of the +divergences ---~which is not sufficient to conclude that the prediction of the +other tools is correct. In the case of dependencies, however, there is no clear +outlier, even though \uica{} seems to fare better than others. + +In no case one tool seems to be responsible for the vast majority of +disagreements, which could hint towards it failing to predict correctly this +bottleneck. In the absence of a source of truth indicating whether a bottleneck +is effectively present, and with no clear-cut result for (a subset of) tool +predictions, we cannot conclude on the quality of the predictions from each +tool for each kind of bottleneck. + +\subsection{Impact of dependency-boundness}\label{ssec:memlatbound} + +\begin{table*} + \centering + \caption{Statistical analysis of overall results, without latency bound + through memory-carried dependencies rows}\label{table:nomemdeps_stats} + \begin{tabular}{l r r r r r r r r r} + \toprule +\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} & +\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\ +\midrule +BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\ +llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\ +UiCA & 2388 & 0 & (0.00\,\%) & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\ +Ithemal & 2388 & 0 & (0.00\,\%) & 62.66\,\% & 53.84\,\% & 24.12\,\% & 81.95\,\% & 0.40\\ +Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.82\\ +Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\ +\bottomrule + \end{tabular} +\end{table*} + +An overview of the full results table (available in our artifact) hints towards +two main tendencies: on a significant number of rows, the static tools +---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield +comparatively bad throughput predictions \emph{together}; and many of these +rows are those using the \texttt{O1} and \texttt{O1autovect} compilation +setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the +latter). + +To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in +terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{} +---~yielding 1050 rows each. All of these share 869 rows (82.8\,\%), which we +call \textit{jointly bad rows}. + +Among these 869 jointly bad rows, we further find that respectively 342 +(39.4\,\%) and 337 (38.8\,\%) are compiled using the \texttt{O1} and +\texttt{O1autovect}, totalling to 679 (78.1\,\%) of \texttt{O1}-based rows, +against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for +\texttt{O3nosimd}. This result is significant enough to be used as a hint to +investigate the issue. + +\begin{figure} + \includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf} + \caption{Statistical distribution of relative errors, with and without + pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot} +\end{figure} + + +Insofar as our approach maintains a strong link between the basic blocks studied and +the source codes from which they are extracted, it is possible to identify the +high-level characteristics of the concerned microbenchmarks. +In the overwhelming majority (97.5\,\%) of those jointly bad rows, the tools predicted +fewer cycles than measured, meaning that a bottleneck is either missed or +underestimated. +Manual investigation of a few simple benchmarks (no polyhedral transformation +applied, \texttt{O1} mode, not unrolled) further hints towards dependencies: +for instance, the \texttt{gemver} benchmark, which is \emph{not} among the +badly predicted benchmarks, has this kernel: + +\begin{minipage}{0.95\linewidth} +\begin{lstlisting}[language={[ANSI]C}] +for(c3) + A[c1][c3] += u1[c1] * v1[c3] + + u2[c1] * v2[c3]; +\end{lstlisting} +\end{minipage} + +while the \texttt{atax} benchmark, which is among the badly predicted ones, has +this kernel: + +\begin{minipage}{0.95\linewidth} +\begin{lstlisting}[language=c] +for(c3) + tmp[c1] += A[c1][c3] * x[c3]; +\end{lstlisting} +\end{minipage} + +The first one exhibits no obvious dependency-boundness, while the second, +accumulating on \texttt{tmp[c1]} (independent of the iteration variable) lacks +in instruction-level parallelism. Among the simple benchmarks (as described +above), 8 are in the badly predicted list, all of which exhibit a +read-after-write data dependency to the preceding iteration. + +Looking at the assembly code generated for those in \texttt{O1} modes, it +appears that the dependencies exhibited at the C level are compiled to +\emph{memory-carried} dependencies: the read-after-write happens for a given +memory address, instead of for a register. This kind of dependency, prone to +aliasing and dependent on the values of the registers, is hard to infer for a +static tool and is not supported by the analyzers under scrutiny in the general +case; it could thus reasonably explain the results observed. + +There is no easy way, however, to know for certain which of the 3500 benchmarks +are latency bound: no hardware counter reports this. We investigate this +further using \gus's sensitivity analysis: in complement of the ``normal'' +throughput estimation of \gus, we run it a second time, disabling the +accounting for latency through memory dependencies. By construction, this second measurement should be +either very close to the first one, or significantly below. We then assume a +benchmark to be latency bound due to memory-carried dependencies when it is at +least 40\,\% faster when this latency is disabled; there are 1112 (31.8\,\%) such +benchmarks. + +Of the 869 jointly bad rows, 745 (85.7\,\%) are declared latency +bound through memory-carried dependencies by \gus. We conclude that the main +reason for these jointly badly predicted benchmarks is that the predictors +under scrutiny failed to correctly detect these dependencies. + +In Section~\ref{ssec:overall_results}, we presented in +Figure~\ref{fig:overall_analysis_boxplot} and +Table~\ref{table:overall_analysis_stats} general statistics on the tools +on the full set of benchmarks. We now remove the 1112 benchmarks +flagged as latency bound through memory-carried dependencies by \gus{} from the +dataset, and present in Figure~\ref{fig:nomemdeps_boxplot} a comparative +box plot for the tools under scrutiny. We also present in +Table~\ref{table:nomemdeps_stats} the same statistics on this pruned dataset. +While the results for \llvmmca, \uica{} and \iaca{} globally improved +significantly, the most noticeable improvements are the reduced spread of the +results and the Kendall's $\tau$ correlation coefficient's increase. + +From this, +we argue that detecting memory-carried dependencies is a weak point in current +state-of-the-art static analyzers, and that their results could be +significantly more accurate if improvements are made in this direction. diff --git a/manuscrit/50_CesASMe/30_future_works.tex b/manuscrit/50_CesASMe/30_future_works.tex new file mode 100644 index 0000000..7f25683 --- /dev/null +++ b/manuscrit/50_CesASMe/30_future_works.tex @@ -0,0 +1,106 @@ +\section{Conclusion and future works} + +In this article, we have presented a fully-tooled approach that enables: + +\begin{itemize} +\item the generation of a wide variety of microbenchmarks, reflecting both the + expertise contained in an initial benchmark suite, and the diversity of + code transformations allowing to stress different aspects of a performance model + ---~or even a measurement environment, \eg{} \bhive; and +\item the comparability of various measurements and + analyses applied to each of these microbenchmarks. +\end{itemize} + +Thanks to this tooling, we were able to show the limits and strengths of +various performance models in relation to the expertise contained in the +Polybench suite. We discuss throughput results in +Section~\ref{ssec:overall_results} and bottleneck prediction in +Section~\ref{ssec:bottleneck_pred_analysis}. + +We were also able to demonstrate the difficulties of reasoning at the level of +a basic block isolated from its context. We specifically study those +difficulties in the case of \bhive{} in Section~\ref{ssec:bhive_errors}. +Indeed, the actual values ---~both from registers and memory~--- involved in a +basic block's computation are constitutive not only of its functional +properties (\ie{} the result of the calculation), but also of some of its +non-functional properties (\eg{} latency, throughput). + +We were also able to show in Section~\ref{ssec:memlatbound} +that state-of-the-art static analyzers struggle to +account for memory-carried dependencies; a weakness significantly impacting +their overall results on our benchmarks. We believe that detecting +and accounting for these dependencies is an important future works direction. + +Moreover, we present this work in the form of a modular software package, each +component of which exposes numerous adjustable parameters. These components can +also be replaced by others fulfilling the same abstract function: another +initial benchmark suite in place of Polybench, other loop nest +optimizers in place of PLUTO and PoCC, other code +analyzers, and so on. This software modularity reflects the fact that our +contribution is about interfacing and communication between distinct issues. + +\medskip + +Furthermore, we believe that the contributions we made in the course of this work +may eventually be used to face different, yet neighbouring issues. +These perspectives can also be seen as future works: + +\smallskip + +\paragraph{Program optimization} the whole program processing we have designed +can be used not only to evaluate the performance model underlying a static +analyzer, but also to guide program optimization itself. In such a perspective, +we would generate different versions of the same program using the +transformations discussed in Section~\ref{sec:bench_gen} and colored blue in +Figure~\ref{fig:contrib}. These different versions would then feed the +execution and measurement environment outlined in +Section~\ref{sec:bench_harness} and colored orange in Figure~\ref{fig:contrib}. +Indeed, thanks to our previous work, we know that the results of these +comparable analyses and measurements would make it possible to identify which +version is the most efficient, and even to reconstruct information indicating +why (which bottlenecks, etc.). + +However, this approach would require that these different versions of the same +program are functionally equivalent, \ie{} that they compute the same +result from the same inputs; yet we saw in Section~\ref{sec:bench_harness} +that, as it stands, the transformations we apply are not concerned with +preserving the semantics of the input codes. To recover this semantic +preservation property, abandoning the kernelification pass we have presented +suffices; this however would require to control L1-residence otherwise. + +\smallskip + +\paragraph{Dataset building} our microbenchmarks generation phase outputs a +large, diverse and representative dataset of microkernels. In addition to our +harness, we believe that such a dataset could be used to improve existing +data-dependant solutions. + +%the measurement and execution environment we +%propose is not the only type of tool whose function is to process a large +%dataset (\ie{} the microbenchmarks generated earlier) to automatically +%abstract its characteristics. We can also think of: + +Inductive methods, for instance in \anica, strive to preserve the properties of a basic +block through successive abstractions of the instructions it contains, so as to +draw the most general conclusions possible from a particular experiment. +Currently, \anica{} starts off from randomly generated basic blocks. This +approach guarantees a certain variety, and avoids +over-specialization, which would prevent it from finding interesting cases too +far from an initial dataset. However, it may well lead to the sample under +consideration being systematically outside the relevant area of the search +space ---~\ie{} having no relation to real-life programs or those in the user's +field. + +On the other hand, machine learning methods based on neural networks, for +instance in \ithemal, seek to correlate the result of a function with the +characteristics of its input ---~in this case to correlate a throughput +prediction with the instructions making up a basic block~--- by backpropagating +the gradient of a cost function. In the case of \ithemal{}, it is trained on +benchmarks originating from a data suite. As opposed to random generation, +this approach offers representative samples, but comes with a risk of lack of +variety and over-specialization. + +Comparatively, our microbenchmark generation method is natively meant to +produce a representative, varied and large dataset. We believe that +enriching the dataset of the above-mentioned methods with our benchmarks might +extend their results and reach. diff --git a/manuscrit/50_CesASMe/99_conclusion.tex b/manuscrit/50_CesASMe/99_conclusion.tex new file mode 100644 index 0000000..3815954 --- /dev/null +++ b/manuscrit/50_CesASMe/99_conclusion.tex @@ -0,0 +1,2 @@ +%% \section*{Conclusion} +%% \todo{} diff --git a/manuscrit/50_CesASMe/main.tex b/manuscrit/50_CesASMe/main.tex index 0ebb0ef..da40a41 100644 --- a/manuscrit/50_CesASMe/main.tex +++ b/manuscrit/50_CesASMe/main.tex @@ -1 +1,9 @@ \chapter{A more systematic approach to throughput prediction performance analysis} + +\input{00_intro.tex} +\input{05_related_works.tex} +\input{10_bench_gen.tex} +\input{20_evaluation.tex} +\input{25_results_analysis.tex} +\input{30_future_works.tex} +\input{99_conclusion.tex} diff --git a/manuscrit/50_CesASMe/overview.tex b/manuscrit/50_CesASMe/overview.tex new file mode 100644 index 0000000..55285f6 --- /dev/null +++ b/manuscrit/50_CesASMe/overview.tex @@ -0,0 +1,82 @@ +\begin{figure*}[ht!] + \definecolor{col_bench_gen}{HTML}{5a7eff} + \definecolor{col_bench_gen_bg}{HTML}{dbeeff} + \definecolor{col_bench_harness}{HTML}{ffa673} + \definecolor{col_results}{HTML}{000000} +\centerline{ + \begin{tikzpicture}[ + hiddennode/.style={rectangle,draw=white, very thick, minimum size=5mm, align=center, font=\footnotesize}, + normnode/.style={rectangle,draw=black, very thick, minimum size=5mm, align=center, font=\footnotesize}, + resultnode/.style={rectangle,draw=col_results, fill=black!2, very thick, minimum size=5mm, align=center, font=\footnotesize}, + bluenode/.style={rectangle, draw=col_bench_gen, fill=col_bench_gen_bg, very thick, minimum height=5mm, minimum width=4cm, align=center, font=\footnotesize}, + rednode/.style={rectangle, draw=col_bench_harness, fill=orange!5, very thick, minimum size=5mm, align=center, font=\footnotesize}, + bencher/.style={rednode, minimum width=2.5cm, minimum height=5mm}, + genarrow/.style={draw=col_bench_gen}, + harnarrow/.style={draw=col_bench_harness}, + ] + \centering + %Nodes + \node[bluenode] (bench) {Benchmark suite \figref{ssec:bench_suite}}; + \node[bluenode] (pocc) [below=of bench] {Loop nest optimizers \figref{ssec:loop_nest_optimizer}}; + \node[bluenode] (kernel) [below=of pocc] {Constraining utility \figref{ssec:kernelify}}; + \node[bluenode] (gcc) [below=of kernel] {Compilations \figref{ssec:compile}}; + \node[rednode] (gdb) [right=0.1\textwidth of gcc] {Basic block \\extraction \figref{ssec:bb_extr}}; + \node[bencher] (ithemal) [right=4cm of gdb] {Ithemal}; + \node[bencher] (iaca) [above=0.5em of ithemal] {IACA}; + \node[bencher] (uica) [above=0.5em of iaca] {uiCA}; + \node[bencher] (llvm) [above=0.5em of uica] {llvm-mca}; + \node[bencher] (bhive) [above=0.5em of llvm] {BHive (measure)}; + \node[rednode] (ppapi) [left=1cm of bhive] {perf (measure)}; + \node[rednode] (gus) [below=0.5em of ppapi] {Gus}; + %% \node[rednode] (uica) [below=of gdb] {uiCA}; + \node[rednode] (lifting) [right=of bhive] { + Prediction lifting\\\figref{ssec:harness_lifting}}; + \node[ + draw=black, + very thick, + dotted, + fit=(ppapi) (gus) (bhive) (llvm) (uica) (iaca) (ithemal) + ] (comps) {}; + \node (throughput_label) [above=0.2em of comps,align=center] { + \footnotesize Throughput predictions \\\footnotesize \& measures + \figref{ssec:throughput_pred_meas}}; + \node[draw=black, + very thick, + dotted, + %% label={below:\footnotesize Variations}, + label={[above,xshift=1cm]\footnotesize Variations}, + fit=(pocc) (kernel) (gcc) + ] (vars) {}; +\node[resultnode] (bench2) [below=of lifting] {Evaluation metrics \\ for + code analyzers}; + + % Key + \node[] (keyblue1) [below left=0.7cm and 0cm of vars] {}; + \node[hiddennode] (keyblue2) [right=0.5cm of keyblue1] {Section~\ref{sec:bench_gen}~: generating microbenchmarks}; + \node[] (keyred1) [right=0.6cm of keyblue2] {}; + \node[hiddennode] (keyred2) [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness}; + \node[] (keyresult1) [right=0.6cm of keyred2] {}; + \node[hiddennode] (keyresult2) [right=0.5cm of keyresult1] + {Section~\ref{sec:results_analysis}~: results analysis}; + + %Lines + \draw[-, very thick, harnarrow] (keyred1.east) -- (keyred2.west); + \draw[-, very thick, genarrow] (keyblue1.east) -- (keyblue2.west); + \draw[-, very thick] (keyresult1.east) -- (keyresult2.west); + \draw[->, very thick, genarrow] (bench.south) -- (pocc.north); + \draw[->, very thick, genarrow] (pocc.south) -- (kernel.north); + \draw[->, very thick, genarrow] (kernel.south) -- (gcc.north); + \draw[->, very thick, genarrow] (gcc.east) -- (gdb.west); + \draw[->, very thick, genarrow] (gcc.east) -- (ppapi.west); + \draw[->, very thick, genarrow] (gcc.east) -- (gus.west); + \draw[->, very thick, harnarrow] (gdb.east) -- (uica.west); + \draw[->, very thick, harnarrow] (gdb.east) -- (iaca.west); + \draw[->, very thick, harnarrow] (gdb.east) -- (ithemal.west); + \draw[->, very thick, harnarrow] (gdb.east) -- (bhive.west); + \draw[->, very thick, harnarrow] (gdb.east) -- (llvm.west); + \draw[->, very thick, harnarrow] (comps.east|-lifting) -- (lifting.west); + \draw[->, very thick] (lifting.south) -- (bench2.north); + \end{tikzpicture} +} +\caption{Our analysis and measurement environment.\label{fig:contrib}} +\end{figure*} diff --git a/manuscrit/assets/imgs/50_CesASMe/.gitignore b/manuscrit/assets/imgs/50_CesASMe/.gitignore new file mode 100644 index 0000000..13f0a16 --- /dev/null +++ b/manuscrit/assets/imgs/50_CesASMe/.gitignore @@ -0,0 +1 @@ +!*.pdf diff --git a/manuscrit/assets/imgs/50_CesASMe/nomemdeps_boxplot.pdf b/manuscrit/assets/imgs/50_CesASMe/nomemdeps_boxplot.pdf new file mode 100644 index 0000000..019a570 Binary files /dev/null and b/manuscrit/assets/imgs/50_CesASMe/nomemdeps_boxplot.pdf differ diff --git a/manuscrit/assets/imgs/50_CesASMe/overall_analysis_boxplot.pdf b/manuscrit/assets/imgs/50_CesASMe/overall_analysis_boxplot.pdf new file mode 100644 index 0000000..e898951 Binary files /dev/null and b/manuscrit/assets/imgs/50_CesASMe/overall_analysis_boxplot.pdf differ diff --git a/manuscrit/assets/imgs/50_CesASMe/results_comparability_hist.pdf b/manuscrit/assets/imgs/50_CesASMe/results_comparability_hist.pdf new file mode 100644 index 0000000..0fda8f5 Binary files /dev/null and b/manuscrit/assets/imgs/50_CesASMe/results_comparability_hist.pdf differ