CesASMe: brutal paper import. Not compiling yet.

This commit is contained in:
Théophile Bastian 2023-09-25 17:00:07 +02:00
parent 0b089085e0
commit fc9182428d
14 changed files with 1143 additions and 0 deletions

View file

@ -0,0 +1,141 @@
\begin{abstract}
A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or
\ithemal{}, strive to statically predict the throughput of a computation
kernel. Each analyzer is based on its own simplified CPU model
reasoning at the scale of an isolated basic block.
Facing this diversity, evaluating their strengths and
weaknesses is important to guide both their usage and their enhancement.
We argue that reasoning at the scale of a single basic block is not
always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled
solution to evaluate code analyzers on C-level benchmarks. It is composed of a
benchmark derivation procedure that feeds an evaluation harness. We use it to
evaluate state-of-the-art code analyzers and to provide insights on their
precision. We use \tool's results to show that memory-carried data
dependencies are a major source of imprecision for these tools.
\end{abstract}
\section{Introduction}\label{sec:intro}
At a time when software is expected to perform more computations, faster and in
more constrained environments, tools that statically predict the resources (and
in particular the CPU resources) they consume are very useful to guide their
optimization. This need is reflected in the diversity of binary or assembly
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various
performance metrics, including the number of CPU cycles a computation kernel will take
---~which roughly translates to execution time.
In addition to raw measurements (relying on hardware counters), these model-based analyses provide
higher-level and refined data, to expose the bottlenecks and guide the
optimization of a given code. This feedback is useful to experts optimizing
computation kernels, including scientific simulations and deep-learning
kernels.
An exact throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly
available, and would be prohibitively slow in any case. These tools thus each
solve in their own way the challenge of modeling complex CPUs while remaining
simple enough to yield a prediction in a reasonable time, ending up with
different models. For instance, on the following x86-64 basic block computing a
general matrix multiplication,
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
movsd (%rcx, %rax), %xmm0
mulsd %xmm1, %xmm0
addsd (%rdx, %rax), %xmm0
movsd %xmm0, (%rdx, %rax)
addq $8, %rax
cmpq $0x2260, %rax
jne 0x16e0
\end{lstlisting}
\end{minipage}
\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
predicts 3 cycles. One may wonder which tool is correct.
The obvious solution to assess their predictions is to compare them to an
actual measure. However, as these tools reason at the basic block level, this
is not as trivially defined as it would seem. Take for instance the following
kernel:
\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov (%rax, %rcx, 1), %r10
mov %r10, (%rbx, %rcx, 1)
add $8, %rcx
\end{lstlisting}
\end{minipage}
\input{overview}
\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
instruction and the second instruction at the previous iteration; which makes
the throughput drop significantly. As we shall see in
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
block's throughput is not well-defined}.
To recover the context of each basic block, we reason instead at the scale of
a C source code. This
makes the measures unambiguous: one can use hardware counters to measure the
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
that both is representative of the domain studied, and wide enough to have a
good coverage of the domain. However, this is not in itself sufficient to
evaluate static tools: on the preceding matrix multiplication kernel, counters
report 80,059 elapsed cycles ---~for the total loop.
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
basic block-level predictions seen above.
A common practice to make these numbers comparable is to renormalize them to
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
kernel's efficiency}. Indeed, the static number of instructions is affected by
many compiler passes, such as scalar evolution, strength reduction, register
allocation, instruction selection\ldots{} Thus, when comparing two compiled
versions of the same code, IPC alone does not necessarily point to the most
efficient version. For instance, a kernel using SIMD instructions will use
fewer instructions than one using only scalars, and thus exhibit a lower or
constant IPC; yet, its performance will unquestionably increase.
The total cycles elapsed to solve a given problem, on the other
hand, is a sound metric of the efficiency of an implementation. We thus
instead \emph{lift} the predictions at basic-block level to a total number of
cycles. In simple cases, this simply means multiplying the block-level
prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.
%In the general case, instrumenting the generated code to obtain the number of
%occurrences of the basic block yields accurate results.
\bigskip
In this article, we present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \tool, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\tool{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.
In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \tool{} is shown in Figure~\ref{fig:contrib}.
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \tool{}.
In addition to statistical studies, we use \tool's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.

View file

@ -0,0 +1,56 @@
\section{Related works}
The static throughput analyzers studied rely on a variety of models.
\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
relies on Intel's expertise on their own processors.
The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
architectures. These models are used in the LLVM Machine Code Analyzer,
\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
of assembly.
Independently, Abel and Reineke used an automated microbenchmark generation
approach to generate port mappings of many architectures in
\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
This work was continued with \uica~\cite{uica}, extending this model with an
extensive frontend description.
Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
neural network to predict basic blocks throughput. To obtain enough data to
train its model, the authors also developed \bhive~\cite{bhive}, a profiling
tool working on basic blocks.
Another static tool, \osaca~\cite{osaca2}, provides lower- and
upper-bounds to the execution time of a basic block. As this kind of
information cannot be fairly compared with tools yielding an exact throughput
prediction, we exclude it from our scope.
All these tools statically predict the number of cycles taken by a piece of
assembly or binary that is assumed to be the body of an infinite ---~or
sufficiently large~--- loop in steady state, all its data being L1-resident. As
discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
analyzers; \eg{} by assuming that the loop is either unrolled or has control
instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
necessarily work on a single basic block, while some others, \eg{} \iaca, work
on a section of code delimited by markers. However, even in the second case,
the code is assumed to be \emph{straight-line code}: branch instructions, if
any, are assumed not taken.
\smallskip
Throughput prediction tools, however, are not all static.
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
region, instrumenting it to retrieve the exact events occurring through its
execution. This way, \gus{} can more finely detect bottlenecks by
sensitivity analysis, at the cost of a significantly longer run time.
\smallskip
The \bhive{} profiler~\cite{bhive} takes another approach to basic block
throughput measurement: by mapping memory at any address accessed by a basic
block, it can effectively run and measure arbitrary code without context, often
---~but not always, as we discuss later~--- yielding good results.
\smallskip
The \anica{} framework~\cite{anica} also attempts to evaluate throughput
predictors by finding examples on which they are inaccurate. \anica{} starts
with randomly generated assembly snippets, and refines them through a process
derived from abstract interpretation to reach general categories of problems.

View file

@ -0,0 +1,109 @@
\section{Generating microbenchmarks}\label{sec:bench_gen}
Our framework aims to generate \emph{microbenchmarks} relevant to a specific
domain.
A microbenchmark is a code that is as simplified as possible to expose the
behaviour under consideration.
The specified computations should be representative of the considered domain,
and at the same time they should stress the different aspects of the
target architecture ---~which is modeled by code analyzers.
In practice, a microbenchmark's \textit{computational kernel} is a simple
\texttt{for} loop, whose
body contains no loops and whose bounds are statically known.
A \emph{measure} is a number of repetitions $n$ of this computational
kernel, $n$ being an user-specified parameter.
The measure may be repeated an arbitrary number of times to improve
stability.
Furthermore, such a microbenchmark should be a function whose computation
happens without leaving the L1 cache.
This requirement helps measurements and analyses to be
undisturbed by memory accesses, but it is also a matter of comparability.
Indeed, most of the static analyzers make the assumption that the code under
consideration is L1-resident; if it is not, their results are meaningless, and
can not be compared with an actual measurement.
The generation of such microbenchmarks is achieved through four distinct
components, whose parameter variations are specified in configuration files~:
a benchmark suite, C-to-C loop nest optimizers, a constraining utility
and a C-to-binary compiler.
\subsection{Benchmark suite}\label{ssec:bench_suite}
Our first component is an initial set of benchmarks which materializes
the human expertise we intend to exploit for the generation of relevant codes.
The considered suite must embed computation kernels
delimited by ad-hoc \texttt{\#pragma}s,
whose arrays are accessed
directly (no indirections) and whose loops are affine.
These constraints are necessary to ensure that the microkernelification phase,
presented below, generates segfault-free code.
In this case, we use Polybench~\cite{polybench}, a suite of 30
benchmarks for polyhedral compilation ---~of which we use only 26. The
\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
removed because they are incompatible with PoCC (introduced below). The
\texttt{lu} benchmark is left out as its execution alone takes longer than all
others together, making its dynamic analysis (\eg{} with \gus) impractical.
In addition to the importance of linear algebra within
it, one of its important features is that it does not include computational
kernels with conditional control flow (\eg{} \texttt{if-then-else})
---~however, it does includes conditional data flow, using the ternary
conditional operator of C.
\subsection{C-to-C loop nest optimizers}\label{ssec:loop_nest_optimizer}
Loop nest optimizers transform an initial benchmark in different ways (generate different
\textit{versions} of the same benchmark), varying the stress on
resources of the target architecture, and by extension the models on which the
static analyzers are based.
In this case, we chose to use the
\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling,
skewing, vectorization/simdization, loop unrolling, loop permutation,
loop fusion.
These transformations are meant to maximize variety within the initial
benchmark suite. Eventually, the generated benchmarks are expected to
highlight the impact on performance of the resulting behaviours.
For instance, \textit{skewing} introduces non-trivial pointer arithmetics,
increasing the pressure on address computation units~; \textit{loop unrolling},
among many things, opens the way to register promotion, which exposes dependencies
and alleviates load-store units~;
\textit{vectorization} stresses SIMD units and decreases
pressure on the front-end~; and so on.
\subsection{Constraining utility}\label{ssec:kernelify}
A constraining utility transforms the code in order to respect an arbitrary number of non-functional
properties.
In this case, we apply a pass of \emph{microkernelification}: we
extract a computational kernel from the arbitrarily deep and arbitrarily
long loop nest generated by the previous component.
The loop chosen to form the microkernel is the one considered to be
the \textit{hottest}; the \textit{hotness} of a loop being obtained by
multiplying the number of arithmetic operations it contains by the number of
times it is iterated. This metric allows us to prioritize the parts of the
code that have the greatest impact on performance.
At this point, the resulting code can
compute a different result from the initial code;
for instance, the composition of tiling and
kernelification reduces the number of loop iterations.
Indeed, our framework is not meant to preserve the
functional semantics of the benchmarks.
Our goal is only to generate codes that are relevant from the point of view of
performance analysis.
\subsection{C-to-binary compiler}\label{ssec:compile}
A C-to-binary compiler varies binary optimization options by
enabling/disabling auto-vectorization, extended instruction
sets, \textit{etc}. We use \texttt{gcc}.
\bigskip
Eventually, the relevance of the microbenchmarks set generated using this approach
derives not only from initial benchmark suite and the relevance of the
transformations chosen at each
stage, but also from the combinatorial explosion generated by the composition
of the four stages. In our experimental setup, this yields up to 144
microbenchmarks per benchmark of the original suite.

View file

@ -0,0 +1,87 @@
\section{Benchmarking harness}\label{sec:bench_harness}
To compare full-kernel cycle measurements to throughput predictions on
individual basic blocks, we lift predictions by adding the weighted basic block
predictions:
\[
\text{lifted\_pred}(\mathcal{K}) =
\sum_{b \in \operatorname{BBs}(\mathcal{K})}
\operatorname{occurences}(b) \times \operatorname{pred}(b)
\]
Our benchmarking harness works in three successive stages. It first
extracts the basic blocks constituting a computation kernel, and instruments it
to retrieve their respective occurrences in the original context. It then runs
all the studied tools on each basic block, while also running measures on the
whole computation kernel. Finally, the block-level results are lifted to
kernel-level results thanks to the occurrences previously measured.
\subsection{Basic block extraction}\label{ssec:bb_extr}
Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
code at each control flow instruction (jump, call, return, \ldots) and each
jump site.
To accurately obtain the occurrences of each basic block in the whole kernel's
computation,
we then instrument it with \texttt{gdb} by placing a break
point at each basic block's first instruction in order to count the occurrences
of each basic block between two calls to the \perf{} counters\footnote{We
assume the program under analysis to be deterministic.}. While this
instrumentation takes about 50 to 100$\times$ more time than a regular run, it
can safely be run in parallel, as the performance results are discarded.
\subsection{Throughput predictions and measures}\label{ssec:throughput_pred_meas}
The harness leverages a variety of tools: actual CPU measurement; the \bhive{}
basic block profiler~\cite{bhive}; \llvmmca~\cite{llvm-mca}, \uica~\cite{uica}
and \iaca~\cite{iaca}, which leverage microarchitectural
models to predict a block's throughput; \ithemal~\cite{ithemal}, a machine
learning model; and \gus~\cite{phd:gruber}, a dynamic analyzer based on \qemu{}
that works at the whole binary level.
The execution time of the full kernel is measured using Linux
\perf~\cite{tool:perf} CPU counters around the full computation kernel. The
measure is repeated four times and the smallest is kept; this ensures that the
cache is warm and compensates for context switching or other measurement
artifacts. \gus{} instruments the whole function body. The other tools included
all work at basic block level; these are run on each basic block of each
benchmark.
We emphasize the importance, throughout the whole evaluation chain, to keep the
exact same assembled binary. Indeed, recompiling the kernel from source
\emph{cannot} be assumed to produce the same assembly kernel. This is even more
important in the presence of slight changes: for instance, inserting \iaca{}
markers at the C-level ---~as is intended~--- around the kernel \emph{might}
change the compiled kernel, if only for alignment regions. We argue that, in
the case of \iaca{} markers, the problem is even more critical, as those
markers prevent a binary from being run by overwriting registers with arbitrary
values. This forces a user to run and measure a version which is different from
the analyzed one. In our harness, we circumvent this issue by adding markers
directly at the assembly level, editing the already compiled version. Our
\texttt{gdb} instrumentation procedure also respects this principle of
single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
\gus{} with a preloaded stub shared library to be able to instrument binaries
containing calls to \perf.
\subsection{Prediction lifting and filtering}\label{ssec:harness_lifting}
We finally lift single basic block predictions to a whole-kernel cycle
prediction by summing the block-level results, weighted by the occurrences of
the basic block in the original context (formula above). If an analyzer fails
on one of the basic blocks of a benchmark, the whole benchmark is discarded for
this analyzer.
In the presence of complex control flow, \eg{} with conditionals inside loops,
our approach based on basic block occurrences is arguably less precise than an
approach based on paths occurrences, as we have less information available
---~for instance, whether a branch is taken with a regular pattern, whether we
have constraints on register values, etc. We however chose this block-based
approach, as most throughput prediction tools work a basic block-level, and are
thus readily available and can be directly plugged into our harness.
Finally, we control the proportion of cache misses in the program's execution
using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more
than 15\,\% of cache misses on a warm cache are not considered L1-resident and
are discarded.

View file

@ -0,0 +1,213 @@
\section{Experimental setup and evaluation}\label{sec:exp_setup}
Running the harness described above provides us with 3500
benchmarks ---~after filtering out non-L1-resident
benchmarks~---, on which each throughput predictor is run. We make the full
output of our tool available in our artifact. Before analyzing these results in
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
predictions comparable to baseline hardware counter measures.
\subsection{Experimental environment}
The experiments presented in this paper were all realized on a Dell PowerEdge
C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
The experiments themselves were run inside a Docker environment very close to
our artifact, based on Debian Bullseye. Care was taken to disable
hyperthreading to improve measurements stability. For tools whose output is
based on a direct measurement (\perf, \bhive), the benchmarks were run
sequentially on a single core with no experiments on the other cores. No such
care was taken for \gus{} as, although based on a dynamic run, its prediction
is purely function of recorded program events and not of program measures. All
other tools were run in parallel.
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
commit \texttt{87463c9}, \ithemal{} at commit \texttt{b3c39a8}.
\subsection{Comparability of the results}
We define the relative error of a time prediction
$C_\text{pred}$ (in cycles) with respect to a baseline $C_\text{baseline}$ as
\[
\operatorname{err} = \frac{\left| C_\text{pred} - C_\text{baseline}
\right|}{C_\text{baseline}}
\]
We assess the comparability of the whole benchmark, measured with \perf{}, to
lifted block-based results by measuring the statistical distribution of the
relative error of two series: the predictions made by \bhive, and the series of
the best block-based prediction for each benchmark.
We single out \bhive{} as it is the only tool able to \textit{measure}
---~instead of predicting~--- an isolated basic block's timing. This, however, is
not sufficient: as discussed later in Section~\ref{ssec:bhive_errors}, \bhive{}
is not able to yield a result for about $40\,\%$ of the benchmarks, and is
subject to large errors in some cases. For this purpose, we also consider, for
each benchmark, the best block-based prediction: we argue that if, for most
benchmarks, at least one of these predictors is able to yield a satisfyingly
accurate result, then the lifting methodology is sound in practice.
The result of this analysis is presented in Table~\ref{table:exp_comparability}
and in Figure~\ref{fig:exp_comparability}. The results are in a range
compatible with common results of the field, as seen \eg{} in~\cite{uica}
reporting Mean Absolute Percentage Error (MAPE, corresponding to the
``Average'' row) of about 10-15\,\% in many cases. While lifted \bhive's
average error is driven high by large errors on certain benchmarks,
investigated later in this article, its median error is still comparable to the
errors of state-of-the-art tools. From this, we conclude that lifted cycle
measures and predictions are consistent with whole-benchmark measures; and
consequently, lifted predictions can reasonably be compared to one another.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf}
\caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
\end{figure}
\begin{table}
\centering
\caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
\begin{tabular}{l r r}
\toprule
& \textbf{Best block-based} & \textbf{BHive} \\
\midrule
Datapoints & 3500 & 2198 \\
Errors & 0 & 1302 \\
& (0\,\%) & (37.20\,\%) \\
Average (\%) & 11.60 & 27.95 \\
Median (\%) & 5.81 & 7.78 \\
Q1 (\%) & 1.99 & 3.01 \\
Q3 (\%) & 15.41 & 23.01 \\
\bottomrule
\end{tabular}
\end{table}
\begin{table*}[!htbp]
\centering
\caption{Bottleneck reports from the studied tools}\label{table:coverage}
\begin{tabular}{l | r r r | r r r | r r r}
\toprule
& \multicolumn{3}{c|}{\textbf{Frontend}}
& \multicolumn{3}{c|}{\textbf{Ports}}
& \multicolumn{3}{c}{\textbf{Dependencies}} \\
& \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\
\midrule
2mm & 34 & 61 & 25.8 \% & 25 & 13 & 70.3 \% & 18 & 29 & 63.3 \% \\
3mm & 44 & 61 & 18.0 \% & 30 & 13 & 66.4 \% & 23 & 37 & 53.1 \% \\
atax & 13 & 72 & 41.0 \% & 25 & 17 & 70.8 \% & 23 & 30 & 63.2 \% \\
bicg & 19 & 59 & 45.8 \% & 25 & 25 & 65.3 \% & 21 & 37 & 59.7 \% \\
doitgen & 51 & 25 & 40.6 \% & 36 & 30 & 48.4 \% & 17 & 22 & 69.5 \% \\
mvt & 27 & 53 & 33.3 \% & 9 & 18 & 77.5 \% & 7 & 32 & 67.5 \% \\
gemver & 62 & 13 & 39.5 \% & 2 & 48 & 59.7 \% & 1 & 28 & 76.6 \% \\
gesummv & 16 & 69 & 41.0 \% & 17 & 23 & 72.2 \% & 24 & 28 & 63.9 \% \\
syr2k & 51 & 37 & 38.9 \% & 8 & 42 & 65.3 \% & 19 & 34 & 63.2 \% \\
trmm & 69 & 27 & 25.0 \% & 16 & 30 & 64.1 \% & 15 & 30 & 64.8 \% \\
symm & 0 & 121 & 11.0 \% & 5 & 20 & 81.6 \% & 9 & 5 & 89.7 \% \\
syrk & 54 & 46 & 30.6 \% & 12 & 42 & 62.5 \% & 20 & 48 & 52.8 \% \\
gemm & 42 & 41 & 42.4 \% & 30 & 41 & 50.7 \% & 16 & 57 & 49.3 \% \\
gramschmidt & 48 & 52 & 21.9 \% & 16 & 20 & 71.9 \% & 24 & 39 & 50.8 \% \\
cholesky & 24 & 72 & 33.3 \% & 0 & 19 & 86.8 \% & 5 & 14 & 86.8 \% \\
durbin & 49 & 52 & 29.9 \% & 0 & 65 & 54.9 \% & 2 & 39 & 71.5 \% \\
trisolv & 53 & 84 & 4.9 \% & 6 & 22 & 80.6 \% & 4 & 16 & 86.1 \% \\
jacobi-1d & 18 & 78 & 33.3 \% & 66 & 9 & 47.9 \% & 0 & 13 & 91.0 \% \\
heat-3d & 32 & 8 & 72.2 \% & 26 & 0 & 81.9 \% & 0 & 0 & 100.0 \% \\
seidel-2d & 0 & 112 & 22.2 \% & 32 & 0 & 77.8 \% & 0 & 0 & 100.0 \% \\
fdtd-2d & 52 & 22 & 47.1 \% & 20 & 41 & 56.4 \% & 0 & 40 & 71.4 \% \\
jacobi-2d & 6 & 31 & 73.6 \% & 24 & 61 & 39.3 \% & 0 & 44 & 68.6 \% \\
adi & 12 & 76 & 21.4 \% & 40 & 0 & 64.3 \% & 0 & 0 & 100.0 \% \\
correlation & 18 & 36 & 51.8 \% & 19 & 30 & 56.2 \% & 23 & 45 & 39.3 \% \\
covariance & 39 & 36 & 37.5 \% & 4 & 34 & 68.3 \% & 19 & 53 & 40.0 \% \\
floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 & 8 & 78.1 \% \\
\textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\
\bottomrule
\end{tabular}
\end{table*}
\subsection{Relevance and representativity (bottleneck
analysis)}\label{ssec:bottleneck_diversity}
The results provided by our harness are only relevant to evaluate the parts of
the tools' models that are stressed by the benchmarks generated; it is hence
critical that our benchmark generation procedure in Section~\ref{sec:bench_gen}
yields diverse results. This should be true by construction, as the various
polyhedral compilation techniques used stress different parts of the
microarchitecture.
To assess this, we study the generated benchmarks' bottlenecks, \ie{}
architectural resources on which a release of pressure improves execution time.
Note that a saturated resource is not necessarily a bottleneck: a code that
uses \eg{} 100\,\% of the arithmetics units available for computations outside
of the critical path, at a point where a chain of dependencies is blocking,
will not run faster if the arithmetics operations are removed; hence, hardware
counters alone are not sufficient to find bottlenecks.
However, some static analyzers report the bottlenecks they detect. To unify
their results and keep things simple, we study three general kinds of
bottlenecks.
\begin{itemize}
\item{} \emph{Frontend:} the CPU's frontend is not able to issue
micro-operations to the backend fast enough. \iaca{} and \uica{} are
able to detect this.
\item{} \emph{Ports:} at least one of the backend ports has too much work;
reducing its pressure would accelerate the computation.
\llvmmca, \iaca{} and \uica{} are able to detect this.
\item{} \emph{Dependencies:} there is a chain of data dependencies slowing
down the computation.
\llvmmca, \iaca{} and \uica{} are able to detect this.
\end{itemize}
For each source benchmark from Polybench and each type of bottleneck, we report
in Table~\ref{table:coverage} the number of derived benchmarks on which all the
tools agree that the bottleneck is present or absent. We also report the
proportion of cases in which the tools failed to agree. We analyze those
results later in Section~\ref{ssec:bottleneck_pred_analysis}.
As we have no source of truth indicating whether a bottleneck is effectively
present in a microbenchmark, we adopt a conservative approach, and consider
only the subset of the microbenchmarks on which the tools agree on the status
of all three resources; for those, we have a good confidence on the bottlenecks
reported. Obviously, this approach is limited, because it excludes
microbenchmarks that might be worth considering, and is most probably subject
to selection bias.
Of the 3,500 microbenchmarks we have generated, 261 (7.5\,\%) are the subject
of the above-mentioned consensus. This sample is made up of microbenchmarks
generated from 21 benchmarks ---~\ie{} for 5 benchmarks, none of the derived
microbenchmarks reached a consensus among the tools~---, yielding a wide
variety of calculations, including floating-point arithmetic, pointer
arithmetic or Boolean arithmetic. Of these, 200 (76.6\,\%) are bottlenecked on
the CPU front-end, 19 (7,3\,\%) on back-end ports, and 81 (31.0\,\%) on latency
introduced by dependencies. As mentioned above, this distribution
probably does not transcribe the distribution among the 3,500 original
benchmarks, as the 261 were not uniformly sampled. However, we argue that, as
all categories are represented in the sample, the initial hypothesis that the
generated benchmarks are diverse and representative is confirmed ---~thanks to
the transformations described in Section~\ref{sec:bench_gen}.
\subsection{Carbon footprint}
Generating and running the full suite of benchmarks required about 30h of
continuous computation on a single machine. During the experiments, the power
supply units reported a near-constant consumption of about 350W. The carbon
intensity of the power grid for the region where the experiment was run, at the
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
The electricity consumed directly by the server thus amounts to about
10.50\,kWh. Assuming a Power Usage Efficiency of 1.5, the total electricity
consumption roughly amounts to 15.75\,kWh, or about 450\,g\coeq.
A carbon footprint estimate of the machine's manufacture itself was conducted
by the manufacturer~\cite{poweredgeC6420lca}. Additionally accounting for the
extra 160\,GB of DDR4 SDRAM~\cite{meta_ACT}, the hardware manufacturing,
transport and end-of-life is evaluated to 1,266\,kg\coeq. In 2023, this
computation cluster's usage rate was 35\,\%. Assuming 6 years of product life,
30h of usage represents about 2,050\,g\coeq{}. The whole experiment thus amounts to
2.5\,kg\coeq.

View file

@ -0,0 +1,338 @@
\section{Results analysis}\label{sec:results_analysis}
The raw complete output from our benchmarking harness ---~roughly speaking, a
large table with, for each benchmark, a cycle measurement, cycle count for each
throughput analyzer, the resulting relative error, and a synthesis of the
bottlenecks reported by each tool~--- enables many analyses that, we believe,
could be useful both to throughput analysis tool developers and users. Tool
designers can draw insights on their tool's best strengths and weaknesses, and
work towards improving them with a clearer vision. Users can gain a better
understanding of which tool is more suited for each situation.
\subsection{Throughput results}\label{ssec:overall_results}
\begin{table*}
\centering
\caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
\begin{tabular}{l r r r r r r r r r}
\toprule
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\
\midrule
BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\
llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\
UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\
Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\
Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\
Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\
\bottomrule
\end{tabular}
\end{table*}
The error distribution of the relative errors, for each tool, is presented as a
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator,
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
anti-correlation and $1$ a full correlation. This is especially useful when one
is not interested in a program's absolute throughput, but rather in comparing
which program has a better throughput.
\begin{figure}
\includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf}
\caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
\end{figure}
These results are, overall, significantly worse than what each tool's article
presents. We attribute this difference mostly to the specificities of
Polybench: being composed of computation kernels, it intrinsically stresses the
CPU more than basic blocks extracted out of the Spec benchmark suite. This
difference is clearly reflected in the experimental section of the Palmed
article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
Spec, often by more than a factor of two.
As \bhive{} and \ithemal{} do not support control flow instructions
(\eg{} \texttt{jump} instructions), those had
to be removed from the blocks before analysis. While none of these tools, apart
from \gus{} ---~which is dynamic~---, is able to account for branching costs,
these two analyzers were also unable to account for the front- and backend cost
of the control flow instructions themselves as well ---~corresponding to the
$TP_U$ mode introduced by \uica~\cite{uica}, while others
measure $TP_L$.
\subsection{Understanding \bhive's results}\label{ssec:bhive_errors}
The error distribution of \bhive{} against \perf{}, plotted right in
Figure~\ref{fig:exp_comparability}, puts forward irregularities in \bhive's
results. Since \bhive{} is based on measures ---~instead of predictions~---
through hardware counters, an excellent accuracy is expected. Its lack of
support for control flow instructions can be held accountable for a portion of
this accuracy drop; our lifting method, based on block occurrences instead of
paths, can explain another portion. We also find that \bhive{} fails to produce
a result in about 40\,\% of the kernels explored ---~which means that, for those
cases, \bhive{} failed to produce a result on at least one of the constituent
basic blocks. In fact, this is due to the difficulties we mentioned in
Section \ref{sec:intro} related to the need to reconstruct the context of each
basic block \textit{ex nihilo}.
The basis of \bhive's method is to run the code to be measured, unrolled a
number of times depending on the code size, with all memory pages but the
code unmapped. As the code tries to access memory, it will raise segfaults,
caught by \bhive's harness, which allocates a single shared-memory page, filled
with a repeated constant, that it will map wherever segfaults occur before
restarting the program.
The main causes of \bhive{} failure are bad code behaviour (\eg{} control flow
not reaching the exit point of the measure if a bad jump is inserted), too many
segfaults to be handled, or a segfault that occurs even after mapping a page at
the problematic address.
The registers are also initialized, at the beginning of the measurement, to the
fixed constant \texttt{0x2324000}. We show through two examples that this
initial value can be of crucial importance.
The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
(Cascade Lake), with hyperthreading disabled.
\paragraph{Imprecise analysis} we consider the following x86-64 kernel.
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
vmulsd (%rax), %xmm3, %xmm0
vmovsd %xmm0, (%r10)
\end{lstlisting}
\end{minipage}
When executed with all the general purpose registers initialized to the default
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
\reg{r10} hold the same value, inducing a read-after-write dependency between
the two instructions. If, however, \bhive{} is tweaked to initialize \reg{r10}
to a value that aliases (\wrt{} physical addresses) with the value in
\reg{rax}, \eg{} between \texttt{0x10000} and \texttt{0x10007} (inclusive), it
reports 19 cycles per iteration instead; while a value between \texttt{0x10008}
and \texttt{0x1009f} (inclusive) yields the expected 1 cycle ---~except for
values in \texttt{0x10039}-\texttt{0x1003f} and
\texttt{0x10079}-\texttt{0x1007f}, yielding 2 cycles as the store crosses a
cache line boundary.
In the same way, the value used to initialize the shared memory page can
influence the results whenever it gets loaded into registers.
\vspace{0.5em}
\paragraph{Failed analysis} some memory accesses will always result in an
error; for instance, it is impossible to \texttt{mmap} at an address lower
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
with equal initial values for all registers, the following kernel would fail,
since the second operation attempts to load at address 0:
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
subq %r11, %r10
movq (%r10), %rax
\end{lstlisting}
\end{minipage}
Such errors can occur in more circumvoluted ways. The following x86-64 kernel,
for instance, is extracted from a version of the \texttt{durbin}
kernel\footnote{\texttt{durbin.pocc.noopt.default.unroll8.MEDIUM.kernel21.s}
in the full results}.
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
vmovsd 0x10(%r8, %rcx), %xmm6
subl %eax, %esi
movslq %esi, %rsi
vfmadd231sd -8(%r9, %rsi, 8), \
%xmm6, %xmm0
\end{lstlisting}
\end{minipage}
Here, \bhive{} fails to measure the kernel when run with the general purpose
registers initialized to the default constant at the 2\textsuperscript{nd}
occurrence of the unrolled loop body, failing to recover from an error at the
\texttt{vfmadd231sd} instruction with the \texttt{mmap} strategy. Indeed, after
the first iteration the value in \reg{rsi} becomes null, then negative at the
second iteration; thus, the second occurrence of the last instruction fetches
at address \texttt{0xfffffffff0a03ff8}, which is in kernel space. This
microkernel can be benchmarked with BHive \eg{} by initializing \reg{rax} to 1.
Some other microkernels fail in a similar way when trying to access addresses
that are not a virtual address in \emph{canonical form} space for x86-64 with
48 bits virtual addresses, as defined in Section~3.3.7.1 of Intel's Software
Developer's Manual~\cite{ref:intel64_software_dev_reference_vol1} and
Section~5.3.1 of the AMD64 Architecture Programmer's
Manual~\cite{ref:amd64_architecture_dev_reference_vol2}. Others still fail with
accesses relative to the instruction pointer, as \bhive{} read-protects the
unrolled microkernel's instructions page.
\subsection{Bottleneck prediction}\label{ssec:bottleneck_pred_analysis}
We introduced in Section~\ref{ssec:bottleneck_diversity} earlier that some of
the tools studied are also able to report suspected bottlenecks for the
evaluated program, whose results are presented in Table~\ref{table:coverage}.
This feature might be even more useful than raw throughput predictions to the
users of these tools willing to optimize their program, as they strongly hint
towards what needs to be enhanced.
In the majority of the cases studied, the tools are not able to agree on the
presence or absence of a type of bottleneck. Although it might seem that the
tools are performing better on frontend bottleneck detection, it must be
recalled that only two tools (versus three in the other cases) are reporting
frontend bottlenecks, thus making it easier for them to agree.
\begin{table}
\centering
\caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
\begin{tabular}{l r r r r}
\toprule
\textbf{Tool}
& \multicolumn{2}{c}{\textbf{Ports}}
& \multicolumn{2}{c}{\textbf{Dependencies}} \\
\midrule
\llvmmca{} & 567 & (24.6 \%) & 1032 & (41.9 \%) \\
\uica{} & 516 & (22.4 \%) & 530 & (21.5 \%) \\
\iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
\bottomrule
\end{tabular}
\end{table}
The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
on which three tools disagree into the number of times one tool makes a
diverging prediction ---~\ie{} the tool predicts differently than the two
others. In the case of ports, \iaca{} is responsible for half of the
divergences ---~which is not sufficient to conclude that the prediction of the
other tools is correct. In the case of dependencies, however, there is no clear
outlier, even though \uica{} seems to fare better than others.
In no case one tool seems to be responsible for the vast majority of
disagreements, which could hint towards it failing to predict correctly this
bottleneck. In the absence of a source of truth indicating whether a bottleneck
is effectively present, and with no clear-cut result for (a subset of) tool
predictions, we cannot conclude on the quality of the predictions from each
tool for each kind of bottleneck.
\subsection{Impact of dependency-boundness}\label{ssec:memlatbound}
\begin{table*}
\centering
\caption{Statistical analysis of overall results, without latency bound
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
\begin{tabular}{l r r r r r r r r r}
\toprule
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\
\midrule
BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
UiCA & 2388 & 0 & (0.00\,\%) & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
Ithemal & 2388 & 0 & (0.00\,\%) & 62.66\,\% & 53.84\,\% & 24.12\,\% & 81.95\,\% & 0.40\\
Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.82\\
Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
\bottomrule
\end{tabular}
\end{table*}
An overview of the full results table (available in our artifact) hints towards
two main tendencies: on a significant number of rows, the static tools
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
comparatively bad throughput predictions \emph{together}; and many of these
rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
latter).
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
---~yielding 1050 rows each. All of these share 869 rows (82.8\,\%), which we
call \textit{jointly bad rows}.
Among these 869 jointly bad rows, we further find that respectively 342
(39.4\,\%) and 337 (38.8\,\%) are compiled using the \texttt{O1} and
\texttt{O1autovect}, totalling to 679 (78.1\,\%) of \texttt{O1}-based rows,
against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
\texttt{O3nosimd}. This result is significant enough to be used as a hint to
investigate the issue.
\begin{figure}
\includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf}
\caption{Statistical distribution of relative errors, with and without
pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
\end{figure}
Insofar as our approach maintains a strong link between the basic blocks studied and
the source codes from which they are extracted, it is possible to identify the
high-level characteristics of the concerned microbenchmarks.
In the overwhelming majority (97.5\,\%) of those jointly bad rows, the tools predicted
fewer cycles than measured, meaning that a bottleneck is either missed or
underestimated.
Manual investigation of a few simple benchmarks (no polyhedral transformation
applied, \texttt{O1} mode, not unrolled) further hints towards dependencies:
for instance, the \texttt{gemver} benchmark, which is \emph{not} among the
badly predicted benchmarks, has this kernel:
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[ANSI]C}]
for(c3)
A[c1][c3] += u1[c1] * v1[c3]
+ u2[c1] * v2[c3];
\end{lstlisting}
\end{minipage}
while the \texttt{atax} benchmark, which is among the badly predicted ones, has
this kernel:
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language=c]
for(c3)
tmp[c1] += A[c1][c3] * x[c3];
\end{lstlisting}
\end{minipage}
The first one exhibits no obvious dependency-boundness, while the second,
accumulating on \texttt{tmp[c1]} (independent of the iteration variable) lacks
in instruction-level parallelism. Among the simple benchmarks (as described
above), 8 are in the badly predicted list, all of which exhibit a
read-after-write data dependency to the preceding iteration.
Looking at the assembly code generated for those in \texttt{O1} modes, it
appears that the dependencies exhibited at the C level are compiled to
\emph{memory-carried} dependencies: the read-after-write happens for a given
memory address, instead of for a register. This kind of dependency, prone to
aliasing and dependent on the values of the registers, is hard to infer for a
static tool and is not supported by the analyzers under scrutiny in the general
case; it could thus reasonably explain the results observed.
There is no easy way, however, to know for certain which of the 3500 benchmarks
are latency bound: no hardware counter reports this. We investigate this
further using \gus's sensitivity analysis: in complement of the ``normal''
throughput estimation of \gus, we run it a second time, disabling the
accounting for latency through memory dependencies. By construction, this second measurement should be
either very close to the first one, or significantly below. We then assume a
benchmark to be latency bound due to memory-carried dependencies when it is at
least 40\,\% faster when this latency is disabled; there are 1112 (31.8\,\%) such
benchmarks.
Of the 869 jointly bad rows, 745 (85.7\,\%) are declared latency
bound through memory-carried dependencies by \gus. We conclude that the main
reason for these jointly badly predicted benchmarks is that the predictors
under scrutiny failed to correctly detect these dependencies.
In Section~\ref{ssec:overall_results}, we presented in
Figure~\ref{fig:overall_analysis_boxplot} and
Table~\ref{table:overall_analysis_stats} general statistics on the tools
on the full set of benchmarks. We now remove the 1112 benchmarks
flagged as latency bound through memory-carried dependencies by \gus{} from the
dataset, and present in Figure~\ref{fig:nomemdeps_boxplot} a comparative
box plot for the tools under scrutiny. We also present in
Table~\ref{table:nomemdeps_stats} the same statistics on this pruned dataset.
While the results for \llvmmca, \uica{} and \iaca{} globally improved
significantly, the most noticeable improvements are the reduced spread of the
results and the Kendall's $\tau$ correlation coefficient's increase.
From this,
we argue that detecting memory-carried dependencies is a weak point in current
state-of-the-art static analyzers, and that their results could be
significantly more accurate if improvements are made in this direction.

View file

@ -0,0 +1,106 @@
\section{Conclusion and future works}
In this article, we have presented a fully-tooled approach that enables:
\begin{itemize}
\item the generation of a wide variety of microbenchmarks, reflecting both the
expertise contained in an initial benchmark suite, and the diversity of
code transformations allowing to stress different aspects of a performance model
---~or even a measurement environment, \eg{} \bhive; and
\item the comparability of various measurements and
analyses applied to each of these microbenchmarks.
\end{itemize}
Thanks to this tooling, we were able to show the limits and strengths of
various performance models in relation to the expertise contained in the
Polybench suite. We discuss throughput results in
Section~\ref{ssec:overall_results} and bottleneck prediction in
Section~\ref{ssec:bottleneck_pred_analysis}.
We were also able to demonstrate the difficulties of reasoning at the level of
a basic block isolated from its context. We specifically study those
difficulties in the case of \bhive{} in Section~\ref{ssec:bhive_errors}.
Indeed, the actual values ---~both from registers and memory~--- involved in a
basic block's computation are constitutive not only of its functional
properties (\ie{} the result of the calculation), but also of some of its
non-functional properties (\eg{} latency, throughput).
We were also able to show in Section~\ref{ssec:memlatbound}
that state-of-the-art static analyzers struggle to
account for memory-carried dependencies; a weakness significantly impacting
their overall results on our benchmarks. We believe that detecting
and accounting for these dependencies is an important future works direction.
Moreover, we present this work in the form of a modular software package, each
component of which exposes numerous adjustable parameters. These components can
also be replaced by others fulfilling the same abstract function: another
initial benchmark suite in place of Polybench, other loop nest
optimizers in place of PLUTO and PoCC, other code
analyzers, and so on. This software modularity reflects the fact that our
contribution is about interfacing and communication between distinct issues.
\medskip
Furthermore, we believe that the contributions we made in the course of this work
may eventually be used to face different, yet neighbouring issues.
These perspectives can also be seen as future works:
\smallskip
\paragraph{Program optimization} the whole program processing we have designed
can be used not only to evaluate the performance model underlying a static
analyzer, but also to guide program optimization itself. In such a perspective,
we would generate different versions of the same program using the
transformations discussed in Section~\ref{sec:bench_gen} and colored blue in
Figure~\ref{fig:contrib}. These different versions would then feed the
execution and measurement environment outlined in
Section~\ref{sec:bench_harness} and colored orange in Figure~\ref{fig:contrib}.
Indeed, thanks to our previous work, we know that the results of these
comparable analyses and measurements would make it possible to identify which
version is the most efficient, and even to reconstruct information indicating
why (which bottlenecks, etc.).
However, this approach would require that these different versions of the same
program are functionally equivalent, \ie{} that they compute the same
result from the same inputs; yet we saw in Section~\ref{sec:bench_harness}
that, as it stands, the transformations we apply are not concerned with
preserving the semantics of the input codes. To recover this semantic
preservation property, abandoning the kernelification pass we have presented
suffices; this however would require to control L1-residence otherwise.
\smallskip
\paragraph{Dataset building} our microbenchmarks generation phase outputs a
large, diverse and representative dataset of microkernels. In addition to our
harness, we believe that such a dataset could be used to improve existing
data-dependant solutions.
%the measurement and execution environment we
%propose is not the only type of tool whose function is to process a large
%dataset (\ie{} the microbenchmarks generated earlier) to automatically
%abstract its characteristics. We can also think of:
Inductive methods, for instance in \anica, strive to preserve the properties of a basic
block through successive abstractions of the instructions it contains, so as to
draw the most general conclusions possible from a particular experiment.
Currently, \anica{} starts off from randomly generated basic blocks. This
approach guarantees a certain variety, and avoids
over-specialization, which would prevent it from finding interesting cases too
far from an initial dataset. However, it may well lead to the sample under
consideration being systematically outside the relevant area of the search
space ---~\ie{} having no relation to real-life programs or those in the user's
field.
On the other hand, machine learning methods based on neural networks, for
instance in \ithemal, seek to correlate the result of a function with the
characteristics of its input ---~in this case to correlate a throughput
prediction with the instructions making up a basic block~--- by backpropagating
the gradient of a cost function. In the case of \ithemal{}, it is trained on
benchmarks originating from a data suite. As opposed to random generation,
this approach offers representative samples, but comes with a risk of lack of
variety and over-specialization.
Comparatively, our microbenchmark generation method is natively meant to
produce a representative, varied and large dataset. We believe that
enriching the dataset of the above-mentioned methods with our benchmarks might
extend their results and reach.

View file

@ -0,0 +1,2 @@
%% \section*{Conclusion}
%% \todo{}

View file

@ -1 +1,9 @@
\chapter{A more systematic approach to throughput prediction performance analysis}
\input{00_intro.tex}
\input{05_related_works.tex}
\input{10_bench_gen.tex}
\input{20_evaluation.tex}
\input{25_results_analysis.tex}
\input{30_future_works.tex}
\input{99_conclusion.tex}

View file

@ -0,0 +1,82 @@
\begin{figure*}[ht!]
\definecolor{col_bench_gen}{HTML}{5a7eff}
\definecolor{col_bench_gen_bg}{HTML}{dbeeff}
\definecolor{col_bench_harness}{HTML}{ffa673}
\definecolor{col_results}{HTML}{000000}
\centerline{
\begin{tikzpicture}[
hiddennode/.style={rectangle,draw=white, very thick, minimum size=5mm, align=center, font=\footnotesize},
normnode/.style={rectangle,draw=black, very thick, minimum size=5mm, align=center, font=\footnotesize},
resultnode/.style={rectangle,draw=col_results, fill=black!2, very thick, minimum size=5mm, align=center, font=\footnotesize},
bluenode/.style={rectangle, draw=col_bench_gen, fill=col_bench_gen_bg, very thick, minimum height=5mm, minimum width=4cm, align=center, font=\footnotesize},
rednode/.style={rectangle, draw=col_bench_harness, fill=orange!5, very thick, minimum size=5mm, align=center, font=\footnotesize},
bencher/.style={rednode, minimum width=2.5cm, minimum height=5mm},
genarrow/.style={draw=col_bench_gen},
harnarrow/.style={draw=col_bench_harness},
]
\centering
%Nodes
\node[bluenode] (bench) {Benchmark suite \figref{ssec:bench_suite}};
\node[bluenode] (pocc) [below=of bench] {Loop nest optimizers \figref{ssec:loop_nest_optimizer}};
\node[bluenode] (kernel) [below=of pocc] {Constraining utility \figref{ssec:kernelify}};
\node[bluenode] (gcc) [below=of kernel] {Compilations \figref{ssec:compile}};
\node[rednode] (gdb) [right=0.1\textwidth of gcc] {Basic block \\extraction \figref{ssec:bb_extr}};
\node[bencher] (ithemal) [right=4cm of gdb] {Ithemal};
\node[bencher] (iaca) [above=0.5em of ithemal] {IACA};
\node[bencher] (uica) [above=0.5em of iaca] {uiCA};
\node[bencher] (llvm) [above=0.5em of uica] {llvm-mca};
\node[bencher] (bhive) [above=0.5em of llvm] {BHive (measure)};
\node[rednode] (ppapi) [left=1cm of bhive] {perf (measure)};
\node[rednode] (gus) [below=0.5em of ppapi] {Gus};
%% \node[rednode] (uica) [below=of gdb] {uiCA};
\node[rednode] (lifting) [right=of bhive] {
Prediction lifting\\\figref{ssec:harness_lifting}};
\node[
draw=black,
very thick,
dotted,
fit=(ppapi) (gus) (bhive) (llvm) (uica) (iaca) (ithemal)
] (comps) {};
\node (throughput_label) [above=0.2em of comps,align=center] {
\footnotesize Throughput predictions \\\footnotesize \& measures
\figref{ssec:throughput_pred_meas}};
\node[draw=black,
very thick,
dotted,
%% label={below:\footnotesize Variations},
label={[above,xshift=1cm]\footnotesize Variations},
fit=(pocc) (kernel) (gcc)
] (vars) {};
\node[resultnode] (bench2) [below=of lifting] {Evaluation metrics \\ for
code analyzers};
% Key
\node[] (keyblue1) [below left=0.7cm and 0cm of vars] {};
\node[hiddennode] (keyblue2) [right=0.5cm of keyblue1] {Section~\ref{sec:bench_gen}~: generating microbenchmarks};
\node[] (keyred1) [right=0.6cm of keyblue2] {};
\node[hiddennode] (keyred2) [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness};
\node[] (keyresult1) [right=0.6cm of keyred2] {};
\node[hiddennode] (keyresult2) [right=0.5cm of keyresult1]
{Section~\ref{sec:results_analysis}~: results analysis};
%Lines
\draw[-, very thick, harnarrow] (keyred1.east) -- (keyred2.west);
\draw[-, very thick, genarrow] (keyblue1.east) -- (keyblue2.west);
\draw[-, very thick] (keyresult1.east) -- (keyresult2.west);
\draw[->, very thick, genarrow] (bench.south) -- (pocc.north);
\draw[->, very thick, genarrow] (pocc.south) -- (kernel.north);
\draw[->, very thick, genarrow] (kernel.south) -- (gcc.north);
\draw[->, very thick, genarrow] (gcc.east) -- (gdb.west);
\draw[->, very thick, genarrow] (gcc.east) -- (ppapi.west);
\draw[->, very thick, genarrow] (gcc.east) -- (gus.west);
\draw[->, very thick, harnarrow] (gdb.east) -- (uica.west);
\draw[->, very thick, harnarrow] (gdb.east) -- (iaca.west);
\draw[->, very thick, harnarrow] (gdb.east) -- (ithemal.west);
\draw[->, very thick, harnarrow] (gdb.east) -- (bhive.west);
\draw[->, very thick, harnarrow] (gdb.east) -- (llvm.west);
\draw[->, very thick, harnarrow] (comps.east|-lifting) -- (lifting.west);
\draw[->, very thick] (lifting.south) -- (bench2.north);
\end{tikzpicture}
}
\caption{Our analysis and measurement environment.\label{fig:contrib}}
\end{figure*}

View file

@ -0,0 +1 @@
!*.pdf

Binary file not shown.