\begin{abstract}
    A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or
    \ithemal{}, strive to statically predict the throughput of a computation
    kernel. Each analyzer is based on its own simplified CPU model
    reasoning at the scale of an isolated basic block.
    Facing this diversity, evaluating their strengths and
    weaknesses is important to guide both their usage and their enhancement.

    We argue that reasoning at the scale of a single basic block is not
    always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled
    solution to evaluate code analyzers on C-level benchmarks. It is composed of a
    benchmark derivation procedure that feeds an evaluation harness. We use it to
    evaluate state-of-the-art code analyzers and to provide insights on their
    precision. We use \tool's results to show that memory-carried data
    dependencies are a major source of imprecision for these tools.
\end{abstract}

\section{Introduction}\label{sec:intro}

At a time when software is expected to perform more computations, faster and in
more constrained environments, tools that statically predict the resources (and
in particular the CPU resources) they consume are very useful to guide their
optimization. This need is reflected in the diversity of binary or assembly
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various
performance metrics, including the number of CPU cycles a computation kernel will take
---~which roughly translates to execution time.
In addition to raw measurements (relying on hardware counters), these model-based analyses provide
higher-level and refined data, to expose the bottlenecks and guide the
optimization of a given code. This feedback is useful to experts optimizing
computation kernels, including scientific simulations and deep-learning
kernels.

An exact throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly
available, and would be prohibitively slow in any case. These tools thus each
solve in their own way the challenge of modeling complex CPUs while remaining
simple enough to yield a prediction in a reasonable time, ending up with
different models. For instance, on the following x86-64 basic block computing a
general matrix multiplication,
\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
movsd (%rcx, %rax), %xmm0
mulsd %xmm1, %xmm0
addsd (%rdx, %rax), %xmm0
movsd %xmm0, (%rdx, %rax)
addq $8, %rax
cmpq $0x2260, %rax
jne 0x16e0
\end{lstlisting}
\end{minipage}

\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
predicts 3 cycles. One may wonder which tool is correct.


The obvious solution to assess their predictions is to compare them to an
actual measure. However, as these tools reason at the basic block level, this
is not as trivially defined as it would seem. Take for instance the following
kernel:

\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov (%rax, %rcx, 1), %r10
mov %r10, (%rbx, %rcx, 1)
add $8, %rcx
\end{lstlisting}
\end{minipage}

\input{overview}

\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
instruction and the second instruction at the previous iteration; which makes
the throughput drop significantly. As we shall see in
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
block's throughput is not well-defined}.

To recover the context of each basic block, we reason instead at the scale of
a C source code. This
makes the measures unambiguous: one can use hardware counters to measure the
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
that both is representative of the domain studied, and wide enough to have a
good coverage of the domain. However, this is not in itself sufficient to
evaluate static tools: on the preceding matrix multiplication kernel, counters
report 80,059 elapsed cycles ---~for the total loop.
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
basic block-level predictions seen above.

A common practice to make these numbers comparable is to renormalize them to
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal.  Yet,
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
kernel's efficiency}.  Indeed, the static number of instructions is affected by
many compiler passes, such as scalar evolution, strength reduction, register
allocation, instruction selection\ldots{} Thus, when comparing two compiled
versions of the same code, IPC alone does not necessarily point to the most
efficient version.  For instance, a kernel using SIMD instructions will use
fewer instructions than one using only scalars, and thus exhibit a lower or
constant IPC; yet, its performance will unquestionably increase.

The total cycles elapsed to solve a given problem, on the other
hand, is a sound metric of the efficiency of an implementation. We thus
instead \emph{lift} the predictions at basic-block level to a total number of
cycles. In simple cases, this simply means multiplying the block-level
prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.

%In the general case, instrumenting the generated code to obtain the number of
%occurrences of the basic block yields accurate results.

\bigskip

In this article, we present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \tool, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\tool{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.
In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \tool{} is shown in Figure~\ref{fig:contrib}.

In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \tool{}.
 In addition to statistical studies, we use \tool's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.