\begin{abstract} A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or \ithemal{}, strive to statically predict the throughput of a computation kernel. Each analyzer is based on its own simplified CPU model reasoning at the scale of an isolated basic block. Facing this diversity, evaluating their strengths and weaknesses is important to guide both their usage and their enhancement. We argue that reasoning at the scale of a single basic block is not always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled solution to evaluate code analyzers on C-level benchmarks. It is composed of a benchmark derivation procedure that feeds an evaluation harness. We use it to evaluate state-of-the-art code analyzers and to provide insights on their precision. We use \tool's results to show that memory-carried data dependencies are a major source of imprecision for these tools. \end{abstract} \section{Introduction}\label{sec:intro} At a time when software is expected to perform more computations, faster and in more constrained environments, tools that statically predict the resources (and in particular the CPU resources) they consume are very useful to guide their optimization. This need is reflected in the diversity of binary or assembly code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca}, \uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various performance metrics, including the number of CPU cycles a computation kernel will take ---~which roughly translates to execution time. In addition to raw measurements (relying on hardware counters), these model-based analyses provide higher-level and refined data, to expose the bottlenecks and guide the optimization of a given code. This feedback is useful to experts optimizing computation kernels, including scientific simulations and deep-learning kernels. An exact throughput prediction would require a cycle-accurate simulator of the processor, based on microarchitectural data that is most often not publicly available, and would be prohibitively slow in any case. These tools thus each solve in their own way the challenge of modeling complex CPUs while remaining simple enough to yield a prediction in a reasonable time, ending up with different models. For instance, on the following x86-64 basic block computing a general matrix multiplication, \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] movsd (%rcx, %rax), %xmm0 mulsd %xmm1, %xmm0 addsd (%rdx, %rax), %xmm0 movsd %xmm0, (%rdx, %rax) addq $8, %rax cmpq $0x2260, %rax jne 0x16e0 \end{lstlisting} \end{minipage} \noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{} predicts 3 cycles. One may wonder which tool is correct. The obvious solution to assess their predictions is to compare them to an actual measure. However, as these tools reason at the basic block level, this is not as trivially defined as it would seem. Take for instance the following kernel: \begin{minipage}{0.90\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] mov (%rax, %rcx, 1), %r10 mov %r10, (%rbx, %rcx, 1) add $8, %rcx \end{lstlisting} \end{minipage} \input{overview} \noindent{}At first, it looks like an array copy from location \reg{rax} to \reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to \reg{rax}\texttt{+8}, there is a read-after-write dependency between the first instruction and the second instruction at the previous iteration; which makes the throughput drop significantly. As we shall see in Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic block's throughput is not well-defined}. To recover the context of each basic block, we reason instead at the scale of a C source code. This makes the measures unambiguous: one can use hardware counters to measure the elapsed cycles during a loop nest. This requires a suite of benchmarks, in C, that both is representative of the domain studied, and wide enough to have a good coverage of the domain. However, this is not in itself sufficient to evaluate static tools: on the preceding matrix multiplication kernel, counters report 80,059 elapsed cycles ---~for the total loop. This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{} basic block-level predictions seen above. A common practice to make these numbers comparable is to renormalize them to instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of $\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of $\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet, IPC is a metric for microarchitectural load, and \textit{tells nothing about a kernel's efficiency}. Indeed, the static number of instructions is affected by many compiler passes, such as scalar evolution, strength reduction, register allocation, instruction selection\ldots{} Thus, when comparing two compiled versions of the same code, IPC alone does not necessarily point to the most efficient version. For instance, a kernel using SIMD instructions will use fewer instructions than one using only scalars, and thus exhibit a lower or constant IPC; yet, its performance will unquestionably increase. The total cycles elapsed to solve a given problem, on the other hand, is a sound metric of the efficiency of an implementation. We thus instead \emph{lift} the predictions at basic-block level to a total number of cycles. In simple cases, this simply means multiplying the block-level prediction by the number of loop iterations; however, this bound might not generally be known. More importantly, the compiler may apply any number of transformations: unrolling, for instance, changes this number. Control flow may also be complicated by code versioning. %In the general case, instrumenting the generated code to obtain the number of %occurrences of the basic block yields accurate results. \bigskip In this article, we present a fully-tooled solution to evaluate and compare the diversity of static throughput predictors. Our tool, \tool, solves two main issues in this direction. In Section~\ref{sec:bench_gen}, we describe how \tool{} generates a wide variety of computation kernels stressing different parameters of the architecture, and thus of the predictors' models, while staying close to representative workloads. To achieve this, we use Polybench~\cite{polybench}, a C-level benchmark suite representative of scientific computation workloads, that we combine with a variety of optimisations, including polyhedral loop transformations. In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to evaluate throughput predictors on this set of benchmarks by lifting their predictions to a total number of cycles that can be compared to a hardware counters-based measure. A high-level view of \tool{} is shown in Figure~\ref{fig:contrib}. In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and analyze the results of \tool{}. In addition to statistical studies, we use \tool's results to investigate analyzers' flaws. We show that code analyzers do not always correctly model data dependencies through memory accesses, substantially impacting their precision.