\section{Re-defining the execution time of a kernel}\label{sec:redefine_exec_time} We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on the execution time of a relatively simple kernel. The obvious solution to assess their predictions is to compare them to an actual measure. However, accounting for dependencies at the scale of a basic block makes this \emph{actual measure} not as trivially defined as it would seem. Take for instance the following kernel: \begin{minipage}{0.90\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] mov (%rax, %rcx, 1), %r10 mov %r10, (%rbx, %rcx, 1) add $8, %rcx \end{lstlisting} \end{minipage} \noindent{}At first, it looks like an array copy from location \reg{rax} to \reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to \reg{rax}\texttt{+8}, there is a read-after-write dependency between the first instruction and the second instruction at the previous iteration; which makes the throughput drop significantly. As we shall see in Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic block's throughput is not well-defined}. To recover the context of each basic block, we reason instead at the scale of a C source code. This makes the measures unambiguous: one can use hardware counters to measure the elapsed cycles during a loop nest. This requires a suite of benchmarks, in C, that both is representative of the domain studied, and wide enough to have a good coverage of the domain. However, this is not in itself sufficient to evaluate static tools: on the preceding matrix multiplication kernel, counters report 80,059 elapsed cycles ---~for the total loop. This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{} basic block-level predictions seen above. A common practice to make these numbers comparable is to renormalize them to instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of $\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of $\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet, IPC is a metric for microarchitectural load, and \textit{tells nothing about a kernel's efficiency}. Indeed, the static number of instructions is affected by many compiler passes, such as scalar evolution, strength reduction, register allocation, instruction selection\ldots{} Thus, when comparing two compiled versions of the same code, IPC alone does not necessarily point to the most efficient version. For instance, a kernel using SIMD instructions will use fewer instructions than one using only scalars, and thus exhibit a lower or constant IPC; yet, its performance will unquestionably increase. The total cycles elapsed to solve a given problem, on the other hand, is a sound metric of the efficiency of an implementation. We thus instead \emph{lift} the predictions at basic-block level to a total number of cycles. In simple cases, this simply means multiplying the block-level prediction by the number of loop iterations; however, this bound might not generally be known. More importantly, the compiler may apply any number of transformations: unrolling, for instance, changes this number. Control flow may also be complicated by code versioning. Instead of guessing this final number of iterations at the assembly level, a sounder alternative is to measure it on the final binary. In \autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to instrument an execution of the binary.