65 lines
3.5 KiB
TeX
65 lines
3.5 KiB
TeX
|
\section{Re-defining the execution time of a
|
||
|
kernel}\label{sec:redefine_exec_time}
|
||
|
|
||
|
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
|
||
|
the execution time of a relatively simple kernel. The obvious solution to
|
||
|
assess their predictions is to compare them to an actual measure. However,
|
||
|
accounting for dependencies at the scale of a basic block makes this
|
||
|
\emph{actual measure} not as trivially defined as it would seem. Take for
|
||
|
instance the following kernel:
|
||
|
|
||
|
\begin{minipage}{0.90\linewidth}
|
||
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||
|
mov (%rax, %rcx, 1), %r10
|
||
|
mov %r10, (%rbx, %rcx, 1)
|
||
|
add $8, %rcx
|
||
|
\end{lstlisting}
|
||
|
\end{minipage}
|
||
|
|
||
|
\noindent{}At first, it looks like an array copy from location \reg{rax} to
|
||
|
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
|
||
|
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
|
||
|
instruction and the second instruction at the previous iteration; which makes
|
||
|
the throughput drop significantly. As we shall see in
|
||
|
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
|
||
|
block's throughput is not well-defined}.
|
||
|
|
||
|
To recover the context of each basic block, we reason instead at the scale of
|
||
|
a C source code. This
|
||
|
makes the measures unambiguous: one can use hardware counters to measure the
|
||
|
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
|
||
|
that both is representative of the domain studied, and wide enough to have a
|
||
|
good coverage of the domain. However, this is not in itself sufficient to
|
||
|
evaluate static tools: on the preceding matrix multiplication kernel, counters
|
||
|
report 80,059 elapsed cycles ---~for the total loop.
|
||
|
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
|
||
|
basic block-level predictions seen above.
|
||
|
|
||
|
A common practice to make these numbers comparable is to renormalize them to
|
||
|
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
|
||
|
$\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
|
||
|
$\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this
|
||
|
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
|
||
|
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
|
||
|
kernel's efficiency}. Indeed, the static number of instructions is affected by
|
||
|
many compiler passes, such as scalar evolution, strength reduction, register
|
||
|
allocation, instruction selection\ldots{} Thus, when comparing two compiled
|
||
|
versions of the same code, IPC alone does not necessarily point to the most
|
||
|
efficient version. For instance, a kernel using SIMD instructions will use
|
||
|
fewer instructions than one using only scalars, and thus exhibit a lower or
|
||
|
constant IPC; yet, its performance will unquestionably increase.
|
||
|
|
||
|
The total cycles elapsed to solve a given problem, on the other
|
||
|
hand, is a sound metric of the efficiency of an implementation. We thus
|
||
|
instead \emph{lift} the predictions at basic-block level to a total number of
|
||
|
cycles. In simple cases, this simply means multiplying the block-level
|
||
|
prediction by the number of loop iterations; however, this bound might not
|
||
|
generally be known. More importantly, the compiler may apply any number of
|
||
|
transformations: unrolling, for instance, changes this number. Control flow may
|
||
|
also be complicated by code versioning.
|
||
|
|
||
|
Instead of guessing this final number of iterations at the assembly level, a
|
||
|
sounder alternative is to measure it on the final binary. In
|
||
|
\autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to
|
||
|
instrument an execution of the binary.
|