phd-thesis/manuscrit/50_CesASMe/02_measuring_exec_time.tex

65 lines
3.5 KiB
TeX
Raw Permalink Normal View History

2023-09-26 11:39:26 +02:00
\section{Re-defining the execution time of a
kernel}\label{sec:redefine_exec_time}
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
the execution time of a relatively simple kernel. The obvious solution to
assess their predictions is to compare them to an actual measure. However,
accounting for dependencies at the scale of a basic block makes this
\emph{actual measure} not as trivially defined as it would seem. Take for
instance the following kernel:
\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov (%rax, %rcx, 1), %r10
mov %r10, (%rbx, %rcx, 1)
add $8, %rcx
\end{lstlisting}
\end{minipage}
\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
instruction and the second instruction at the previous iteration; which makes
the throughput drop significantly. As we shall see in
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
block's throughput is not well-defined}.
To recover the context of each basic block, we reason instead at the scale of
a C source code. This
makes the measures unambiguous: one can use hardware counters to measure the
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
that both is representative of the domain studied, and wide enough to have a
good coverage of the domain. However, this is not in itself sufficient to
evaluate static tools: on the preceding matrix multiplication kernel, counters
report 80,059 elapsed cycles ---~for the total loop.
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
basic block-level predictions seen above.
A common practice to make these numbers comparable is to renormalize them to
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
$\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
$\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
kernel's efficiency}. Indeed, the static number of instructions is affected by
many compiler passes, such as scalar evolution, strength reduction, register
allocation, instruction selection\ldots{} Thus, when comparing two compiled
versions of the same code, IPC alone does not necessarily point to the most
efficient version. For instance, a kernel using SIMD instructions will use
fewer instructions than one using only scalars, and thus exhibit a lower or
constant IPC; yet, its performance will unquestionably increase.
The total cycles elapsed to solve a given problem, on the other
hand, is a sound metric of the efficiency of an implementation. We thus
instead \emph{lift} the predictions at basic-block level to a total number of
cycles. In simple cases, this simply means multiplying the block-level
prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.
Instead of guessing this final number of iterations at the assembly level, a
sounder alternative is to measure it on the final binary. In
\autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to
instrument an execution of the binary.