phd-thesis/manuscrit/50_CesASMe/02_measuring_exec_time.tex

\section{Re-defining the execution time of a
kernel}\label{sec:redefine_exec_time}

We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
the execution time of a relatively simple kernel. The obvious solution to
assess their predictions is to compare them to an actual measure. However,
accounting for dependencies at the scale of a basic block makes this
\emph{actual measure} not as trivially defined as it would seem. Take for
instance the following kernel:

\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov (%rax, %rcx, 1), %r10
mov %r10, (%rbx, %rcx, 1)
add $8, %rcx
\end{lstlisting}
\end{minipage}

\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
instruction and the second instruction at the previous iteration; which makes
the throughput drop significantly. As we shall see in
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
block's throughput is not well-defined}.

To recover the context of each basic block, we reason instead at the scale of
a C source code. This
makes the measures unambiguous: one can use hardware counters to measure the
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
that both is representative of the domain studied, and wide enough to have a
good coverage of the domain. However, this is not in itself sufficient to
evaluate static tools: on the preceding matrix multiplication kernel, counters
report 80,059 elapsed cycles ---~for the total loop.
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
basic block-level predictions seen above.

A common practice to make these numbers comparable is to renormalize them to
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
$\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
$\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal.  Yet,
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
kernel's efficiency}.  Indeed, the static number of instructions is affected by
many compiler passes, such as scalar evolution, strength reduction, register
allocation, instruction selection\ldots{} Thus, when comparing two compiled
versions of the same code, IPC alone does not necessarily point to the most
efficient version.  For instance, a kernel using SIMD instructions will use
fewer instructions than one using only scalars, and thus exhibit a lower or
constant IPC; yet, its performance will unquestionably increase.

The total cycles elapsed to solve a given problem, on the other
hand, is a sound metric of the efficiency of an implementation. We thus
instead \emph{lift} the predictions at basic-block level to a total number of
cycles. In simple cases, this simply means multiplying the block-level
prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.

Instead of guessing this final number of iterations at the assembly level, a
sounder alternative is to measure it on the final binary. In
\autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to
instrument an execution of the binary.
CesASMe: another integration pass 2023-09-26 11:39:26 +02:00			`\section{Re-defining the execution time of a`
			`kernel}\label{sec:redefine_exec_time}`

			`We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on`
			`the execution time of a relatively simple kernel. The obvious solution to`
			`assess their predictions is to compare them to an actual measure. However,`
			`accounting for dependencies at the scale of a basic block makes this`
			`\emph{actual measure} not as trivially defined as it would seem. Take for`
			`instance the following kernel:`

			`\begin{minipage}{0.90\linewidth}`
			`\begin{lstlisting}[language={[x86masm]Assembler}]`
			`mov (%rax, %rcx, 1), %r10`
			`mov %r10, (%rbx, %rcx, 1)`
			`add $8, %rcx`
			`\end{lstlisting}`
			`\end{minipage}`

			`\noindent{}At first, it looks like an array copy from location \reg{rax} to`
			`\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to`
			`\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first`
			`instruction and the second instruction at the previous iteration; which makes`
			`the throughput drop significantly. As we shall see in`
			`Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic`
			`block's throughput is not well-defined}.`

			`To recover the context of each basic block, we reason instead at the scale of`
			`a C source code. This`
			`makes the measures unambiguous: one can use hardware counters to measure the`
			`elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,`
			`that both is representative of the domain studied, and wide enough to have a`
			`good coverage of the domain. However, this is not in itself sufficient to`
			`evaluate static tools: on the preceding matrix multiplication kernel, counters`
			`report 80,059 elapsed cycles ---~for the total loop.`
			`This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}`
			`basic block-level predictions seen above.`

			`A common practice to make these numbers comparable is to renormalize them to`
			`instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of`
			`$\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of`
			`$\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this`
			`case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,`
			`IPC is a metric for microarchitectural load, and \textit{tells nothing about a`
			`kernel's efficiency}. Indeed, the static number of instructions is affected by`
			`many compiler passes, such as scalar evolution, strength reduction, register`
			`allocation, instruction selection\ldots{} Thus, when comparing two compiled`
			`versions of the same code, IPC alone does not necessarily point to the most`
			`efficient version. For instance, a kernel using SIMD instructions will use`
			`fewer instructions than one using only scalars, and thus exhibit a lower or`
			`constant IPC; yet, its performance will unquestionably increase.`

			`The total cycles elapsed to solve a given problem, on the other`
			`hand, is a sound metric of the efficiency of an implementation. We thus`
			`instead \emph{lift} the predictions at basic-block level to a total number of`
			`cycles. In simple cases, this simply means multiplying the block-level`
			`prediction by the number of loop iterations; however, this bound might not`
			`generally be known. More importantly, the compiler may apply any number of`
			`transformations: unrolling, for instance, changes this number. Control flow may`
			`also be complicated by code versioning.`

			`Instead of guessing this final number of iterations at the assembly level, a`
			`sounder alternative is to measure it on the final binary. In`
			`\autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to`
			`instrument an execution of the binary.`