phd-thesis/manuscrit/50_CesASMe/00_intro.tex

\section{Introduction}\label{sec:intro}

At a time when software is expected to perform more computations, faster and in
more constrained environments, tools that statically predict the resources (and
in particular the CPU resources) they consume are very useful to guide their
optimization. This need is reflected in the diversity of binary or assembly
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
these tools strive to extract various performance metrics, including the number
of CPU cycles a computation kernel will take ---~which roughly translates to
execution time.  In addition to raw measurements (relying on hardware
counters), these model-based analyses provide higher-level and refined data, to
expose the bottlenecks and guide the optimization of a given code. This
feedback is useful to experts optimizing computation kernels, including
scientific simulations and deep-learning kernels.

An exact throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly
available, and would be prohibitively slow in any case. These tools thus each
solve in their own way the challenge of modeling complex CPUs while remaining
simple enough to yield a prediction in a reasonable time, ending up with
different models. For instance, on the following x86-64 basic block computing a
general matrix multiplication,

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
movsd (%rcx, %rax), %xmm0
mulsd %xmm1, %xmm0
addsd (%rdx, %rax), %xmm0
movsd %xmm0, (%rdx, %rax)
addq $8, %rax
cmpq $0x2260, %rax
jne 0x16e0
\end{lstlisting}
\end{minipage}

\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
predicts 3 cycles. One may wonder which tool is correct.


The obvious solution to assess their predictions is to compare them to an
actual measure. However, as these tools reason at the basic block level, this
is not as trivially defined as it would seem. Take for instance the following
kernel:

\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov (%rax, %rcx, 1), %r10
mov %r10, (%rbx, %rcx, 1)
add $8, %rcx
\end{lstlisting}
\end{minipage}

\input{overview}

\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
instruction and the second instruction at the previous iteration; which makes
the throughput drop significantly. As we shall see in
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
block's throughput is not well-defined}.

To recover the context of each basic block, we reason instead at the scale of
a C source code. This
makes the measures unambiguous: one can use hardware counters to measure the
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
that both is representative of the domain studied, and wide enough to have a
good coverage of the domain. However, this is not in itself sufficient to
evaluate static tools: on the preceding matrix multiplication kernel, counters
report 80,059 elapsed cycles ---~for the total loop.
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
basic block-level predictions seen above.

A common practice to make these numbers comparable is to renormalize them to
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal.  Yet,
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
kernel's efficiency}.  Indeed, the static number of instructions is affected by
many compiler passes, such as scalar evolution, strength reduction, register
allocation, instruction selection\ldots{} Thus, when comparing two compiled
versions of the same code, IPC alone does not necessarily point to the most
efficient version.  For instance, a kernel using SIMD instructions will use
fewer instructions than one using only scalars, and thus exhibit a lower or
constant IPC; yet, its performance will unquestionably increase.

The total cycles elapsed to solve a given problem, on the other
hand, is a sound metric of the efficiency of an implementation. We thus
instead \emph{lift} the predictions at basic-block level to a total number of
cycles. In simple cases, this simply means multiplying the block-level
prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.

\bigskip

In this article, we present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.

In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \cesasme{}.
 In addition to statistical studies, we use \cesasme's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`\section{Introduction}\label{sec:intro}`

			`At a time when software is expected to perform more computations, faster and in`
			`more constrained environments, tools that statically predict the resources (and`
			`in particular the CPU resources) they consume are very useful to guide their`
			`optimization. This need is reflected in the diversity of binary or assembly`
			`code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has`
			`maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00			`\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all`
			`these tools strive to extract various performance metrics, including the number`
			`of CPU cycles a computation kernel will take ---~which roughly translates to`
			`execution time. In addition to raw measurements (relying on hardware`
			`counters), these model-based analyses provide higher-level and refined data, to`
			`expose the bottlenecks and guide the optimization of a given code. This`
			`feedback is useful to experts optimizing computation kernels, including`
			`scientific simulations and deep-learning kernels.`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00
			`An exact throughput prediction would require a cycle-accurate simulator of the`
			`processor, based on microarchitectural data that is most often not publicly`
			`available, and would be prohibitively slow in any case. These tools thus each`
			`solve in their own way the challenge of modeling complex CPUs while remaining`
			`simple enough to yield a prediction in a reasonable time, ending up with`
			`different models. For instance, on the following x86-64 basic block computing a`
			`general matrix multiplication,`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`\begin{minipage}{0.95\linewidth}`
			`\begin{lstlisting}[language={[x86masm]Assembler}]`
			`movsd (%rcx, %rax), %xmm0`
			`mulsd %xmm1, %xmm0`
			`addsd (%rdx, %rax), %xmm0`
			`movsd %xmm0, (%rdx, %rax)`
			`addq $8, %rax`
			`cmpq $0x2260, %rax`
			`jne 0x16e0`
			`\end{lstlisting}`
			`\end{minipage}`

			`\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}`
			`predicts 3 cycles. One may wonder which tool is correct.`


			`The obvious solution to assess their predictions is to compare them to an`
			`actual measure. However, as these tools reason at the basic block level, this`
			`is not as trivially defined as it would seem. Take for instance the following`
			`kernel:`

			`\begin{minipage}{0.90\linewidth}`
			`\begin{lstlisting}[language={[x86masm]Assembler}]`
			`mov (%rax, %rcx, 1), %r10`
			`mov %r10, (%rbx, %rcx, 1)`
			`add $8, %rcx`
			`\end{lstlisting}`
			`\end{minipage}`

			`\input{overview}`

			`\noindent{}At first, it looks like an array copy from location \reg{rax} to`
			`\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to`
			`\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first`
			`instruction and the second instruction at the previous iteration; which makes`
			`the throughput drop significantly. As we shall see in`
			`Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic`
			`block's throughput is not well-defined}.`

			`To recover the context of each basic block, we reason instead at the scale of`
			`a C source code. This`
			`makes the measures unambiguous: one can use hardware counters to measure the`
			`elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,`
			`that both is representative of the domain studied, and wide enough to have a`
			`good coverage of the domain. However, this is not in itself sufficient to`
			`evaluate static tools: on the preceding matrix multiplication kernel, counters`
			`report 80,059 elapsed cycles ---~for the total loop.`
			`This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}`
			`basic block-level predictions seen above.`

			`A common practice to make these numbers comparable is to renormalize them to`
			`instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of`
			`$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of`
			`$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this`
			`case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,`
			`IPC is a metric for microarchitectural load, and \textit{tells nothing about a`
			`kernel's efficiency}. Indeed, the static number of instructions is affected by`
			`many compiler passes, such as scalar evolution, strength reduction, register`
			`allocation, instruction selection\ldots{} Thus, when comparing two compiled`
			`versions of the same code, IPC alone does not necessarily point to the most`
			`efficient version. For instance, a kernel using SIMD instructions will use`
			`fewer instructions than one using only scalars, and thus exhibit a lower or`
			`constant IPC; yet, its performance will unquestionably increase.`

			`The total cycles elapsed to solve a given problem, on the other`
			`hand, is a sound metric of the efficiency of an implementation. We thus`
			`instead \emph{lift} the predictions at basic-block level to a total number of`
			`cycles. In simple cases, this simply means multiplying the block-level`
			`prediction by the number of loop iterations; however, this bound might not`
			`generally be known. More importantly, the compiler may apply any number of`
			`transformations: unrolling, for instance, changes this number. Control flow may`
			`also be complicated by code versioning.`

			`\bigskip`

			`In this article, we present a fully-tooled solution to evaluate and compare the`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00			`diversity of static throughput predictors. Our tool, \cesasme, solves two main`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`issues in this direction. In Section~\ref{sec:bench_gen}, we describe how`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00			`\cesasme{} generates a wide variety of computation kernels stressing different`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`parameters of the architecture, and thus of the predictors' models, while`
			`staying close to representative workloads. To achieve this, we use`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00			`Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`scientific computation workloads, that we combine with a variety of`
			`optimisations, including polyhedral loop transformations.`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00			`In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`evaluate throughput predictors on this set of benchmarks by lifting their`
			`predictions to a total number of cycles that can be compared to a hardware`
			`counters-based measure. A`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00			`high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00
			`In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our`
			`methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00			`analyze the results of \cesasme{}.`
			`In addition to statistical studies, we use \cesasme's results`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`to investigate analyzers' flaws. We show that code`
			`analyzers do not always correctly model data dependencies through memory`
			`accesses, substantially impacting their precision.`