122 lines
6.4 KiB
TeX
122 lines
6.4 KiB
TeX
\section{Introduction}\label{sec:intro}
|
|
|
|
At a time when software is expected to perform more computations, faster and in
|
|
more constrained environments, tools that statically predict the resources (and
|
|
in particular the CPU resources) they consume are very useful to guide their
|
|
optimization. This need is reflected in the diversity of binary or assembly
|
|
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
|
|
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
|
|
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all
|
|
these tools strive to extract various performance metrics, including the number
|
|
of CPU cycles a computation kernel will take ---~which roughly translates to
|
|
execution time. In addition to raw measurements (relying on hardware
|
|
counters), these model-based analyses provide higher-level and refined data, to
|
|
expose the bottlenecks and guide the optimization of a given code. This
|
|
feedback is useful to experts optimizing computation kernels, including
|
|
scientific simulations and deep-learning kernels.
|
|
|
|
An exact throughput prediction would require a cycle-accurate simulator of the
|
|
processor, based on microarchitectural data that is most often not publicly
|
|
available, and would be prohibitively slow in any case. These tools thus each
|
|
solve in their own way the challenge of modeling complex CPUs while remaining
|
|
simple enough to yield a prediction in a reasonable time, ending up with
|
|
different models. For instance, on the following x86-64 basic block computing a
|
|
general matrix multiplication,
|
|
|
|
\begin{minipage}{0.95\linewidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
movsd (%rcx, %rax), %xmm0
|
|
mulsd %xmm1, %xmm0
|
|
addsd (%rdx, %rax), %xmm0
|
|
movsd %xmm0, (%rdx, %rax)
|
|
addq $8, %rax
|
|
cmpq $0x2260, %rax
|
|
jne 0x16e0
|
|
\end{lstlisting}
|
|
\end{minipage}
|
|
|
|
\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
|
|
predicts 3 cycles. One may wonder which tool is correct.
|
|
|
|
|
|
The obvious solution to assess their predictions is to compare them to an
|
|
actual measure. However, as these tools reason at the basic block level, this
|
|
is not as trivially defined as it would seem. Take for instance the following
|
|
kernel:
|
|
|
|
\begin{minipage}{0.90\linewidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
mov (%rax, %rcx, 1), %r10
|
|
mov %r10, (%rbx, %rcx, 1)
|
|
add $8, %rcx
|
|
\end{lstlisting}
|
|
\end{minipage}
|
|
|
|
\input{overview}
|
|
|
|
\noindent{}At first, it looks like an array copy from location \reg{rax} to
|
|
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
|
|
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
|
|
instruction and the second instruction at the previous iteration; which makes
|
|
the throughput drop significantly. As we shall see in
|
|
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
|
|
block's throughput is not well-defined}.
|
|
|
|
To recover the context of each basic block, we reason instead at the scale of
|
|
a C source code. This
|
|
makes the measures unambiguous: one can use hardware counters to measure the
|
|
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
|
|
that both is representative of the domain studied, and wide enough to have a
|
|
good coverage of the domain. However, this is not in itself sufficient to
|
|
evaluate static tools: on the preceding matrix multiplication kernel, counters
|
|
report 80,059 elapsed cycles ---~for the total loop.
|
|
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
|
|
basic block-level predictions seen above.
|
|
|
|
A common practice to make these numbers comparable is to renormalize them to
|
|
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
|
|
$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
|
|
$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
|
|
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
|
|
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
|
|
kernel's efficiency}. Indeed, the static number of instructions is affected by
|
|
many compiler passes, such as scalar evolution, strength reduction, register
|
|
allocation, instruction selection\ldots{} Thus, when comparing two compiled
|
|
versions of the same code, IPC alone does not necessarily point to the most
|
|
efficient version. For instance, a kernel using SIMD instructions will use
|
|
fewer instructions than one using only scalars, and thus exhibit a lower or
|
|
constant IPC; yet, its performance will unquestionably increase.
|
|
|
|
The total cycles elapsed to solve a given problem, on the other
|
|
hand, is a sound metric of the efficiency of an implementation. We thus
|
|
instead \emph{lift} the predictions at basic-block level to a total number of
|
|
cycles. In simple cases, this simply means multiplying the block-level
|
|
prediction by the number of loop iterations; however, this bound might not
|
|
generally be known. More importantly, the compiler may apply any number of
|
|
transformations: unrolling, for instance, changes this number. Control flow may
|
|
also be complicated by code versioning.
|
|
|
|
\bigskip
|
|
|
|
In this article, we present a fully-tooled solution to evaluate and compare the
|
|
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
|
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
|
\cesasme{} generates a wide variety of computation kernels stressing different
|
|
parameters of the architecture, and thus of the predictors' models, while
|
|
staying close to representative workloads. To achieve this, we use
|
|
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
|
scientific computation workloads, that we combine with a variety of
|
|
optimisations, including polyhedral loop transformations.
|
|
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
|
evaluate throughput predictors on this set of benchmarks by lifting their
|
|
predictions to a total number of cycles that can be compared to a hardware
|
|
counters-based measure. A
|
|
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
|
|
|
|
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
|
|
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
|
|
analyze the results of \cesasme{}.
|
|
In addition to statistical studies, we use \cesasme's results
|
|
to investigate analyzers' flaws. We show that code
|
|
analyzers do not always correctly model data dependencies through memory
|
|
accesses, substantially impacting their precision.
|