phd-thesis/manuscrit/50_CesASMe/00_intro.tex

In the previous chapters, we focused on two of the main bottleneck factors for
computation kernels: \autoref{chap:palmed} investigated the backend aspect of
throughput prediction, while \autoref{chap:frontend} dived into the frontend
aspects.

Throughout those two chapters, we entirely left out another crucial
factor: dependencies, and the latency they induce between instructions. We
managed to do so, because our baseline of native execution was \pipedream{}
measures, \emph{designed} to suppress any dependency.

However, state-of-the-art tools strive to provide an estimation of the
execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
as possible}, and as such, cannot neglect this third major bottleneck.
An exact
throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly
available, and would be prohibitively slow in any case. These tools thus each
solve in their own way the challenge of modeling complex CPUs while remaining
simple enough to yield a prediction in a reasonable time, ending up with
different models. For instance, on the following x86-64 basic block computing a
general matrix multiplication,

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
movsd (%rcx, %rax), %xmm0
mulsd %xmm1, %xmm0
addsd (%rdx, %rax), %xmm0
movsd %xmm0, (%rdx, %rax)
addq $8, %rax
cmpq $0x2260, %rax
jne 0x16e0
\end{lstlisting}
\end{minipage}

\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
predicts 3 cycles. One may wonder which tool is correct.

In this chapter, we take a step back from our previous contributions, and
assess more generally the landscape of code analyzers. What are the key
bottlenecks to account for if one aims to predict the execution time of a
kernel correctly? Are some of these badly accounted for by state-of-the-art
code analyzers? This chapter, by conducting a broad experimental analysis of
these tools, strives to answer these questions.

\input{overview}

\bigskip{}

In \autoref{sec:redefine_exec_time}, we investigate how a kernel's execution time
may be measured if we want to correctly account for its dependencies. We
advocate for the measurement of the total execution time of a computation
kernel in its original context, coupled with a precise measure of its number of
iterations to normalize the measure.

We then present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.

In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.

In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \cesasme{}.
 In addition to statistical studies, we use \cesasme's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.