77 lines
3.8 KiB
TeX
77 lines
3.8 KiB
TeX
In the previous chapters, we focused on two of the main bottleneck factors for
|
|
computation kernels: \autoref{chap:palmed} investigated the backend aspect of
|
|
throughput prediction, while \autoref{chap:frontend} dived into the frontend
|
|
aspects.
|
|
|
|
Throughout those two chapters, we entirely left out another crucial
|
|
factor: dependencies, and the latency they induce between instructions. We
|
|
managed to do so, because our baseline of native execution was \pipedream{}
|
|
measures, \emph{designed} to suppress any dependency.
|
|
|
|
However, state-of-the-art tools strive to provide an estimation of the
|
|
execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
|
|
as possible}, and as such, cannot neglect this third major bottleneck.
|
|
An exact
|
|
throughput prediction would require a cycle-accurate simulator of the
|
|
processor, based on microarchitectural data that is most often not publicly
|
|
available, and would be prohibitively slow in any case. These tools thus each
|
|
solve in their own way the challenge of modeling complex CPUs while remaining
|
|
simple enough to yield a prediction in a reasonable time, ending up with
|
|
different models. For instance, on the following x86-64 basic block computing a
|
|
general matrix multiplication,
|
|
|
|
\begin{minipage}{0.95\linewidth}
|
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
movsd (%rcx, %rax), %xmm0
|
|
mulsd %xmm1, %xmm0
|
|
addsd (%rdx, %rax), %xmm0
|
|
movsd %xmm0, (%rdx, %rax)
|
|
addq $8, %rax
|
|
cmpq $0x2260, %rax
|
|
jne 0x16e0
|
|
\end{lstlisting}
|
|
\end{minipage}
|
|
|
|
\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
|
|
predicts 3 cycles. One may wonder which tool is correct.
|
|
|
|
In this chapter, we take a step back from our previous contributions, and
|
|
assess more generally the landscape of code analyzers. What are the key
|
|
bottlenecks to account for if one aims to predict the execution time of a
|
|
kernel correctly? Are some of these badly accounted for by state-of-the-art
|
|
code analyzers? This chapter, by conducting a broad experimental analysis of
|
|
these tools, strives to answer these questions.
|
|
|
|
\input{overview}
|
|
|
|
\bigskip{}
|
|
|
|
In \autoref{sec:redefine_exec_time}, we investigate how a kernel's execution time
|
|
may be measured if we want to correctly account for its dependencies. We
|
|
advocate for the measurement of the total execution time of a computation
|
|
kernel in its original context, coupled with a precise measure of its number of
|
|
iterations to normalize the measure.
|
|
|
|
We then present a fully-tooled solution to evaluate and compare the
|
|
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
|
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
|
\cesasme{} generates a wide variety of computation kernels stressing different
|
|
parameters of the architecture, and thus of the predictors' models, while
|
|
staying close to representative workloads. To achieve this, we use
|
|
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
|
scientific computation workloads, that we combine with a variety of
|
|
optimisations, including polyhedral loop transformations.
|
|
|
|
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
|
evaluate throughput predictors on this set of benchmarks by lifting their
|
|
predictions to a total number of cycles that can be compared to a hardware
|
|
counters-based measure. A
|
|
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
|
|
|
|
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
|
|
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
|
|
analyze the results of \cesasme{}.
|
|
In addition to statistical studies, we use \cesasme's results
|
|
to investigate analyzers' flaws. We show that code
|
|
analyzers do not always correctly model data dependencies through memory
|
|
accesses, substantially impacting their precision.
|