phd-thesis/manuscrit/50_CesASMe/00_intro.tex

In the previous chapters, we focused on two of the main bottleneck factors for
computation kernels: \autoref{chap:palmed} investigated the backend aspect of
throughput prediction, while \autoref{chap:frontend} dived into the frontend
aspects.

Throughout those two chapters, we entirely left out another crucial
factor: dependencies, and the latency they induce between instructions. We
managed to do so, because our baseline of native execution was \pipedream{}
measures, \emph{designed} to suppress any dependency.

However, state-of-the-art tools strive to provide an estimation of the
execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise
as possible}, and as such, cannot neglect this third major bottleneck.
An exact
throughput prediction would require a cycle-accurate simulator of the
processor, based on microarchitectural data that is most often not publicly
available, and would be prohibitively slow in any case. These tools thus each
solve in their own way the challenge of modeling complex CPUs while remaining
simple enough to yield a prediction in a reasonable time, ending up with
different models. For instance, on the following x86-64 basic block computing a
general matrix multiplication,

\begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
movsd (%rcx, %rax), %xmm0
mulsd %xmm1, %xmm0
addsd (%rdx, %rax), %xmm0
movsd %xmm0, (%rdx, %rax)
addq $8, %rax
cmpq $0x2260, %rax
jne 0x16e0
\end{lstlisting}
\end{minipage}

\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
predicts 3 cycles. One may wonder which tool is correct.

In this chapter, we take a step back from our previous contributions, and
assess more generally the landscape of code analyzers. What are the key
bottlenecks to account for if one aims to predict the execution time of a
kernel correctly? Are some of these badly accounted for by state-of-the-art
code analyzers? This chapter, by conducting a broad experimental analysis of
these tools, strives to answer these questions.

\input{overview}

\bigskip{}

In \autoref{sec:redefine_exec_time}, we investigate how a kernel's execution time
may be measured if we want to correctly account for its dependencies. We
advocate for the measurement of the total execution time of a computation
kernel in its original context, coupled with a precise measure of its number of
iterations to normalize the measure.

We then present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.

In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.

In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \cesasme{}.
 In addition to statistical studies, we use \cesasme's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.
First pass on CesASMe -- intro is still a mess 2023-09-25 18:45:35 +02:00			`In the previous chapters, we focused on two of the main bottleneck factors for`
			`computation kernels: \autoref{chap:palmed} investigated the backend aspect of`
			`throughput prediction, while \autoref{chap:frontend} dived into the frontend`
			`aspects.`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00
First pass on CesASMe -- intro is still a mess 2023-09-25 18:45:35 +02:00			`Throughout those two chapters, we entirely left out another crucial`
			`factor: dependencies, and the latency they induce between instructions. We`
			`managed to do so, because our baseline of native execution was \pipedream{}`
			`measures, \emph{designed} to suppress any dependency.`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00
First pass on CesASMe -- intro is still a mess 2023-09-25 18:45:35 +02:00			`However, state-of-the-art tools strive to provide an estimation of the`
			`execution time $\cyc{\kerK}$ of a given kernel $\kerK$ that is \emph{as precise`
			`as possible}, and as such, cannot neglect this third major bottleneck.`
			`An exact`
			`throughput prediction would require a cycle-accurate simulator of the`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`processor, based on microarchitectural data that is most often not publicly`
			`available, and would be prohibitively slow in any case. These tools thus each`
			`solve in their own way the challenge of modeling complex CPUs while remaining`
			`simple enough to yield a prediction in a reasonable time, ending up with`
			`different models. For instance, on the following x86-64 basic block computing a`
			`general matrix multiplication,`
CesASMe: first adaptations 2023-09-25 17:41:37 +02:00
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00			`\begin{minipage}{0.95\linewidth}`
			`\begin{lstlisting}[language={[x86masm]Assembler}]`
			`movsd (%rcx, %rax), %xmm0`
			`mulsd %xmm1, %xmm0`
			`addsd (%rdx, %rax), %xmm0`
			`movsd %xmm0, (%rdx, %rax)`
			`addq $8, %rax`
			`cmpq $0x2260, %rax`
			`jne 0x16e0`
			`\end{lstlisting}`
			`\end{minipage}`

			`\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}`
			`predicts 3 cycles. One may wonder which tool is correct.`

CesASMe: Try to remodel the introduction. WIP. 2023-09-25 19:05:42 +02:00			`In this chapter, we take a step back from our previous contributions, and`
			`assess more generally the landscape of code analyzers. What are the key`
			`bottlenecks to account for if one aims to predict the execution time of a`
			`kernel correctly? Are some of these badly accounted for by state-of-the-art`
			`code analyzers? This chapter, by conducting a broad experimental analysis of`
CesASMe: another integration pass 2023-09-26 11:39:26 +02:00			`these tools, strives to answer these questions.`
CesASMe: brutal paper import. Not compiling yet. 2023-09-25 17:00:07 +02:00
CesASMe: Try to remodel the introduction. WIP. 2023-09-25 19:05:42 +02:00			`\input{overview}`

			`\bigskip{}`

CesASMe: another integration pass 2023-09-26 11:39:26 +02:00			`In \autoref{sec:redefine_exec_time}, we investigate how a kernel's execution time`
			`may be measured if we want to correctly account for its dependencies. We`
			`advocate for the measurement of the total execution time of a computation`
			`kernel in its original context, coupled with a precise measure of its number of`
			`iterations to normalize the measure.`

			`We then present a fully-tooled solution to evaluate and compare the`
CesASMe: Try to remodel the introduction. WIP. 2023-09-25 19:05:42 +02:00			`diversity of static throughput predictors. Our tool, \cesasme, solves two main`
			`issues in this direction. In Section~\ref{sec:bench_gen}, we describe how`
			`\cesasme{} generates a wide variety of computation kernels stressing different`
			`parameters of the architecture, and thus of the predictors' models, while`
			`staying close to representative workloads. To achieve this, we use`
			`Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of`
			`scientific computation workloads, that we combine with a variety of`
			`optimisations, including polyhedral loop transformations.`

			`In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to`
			`evaluate throughput predictors on this set of benchmarks by lifting their`
			`predictions to a total number of cycles that can be compared to a hardware`
			`counters-based measure. A`
			`high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.`

			`In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our`
			`methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and`
			`analyze the results of \cesasme{}.`
			`In addition to statistical studies, we use \cesasme's results`
			`to investigate analyzers' flaws. We show that code`
			`analyzers do not always correctly model data dependencies through memory`
			`accesses, substantially impacting their precision.`