CesASMe: Try to remodel the introduction. WIP.

This commit is contained in:
Théophile Bastian 2023-09-25 19:05:42 +02:00
parent 853f792023
commit b44b6fcf10
2 changed files with 47 additions and 32 deletions

View file

@ -35,11 +35,52 @@ jne 0x16e0
\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
predicts 3 cycles. One may wonder which tool is correct.
In this chapter, we take a step back from our previous contributions, and
assess more generally the landscape of code analyzers. What are the key
bottlenecks to account for if one aims to predict the execution time of a
kernel correctly? Are some of these badly accounted for by state-of-the-art
code analyzers? This chapter, by conducting a broad experimental analysis of
these tools, strives to answer to these questions.
The obvious solution to assess their predictions is to compare them to an
actual measure. However, accounting for dependencies at the scale of a basic
block makes this \emph{actual measure} not as trivially defined as it would
seem. Take for instance the following kernel:
\input{overview}
\bigskip{}
We present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.
In \autoref{sec:redefine_exec_time}, we \qtodo{blabla full cycles}.
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \cesasme{}.
In addition to statistical studies, we use \cesasme's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.
\section{Re-defining the execution time of a
kernel}\label{sec:redefine_exec_time}
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
the execution time of a relatively simple kernel. The obvious solution to
assess their predictions is to compare them to an actual measure. However,
accounting for dependencies at the scale of a basic block makes this
\emph{actual measure} not as trivially defined as it would seem. Take for
instance the following kernel:
\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
@ -49,8 +90,6 @@ add $8, %rcx
\end{lstlisting}
\end{minipage}
\input{overview}
\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
@ -92,28 +131,3 @@ prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.
\bigskip
In this chapter, we present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different
parameters of the architecture, and thus of the predictors' models, while
staying close to representative workloads. To achieve this, we use
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations.
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware
counters-based measure. A
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
analyze the results of \cesasme{}.
In addition to statistical studies, we use \cesasme's results
to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision.

View file

@ -1,4 +1,5 @@
\chapter{A more systematic approach to throughput prediction performance analysis}
\chapter{A more systematic approach to throughput prediction performance
analysis: \cesasme{}}
\input{00_intro.tex}
\input{05_related_works.tex}