CesASMe: Try to remodel the introduction. WIP.

2023-09-25 19:05:42 +02:00 · 2023-09-25 19:05:42 +02:00 · b44b6fcf10
commit b44b6fcf10
parent 853f792023
2 changed files with 47 additions and 32 deletions
--- a/manuscrit/50_CesASMe/00_intro.tex
+++ b/manuscrit/50_CesASMe/00_intro.tex
@ -35,11 +35,52 @@ jne 0x16e0
 \noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
 predicts 3 cycles. One may wonder which tool is correct.

+In this chapter, we take a step back from our previous contributions, and
+assess more generally the landscape of code analyzers. What are the key
+bottlenecks to account for if one aims to predict the execution time of a
+kernel correctly? Are some of these badly accounted for by state-of-the-art
+code analyzers? This chapter, by conducting a broad experimental analysis of
+these tools, strives to answer to these questions.

-The obvious solution to assess their predictions is to compare them to an
-actual measure. However, accounting for dependencies at the scale of a basic
-block makes this \emph{actual measure} not as trivially defined as it would
-seem. Take for instance the following kernel:
+\input{overview}
+
+\bigskip{}
+
+We present a fully-tooled solution to evaluate and compare the
+diversity of static throughput predictors. Our tool, \cesasme, solves two main
+issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
+\cesasme{} generates a wide variety of computation kernels stressing different
+parameters of the architecture, and thus of the predictors' models, while
+staying close to representative workloads. To achieve this, we use
+Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
+scientific computation workloads, that we combine with a variety of
+optimisations, including polyhedral loop transformations.
+
+In \autoref{sec:redefine_exec_time}, we \qtodo{blabla full cycles}.
+
+In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
+evaluate throughput predictors on this set of benchmarks by lifting their
+predictions to a total number of cycles that can be compared to a hardware
+counters-based measure. A
+high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
+
+In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
+methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
+analyze the results of \cesasme{}.
+ In addition to statistical studies, we use \cesasme's results
+to investigate analyzers' flaws. We show that code
+analyzers do not always correctly model data dependencies through memory
+accesses, substantially impacting their precision.
+
+\section{Re-defining the execution time of a
+kernel}\label{sec:redefine_exec_time}
+
+We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
+the execution time of a relatively simple kernel. The obvious solution to
+assess their predictions is to compare them to an actual measure. However,
+accounting for dependencies at the scale of a basic block makes this
+\emph{actual measure} not as trivially defined as it would seem. Take for
+instance the following kernel:

 \begin{minipage}{0.90\linewidth}
 \begin{lstlisting}[language={[x86masm]Assembler}]
@ -49,8 +90,6 @@ add $8, %rcx
 \end{lstlisting}
 \end{minipage}

-\input{overview}
-
 \noindent{}At first, it looks like an array copy from location \reg{rax} to
 \reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
 \reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
@ -92,28 +131,3 @@ prediction by the number of loop iterations; however, this bound might not
 generally be known. More importantly, the compiler may apply any number of
 transformations: unrolling, for instance, changes this number. Control flow may
 also be complicated by code versioning.
-
-\bigskip
-
-In this chapter, we present a fully-tooled solution to evaluate and compare the
-diversity of static throughput predictors. Our tool, \cesasme, solves two main
-issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
-\cesasme{} generates a wide variety of computation kernels stressing different
-parameters of the architecture, and thus of the predictors' models, while
-staying close to representative workloads. To achieve this, we use
-Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
-scientific computation workloads, that we combine with a variety of
-optimisations, including polyhedral loop transformations.
-In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
-evaluate throughput predictors on this set of benchmarks by lifting their
-predictions to a total number of cycles that can be compared to a hardware
-counters-based measure. A
-high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
-
-In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
-methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
-analyze the results of \cesasme{}.
- In addition to statistical studies, we use \cesasme's results
-to investigate analyzers' flaws. We show that code
-analyzers do not always correctly model data dependencies through memory
-accesses, substantially impacting their precision.
--- a/manuscrit/50_CesASMe/main.tex
+++ b/manuscrit/50_CesASMe/main.tex
@ -1,4 +1,5 @@
-\chapter{A more systematic approach to throughput prediction performance analysis}
+\chapter{A more systematic approach to throughput prediction performance
+analysis: \cesasme{}}

 \input{00_intro.tex}
 \input{05_related_works.tex}