diff --git a/manuscrit/50_CesASMe/00_intro.tex b/manuscrit/50_CesASMe/00_intro.tex index f498d71..35d9ed5 100644 --- a/manuscrit/50_CesASMe/00_intro.tex +++ b/manuscrit/50_CesASMe/00_intro.tex @@ -35,11 +35,52 @@ jne 0x16e0 \noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{} predicts 3 cycles. One may wonder which tool is correct. +In this chapter, we take a step back from our previous contributions, and +assess more generally the landscape of code analyzers. What are the key +bottlenecks to account for if one aims to predict the execution time of a +kernel correctly? Are some of these badly accounted for by state-of-the-art +code analyzers? This chapter, by conducting a broad experimental analysis of +these tools, strives to answer to these questions. -The obvious solution to assess their predictions is to compare them to an -actual measure. However, accounting for dependencies at the scale of a basic -block makes this \emph{actual measure} not as trivially defined as it would -seem. Take for instance the following kernel: +\input{overview} + +\bigskip{} + +We present a fully-tooled solution to evaluate and compare the +diversity of static throughput predictors. Our tool, \cesasme, solves two main +issues in this direction. In Section~\ref{sec:bench_gen}, we describe how +\cesasme{} generates a wide variety of computation kernels stressing different +parameters of the architecture, and thus of the predictors' models, while +staying close to representative workloads. To achieve this, we use +Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of +scientific computation workloads, that we combine with a variety of +optimisations, including polyhedral loop transformations. + +In \autoref{sec:redefine_exec_time}, we \qtodo{blabla full cycles}. + +In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to +evaluate throughput predictors on this set of benchmarks by lifting their +predictions to a total number of cycles that can be compared to a hardware +counters-based measure. A +high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}. + +In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our +methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and +analyze the results of \cesasme{}. + In addition to statistical studies, we use \cesasme's results +to investigate analyzers' flaws. We show that code +analyzers do not always correctly model data dependencies through memory +accesses, substantially impacting their precision. + +\section{Re-defining the execution time of a +kernel}\label{sec:redefine_exec_time} + +We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on +the execution time of a relatively simple kernel. The obvious solution to +assess their predictions is to compare them to an actual measure. However, +accounting for dependencies at the scale of a basic block makes this +\emph{actual measure} not as trivially defined as it would seem. Take for +instance the following kernel: \begin{minipage}{0.90\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] @@ -49,8 +90,6 @@ add $8, %rcx \end{lstlisting} \end{minipage} -\input{overview} - \noindent{}At first, it looks like an array copy from location \reg{rax} to \reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to \reg{rax}\texttt{+8}, there is a read-after-write dependency between the first @@ -92,28 +131,3 @@ prediction by the number of loop iterations; however, this bound might not generally be known. More importantly, the compiler may apply any number of transformations: unrolling, for instance, changes this number. Control flow may also be complicated by code versioning. - -\bigskip - -In this chapter, we present a fully-tooled solution to evaluate and compare the -diversity of static throughput predictors. Our tool, \cesasme, solves two main -issues in this direction. In Section~\ref{sec:bench_gen}, we describe how -\cesasme{} generates a wide variety of computation kernels stressing different -parameters of the architecture, and thus of the predictors' models, while -staying close to representative workloads. To achieve this, we use -Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of -scientific computation workloads, that we combine with a variety of -optimisations, including polyhedral loop transformations. -In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to -evaluate throughput predictors on this set of benchmarks by lifting their -predictions to a total number of cycles that can be compared to a hardware -counters-based measure. A -high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}. - -In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our -methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and -analyze the results of \cesasme{}. - In addition to statistical studies, we use \cesasme's results -to investigate analyzers' flaws. We show that code -analyzers do not always correctly model data dependencies through memory -accesses, substantially impacting their precision. diff --git a/manuscrit/50_CesASMe/main.tex b/manuscrit/50_CesASMe/main.tex index bf41cdb..875530e 100644 --- a/manuscrit/50_CesASMe/main.tex +++ b/manuscrit/50_CesASMe/main.tex @@ -1,4 +1,5 @@ -\chapter{A more systematic approach to throughput prediction performance analysis} +\chapter{A more systematic approach to throughput prediction performance +analysis: \cesasme{}} \input{00_intro.tex} \input{05_related_works.tex}