CesASMe: Try to remodel the introduction. WIP.
This commit is contained in:
parent
853f792023
commit
b44b6fcf10
2 changed files with 47 additions and 32 deletions
|
@ -35,11 +35,52 @@ jne 0x16e0
|
|||
\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
|
||||
predicts 3 cycles. One may wonder which tool is correct.
|
||||
|
||||
In this chapter, we take a step back from our previous contributions, and
|
||||
assess more generally the landscape of code analyzers. What are the key
|
||||
bottlenecks to account for if one aims to predict the execution time of a
|
||||
kernel correctly? Are some of these badly accounted for by state-of-the-art
|
||||
code analyzers? This chapter, by conducting a broad experimental analysis of
|
||||
these tools, strives to answer to these questions.
|
||||
|
||||
The obvious solution to assess their predictions is to compare them to an
|
||||
actual measure. However, accounting for dependencies at the scale of a basic
|
||||
block makes this \emph{actual measure} not as trivially defined as it would
|
||||
seem. Take for instance the following kernel:
|
||||
\input{overview}
|
||||
|
||||
\bigskip{}
|
||||
|
||||
We present a fully-tooled solution to evaluate and compare the
|
||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
||||
parameters of the architecture, and thus of the predictors' models, while
|
||||
staying close to representative workloads. To achieve this, we use
|
||||
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
||||
scientific computation workloads, that we combine with a variety of
|
||||
optimisations, including polyhedral loop transformations.
|
||||
|
||||
In \autoref{sec:redefine_exec_time}, we \qtodo{blabla full cycles}.
|
||||
|
||||
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
||||
evaluate throughput predictors on this set of benchmarks by lifting their
|
||||
predictions to a total number of cycles that can be compared to a hardware
|
||||
counters-based measure. A
|
||||
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
|
||||
|
||||
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
|
||||
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
|
||||
analyze the results of \cesasme{}.
|
||||
In addition to statistical studies, we use \cesasme's results
|
||||
to investigate analyzers' flaws. We show that code
|
||||
analyzers do not always correctly model data dependencies through memory
|
||||
accesses, substantially impacting their precision.
|
||||
|
||||
\section{Re-defining the execution time of a
|
||||
kernel}\label{sec:redefine_exec_time}
|
||||
|
||||
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
|
||||
the execution time of a relatively simple kernel. The obvious solution to
|
||||
assess their predictions is to compare them to an actual measure. However,
|
||||
accounting for dependencies at the scale of a basic block makes this
|
||||
\emph{actual measure} not as trivially defined as it would seem. Take for
|
||||
instance the following kernel:
|
||||
|
||||
\begin{minipage}{0.90\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
|
@ -49,8 +90,6 @@ add $8, %rcx
|
|||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
\input{overview}
|
||||
|
||||
\noindent{}At first, it looks like an array copy from location \reg{rax} to
|
||||
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
|
||||
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
|
||||
|
@ -92,28 +131,3 @@ prediction by the number of loop iterations; however, this bound might not
|
|||
generally be known. More importantly, the compiler may apply any number of
|
||||
transformations: unrolling, for instance, changes this number. Control flow may
|
||||
also be complicated by code versioning.
|
||||
|
||||
\bigskip
|
||||
|
||||
In this chapter, we present a fully-tooled solution to evaluate and compare the
|
||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
||||
parameters of the architecture, and thus of the predictors' models, while
|
||||
staying close to representative workloads. To achieve this, we use
|
||||
Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
||||
scientific computation workloads, that we combine with a variety of
|
||||
optimisations, including polyhedral loop transformations.
|
||||
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
||||
evaluate throughput predictors on this set of benchmarks by lifting their
|
||||
predictions to a total number of cycles that can be compared to a hardware
|
||||
counters-based measure. A
|
||||
high-level view of \cesasme{} is shown in Figure~\ref{fig:contrib}.
|
||||
|
||||
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
|
||||
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
|
||||
analyze the results of \cesasme{}.
|
||||
In addition to statistical studies, we use \cesasme's results
|
||||
to investigate analyzers' flaws. We show that code
|
||||
analyzers do not always correctly model data dependencies through memory
|
||||
accesses, substantially impacting their precision.
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
\chapter{A more systematic approach to throughput prediction performance analysis}
|
||||
\chapter{A more systematic approach to throughput prediction performance
|
||||
analysis: \cesasme{}}
|
||||
|
||||
\input{00_intro.tex}
|
||||
\input{05_related_works.tex}
|
||||
|
|
Loading…
Reference in a new issue