CesASMe: another integration pass

This commit is contained in:
Théophile Bastian 2023-09-26 11:39:26 +02:00
parent b44b6fcf10
commit 4c4f70c246
8 changed files with 133 additions and 93 deletions

View file

@ -40,13 +40,19 @@ assess more generally the landscape of code analyzers. What are the key
bottlenecks to account for if one aims to predict the execution time of a bottlenecks to account for if one aims to predict the execution time of a
kernel correctly? Are some of these badly accounted for by state-of-the-art kernel correctly? Are some of these badly accounted for by state-of-the-art
code analyzers? This chapter, by conducting a broad experimental analysis of code analyzers? This chapter, by conducting a broad experimental analysis of
these tools, strives to answer to these questions. these tools, strives to answer these questions.
\input{overview} \input{overview}
\bigskip{} \bigskip{}
We present a fully-tooled solution to evaluate and compare the In \autoref{sec:redefine_exec_time}, we investigate how a kernel's execution time
may be measured if we want to correctly account for its dependencies. We
advocate for the measurement of the total execution time of a computation
kernel in its original context, coupled with a precise measure of its number of
iterations to normalize the measure.
We then present a fully-tooled solution to evaluate and compare the
diversity of static throughput predictors. Our tool, \cesasme, solves two main diversity of static throughput predictors. Our tool, \cesasme, solves two main
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
\cesasme{} generates a wide variety of computation kernels stressing different \cesasme{} generates a wide variety of computation kernels stressing different
@ -56,8 +62,6 @@ Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
scientific computation workloads, that we combine with a variety of scientific computation workloads, that we combine with a variety of
optimisations, including polyhedral loop transformations. optimisations, including polyhedral loop transformations.
In \autoref{sec:redefine_exec_time}, we \qtodo{blabla full cycles}.
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
evaluate throughput predictors on this set of benchmarks by lifting their evaluate throughput predictors on this set of benchmarks by lifting their
predictions to a total number of cycles that can be compared to a hardware predictions to a total number of cycles that can be compared to a hardware
@ -71,63 +75,3 @@ analyze the results of \cesasme{}.
to investigate analyzers' flaws. We show that code to investigate analyzers' flaws. We show that code
analyzers do not always correctly model data dependencies through memory analyzers do not always correctly model data dependencies through memory
accesses, substantially impacting their precision. accesses, substantially impacting their precision.
\section{Re-defining the execution time of a
kernel}\label{sec:redefine_exec_time}
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
the execution time of a relatively simple kernel. The obvious solution to
assess their predictions is to compare them to an actual measure. However,
accounting for dependencies at the scale of a basic block makes this
\emph{actual measure} not as trivially defined as it would seem. Take for
instance the following kernel:
\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov (%rax, %rcx, 1), %r10
mov %r10, (%rbx, %rcx, 1)
add $8, %rcx
\end{lstlisting}
\end{minipage}
\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
instruction and the second instruction at the previous iteration; which makes
the throughput drop significantly. As we shall see in
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
block's throughput is not well-defined}.
To recover the context of each basic block, we reason instead at the scale of
a C source code. This
makes the measures unambiguous: one can use hardware counters to measure the
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
that both is representative of the domain studied, and wide enough to have a
good coverage of the domain. However, this is not in itself sufficient to
evaluate static tools: on the preceding matrix multiplication kernel, counters
report 80,059 elapsed cycles ---~for the total loop.
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
basic block-level predictions seen above.
A common practice to make these numbers comparable is to renormalize them to
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
kernel's efficiency}. Indeed, the static number of instructions is affected by
many compiler passes, such as scalar evolution, strength reduction, register
allocation, instruction selection\ldots{} Thus, when comparing two compiled
versions of the same code, IPC alone does not necessarily point to the most
efficient version. For instance, a kernel using SIMD instructions will use
fewer instructions than one using only scalars, and thus exhibit a lower or
constant IPC; yet, its performance will unquestionably increase.
The total cycles elapsed to solve a given problem, on the other
hand, is a sound metric of the efficiency of an implementation. We thus
instead \emph{lift} the predictions at basic-block level to a total number of
cycles. In simple cases, this simply means multiplying the block-level
prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.

View file

@ -0,0 +1,64 @@
\section{Re-defining the execution time of a
kernel}\label{sec:redefine_exec_time}
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
the execution time of a relatively simple kernel. The obvious solution to
assess their predictions is to compare them to an actual measure. However,
accounting for dependencies at the scale of a basic block makes this
\emph{actual measure} not as trivially defined as it would seem. Take for
instance the following kernel:
\begin{minipage}{0.90\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
mov (%rax, %rcx, 1), %r10
mov %r10, (%rbx, %rcx, 1)
add $8, %rcx
\end{lstlisting}
\end{minipage}
\noindent{}At first, it looks like an array copy from location \reg{rax} to
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
instruction and the second instruction at the previous iteration; which makes
the throughput drop significantly. As we shall see in
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
block's throughput is not well-defined}.
To recover the context of each basic block, we reason instead at the scale of
a C source code. This
makes the measures unambiguous: one can use hardware counters to measure the
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
that both is representative of the domain studied, and wide enough to have a
good coverage of the domain. However, this is not in itself sufficient to
evaluate static tools: on the preceding matrix multiplication kernel, counters
report 80,059 elapsed cycles ---~for the total loop.
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
basic block-level predictions seen above.
A common practice to make these numbers comparable is to renormalize them to
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
$\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
$\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
kernel's efficiency}. Indeed, the static number of instructions is affected by
many compiler passes, such as scalar evolution, strength reduction, register
allocation, instruction selection\ldots{} Thus, when comparing two compiled
versions of the same code, IPC alone does not necessarily point to the most
efficient version. For instance, a kernel using SIMD instructions will use
fewer instructions than one using only scalars, and thus exhibit a lower or
constant IPC; yet, its performance will unquestionably increase.
The total cycles elapsed to solve a given problem, on the other
hand, is a sound metric of the efficiency of an implementation. We thus
instead \emph{lift} the predictions at basic-block level to a total number of
cycles. In simple cases, this simply means multiplying the block-level
prediction by the number of loop iterations; however, this bound might not
generally be known. More importantly, the compiler may apply any number of
transformations: unrolling, for instance, changes this number. Control flow may
also be complicated by code versioning.
Instead of guessing this final number of iterations at the assembly level, a
sounder alternative is to measure it on the final binary. In
\autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to
instrument an execution of the binary.

View file

@ -1,20 +1,50 @@
\section{Related works} \section{Related works}
\paragraph{Another comparative study: \anica{}.} The \anica{}
framework~\cite{anica} also attempts to comparatively evaluate various throughput predictors by
finding examples on which they are inaccurate. \anica{} starts with randomly
generated assembly snippets fed to various code analyzers. Once it finds a
snippet on which (some) code analyzers yield unsatisfying results, it refines
it through a process derived from abstract interpretation to reach a
more general category of input, \eg{} ``a load to a SIMD register followed by a
SIMD arithmetic operation''.
\paragraph{A dynamic code analyzer: \gus{}.}
So far, this manuscript was mostly concerned with static code analyzers.
Throughput prediction tools, however, are not all static. Throughput prediction tools, however, are not all static.
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program \gus is a dynamic tool first introduced in \fgruber{}'s PhD
region, instrumenting it to retrieve the exact events occurring through its thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
execution. This way, \gus{} can more finely detect bottlenecks by dynamically predict the throughput of user-defined regions of interest in whole
sensitivity analysis, at the cost of a significantly longer run time. program.
In these regions, it instruments every instruction, memory access, \ldots{} in
order to retrieve the exact events occurring through the program's
execution. \gus{} then leverages throughput, latency and microarchitectural
models to analyze resource usage and produce an accurate theoretical elapsed
cycles prediction.
\smallskip Its main strength, however, resides in its \emph{sensitivity analysis}
capabilities: by applying an arbitrary factor to some parts of the model (\eg{}
latencies, arithmetics port, \ldots{}), it is possible to investigate the
impact of a specific resource on the final execution time of a region of
interest. It can also accurately determine if a resource is actually a
bottleneck for a region, \ie{} if increasing this resource's capabilities would
reduce the execution time. The output of \gus{} on a region of interest
provides a very detailed insight on each instruction's resource consumption and
its contribution to the final execution time. As a dynamic analysis tool, it
is also able to extract the dependencies an instruction exhibits on a real run.
The \bhive{} profiler~\cite{bhive} takes another approach to basic block The main downside of \gus{}, however, is its slowness. As most dynamic tools,
throughput measurement: by mapping memory at any address accessed by a basic it suffers from a heavy slowdown compared to a native execution of the binary,
block, it can effectively run and measure arbitrary code without context, often oftentimes about $100\times$ slower. While it remains a precious tool to the
---~but not always, as we discuss later~--- yielding good results. user willing to deeply optimize an execution kernel, it makes \gus{} highly
impractical to run on a large collection of execution kernels.
\smallskip \paragraph{An isolated basic-block profiler: \bhive{}.} In
\autoref{sec:redefine_exec_time} above, we advocated for measuring a basic
The \anica{} framework~\cite{anica} also attempts to evaluate throughput block's execution time \emph{in-context}. The \bhive{} profiler~\cite{bhive},
predictors by finding examples on which they are inaccurate. \anica{} starts initially written by \ithemal{}'s authors~\cite{ithemal} to provide their model
with randomly generated assembly snippets, and refines them through a process with sufficient ---~and sufficiently accurate~--- training data, takes an
derived from abstract interpretation to reach general categories of problems. orthogonal approach to basic block throughput measurement. By mapping memory at
any address accessed by a basic block, it can effectively run and measure
arbitrary code without context, often ---~but not always, as we discuss
later~--- yielding good results.

View file

@ -25,7 +25,7 @@ jump site.
To accurately obtain the occurrences of each basic block in the whole kernel's To accurately obtain the occurrences of each basic block in the whole kernel's
computation, computation,
we then instrument it with \texttt{gdb} by placing a break we then instrument it with \gdb{} by placing a break
point at each basic block's first instruction in order to count the occurrences point at each basic block's first instruction in order to count the occurrences
of each basic block between two calls to the \perf{} counters\footnote{We of each basic block between two calls to the \perf{} counters\footnote{We
assume the program under analysis to be deterministic.}. While this assume the program under analysis to be deterministic.}. While this
@ -60,7 +60,7 @@ markers prevent a binary from being run by overwriting registers with arbitrary
values. This forces a user to run and measure a version which is different from values. This forces a user to run and measure a version which is different from
the analyzed one. In our harness, we circumvent this issue by adding markers the analyzed one. In our harness, we circumvent this issue by adding markers
directly at the assembly level, editing the already compiled version. Our directly at the assembly level, editing the already compiled version. Our
\texttt{gdb} instrumentation procedure also respects this principle of \gdb{} instrumentation procedure also respects this principle of
single-compilation. As \qemu{} breaks the \perf{} interface, we have to run single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
\gus{} with a preloaded stub shared library to be able to instrument binaries \gus{} with a preloaded stub shared library to be able to instrument binaries
containing calls to \perf. containing calls to \perf.

View file

@ -11,9 +11,10 @@ predictions comparable to baseline hardware counter measures.
\subsection{Experimental environment} \subsection{Experimental environment}
The experiments presented in this paper were all realized on a Dell PowerEdge The experiments presented in this paper were all realized on a Dell PowerEdge
C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB C6420 machine, from the \textit{Dahu} cluster of Grid5000 in
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each. ---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130
CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
The experiments themselves were run inside a Docker environment based on Debian The experiments themselves were run inside a Docker environment based on Debian
Bullseye. Care was taken to disable hyperthreading to improve measurements Bullseye. Care was taken to disable hyperthreading to improve measurements

View file

@ -72,11 +72,11 @@ through hardware counters, an excellent accuracy is expected. Its lack of
support for control flow instructions can be held accountable for a portion of support for control flow instructions can be held accountable for a portion of
this accuracy drop; our lifting method, based on block occurrences instead of this accuracy drop; our lifting method, based on block occurrences instead of
paths, can explain another portion. We also find that \bhive{} fails to produce paths, can explain another portion. We also find that \bhive{} fails to produce
a result in about 40\,\% of the kernels explored ---~which means that, for those a result in about 40\,\% of the kernels explored ---~which means that, for
cases, \bhive{} failed to produce a result on at least one of the constituent those cases, \bhive{} failed to produce a result on at least one of the
basic blocks. In fact, this is due to the difficulties we mentioned in constituent basic blocks. In fact, this is due to the difficulties we mentioned
\qtodo{[ref intro]} related to the need to reconstruct the context of each in \autoref{sec:redefine_exec_time} earlier, related to the need to reconstruct
basic block \textit{ex nihilo}. the context of each basic block \textit{ex nihilo}.
The basis of \bhive's method is to run the code to be measured, unrolled a The basis of \bhive's method is to run the code to be measured, unrolled a
number of times depending on the code size, with all memory pages but the number of times depending on the code size, with all memory pages but the
@ -96,7 +96,7 @@ initial value can be of crucial importance.
The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
(Cascade Lake), with hyperthreading disabled. (Cascade Lake), with hyperthreading disabled.
\paragraph{Imprecise analysis} we consider the following x86-64 kernel. \paragraph{Imprecise analysis.} We consider the following x86-64 kernel.
\begin{minipage}{0.95\linewidth} \begin{minipage}{0.95\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}] \begin{lstlisting}[language={[x86masm]Assembler}]
@ -122,7 +122,7 @@ influence the results whenever it gets loaded into registers.
\vspace{0.5em} \vspace{0.5em}
\paragraph{Failed analysis} some memory accesses will always result in an \paragraph{Failed analysis.} Some memory accesses will always result in an
error; for instance, it is impossible to \texttt{mmap} at an address lower error; for instance, it is impossible to \texttt{mmap} at an address lower
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus, than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
with equal initial values for all registers, the following kernel would fail, with equal initial values for all registers, the following kernel would fail,

View file

@ -47,7 +47,7 @@ These perspectives can also be seen as future works:
\smallskip \smallskip
\paragraph{Program optimization} the whole program processing we have designed \paragraph{Program optimization.} The whole program processing we have designed
can be used not only to evaluate the performance model underlying a static can be used not only to evaluate the performance model underlying a static
analyzer, but also to guide program optimization itself. In such a perspective, analyzer, but also to guide program optimization itself. In such a perspective,
we would generate different versions of the same program using the we would generate different versions of the same program using the
@ -70,7 +70,7 @@ suffices; this however would require to control L1-residence otherwise.
\smallskip \smallskip
\paragraph{Dataset building} our microbenchmarks generation phase outputs a \paragraph{Dataset building.} Our microbenchmarks generation phase outputs a
large, diverse and representative dataset of microkernels. In addition to our large, diverse and representative dataset of microkernels. In addition to our
harness, we believe that such a dataset could be used to improve existing harness, we believe that such a dataset could be used to improve existing
data-dependant solutions. data-dependant solutions.

View file

@ -2,6 +2,7 @@
analysis: \cesasme{}} analysis: \cesasme{}}
\input{00_intro.tex} \input{00_intro.tex}
\input{02_measuring_exec_time.tex}
\input{05_related_works.tex} \input{05_related_works.tex}
\input{10_bench_gen.tex} \input{10_bench_gen.tex}
\input{15_harness.tex} \input{15_harness.tex}