diff --git a/manuscrit/50_CesASMe/00_intro.tex b/manuscrit/50_CesASMe/00_intro.tex index 35d9ed5..d743ecd 100644 --- a/manuscrit/50_CesASMe/00_intro.tex +++ b/manuscrit/50_CesASMe/00_intro.tex @@ -40,13 +40,19 @@ assess more generally the landscape of code analyzers. What are the key bottlenecks to account for if one aims to predict the execution time of a kernel correctly? Are some of these badly accounted for by state-of-the-art code analyzers? This chapter, by conducting a broad experimental analysis of -these tools, strives to answer to these questions. +these tools, strives to answer these questions. \input{overview} \bigskip{} -We present a fully-tooled solution to evaluate and compare the +In \autoref{sec:redefine_exec_time}, we investigate how a kernel's execution time +may be measured if we want to correctly account for its dependencies. We +advocate for the measurement of the total execution time of a computation +kernel in its original context, coupled with a precise measure of its number of +iterations to normalize the measure. + +We then present a fully-tooled solution to evaluate and compare the diversity of static throughput predictors. Our tool, \cesasme, solves two main issues in this direction. In Section~\ref{sec:bench_gen}, we describe how \cesasme{} generates a wide variety of computation kernels stressing different @@ -56,8 +62,6 @@ Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of scientific computation workloads, that we combine with a variety of optimisations, including polyhedral loop transformations. -In \autoref{sec:redefine_exec_time}, we \qtodo{blabla full cycles}. - In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to evaluate throughput predictors on this set of benchmarks by lifting their predictions to a total number of cycles that can be compared to a hardware @@ -71,63 +75,3 @@ analyze the results of \cesasme{}. to investigate analyzers' flaws. We show that code analyzers do not always correctly model data dependencies through memory accesses, substantially impacting their precision. - -\section{Re-defining the execution time of a -kernel}\label{sec:redefine_exec_time} - -We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on -the execution time of a relatively simple kernel. The obvious solution to -assess their predictions is to compare them to an actual measure. However, -accounting for dependencies at the scale of a basic block makes this -\emph{actual measure} not as trivially defined as it would seem. Take for -instance the following kernel: - -\begin{minipage}{0.90\linewidth} -\begin{lstlisting}[language={[x86masm]Assembler}] -mov (%rax, %rcx, 1), %r10 -mov %r10, (%rbx, %rcx, 1) -add $8, %rcx -\end{lstlisting} -\end{minipage} - -\noindent{}At first, it looks like an array copy from location \reg{rax} to -\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to -\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first -instruction and the second instruction at the previous iteration; which makes -the throughput drop significantly. As we shall see in -Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic -block's throughput is not well-defined}. - -To recover the context of each basic block, we reason instead at the scale of -a C source code. This -makes the measures unambiguous: one can use hardware counters to measure the -elapsed cycles during a loop nest. This requires a suite of benchmarks, in C, -that both is representative of the domain studied, and wide enough to have a -good coverage of the domain. However, this is not in itself sufficient to -evaluate static tools: on the preceding matrix multiplication kernel, counters -report 80,059 elapsed cycles ---~for the total loop. -This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{} -basic block-level predictions seen above. - -A common practice to make these numbers comparable is to renormalize them to -instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of -$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of -$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this -case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet, -IPC is a metric for microarchitectural load, and \textit{tells nothing about a -kernel's efficiency}. Indeed, the static number of instructions is affected by -many compiler passes, such as scalar evolution, strength reduction, register -allocation, instruction selection\ldots{} Thus, when comparing two compiled -versions of the same code, IPC alone does not necessarily point to the most -efficient version. For instance, a kernel using SIMD instructions will use -fewer instructions than one using only scalars, and thus exhibit a lower or -constant IPC; yet, its performance will unquestionably increase. - -The total cycles elapsed to solve a given problem, on the other -hand, is a sound metric of the efficiency of an implementation. We thus -instead \emph{lift} the predictions at basic-block level to a total number of -cycles. In simple cases, this simply means multiplying the block-level -prediction by the number of loop iterations; however, this bound might not -generally be known. More importantly, the compiler may apply any number of -transformations: unrolling, for instance, changes this number. Control flow may -also be complicated by code versioning. diff --git a/manuscrit/50_CesASMe/02_measuring_exec_time.tex b/manuscrit/50_CesASMe/02_measuring_exec_time.tex new file mode 100644 index 0000000..6b794b4 --- /dev/null +++ b/manuscrit/50_CesASMe/02_measuring_exec_time.tex @@ -0,0 +1,64 @@ +\section{Re-defining the execution time of a +kernel}\label{sec:redefine_exec_time} + +We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on +the execution time of a relatively simple kernel. The obvious solution to +assess their predictions is to compare them to an actual measure. However, +accounting for dependencies at the scale of a basic block makes this +\emph{actual measure} not as trivially defined as it would seem. Take for +instance the following kernel: + +\begin{minipage}{0.90\linewidth} +\begin{lstlisting}[language={[x86masm]Assembler}] +mov (%rax, %rcx, 1), %r10 +mov %r10, (%rbx, %rcx, 1) +add $8, %rcx +\end{lstlisting} +\end{minipage} + +\noindent{}At first, it looks like an array copy from location \reg{rax} to +\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to +\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first +instruction and the second instruction at the previous iteration; which makes +the throughput drop significantly. As we shall see in +Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic +block's throughput is not well-defined}. + +To recover the context of each basic block, we reason instead at the scale of +a C source code. This +makes the measures unambiguous: one can use hardware counters to measure the +elapsed cycles during a loop nest. This requires a suite of benchmarks, in C, +that both is representative of the domain studied, and wide enough to have a +good coverage of the domain. However, this is not in itself sufficient to +evaluate static tools: on the preceding matrix multiplication kernel, counters +report 80,059 elapsed cycles ---~for the total loop. +This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{} +basic block-level predictions seen above. + +A common practice to make these numbers comparable is to renormalize them to +instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of +$\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of +$\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this +case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet, +IPC is a metric for microarchitectural load, and \textit{tells nothing about a +kernel's efficiency}. Indeed, the static number of instructions is affected by +many compiler passes, such as scalar evolution, strength reduction, register +allocation, instruction selection\ldots{} Thus, when comparing two compiled +versions of the same code, IPC alone does not necessarily point to the most +efficient version. For instance, a kernel using SIMD instructions will use +fewer instructions than one using only scalars, and thus exhibit a lower or +constant IPC; yet, its performance will unquestionably increase. + +The total cycles elapsed to solve a given problem, on the other +hand, is a sound metric of the efficiency of an implementation. We thus +instead \emph{lift} the predictions at basic-block level to a total number of +cycles. In simple cases, this simply means multiplying the block-level +prediction by the number of loop iterations; however, this bound might not +generally be known. More importantly, the compiler may apply any number of +transformations: unrolling, for instance, changes this number. Control flow may +also be complicated by code versioning. + +Instead of guessing this final number of iterations at the assembly level, a +sounder alternative is to measure it on the final binary. In +\autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to +instrument an execution of the binary. diff --git a/manuscrit/50_CesASMe/05_related_works.tex b/manuscrit/50_CesASMe/05_related_works.tex index 6b6b14c..ef857b8 100644 --- a/manuscrit/50_CesASMe/05_related_works.tex +++ b/manuscrit/50_CesASMe/05_related_works.tex @@ -1,20 +1,50 @@ \section{Related works} + +\paragraph{Another comparative study: \anica{}.} The \anica{} +framework~\cite{anica} also attempts to comparatively evaluate various throughput predictors by +finding examples on which they are inaccurate. \anica{} starts with randomly +generated assembly snippets fed to various code analyzers. Once it finds a +snippet on which (some) code analyzers yield unsatisfying results, it refines +it through a process derived from abstract interpretation to reach a +more general category of input, \eg{} ``a load to a SIMD register followed by a +SIMD arithmetic operation''. + +\paragraph{A dynamic code analyzer: \gus{}.} +So far, this manuscript was mostly concerned with static code analyzers. Throughput prediction tools, however, are not all static. -\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program -region, instrumenting it to retrieve the exact events occurring through its -execution. This way, \gus{} can more finely detect bottlenecks by -sensitivity analysis, at the cost of a significantly longer run time. +\gus is a dynamic tool first introduced in \fgruber{}'s PhD +thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to +dynamically predict the throughput of user-defined regions of interest in whole +program. +In these regions, it instruments every instruction, memory access, \ldots{} in +order to retrieve the exact events occurring through the program's +execution. \gus{} then leverages throughput, latency and microarchitectural +models to analyze resource usage and produce an accurate theoretical elapsed +cycles prediction. -\smallskip +Its main strength, however, resides in its \emph{sensitivity analysis} +capabilities: by applying an arbitrary factor to some parts of the model (\eg{} +latencies, arithmetics port, \ldots{}), it is possible to investigate the +impact of a specific resource on the final execution time of a region of +interest. It can also accurately determine if a resource is actually a +bottleneck for a region, \ie{} if increasing this resource's capabilities would +reduce the execution time. The output of \gus{} on a region of interest +provides a very detailed insight on each instruction's resource consumption and +its contribution to the final execution time. As a dynamic analysis tool, it +is also able to extract the dependencies an instruction exhibits on a real run. -The \bhive{} profiler~\cite{bhive} takes another approach to basic block -throughput measurement: by mapping memory at any address accessed by a basic -block, it can effectively run and measure arbitrary code without context, often ----~but not always, as we discuss later~--- yielding good results. +The main downside of \gus{}, however, is its slowness. As most dynamic tools, +it suffers from a heavy slowdown compared to a native execution of the binary, +oftentimes about $100\times$ slower. While it remains a precious tool to the +user willing to deeply optimize an execution kernel, it makes \gus{} highly +impractical to run on a large collection of execution kernels. -\smallskip - -The \anica{} framework~\cite{anica} also attempts to evaluate throughput -predictors by finding examples on which they are inaccurate. \anica{} starts -with randomly generated assembly snippets, and refines them through a process -derived from abstract interpretation to reach general categories of problems. +\paragraph{An isolated basic-block profiler: \bhive{}.} In +\autoref{sec:redefine_exec_time} above, we advocated for measuring a basic +block's execution time \emph{in-context}. The \bhive{} profiler~\cite{bhive}, +initially written by \ithemal{}'s authors~\cite{ithemal} to provide their model +with sufficient ---~and sufficiently accurate~--- training data, takes an +orthogonal approach to basic block throughput measurement. By mapping memory at +any address accessed by a basic block, it can effectively run and measure +arbitrary code without context, often ---~but not always, as we discuss +later~--- yielding good results. diff --git a/manuscrit/50_CesASMe/15_harness.tex b/manuscrit/50_CesASMe/15_harness.tex index 9e6c9ce..18ed49b 100644 --- a/manuscrit/50_CesASMe/15_harness.tex +++ b/manuscrit/50_CesASMe/15_harness.tex @@ -25,7 +25,7 @@ jump site. To accurately obtain the occurrences of each basic block in the whole kernel's computation, -we then instrument it with \texttt{gdb} by placing a break +we then instrument it with \gdb{} by placing a break point at each basic block's first instruction in order to count the occurrences of each basic block between two calls to the \perf{} counters\footnote{We assume the program under analysis to be deterministic.}. While this @@ -60,7 +60,7 @@ markers prevent a binary from being run by overwriting registers with arbitrary values. This forces a user to run and measure a version which is different from the analyzed one. In our harness, we circumvent this issue by adding markers directly at the assembly level, editing the already compiled version. Our -\texttt{gdb} instrumentation procedure also respects this principle of +\gdb{} instrumentation procedure also respects this principle of single-compilation. As \qemu{} breaks the \perf{} interface, we have to run \gus{} with a preloaded stub shared library to be able to instrument binaries containing calls to \perf. diff --git a/manuscrit/50_CesASMe/20_evaluation.tex b/manuscrit/50_CesASMe/20_evaluation.tex index 6f831fe..7d2fe8c 100644 --- a/manuscrit/50_CesASMe/20_evaluation.tex +++ b/manuscrit/50_CesASMe/20_evaluation.tex @@ -11,9 +11,10 @@ predictions comparable to baseline hardware counter measures. \subsection{Experimental environment} The experiments presented in this paper were all realized on a Dell PowerEdge -C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB -of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel -Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each. +C6420 machine, from the \textit{Dahu} cluster of Grid5000 in +Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM +---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130 +CPUs (x86-64, Skylake microarchitecture) with 16 cores each. The experiments themselves were run inside a Docker environment based on Debian Bullseye. Care was taken to disable hyperthreading to improve measurements diff --git a/manuscrit/50_CesASMe/25_results_analysis.tex b/manuscrit/50_CesASMe/25_results_analysis.tex index cf14554..8901b20 100644 --- a/manuscrit/50_CesASMe/25_results_analysis.tex +++ b/manuscrit/50_CesASMe/25_results_analysis.tex @@ -72,11 +72,11 @@ through hardware counters, an excellent accuracy is expected. Its lack of support for control flow instructions can be held accountable for a portion of this accuracy drop; our lifting method, based on block occurrences instead of paths, can explain another portion. We also find that \bhive{} fails to produce -a result in about 40\,\% of the kernels explored ---~which means that, for those -cases, \bhive{} failed to produce a result on at least one of the constituent -basic blocks. In fact, this is due to the difficulties we mentioned in -\qtodo{[ref intro]} related to the need to reconstruct the context of each -basic block \textit{ex nihilo}. +a result in about 40\,\% of the kernels explored ---~which means that, for +those cases, \bhive{} failed to produce a result on at least one of the +constituent basic blocks. In fact, this is due to the difficulties we mentioned +in \autoref{sec:redefine_exec_time} earlier, related to the need to reconstruct +the context of each basic block \textit{ex nihilo}. The basis of \bhive's method is to run the code to be measured, unrolled a number of times depending on the code size, with all memory pages but the @@ -96,7 +96,7 @@ initial value can be of crucial importance. The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU (Cascade Lake), with hyperthreading disabled. -\paragraph{Imprecise analysis} we consider the following x86-64 kernel. +\paragraph{Imprecise analysis.} We consider the following x86-64 kernel. \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] @@ -122,7 +122,7 @@ influence the results whenever it gets loaded into registers. \vspace{0.5em} -\paragraph{Failed analysis} some memory accesses will always result in an +\paragraph{Failed analysis.} Some memory accesses will always result in an error; for instance, it is impossible to \texttt{mmap} at an address lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus, with equal initial values for all registers, the following kernel would fail, diff --git a/manuscrit/50_CesASMe/30_future_works.tex b/manuscrit/50_CesASMe/30_future_works.tex index 85c056a..e0d2ae0 100644 --- a/manuscrit/50_CesASMe/30_future_works.tex +++ b/manuscrit/50_CesASMe/30_future_works.tex @@ -47,7 +47,7 @@ These perspectives can also be seen as future works: \smallskip -\paragraph{Program optimization} the whole program processing we have designed +\paragraph{Program optimization.} The whole program processing we have designed can be used not only to evaluate the performance model underlying a static analyzer, but also to guide program optimization itself. In such a perspective, we would generate different versions of the same program using the @@ -70,7 +70,7 @@ suffices; this however would require to control L1-residence otherwise. \smallskip -\paragraph{Dataset building} our microbenchmarks generation phase outputs a +\paragraph{Dataset building.} Our microbenchmarks generation phase outputs a large, diverse and representative dataset of microkernels. In addition to our harness, we believe that such a dataset could be used to improve existing data-dependant solutions. diff --git a/manuscrit/50_CesASMe/main.tex b/manuscrit/50_CesASMe/main.tex index 875530e..f67f3d9 100644 --- a/manuscrit/50_CesASMe/main.tex +++ b/manuscrit/50_CesASMe/main.tex @@ -2,6 +2,7 @@ analysis: \cesasme{}} \input{00_intro.tex} +\input{02_measuring_exec_time.tex} \input{05_related_works.tex} \input{10_bench_gen.tex} \input{15_harness.tex}