diff --git a/manuscrit/30_palmed/35_benchsuite_bb.tex b/manuscrit/30_palmed/35_benchsuite_bb.tex index 4da2947..95625c4 100644 --- a/manuscrit/30_palmed/35_benchsuite_bb.tex +++ b/manuscrit/30_palmed/35_benchsuite_bb.tex @@ -72,7 +72,7 @@ details below. Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based and 2\,664 polybench-based basic blocks. -\subsection{Automating basic block extraction} +\subsection{Automating basic block extraction}\label{ssec:palmed_bb_extraction} This manual method, however, has multiple drawbacks. It is, obviously, tedious to manually compile and run a benchmark suite, then extract basic blocks using diff --git a/manuscrit/50_CesASMe/00_intro.tex b/manuscrit/50_CesASMe/00_intro.tex index d743ecd..a12d783 100644 --- a/manuscrit/50_CesASMe/00_intro.tex +++ b/manuscrit/50_CesASMe/00_intro.tex @@ -52,15 +52,17 @@ advocate for the measurement of the total execution time of a computation kernel in its original context, coupled with a precise measure of its number of iterations to normalize the measure. -We then present a fully-tooled solution to evaluate and compare the -diversity of static throughput predictors. Our tool, \cesasme, solves two main -issues in this direction. In Section~\ref{sec:bench_gen}, we describe how -\cesasme{} generates a wide variety of computation kernels stressing different -parameters of the architecture, and thus of the predictors' models, while -staying close to representative workloads. To achieve this, we use -Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of -scientific computation workloads, that we combine with a variety of -optimisations, including polyhedral loop transformations. +We then present a fully-tooled solution to evaluate and compare the diversity +of static throughput predictors. Our tool, \cesasme, solves two main issues in +this direction. In Section~\ref{sec:bench_gen}, we describe how \cesasme{} +generates a wide variety of computation kernels stressing different parameters +of the architecture, and thus of the predictors' models, while staying close to +representative workloads. To achieve this, we use +Polybench~\cite{bench:polybench}, a C-level benchmark suite that we already +introduced for \palmed{} in \autoref{sec:benchsuite_bb}. Polybench is composed +of benchmarks representative of scientific computation workloads, that we +combine with a variety of optimisations, including polyhedral loop +transformations. In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to evaluate throughput predictors on this set of benchmarks by lifting their diff --git a/manuscrit/50_CesASMe/05_related_works.tex b/manuscrit/50_CesASMe/05_related_works.tex index ef857b8..2f74ddb 100644 --- a/manuscrit/50_CesASMe/05_related_works.tex +++ b/manuscrit/50_CesASMe/05_related_works.tex @@ -12,7 +12,7 @@ SIMD arithmetic operation''. \paragraph{A dynamic code analyzer: \gus{}.} So far, this manuscript was mostly concerned with static code analyzers. Throughput prediction tools, however, are not all static. -\gus is a dynamic tool first introduced in \fgruber{}'s PhD +\gus{} is a dynamic tool first introduced in \fgruber{}'s PhD thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to dynamically predict the throughput of user-defined regions of interest in whole program. diff --git a/manuscrit/50_CesASMe/10_bench_gen.tex b/manuscrit/50_CesASMe/10_bench_gen.tex index 258c39f..4d2e2fa 100644 --- a/manuscrit/50_CesASMe/10_bench_gen.tex +++ b/manuscrit/50_CesASMe/10_bench_gen.tex @@ -12,7 +12,7 @@ In practice, a microbenchmark's \textit{computational kernel} is a simple \texttt{for} loop, whose body contains no loops and whose bounds are statically known. A \emph{measure} is a number of repetitions $n$ of this computational -kernel, $n$ being an user-specified parameter. +kernel, $n$ being a user-specified parameter. The measure may be repeated an arbitrary number of times to improve stability. @@ -46,7 +46,7 @@ removed because they are incompatible with PoCC (introduced below). The \texttt{lu} benchmark is left out as its execution alone takes longer than all others together, making its dynamic analysis (\eg{} with \gus) impractical. In addition to the importance of linear algebra within -it, one of its important features is that it does not include computational +Polybench, one of its important features is that it does not include computational kernels with conditional control flow (\eg{} \texttt{if-then-else}) ---~however, it does includes conditional data flow, using the ternary conditional operator of C. diff --git a/manuscrit/50_CesASMe/15_harness.tex b/manuscrit/50_CesASMe/15_harness.tex index 18ed49b..f8f8f4d 100644 --- a/manuscrit/50_CesASMe/15_harness.tex +++ b/manuscrit/50_CesASMe/15_harness.tex @@ -21,7 +21,8 @@ kernel-level results thanks to the occurrences previously measured. Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly code at each control flow instruction (jump, call, return, \ldots) and each -jump site. +jump site, as in \autoref{alg:bb_extr_procedure} from +\autoref{ssec:palmed_bb_extraction}. To accurately obtain the occurrences of each basic block in the whole kernel's computation, diff --git a/manuscrit/50_CesASMe/20_evaluation.tex b/manuscrit/50_CesASMe/20_evaluation.tex index 7d2fe8c..461823c 100644 --- a/manuscrit/50_CesASMe/20_evaluation.tex +++ b/manuscrit/50_CesASMe/20_evaluation.tex @@ -10,11 +10,11 @@ predictions comparable to baseline hardware counter measures. \subsection{Experimental environment} -The experiments presented in this paper were all realized on a Dell PowerEdge -C6420 machine, from the \textit{Dahu} cluster of Grid5000 in -Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM ----~only a small fraction of which was used~--- and two Intel Xeon Gold 6130 -CPUs (x86-64, Skylake microarchitecture) with 16 cores each. +The experiments presented in this chapter, unless stated otherwise, were all +realized on a Dell PowerEdge C6420 machine, from the \textit{Dahu} cluster of +Grid5000 in Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of +DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel Xeon +Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each. The experiments themselves were run inside a Docker environment based on Debian Bullseye. Care was taken to disable hyperthreading to improve measurements @@ -92,9 +92,11 @@ consequently, lifted predictions can reasonably be compared to one another. \footnotesize \begin{tabular}{l | r r r | r r r | r r r} \toprule + \multicolumn{1}{c|}{\textbf{Polybench}} & \multicolumn{3}{c|}{\textbf{Frontend}} & \multicolumn{3}{c|}{\textbf{Ports}} & \multicolumn{3}{c}{\textbf{Dependencies}} \\ + \multicolumn{1}{c|}{\textbf{benchmark}} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\ \midrule diff --git a/manuscrit/50_CesASMe/25_results_analysis.tex b/manuscrit/50_CesASMe/25_results_analysis.tex index 8901b20..138ddbc 100644 --- a/manuscrit/50_CesASMe/25_results_analysis.tex +++ b/manuscrit/50_CesASMe/25_results_analysis.tex @@ -48,8 +48,8 @@ in \autoref{chap:palmed} and \autoref{chap:frontend}. These results are, overall, significantly worse than what each tool's article presents. We attribute this difference mostly to the specificities of Polybench: being composed of computation kernels, it intrinsically stresses the -CPU more than basic blocks extracted out of the Spec benchmark suite. This -difference is clearly reflected in the experimental section of Palmed in +CPU more than basic blocks extracted out of \eg{} the Spec benchmark suite. +This difference is clearly reflected in the experimental section of Palmed in \autoref{chap:palmed}: the accuracy of most tools is worse on Polybench than on Spec, often by more than a factor of two. @@ -105,6 +105,12 @@ The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU \end{lstlisting} \end{minipage} +Note here that the \lstxasm{vmulsd out, in1, in2} instruction is the scalar +double-precision float multiplication of values from \lstxasm{in1} and +\lstxasm{in2}, storing the result in \lstxasm{out}; while \lstxasm{vmovsd out, +in} is a simple \lstxasm{mov} operation from \lstxasm{in} to \lstxasm{out} +operating on double-precision floats in \reg{xmm} registers. + When executed with all the general purpose registers initialized to the default constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and \reg{r10} hold the same value, inducing a read-after-write dependency between @@ -123,10 +129,11 @@ influence the results whenever it gets loaded into registers. \vspace{0.5em} \paragraph{Failed analysis.} Some memory accesses will always result in an -error; for instance, it is impossible to \texttt{mmap} at an address lower -than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus, -with equal initial values for all registers, the following kernel would fail, -since the second operation attempts to load at address 0: +error; for instance, on Linux, it is impossible to \texttt{mmap} at an address +lower than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to +\texttt{0x10000}. Thus, with equal initial values for all registers, the +following kernel would fail, since the second operation attempts to load at +address 0: \begin{minipage}{0.95\linewidth} \begin{lstlisting}[language={[x86masm]Assembler}] @@ -181,7 +188,7 @@ In the majority of the cases studied, the tools are not able to agree on the presence or absence of a type of bottleneck. Although it might seem that the tools are performing better on frontend bottleneck detection, it must be recalled that only two tools (versus three in the other cases) are reporting -frontend bottlenecks, thus making it easier for them to agree. +frontend bottlenecks, thus making it more likely for them to agree. \begin{table} \centering @@ -333,6 +340,8 @@ While the results for \llvmmca, \uica{} and \iaca{} globally improved significantly, the most noticeable improvements are the reduced spread of the results and the Kendall's $\tau$ correlation coefficient's increase. +\medskip{} + From this, we argue that detecting memory-carried dependencies is a weak point in current state-of-the-art static analyzers, and that their results could be diff --git a/manuscrit/50_CesASMe/30_future_works.tex b/manuscrit/50_CesASMe/30_future_works.tex index e0d2ae0..a3dc9e6 100644 --- a/manuscrit/50_CesASMe/30_future_works.tex +++ b/manuscrit/50_CesASMe/30_future_works.tex @@ -29,7 +29,8 @@ We were also able to show in Section~\ref{ssec:memlatbound} that state-of-the-art static analyzers struggle to account for memory-carried dependencies; a weakness significantly impacting their overall results on our benchmarks. We believe that detecting -and accounting for these dependencies is an important future works direction. +and accounting for these dependencies is an important topic --~which we will +tackle in the following chapter. Moreover, we present this work in the form of a modular software package, each component of which exposes numerous adjustable parameters. These components can diff --git a/manuscrit/50_CesASMe/99_conclusion.tex b/manuscrit/50_CesASMe/99_conclusion.tex deleted file mode 100644 index 3815954..0000000 --- a/manuscrit/50_CesASMe/99_conclusion.tex +++ /dev/null @@ -1,2 +0,0 @@ -%% \section*{Conclusion} -%% \todo{} diff --git a/manuscrit/50_CesASMe/main.tex b/manuscrit/50_CesASMe/main.tex index 0c75c78..7e31666 100644 --- a/manuscrit/50_CesASMe/main.tex +++ b/manuscrit/50_CesASMe/main.tex @@ -9,4 +9,3 @@ analysis: \cesasme{}}\label{chap:CesASMe} \input{20_evaluation.tex} \input{25_results_analysis.tex} \input{30_future_works.tex} -\input{99_conclusion.tex} diff --git a/manuscrit/include/macros.tex b/manuscrit/include/macros.tex index 380a828..0702d2c 100644 --- a/manuscrit/include/macros.tex +++ b/manuscrit/include/macros.tex @@ -68,7 +68,7 @@ \newcommand{\coeq}{CO$_{2}$eq} -\newcommand{\figref}[1]{[\ref{#1}]} +\newcommand{\figref}[1]{[ยง\ref{#1}]} \newcommand{\reg}[1]{\texttt{\%#1}}