Proof-read chapter 2 (Palmed)

2024-08-17 13:03:32 +02:00 · 2024-08-17 13:03:32 +02:00 · 4e13835886
commit 4e13835886
parent 596950a835
6 changed files with 119 additions and 93 deletions
--- a/manuscrit/20_foundations/20_code_analyzers.tex
+++ b/manuscrit/20_foundations/20_code_analyzers.tex
@ -542,7 +542,7 @@ Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
 for large values of $n$ in this manuscript whenever it is clear that this value
 is a measure.
-\subsubsection{Basic block of an assembly-level program}
+\subsubsection{Basic block of an assembly-level program}\label{sssec:def:bbs}
 Code analyzers are meant to analyze sections of straight-line code, that is,
 portions of code which do not contain control flow. As such, it is convenient
--- a/manuscrit/30_palmed/00_intro.tex
+++ b/manuscrit/30_palmed/00_intro.tex
@ -25,5 +25,6 @@ project during the first period of my own PhD.
 In this chapter, sections~\ref{sec:palmed_resource_models}
 through~\ref{sec:palmed_pipedream} describe \palmed{}, and present what is
-mostly not my own work. Sections~\ref{sec:benchsuite_bb} and later describe my
+mostly not my own work, but introduce important concepts for this manuscript.
-own work on this project.
+Sections~\ref{sec:benchsuite_bb} and later describe my own work on this
 project.
--- a/manuscrit/30_palmed/10_resource_models.tex
+++ b/manuscrit/30_palmed/10_resource_models.tex
@ -21,6 +21,17 @@ instruction's mapping is described as a string, \eg{}
 \texttt{VCVTT}\footnote{The precise variant is \texttt{VCVTTSD2SI (R32, XMM)}}
 is described as \texttt{1*p0+1*p01}.
 The two layers of such a model play a very different role. Indeed, the
 top layer (instructions to \uops{}) can be seen as an \emph{and}, or
 \emph{conjunctive} layer: an instruction is decomposed into each of its
 \uops{}, which must all be executed for the instruction to be completed. The
 bottom layer (\uops{} to ports), however, can be seen as an \emph{or}, or
 \emph{disjunctive} layer: a \uop{} must be executed on \emph{one} of those
 ports, each able to execute this \uop{}. This can be seen in the example from
 \uopsinfo{} above: \texttt{VCVTT} is decomposed into two \uops{}, the first
 necessarily executed on port 0, the second on port either 0 or 1.
 \medskip{}
 \begin{figure}
    \centering
@ -38,14 +49,14 @@ dependencies in steady-state, and a port mapping is sufficient.
 As some \uops{} are compatible with multiple ports, the number of cycles
 required to run one occurrence of a kernel is not trivial. An assignment, for a
 given kernel, of its constitutive \uops{} to ports, is a \emph{schedule}
---~the number of cycles taken by a kernel with a fixed schedule is
+---~the number of cycles taken by a kernel given a fixed schedule is
 well-defined. The throughput of a kernel is defined as the throughput under an
 optimal schedule for this kernel.
 \begin{example}[Kernel throughputs with port mappings]
    The kernel $\kerK_1 = \texttt{DIVPS} + \texttt{BSR} + \texttt{JMP}$ can
    complete in one cycle: $\cyc{\kerK_1} = 1$. Indeed, according to the port
-    mapping in \autoref{fig:sample_resource_mapping}, each of those
+    mapping in \autoref{fig:sample_port_mapping}, each of those
    instructions is decoded into a single \uop{}, each compatible with a
    single, distinct port. Thus, the three instructions can be issued in
    parallel in one cycle.
@ -59,7 +70,7 @@ optimal schedule for this kernel.
    The kernel $\kerK_3 = \texttt{ADDSS} + 2\times\texttt{BSR}$, however, needs
    at least two cycles to be executed: \texttt{BSR} can only be executed on
-    port $p_1$, which can execute at most a \uop{} per cycle. $\cyc{\kerK_3} =
+    port $p_1$, which can execute at most one \uop{} per cycle. $\cyc{\kerK_3} =
    2$.
    The instruction \texttt{ADDSS} alone, however, can be executed twice per
@ -197,9 +208,12 @@ $\kerK$, and
            \texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
            \texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
            \midrule
-            Total & 0 & 1 & 1 \\
+            Total & 0 & \textbf{1} & \textbf{1} \\
            \bottomrule
        \end{tabular}
        \smallskip{}
        $\implies{} \cyc{\kerK_2} = 1$
    \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
        \centering
        $\kerK_3$
@ -212,9 +226,13 @@ $\kerK$, and
            \texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
            $2\times$\texttt{BSR} & & 2 & 1 \\
            \midrule
-            Total & 0 & 2 & 1.5 \\
+            Total & 0 & \textbf{2} & 1.5 \\
            \bottomrule
        \end{tabular}
        \smallskip{}
        $\implies{} \cyc{\kerK_3} = 2$
        \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
        \centering
        $\kerK_4$
@ -227,9 +245,13 @@ $\kerK$, and
            $2\times$\texttt{ADDSS} & & & 1 \\
            \texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
            \midrule
-            Total & 0 & 1 & 1.5 \\
+            Total & 0 & 1 & \textbf{1.5} \\
            \bottomrule
        \end{tabular}
        \smallskip{}
        $\implies{} \cyc{\kerK_4} = 1.5$
    \end{minipage}
 \end{example}
--- a/manuscrit/30_palmed/35_benchsuite_bb.tex
+++ b/manuscrit/30_palmed/35_benchsuite_bb.tex
@ -9,8 +9,8 @@ for their evaluation.  However, random generation may yield basic blocks that
 are not representative of the various workloads our model might be used on.
 Thus, while arbitrarily or randomly generated microbenchmarks were well suited
 to the data acquisition phase needed to generate the model, the kernels on
-which the model would be evaluated could not be arbitrary, but must come from
+which the model would be evaluated could not be arbitrary, and must instead
-real-world programs.
+come from real-world programs.
 \subsection{Benchmark suites}
@ -23,7 +23,7 @@ blocks used to evaluate \palmed{} should thus be reasonably close from these
 criteria.
 For this reason, we evaluate \palmed{} on basic blocks extracted from
-two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
+two well-known benchmark suites: Polybench and SPEC CPU 2017.
 \paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
 numerical computation~\cite{bench:polybench}. Its benchmarks are
@ -49,8 +49,8 @@ review~--- complicated.
 \subsection{Manually extracting basic blocks}
-Our first approach, that we used to extract basic blocks from the two benchmark
+The first approach that we used to extract basic blocks from the two benchmark
-suites introduced above for the evaluation included in our article for
+suites introduced above, for the evaluation included in our article for
 \palmed{}~\cite{palmed}, was very manual. We use different ---~though
 similar~--- approaches for Polybench and SPEC\@.
@ -85,10 +85,11 @@ Most importantly, this manual extraction is not reproducible. This comes with
 two problems.
 \begin{itemize}
    \item If the dataset was to be lost, or if another researcher wanted to
-        reproduce our results, the exact same dataset could not be recreated.
+        reproduce our results, the exact same dataset could not be identically
-        The same general procedure could be followed again, but code and
+        recreated. The same general procedure could be followed again, but code
-        scripts would have to be re-written, manually typed and undocumented
+        and scripts would have to be re-written, manually typed and
-        shell lines re-written, etc.
+        undocumented shell lines re-written, etc. Most importantly, the
        re-extracted basic blocks may well be slightly different.
    \item The same consideration applies to porting the dataset to another ISA.
        Indeed, as the dataset consists of assembly-level basic-blocks, it
        cannot be transferred to another ISA: it has to be re-generated from
@ -119,7 +120,7 @@ The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
 sampling the current program counter (as well as the stack, if requested, to
 obtain a stack trace) upon either event occurrences, such as number of elapsed
 CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
-user-defined frequency.
+user-defined time frequency.
 In our case, we use this second mode to uniformly sample the program counter
 across a run. We recover the output of the profiling as a \textit{raw trace}
@ -195,11 +196,13 @@ memoize this step to do it only once per symbol. We then bissect the basic
 block corresponding to the current PC from the list of obtained basic blocks to
 count the occurrences of each block.
-To split a symbol into basic blocks, we determine using \texttt{capstone} its
+To split a symbol into basic blocks, we follow the procedure introduced by our
-set of \emph{flow sites} and \emph{jump sites}. The former is the set of
+formal definition in \autoref{sssec:def:bbs}. We determine using
-addresses just after a control flow instruction, while the latter is the set of
+\texttt{capstone} its set of \emph{flow sites} and \emph{jump sites}. The
-addresses to which jump instructions may jump. We then split the
+former is the set of addresses just after a control flow instruction, while the
-straight-line code of the symbol using the union of both sets as boundaries.
+latter is the set of addresses to which jump instructions may jump. We then
 split the straight-line code of the symbol using the union of both sets as
 boundaries.
 \medskip
--- a/manuscrit/30_palmed/40_palmed_results.tex
+++ b/manuscrit/30_palmed/40_palmed_results.tex
@ -20,7 +20,7 @@ does not support some instructions (control flow, x86-64 divisions, \ldots),
 those are stripped from the original kernel, which might denature the original
 basic block.
-To evaluate \palmed{}, the same kernel is run:
+To evaluate \palmed{}, the same kernel's run time is measured:
 \begin{enumerate}
@ -31,8 +31,9 @@ To evaluate \palmed{}, the same kernel is run:
 \item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
    equivalent conjunctive resource mapping\footnote{When this evaluation was
-    made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
+    made, \uica{}~\cite{uica} was not yet published. Since \palmed{} only
-    resource mapping, the comparison to \uopsinfo{} is fair.};
+    provides a resource mapping, but no frontend, the comparison to \uopsinfo{}
    is fair.};
 \item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
    its provided mapping;
@ -98,21 +99,21 @@ all} in the basic block was present in the model.
 This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
 a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
-are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
+are pruned from the benchmarks. Hence, \palmed{} has a 100\,\% coverage
 \emph{by construction} --- which does not mean that is supports all the
-instructions found in the original basic blocks.
+instructions found in the original basic blocks, but only that our methodology
 is unable to process basic blocks unsupported by Pipedream.
 \subsection{Results}
 \input{40-1_results_fig.tex}
-We run the evaluation harness on three different machines:
+We run the evaluation harness on two different machines:
 \begin{itemize}
    \item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
        4114 CPU, totalling 20 cores;
    \item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
-        CPU with 24 cores;
+        CPU with 24 cores.
    \item{} an ARMv8a Raspberry Pi 4 with 4 Cortex A72 cores.
 \end{itemize}
 As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
@ -130,64 +131,3 @@ $y$ for a significant number of microkernels with a measured IPC of $x$. The
 closer a prediction is to the red horizontal line, the more accurate it is.
 These results are analyzed in the full article~\cite{palmed}.
 \section{Other contributions}
 \paragraph{Using a database to enhance reproducibility and usability.}
 \palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
 instance, generating a mapping for an x86-64 machine requires the execution of
 about $10^6$ benchmarks on the CPU\@.
 Each of these measures takes time: the multiset of instructions must be
 transformed into an assembly code, including the register mapping phrase; this
 assembly must be assembled and linked into an ELF file; and finally, the
 benchmark must be actually executed, with multiple warm-up rounds and multiple
 measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
 to two-thirds of a second on a single core. The whole benchmarking phase, on
 the \texttt{SKL-SP} processor, roughly took eight hours.
 \medskip{}
 As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
 \palmed{} cannot be made truly reproducible. However, the slight fluctuations
 in measured cycles between two executions of a benchmark are also a source of
 non-determinism in the execution of Palmed.
 \medskip{}
 For both these reasons, we implemented into \palmed{} a database-backed storage of
 measurements. Whenever \palmed{} needs to measure a kernel, it will first try
 to find a corresponding measure in the database; if the measure does not exist
 yet, it will be run, then stored in database.
 For each measure, we further store for context:
 the time and date at which the measure was made;
 the machine on which the measure was made;
 how many times the measure was repeated;
 how many warm-up rounds were performed;
 how many instructions were in the unrolled loop;
 how many instructions were executed per repetition in total;
 the parameters for \pipedream{}'s assembly generation procedure;
 how the final result was aggregated from the repeated measures;
 the variance of the set of measures;
 how many CPU cores were active when the measure was made;
 which CPU core was used for this measure;
 whether the kernel's scheduler was set to FIFO mode.
 \bigskip{}
 We believe that, as a whole, the use of a database increases the usability of
 \palmed{}: it is faster if some measures were already made in the past and
 recovers better upon error.
 This also gives us a better confidence towards our results: we can easily
 archive and backup our experimental data, and we can easily trace the origin of
 a measure if needed. We can also reuse the exact same measures between two runs
 of \palmed{}, to ensure that the results are as consistent as possible.
 \paragraph{General engineering contributions.} Apart from purely scientific
 contributions, we worked on improving \palmed{} as a whole, from the
 engineering point of view: code quality; reliable parallel measurements;
 recovery upon error; logging; \ldots{} These improvements amount to about a
 hundred merge-requests between \nderumig{} and myself.
--- a/manuscrit/30_palmed/50_other_contributions.tex
+++ b/manuscrit/30_palmed/50_other_contributions.tex
@ -0,0 +1,60 @@
 \section{Other contributions}
 \paragraph{Using a database to enhance reproducibility and usability.}
 \palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
 instance, generating a mapping for an x86-64 machine requires the execution of
 about $10^6$ benchmarks on the CPU\@.
 Each of these measures takes time: the multiset of instructions must be
 transformed into an assembly code, including the register mapping phrase; this
 assembly must be assembled and linked into an ELF file; and finally, the
 benchmark must be actually executed, with multiple warm-up rounds and multiple
 measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
 to two-thirds of a second on a single core. The whole benchmarking phase, on
 the \texttt{SKL-SP} processor, roughly took eight hours.
 \medskip{}
 As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
 \palmed{} cannot be made truly reproducible. However, the slight fluctuations
 in measured cycles between two executions of a benchmark are also a major
 source of non-determinism in the execution of Palmed.
 \medskip{}
 For both these reasons, we implemented into \palmed{} a database-backed storage of
 measurements. Whenever \palmed{} needs to measure a kernel, it will first try
 to find a corresponding measure in the database; if the measure does not exist
 yet, it will be run, then stored in database.
 For each measure, we further store for context:
 the time and date at which the measure was made;
 the machine on which the measure was made;
 how many times the measure was repeated;
 how many warm-up rounds were performed;
 how many instructions were in the unrolled loop;
 how many instructions were executed per repetition in total;
 the parameters for \pipedream{}'s assembly generation procedure;
 how the final result was aggregated from the repeated measures;
 the variance of the set of measures;
 how many CPU cores were active when the measure was made;
 which CPU core was used for this measure;
 whether the kernel's scheduler was set to FIFO mode.
 \bigskip{}
 We believe that, as a whole, the use of a database increases the usability of
 \palmed{}: it is faster if some measures were already made in the past and
 recovers better upon error.
 This also gives us a better confidence towards our results: we can easily
 archive and backup our experimental data, and we can easily trace the origin of
 a measure if needed. We can also reuse the exact same measures between two runs
 of \palmed{}, to ensure that the results are as consistent as possible.
 \paragraph{General engineering contributions.} Apart from purely scientific
 contributions, we worked on improving \palmed{} as a whole, from the
 engineering point of view: code quality; reliable parallel measurements;
 recovery upon error; logging; \ldots{} These improvements amount to about a
 hundred merge-requests between \nderumig{} and myself.