Proof-read chapter 2 (Palmed)

2024-08-17 13:03:32 +02:00 · 2024-08-17 13:03:32 +02:00 · 4e13835886
commit 4e13835886
parent 596950a835
6 changed files with 119 additions and 93 deletions
--- a/manuscrit/20_foundations/20_code_analyzers.tex
+++ b/manuscrit/20_foundations/20_code_analyzers.tex
@ -542,7 +542,7 @@ Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
 for large values of $n$ in this manuscript whenever it is clear that this value
 is a measure.

-\subsubsection{Basic block of an assembly-level program}
+\subsubsection{Basic block of an assembly-level program}\label{sssec:def:bbs}

 Code analyzers are meant to analyze sections of straight-line code, that is,
 portions of code which do not contain control flow. As such, it is convenient
--- a/manuscrit/30_palmed/00_intro.tex
+++ b/manuscrit/30_palmed/00_intro.tex
@ -25,5 +25,6 @@ project during the first period of my own PhD.

 In this chapter, sections~\ref{sec:palmed_resource_models}
 through~\ref{sec:palmed_pipedream} describe \palmed{}, and present what is
-mostly not my own work. Sections~\ref{sec:benchsuite_bb} and later describe my
-own work on this project.
+mostly not my own work, but introduce important concepts for this manuscript.
+Sections~\ref{sec:benchsuite_bb} and later describe my own work on this
+project.
--- a/manuscrit/30_palmed/10_resource_models.tex
+++ b/manuscrit/30_palmed/10_resource_models.tex
@ -21,6 +21,17 @@ instruction's mapping is described as a string, \eg{}
 \texttt{VCVTT}\footnote{The precise variant is \texttt{VCVTTSD2SI (R32, XMM)}}
 is described as \texttt{1*p0+1*p01}.

+The two layers of such a model play a very different role. Indeed, the
+top layer (instructions to \uops{}) can be seen as an \emph{and}, or
+\emph{conjunctive} layer: an instruction is decomposed into each of its
+\uops{}, which must all be executed for the instruction to be completed. The
+bottom layer (\uops{} to ports), however, can be seen as an \emph{or}, or
+\emph{disjunctive} layer: a \uop{} must be executed on \emph{one} of those
+ports, each able to execute this \uop{}. This can be seen in the example from
+\uopsinfo{} above: \texttt{VCVTT} is decomposed into two \uops{}, the first
+necessarily executed on port 0, the second on port either 0 or 1.
+
+\medskip{}

 \begin{figure}
    \centering
@ -38,14 +49,14 @@ dependencies in steady-state, and a port mapping is sufficient.
 As some \uops{} are compatible with multiple ports, the number of cycles
 required to run one occurrence of a kernel is not trivial. An assignment, for a
 given kernel, of its constitutive \uops{} to ports, is a \emph{schedule}
---~the number of cycles taken by a kernel with a fixed schedule is
+---~the number of cycles taken by a kernel given a fixed schedule is
 well-defined. The throughput of a kernel is defined as the throughput under an
 optimal schedule for this kernel.

 \begin{example}[Kernel throughputs with port mappings]
    The kernel $\kerK_1 = \texttt{DIVPS} + \texttt{BSR} + \texttt{JMP}$ can
    complete in one cycle: $\cyc{\kerK_1} = 1$. Indeed, according to the port
-    mapping in \autoref{fig:sample_resource_mapping}, each of those
+    mapping in \autoref{fig:sample_port_mapping}, each of those
    instructions is decoded into a single \uop{}, each compatible with a
    single, distinct port. Thus, the three instructions can be issued in
    parallel in one cycle.
@ -59,7 +70,7 @@ optimal schedule for this kernel.

    The kernel $\kerK_3 = \texttt{ADDSS} + 2\times\texttt{BSR}$, however, needs
    at least two cycles to be executed: \texttt{BSR} can only be executed on
-    port $p_1$, which can execute at most a \uop{} per cycle. $\cyc{\kerK_3} =
+    port $p_1$, which can execute at most one \uop{} per cycle. $\cyc{\kerK_3} =
    2$.

    The instruction \texttt{ADDSS} alone, however, can be executed twice per
@ -197,9 +208,12 @@ $\kerK$, and
            \texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
            \texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
            \midrule
-            Total & 0 & 1 & 1 \\
+            Total & 0 & \textbf{1} & \textbf{1} \\
            \bottomrule
        \end{tabular}
+
+        \smallskip{}
+        $\implies{} \cyc{\kerK_2} = 1$
    \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
        \centering
        $\kerK_3$
@ -212,9 +226,13 @@ $\kerK$, and
            \texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
            $2\times$\texttt{BSR} & & 2 & 1 \\
            \midrule
-            Total & 0 & 2 & 1.5 \\
+            Total & 0 & \textbf{2} & 1.5 \\
            \bottomrule
        \end{tabular}
+
+        \smallskip{}
+        $\implies{} \cyc{\kerK_3} = 2$
+
        \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
        \centering
        $\kerK_4$
@ -227,9 +245,13 @@ $\kerK$, and
            $2\times$\texttt{ADDSS} & & & 1 \\
            \texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
            \midrule
-            Total & 0 & 1 & 1.5 \\
+            Total & 0 & 1 & \textbf{1.5} \\
            \bottomrule
        \end{tabular}
+
+        \smallskip{}
+        $\implies{} \cyc{\kerK_4} = 1.5$
+
    \end{minipage}
 \end{example}

--- a/manuscrit/30_palmed/35_benchsuite_bb.tex
+++ b/manuscrit/30_palmed/35_benchsuite_bb.tex
@ -9,8 +9,8 @@ for their evaluation.  However, random generation may yield basic blocks that
 are not representative of the various workloads our model might be used on.
 Thus, while arbitrarily or randomly generated microbenchmarks were well suited
 to the data acquisition phase needed to generate the model, the kernels on
-which the model would be evaluated could not be arbitrary, but must come from
-real-world programs.
+which the model would be evaluated could not be arbitrary, and must instead
+come from real-world programs.

 \subsection{Benchmark suites}

@ -23,7 +23,7 @@ blocks used to evaluate \palmed{} should thus be reasonably close from these
 criteria.

 For this reason, we evaluate \palmed{} on basic blocks extracted from
-two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
+two well-known benchmark suites: Polybench and SPEC CPU 2017.

 \paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
 numerical computation~\cite{bench:polybench}. Its benchmarks are
@ -49,8 +49,8 @@ review~--- complicated.

 \subsection{Manually extracting basic blocks}

-Our first approach, that we used to extract basic blocks from the two benchmark
-suites introduced above for the evaluation included in our article for
+The first approach that we used to extract basic blocks from the two benchmark
+suites introduced above, for the evaluation included in our article for
 \palmed{}~\cite{palmed}, was very manual. We use different ---~though
 similar~--- approaches for Polybench and SPEC\@.

@ -85,10 +85,11 @@ Most importantly, this manual extraction is not reproducible. This comes with
 two problems.
 \begin{itemize}
    \item If the dataset was to be lost, or if another researcher wanted to
-        reproduce our results, the exact same dataset could not be recreated.
-        The same general procedure could be followed again, but code and
-        scripts would have to be re-written, manually typed and undocumented
-        shell lines re-written, etc.
+        reproduce our results, the exact same dataset could not be identically
+        recreated. The same general procedure could be followed again, but code
+        and scripts would have to be re-written, manually typed and
+        undocumented shell lines re-written, etc. Most importantly, the
+        re-extracted basic blocks may well be slightly different.
    \item The same consideration applies to porting the dataset to another ISA.
        Indeed, as the dataset consists of assembly-level basic-blocks, it
        cannot be transferred to another ISA: it has to be re-generated from
@ -119,7 +120,7 @@ The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
 sampling the current program counter (as well as the stack, if requested, to
 obtain a stack trace) upon either event occurrences, such as number of elapsed
 CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
-user-defined frequency.
+user-defined time frequency.

 In our case, we use this second mode to uniformly sample the program counter
 across a run. We recover the output of the profiling as a \textit{raw trace}
@ -195,11 +196,13 @@ memoize this step to do it only once per symbol. We then bissect the basic
 block corresponding to the current PC from the list of obtained basic blocks to
 count the occurrences of each block.

-To split a symbol into basic blocks, we determine using \texttt{capstone} its
-set of \emph{flow sites} and \emph{jump sites}. The former is the set of
-addresses just after a control flow instruction, while the latter is the set of
-addresses to which jump instructions may jump. We then split the
-straight-line code of the symbol using the union of both sets as boundaries.
+To split a symbol into basic blocks, we follow the procedure introduced by our
+formal definition in \autoref{sssec:def:bbs}. We determine using
+\texttt{capstone} its set of \emph{flow sites} and \emph{jump sites}. The
+former is the set of addresses just after a control flow instruction, while the
+latter is the set of addresses to which jump instructions may jump. We then
+split the straight-line code of the symbol using the union of both sets as
+boundaries.

 \medskip

--- a/manuscrit/30_palmed/40_palmed_results.tex
+++ b/manuscrit/30_palmed/40_palmed_results.tex
@ -20,7 +20,7 @@ does not support some instructions (control flow, x86-64 divisions, \ldots),
 those are stripped from the original kernel, which might denature the original
 basic block.

-To evaluate \palmed{}, the same kernel is run:
+To evaluate \palmed{}, the same kernel's run time is measured:

 \begin{enumerate}

@ -31,8 +31,9 @@ To evaluate \palmed{}, the same kernel is run:

 \item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
    equivalent conjunctive resource mapping\footnote{When this evaluation was
-    made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
-    resource mapping, the comparison to \uopsinfo{} is fair.};
+    made, \uica{}~\cite{uica} was not yet published. Since \palmed{} only
+    provides a resource mapping, but no frontend, the comparison to \uopsinfo{}
+    is fair.};

 \item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
    its provided mapping;
@ -98,21 +99,21 @@ all} in the basic block was present in the model.

 This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
 a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
-are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
+are pruned from the benchmarks. Hence, \palmed{} has a 100\,\% coverage
 \emph{by construction} --- which does not mean that is supports all the
-instructions found in the original basic blocks.
+instructions found in the original basic blocks, but only that our methodology
+is unable to process basic blocks unsupported by Pipedream.

 \subsection{Results}

 \input{40-1_results_fig.tex}

-We run the evaluation harness on three different machines:
+We run the evaluation harness on two different machines:
 \begin{itemize}
    \item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
        4114 CPU, totalling 20 cores;
    \item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
-        CPU with 24 cores;
-    \item{} an ARMv8a Raspberry Pi 4 with 4 Cortex A72 cores.
+        CPU with 24 cores.
 \end{itemize}

 As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
@ -130,64 +131,3 @@ $y$ for a significant number of microkernels with a measured IPC of $x$. The
 closer a prediction is to the red horizontal line, the more accurate it is.

 These results are analyzed in the full article~\cite{palmed}.
-
-\section{Other contributions}
-
-\paragraph{Using a database to enhance reproducibility and usability.}
-\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
-instance, generating a mapping for an x86-64 machine requires the execution of
-about $10^6$ benchmarks on the CPU\@.
-
-Each of these measures takes time: the multiset of instructions must be
-transformed into an assembly code, including the register mapping phrase; this
-assembly must be assembled and linked into an ELF file; and finally, the
-benchmark must be actually executed, with multiple warm-up rounds and multiple
-measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
-to two-thirds of a second on a single core. The whole benchmarking phase, on
-the \texttt{SKL-SP} processor, roughly took eight hours.
-
-\medskip{}
-
-As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
-\palmed{} cannot be made truly reproducible. However, the slight fluctuations
-in measured cycles between two executions of a benchmark are also a source of
-non-determinism in the execution of Palmed.
-
-\medskip{}
-
-For both these reasons, we implemented into \palmed{} a database-backed storage of
-measurements. Whenever \palmed{} needs to measure a kernel, it will first try
-to find a corresponding measure in the database; if the measure does not exist
-yet, it will be run, then stored in database.
-
-For each measure, we further store for context:
-the time and date at which the measure was made;
-the machine on which the measure was made;
-how many times the measure was repeated;
-how many warm-up rounds were performed;
-how many instructions were in the unrolled loop;
-how many instructions were executed per repetition in total;
-the parameters for \pipedream{}'s assembly generation procedure;
-how the final result was aggregated from the repeated measures;
-the variance of the set of measures;
-how many CPU cores were active when the measure was made;
-which CPU core was used for this measure;
-whether the kernel's scheduler was set to FIFO mode.
-
-\bigskip{}
-
-We believe that, as a whole, the use of a database increases the usability of
-\palmed{}: it is faster if some measures were already made in the past and
-recovers better upon error.
-
-This also gives us a better confidence towards our results: we can easily
-archive and backup our experimental data, and we can easily trace the origin of
-a measure if needed. We can also reuse the exact same measures between two runs
-of \palmed{}, to ensure that the results are as consistent as possible.
-
-
-\paragraph{General engineering contributions.} Apart from purely scientific
-contributions, we worked on improving \palmed{} as a whole, from the
-engineering point of view: code quality; reliable parallel measurements;
-recovery upon error; logging; \ldots{} These improvements amount to about a
-hundred merge-requests between \nderumig{} and myself.
--- a/manuscrit/30_palmed/50_other_contributions.tex
+++ b/manuscrit/30_palmed/50_other_contributions.tex
@ -0,0 +1,60 @@
+\section{Other contributions}
+
+\paragraph{Using a database to enhance reproducibility and usability.}
+\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
+instance, generating a mapping for an x86-64 machine requires the execution of
+about $10^6$ benchmarks on the CPU\@.
+
+Each of these measures takes time: the multiset of instructions must be
+transformed into an assembly code, including the register mapping phrase; this
+assembly must be assembled and linked into an ELF file; and finally, the
+benchmark must be actually executed, with multiple warm-up rounds and multiple
+measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
+to two-thirds of a second on a single core. The whole benchmarking phase, on
+the \texttt{SKL-SP} processor, roughly took eight hours.
+
+\medskip{}
+
+As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
+\palmed{} cannot be made truly reproducible. However, the slight fluctuations
+in measured cycles between two executions of a benchmark are also a major
+source of non-determinism in the execution of Palmed.
+
+\medskip{}
+
+For both these reasons, we implemented into \palmed{} a database-backed storage of
+measurements. Whenever \palmed{} needs to measure a kernel, it will first try
+to find a corresponding measure in the database; if the measure does not exist
+yet, it will be run, then stored in database.
+
+For each measure, we further store for context:
+the time and date at which the measure was made;
+the machine on which the measure was made;
+how many times the measure was repeated;
+how many warm-up rounds were performed;
+how many instructions were in the unrolled loop;
+how many instructions were executed per repetition in total;
+the parameters for \pipedream{}'s assembly generation procedure;
+how the final result was aggregated from the repeated measures;
+the variance of the set of measures;
+how many CPU cores were active when the measure was made;
+which CPU core was used for this measure;
+whether the kernel's scheduler was set to FIFO mode.
+
+\bigskip{}
+
+We believe that, as a whole, the use of a database increases the usability of
+\palmed{}: it is faster if some measures were already made in the past and
+recovers better upon error.
+
+This also gives us a better confidence towards our results: we can easily
+archive and backup our experimental data, and we can easily trace the origin of
+a measure if needed. We can also reuse the exact same measures between two runs
+of \palmed{}, to ensure that the results are as consistent as possible.
+
+
+\paragraph{General engineering contributions.} Apart from purely scientific
+contributions, we worked on improving \palmed{} as a whole, from the
+engineering point of view: code quality; reliable parallel measurements;
+recovery upon error; logging; \ldots{} These improvements amount to about a
+hundred merge-requests between \nderumig{} and myself.