Proof-read chapter 2 (Palmed)

This commit is contained in:
Théophile Bastian 2024-08-17 13:03:32 +02:00
parent 596950a835
commit 4e13835886
6 changed files with 119 additions and 93 deletions

View file

@ -542,7 +542,7 @@ Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
for large values of $n$ in this manuscript whenever it is clear that this value
is a measure.
\subsubsection{Basic block of an assembly-level program}
\subsubsection{Basic block of an assembly-level program}\label{sssec:def:bbs}
Code analyzers are meant to analyze sections of straight-line code, that is,
portions of code which do not contain control flow. As such, it is convenient

View file

@ -25,5 +25,6 @@ project during the first period of my own PhD.
In this chapter, sections~\ref{sec:palmed_resource_models}
through~\ref{sec:palmed_pipedream} describe \palmed{}, and present what is
mostly not my own work. Sections~\ref{sec:benchsuite_bb} and later describe my
own work on this project.
mostly not my own work, but introduce important concepts for this manuscript.
Sections~\ref{sec:benchsuite_bb} and later describe my own work on this
project.

View file

@ -21,6 +21,17 @@ instruction's mapping is described as a string, \eg{}
\texttt{VCVTT}\footnote{The precise variant is \texttt{VCVTTSD2SI (R32, XMM)}}
is described as \texttt{1*p0+1*p01}.
The two layers of such a model play a very different role. Indeed, the
top layer (instructions to \uops{}) can be seen as an \emph{and}, or
\emph{conjunctive} layer: an instruction is decomposed into each of its
\uops{}, which must all be executed for the instruction to be completed. The
bottom layer (\uops{} to ports), however, can be seen as an \emph{or}, or
\emph{disjunctive} layer: a \uop{} must be executed on \emph{one} of those
ports, each able to execute this \uop{}. This can be seen in the example from
\uopsinfo{} above: \texttt{VCVTT} is decomposed into two \uops{}, the first
necessarily executed on port 0, the second on port either 0 or 1.
\medskip{}
\begin{figure}
\centering
@ -38,14 +49,14 @@ dependencies in steady-state, and a port mapping is sufficient.
As some \uops{} are compatible with multiple ports, the number of cycles
required to run one occurrence of a kernel is not trivial. An assignment, for a
given kernel, of its constitutive \uops{} to ports, is a \emph{schedule}
---~the number of cycles taken by a kernel with a fixed schedule is
---~the number of cycles taken by a kernel given a fixed schedule is
well-defined. The throughput of a kernel is defined as the throughput under an
optimal schedule for this kernel.
\begin{example}[Kernel throughputs with port mappings]
The kernel $\kerK_1 = \texttt{DIVPS} + \texttt{BSR} + \texttt{JMP}$ can
complete in one cycle: $\cyc{\kerK_1} = 1$. Indeed, according to the port
mapping in \autoref{fig:sample_resource_mapping}, each of those
mapping in \autoref{fig:sample_port_mapping}, each of those
instructions is decoded into a single \uop{}, each compatible with a
single, distinct port. Thus, the three instructions can be issued in
parallel in one cycle.
@ -59,7 +70,7 @@ optimal schedule for this kernel.
The kernel $\kerK_3 = \texttt{ADDSS} + 2\times\texttt{BSR}$, however, needs
at least two cycles to be executed: \texttt{BSR} can only be executed on
port $p_1$, which can execute at most a \uop{} per cycle. $\cyc{\kerK_3} =
port $p_1$, which can execute at most one \uop{} per cycle. $\cyc{\kerK_3} =
2$.
The instruction \texttt{ADDSS} alone, however, can be executed twice per
@ -197,9 +208,12 @@ $\kerK$, and
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
\midrule
Total & 0 & 1 & 1 \\
Total & 0 & \textbf{1} & \textbf{1} \\
\bottomrule
\end{tabular}
\smallskip{}
$\implies{} \cyc{\kerK_2} = 1$
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
\centering
$\kerK_3$
@ -212,9 +226,13 @@ $\kerK$, and
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
$2\times$\texttt{BSR} & & 2 & 1 \\
\midrule
Total & 0 & 2 & 1.5 \\
Total & 0 & \textbf{2} & 1.5 \\
\bottomrule
\end{tabular}
\smallskip{}
$\implies{} \cyc{\kerK_3} = 2$
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
\centering
$\kerK_4$
@ -227,9 +245,13 @@ $\kerK$, and
$2\times$\texttt{ADDSS} & & & 1 \\
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
\midrule
Total & 0 & 1 & 1.5 \\
Total & 0 & 1 & \textbf{1.5} \\
\bottomrule
\end{tabular}
\smallskip{}
$\implies{} \cyc{\kerK_4} = 1.5$
\end{minipage}
\end{example}

View file

@ -9,8 +9,8 @@ for their evaluation. However, random generation may yield basic blocks that
are not representative of the various workloads our model might be used on.
Thus, while arbitrarily or randomly generated microbenchmarks were well suited
to the data acquisition phase needed to generate the model, the kernels on
which the model would be evaluated could not be arbitrary, but must come from
real-world programs.
which the model would be evaluated could not be arbitrary, and must instead
come from real-world programs.
\subsection{Benchmark suites}
@ -23,7 +23,7 @@ blocks used to evaluate \palmed{} should thus be reasonably close from these
criteria.
For this reason, we evaluate \palmed{} on basic blocks extracted from
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
two well-known benchmark suites: Polybench and SPEC CPU 2017.
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
numerical computation~\cite{bench:polybench}. Its benchmarks are
@ -49,8 +49,8 @@ review~--- complicated.
\subsection{Manually extracting basic blocks}
Our first approach, that we used to extract basic blocks from the two benchmark
suites introduced above for the evaluation included in our article for
The first approach that we used to extract basic blocks from the two benchmark
suites introduced above, for the evaluation included in our article for
\palmed{}~\cite{palmed}, was very manual. We use different ---~though
similar~--- approaches for Polybench and SPEC\@.
@ -85,10 +85,11 @@ Most importantly, this manual extraction is not reproducible. This comes with
two problems.
\begin{itemize}
\item If the dataset was to be lost, or if another researcher wanted to
reproduce our results, the exact same dataset could not be recreated.
The same general procedure could be followed again, but code and
scripts would have to be re-written, manually typed and undocumented
shell lines re-written, etc.
reproduce our results, the exact same dataset could not be identically
recreated. The same general procedure could be followed again, but code
and scripts would have to be re-written, manually typed and
undocumented shell lines re-written, etc. Most importantly, the
re-extracted basic blocks may well be slightly different.
\item The same consideration applies to porting the dataset to another ISA.
Indeed, as the dataset consists of assembly-level basic-blocks, it
cannot be transferred to another ISA: it has to be re-generated from
@ -119,7 +120,7 @@ The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
sampling the current program counter (as well as the stack, if requested, to
obtain a stack trace) upon either event occurrences, such as number of elapsed
CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
user-defined frequency.
user-defined time frequency.
In our case, we use this second mode to uniformly sample the program counter
across a run. We recover the output of the profiling as a \textit{raw trace}
@ -195,11 +196,13 @@ memoize this step to do it only once per symbol. We then bissect the basic
block corresponding to the current PC from the list of obtained basic blocks to
count the occurrences of each block.
To split a symbol into basic blocks, we determine using \texttt{capstone} its
set of \emph{flow sites} and \emph{jump sites}. The former is the set of
addresses just after a control flow instruction, while the latter is the set of
addresses to which jump instructions may jump. We then split the
straight-line code of the symbol using the union of both sets as boundaries.
To split a symbol into basic blocks, we follow the procedure introduced by our
formal definition in \autoref{sssec:def:bbs}. We determine using
\texttt{capstone} its set of \emph{flow sites} and \emph{jump sites}. The
former is the set of addresses just after a control flow instruction, while the
latter is the set of addresses to which jump instructions may jump. We then
split the straight-line code of the symbol using the union of both sets as
boundaries.
\medskip

View file

@ -20,7 +20,7 @@ does not support some instructions (control flow, x86-64 divisions, \ldots),
those are stripped from the original kernel, which might denature the original
basic block.
To evaluate \palmed{}, the same kernel is run:
To evaluate \palmed{}, the same kernel's run time is measured:
\begin{enumerate}
@ -31,8 +31,9 @@ To evaluate \palmed{}, the same kernel is run:
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
equivalent conjunctive resource mapping\footnote{When this evaluation was
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
resource mapping, the comparison to \uopsinfo{} is fair.};
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} only
provides a resource mapping, but no frontend, the comparison to \uopsinfo{}
is fair.};
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
its provided mapping;
@ -98,21 +99,21 @@ all} in the basic block was present in the model.
This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
are pruned from the benchmarks. Hence, \palmed{} has a 100\,\% coverage
\emph{by construction} --- which does not mean that is supports all the
instructions found in the original basic blocks.
instructions found in the original basic blocks, but only that our methodology
is unable to process basic blocks unsupported by Pipedream.
\subsection{Results}
\input{40-1_results_fig.tex}
We run the evaluation harness on three different machines:
We run the evaluation harness on two different machines:
\begin{itemize}
\item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
4114 CPU, totalling 20 cores;
\item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
CPU with 24 cores;
\item{} an ARMv8a Raspberry Pi 4 with 4 Cortex A72 cores.
CPU with 24 cores.
\end{itemize}
As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
@ -130,64 +131,3 @@ $y$ for a significant number of microkernels with a measured IPC of $x$. The
closer a prediction is to the red horizontal line, the more accurate it is.
These results are analyzed in the full article~\cite{palmed}.
\section{Other contributions}
\paragraph{Using a database to enhance reproducibility and usability.}
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
instance, generating a mapping for an x86-64 machine requires the execution of
about $10^6$ benchmarks on the CPU\@.
Each of these measures takes time: the multiset of instructions must be
transformed into an assembly code, including the register mapping phrase; this
assembly must be assembled and linked into an ELF file; and finally, the
benchmark must be actually executed, with multiple warm-up rounds and multiple
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
to two-thirds of a second on a single core. The whole benchmarking phase, on
the \texttt{SKL-SP} processor, roughly took eight hours.
\medskip{}
As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
in measured cycles between two executions of a benchmark are also a source of
non-determinism in the execution of Palmed.
\medskip{}
For both these reasons, we implemented into \palmed{} a database-backed storage of
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
to find a corresponding measure in the database; if the measure does not exist
yet, it will be run, then stored in database.
For each measure, we further store for context:
the time and date at which the measure was made;
the machine on which the measure was made;
how many times the measure was repeated;
how many warm-up rounds were performed;
how many instructions were in the unrolled loop;
how many instructions were executed per repetition in total;
the parameters for \pipedream{}'s assembly generation procedure;
how the final result was aggregated from the repeated measures;
the variance of the set of measures;
how many CPU cores were active when the measure was made;
which CPU core was used for this measure;
whether the kernel's scheduler was set to FIFO mode.
\bigskip{}
We believe that, as a whole, the use of a database increases the usability of
\palmed{}: it is faster if some measures were already made in the past and
recovers better upon error.
This also gives us a better confidence towards our results: we can easily
archive and backup our experimental data, and we can easily trace the origin of
a measure if needed. We can also reuse the exact same measures between two runs
of \palmed{}, to ensure that the results are as consistent as possible.
\paragraph{General engineering contributions.} Apart from purely scientific
contributions, we worked on improving \palmed{} as a whole, from the
engineering point of view: code quality; reliable parallel measurements;
recovery upon error; logging; \ldots{} These improvements amount to about a
hundred merge-requests between \nderumig{} and myself.

View file

@ -0,0 +1,60 @@
\section{Other contributions}
\paragraph{Using a database to enhance reproducibility and usability.}
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
instance, generating a mapping for an x86-64 machine requires the execution of
about $10^6$ benchmarks on the CPU\@.
Each of these measures takes time: the multiset of instructions must be
transformed into an assembly code, including the register mapping phrase; this
assembly must be assembled and linked into an ELF file; and finally, the
benchmark must be actually executed, with multiple warm-up rounds and multiple
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
to two-thirds of a second on a single core. The whole benchmarking phase, on
the \texttt{SKL-SP} processor, roughly took eight hours.
\medskip{}
As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
in measured cycles between two executions of a benchmark are also a major
source of non-determinism in the execution of Palmed.
\medskip{}
For both these reasons, we implemented into \palmed{} a database-backed storage of
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
to find a corresponding measure in the database; if the measure does not exist
yet, it will be run, then stored in database.
For each measure, we further store for context:
the time and date at which the measure was made;
the machine on which the measure was made;
how many times the measure was repeated;
how many warm-up rounds were performed;
how many instructions were in the unrolled loop;
how many instructions were executed per repetition in total;
the parameters for \pipedream{}'s assembly generation procedure;
how the final result was aggregated from the repeated measures;
the variance of the set of measures;
how many CPU cores were active when the measure was made;
which CPU core was used for this measure;
whether the kernel's scheduler was set to FIFO mode.
\bigskip{}
We believe that, as a whole, the use of a database increases the usability of
\palmed{}: it is faster if some measures were already made in the past and
recovers better upon error.
This also gives us a better confidence towards our results: we can easily
archive and backup our experimental data, and we can easily trace the origin of
a measure if needed. We can also reuse the exact same measures between two runs
of \palmed{}, to ensure that the results are as consistent as possible.
\paragraph{General engineering contributions.} Apart from purely scientific
contributions, we worked on improving \palmed{} as a whole, from the
engineering point of view: code quality; reliable parallel measurements;
recovery upon error; logging; \ldots{} These improvements amount to about a
hundred merge-requests between \nderumig{} and myself.