Proof-read chapter 2 (Palmed)
This commit is contained in:
parent
596950a835
commit
4e13835886
6 changed files with 119 additions and 93 deletions
|
@ -542,7 +542,7 @@ Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
|
|||
for large values of $n$ in this manuscript whenever it is clear that this value
|
||||
is a measure.
|
||||
|
||||
\subsubsection{Basic block of an assembly-level program}
|
||||
\subsubsection{Basic block of an assembly-level program}\label{sssec:def:bbs}
|
||||
|
||||
Code analyzers are meant to analyze sections of straight-line code, that is,
|
||||
portions of code which do not contain control flow. As such, it is convenient
|
||||
|
|
|
@ -25,5 +25,6 @@ project during the first period of my own PhD.
|
|||
|
||||
In this chapter, sections~\ref{sec:palmed_resource_models}
|
||||
through~\ref{sec:palmed_pipedream} describe \palmed{}, and present what is
|
||||
mostly not my own work. Sections~\ref{sec:benchsuite_bb} and later describe my
|
||||
own work on this project.
|
||||
mostly not my own work, but introduce important concepts for this manuscript.
|
||||
Sections~\ref{sec:benchsuite_bb} and later describe my own work on this
|
||||
project.
|
||||
|
|
|
@ -21,6 +21,17 @@ instruction's mapping is described as a string, \eg{}
|
|||
\texttt{VCVTT}\footnote{The precise variant is \texttt{VCVTTSD2SI (R32, XMM)}}
|
||||
is described as \texttt{1*p0+1*p01}.
|
||||
|
||||
The two layers of such a model play a very different role. Indeed, the
|
||||
top layer (instructions to \uops{}) can be seen as an \emph{and}, or
|
||||
\emph{conjunctive} layer: an instruction is decomposed into each of its
|
||||
\uops{}, which must all be executed for the instruction to be completed. The
|
||||
bottom layer (\uops{} to ports), however, can be seen as an \emph{or}, or
|
||||
\emph{disjunctive} layer: a \uop{} must be executed on \emph{one} of those
|
||||
ports, each able to execute this \uop{}. This can be seen in the example from
|
||||
\uopsinfo{} above: \texttt{VCVTT} is decomposed into two \uops{}, the first
|
||||
necessarily executed on port 0, the second on port either 0 or 1.
|
||||
|
||||
\medskip{}
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -38,14 +49,14 @@ dependencies in steady-state, and a port mapping is sufficient.
|
|||
As some \uops{} are compatible with multiple ports, the number of cycles
|
||||
required to run one occurrence of a kernel is not trivial. An assignment, for a
|
||||
given kernel, of its constitutive \uops{} to ports, is a \emph{schedule}
|
||||
---~the number of cycles taken by a kernel with a fixed schedule is
|
||||
---~the number of cycles taken by a kernel given a fixed schedule is
|
||||
well-defined. The throughput of a kernel is defined as the throughput under an
|
||||
optimal schedule for this kernel.
|
||||
|
||||
\begin{example}[Kernel throughputs with port mappings]
|
||||
The kernel $\kerK_1 = \texttt{DIVPS} + \texttt{BSR} + \texttt{JMP}$ can
|
||||
complete in one cycle: $\cyc{\kerK_1} = 1$. Indeed, according to the port
|
||||
mapping in \autoref{fig:sample_resource_mapping}, each of those
|
||||
mapping in \autoref{fig:sample_port_mapping}, each of those
|
||||
instructions is decoded into a single \uop{}, each compatible with a
|
||||
single, distinct port. Thus, the three instructions can be issued in
|
||||
parallel in one cycle.
|
||||
|
@ -59,7 +70,7 @@ optimal schedule for this kernel.
|
|||
|
||||
The kernel $\kerK_3 = \texttt{ADDSS} + 2\times\texttt{BSR}$, however, needs
|
||||
at least two cycles to be executed: \texttt{BSR} can only be executed on
|
||||
port $p_1$, which can execute at most a \uop{} per cycle. $\cyc{\kerK_3} =
|
||||
port $p_1$, which can execute at most one \uop{} per cycle. $\cyc{\kerK_3} =
|
||||
2$.
|
||||
|
||||
The instruction \texttt{ADDSS} alone, however, can be executed twice per
|
||||
|
@ -197,9 +208,12 @@ $\kerK$, and
|
|||
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
|
||||
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
|
||||
\midrule
|
||||
Total & 0 & 1 & 1 \\
|
||||
Total & 0 & \textbf{1} & \textbf{1} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
|
||||
\smallskip{}
|
||||
$\implies{} \cyc{\kerK_2} = 1$
|
||||
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
|
||||
\centering
|
||||
$\kerK_3$
|
||||
|
@ -212,9 +226,13 @@ $\kerK$, and
|
|||
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
|
||||
$2\times$\texttt{BSR} & & 2 & 1 \\
|
||||
\midrule
|
||||
Total & 0 & 2 & 1.5 \\
|
||||
Total & 0 & \textbf{2} & 1.5 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
|
||||
\smallskip{}
|
||||
$\implies{} \cyc{\kerK_3} = 2$
|
||||
|
||||
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
|
||||
\centering
|
||||
$\kerK_4$
|
||||
|
@ -227,9 +245,13 @@ $\kerK$, and
|
|||
$2\times$\texttt{ADDSS} & & & 1 \\
|
||||
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
|
||||
\midrule
|
||||
Total & 0 & 1 & 1.5 \\
|
||||
Total & 0 & 1 & \textbf{1.5} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
|
||||
\smallskip{}
|
||||
$\implies{} \cyc{\kerK_4} = 1.5$
|
||||
|
||||
\end{minipage}
|
||||
\end{example}
|
||||
|
||||
|
|
|
@ -9,8 +9,8 @@ for their evaluation. However, random generation may yield basic blocks that
|
|||
are not representative of the various workloads our model might be used on.
|
||||
Thus, while arbitrarily or randomly generated microbenchmarks were well suited
|
||||
to the data acquisition phase needed to generate the model, the kernels on
|
||||
which the model would be evaluated could not be arbitrary, but must come from
|
||||
real-world programs.
|
||||
which the model would be evaluated could not be arbitrary, and must instead
|
||||
come from real-world programs.
|
||||
|
||||
\subsection{Benchmark suites}
|
||||
|
||||
|
@ -23,7 +23,7 @@ blocks used to evaluate \palmed{} should thus be reasonably close from these
|
|||
criteria.
|
||||
|
||||
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
||||
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
||||
two well-known benchmark suites: Polybench and SPEC CPU 2017.
|
||||
|
||||
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
||||
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
||||
|
@ -49,8 +49,8 @@ review~--- complicated.
|
|||
|
||||
\subsection{Manually extracting basic blocks}
|
||||
|
||||
Our first approach, that we used to extract basic blocks from the two benchmark
|
||||
suites introduced above for the evaluation included in our article for
|
||||
The first approach that we used to extract basic blocks from the two benchmark
|
||||
suites introduced above, for the evaluation included in our article for
|
||||
\palmed{}~\cite{palmed}, was very manual. We use different ---~though
|
||||
similar~--- approaches for Polybench and SPEC\@.
|
||||
|
||||
|
@ -85,10 +85,11 @@ Most importantly, this manual extraction is not reproducible. This comes with
|
|||
two problems.
|
||||
\begin{itemize}
|
||||
\item If the dataset was to be lost, or if another researcher wanted to
|
||||
reproduce our results, the exact same dataset could not be recreated.
|
||||
The same general procedure could be followed again, but code and
|
||||
scripts would have to be re-written, manually typed and undocumented
|
||||
shell lines re-written, etc.
|
||||
reproduce our results, the exact same dataset could not be identically
|
||||
recreated. The same general procedure could be followed again, but code
|
||||
and scripts would have to be re-written, manually typed and
|
||||
undocumented shell lines re-written, etc. Most importantly, the
|
||||
re-extracted basic blocks may well be slightly different.
|
||||
\item The same consideration applies to porting the dataset to another ISA.
|
||||
Indeed, as the dataset consists of assembly-level basic-blocks, it
|
||||
cannot be transferred to another ISA: it has to be re-generated from
|
||||
|
@ -119,7 +120,7 @@ The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
|
|||
sampling the current program counter (as well as the stack, if requested, to
|
||||
obtain a stack trace) upon either event occurrences, such as number of elapsed
|
||||
CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
|
||||
user-defined frequency.
|
||||
user-defined time frequency.
|
||||
|
||||
In our case, we use this second mode to uniformly sample the program counter
|
||||
across a run. We recover the output of the profiling as a \textit{raw trace}
|
||||
|
@ -195,11 +196,13 @@ memoize this step to do it only once per symbol. We then bissect the basic
|
|||
block corresponding to the current PC from the list of obtained basic blocks to
|
||||
count the occurrences of each block.
|
||||
|
||||
To split a symbol into basic blocks, we determine using \texttt{capstone} its
|
||||
set of \emph{flow sites} and \emph{jump sites}. The former is the set of
|
||||
addresses just after a control flow instruction, while the latter is the set of
|
||||
addresses to which jump instructions may jump. We then split the
|
||||
straight-line code of the symbol using the union of both sets as boundaries.
|
||||
To split a symbol into basic blocks, we follow the procedure introduced by our
|
||||
formal definition in \autoref{sssec:def:bbs}. We determine using
|
||||
\texttt{capstone} its set of \emph{flow sites} and \emph{jump sites}. The
|
||||
former is the set of addresses just after a control flow instruction, while the
|
||||
latter is the set of addresses to which jump instructions may jump. We then
|
||||
split the straight-line code of the symbol using the union of both sets as
|
||||
boundaries.
|
||||
|
||||
\medskip
|
||||
|
||||
|
|
|
@ -20,7 +20,7 @@ does not support some instructions (control flow, x86-64 divisions, \ldots),
|
|||
those are stripped from the original kernel, which might denature the original
|
||||
basic block.
|
||||
|
||||
To evaluate \palmed{}, the same kernel is run:
|
||||
To evaluate \palmed{}, the same kernel's run time is measured:
|
||||
|
||||
\begin{enumerate}
|
||||
|
||||
|
@ -31,8 +31,9 @@ To evaluate \palmed{}, the same kernel is run:
|
|||
|
||||
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
|
||||
equivalent conjunctive resource mapping\footnote{When this evaluation was
|
||||
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
|
||||
resource mapping, the comparison to \uopsinfo{} is fair.};
|
||||
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} only
|
||||
provides a resource mapping, but no frontend, the comparison to \uopsinfo{}
|
||||
is fair.};
|
||||
|
||||
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
|
||||
its provided mapping;
|
||||
|
@ -98,21 +99,21 @@ all} in the basic block was present in the model.
|
|||
|
||||
This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
|
||||
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
|
||||
are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
|
||||
are pruned from the benchmarks. Hence, \palmed{} has a 100\,\% coverage
|
||||
\emph{by construction} --- which does not mean that is supports all the
|
||||
instructions found in the original basic blocks.
|
||||
instructions found in the original basic blocks, but only that our methodology
|
||||
is unable to process basic blocks unsupported by Pipedream.
|
||||
|
||||
\subsection{Results}
|
||||
|
||||
\input{40-1_results_fig.tex}
|
||||
|
||||
We run the evaluation harness on three different machines:
|
||||
We run the evaluation harness on two different machines:
|
||||
\begin{itemize}
|
||||
\item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
|
||||
4114 CPU, totalling 20 cores;
|
||||
\item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
|
||||
CPU with 24 cores;
|
||||
\item{} an ARMv8a Raspberry Pi 4 with 4 Cortex A72 cores.
|
||||
CPU with 24 cores.
|
||||
\end{itemize}
|
||||
|
||||
As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
|
||||
|
@ -130,64 +131,3 @@ $y$ for a significant number of microkernels with a measured IPC of $x$. The
|
|||
closer a prediction is to the red horizontal line, the more accurate it is.
|
||||
|
||||
These results are analyzed in the full article~\cite{palmed}.
|
||||
|
||||
\section{Other contributions}
|
||||
|
||||
\paragraph{Using a database to enhance reproducibility and usability.}
|
||||
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
|
||||
instance, generating a mapping for an x86-64 machine requires the execution of
|
||||
about $10^6$ benchmarks on the CPU\@.
|
||||
|
||||
Each of these measures takes time: the multiset of instructions must be
|
||||
transformed into an assembly code, including the register mapping phrase; this
|
||||
assembly must be assembled and linked into an ELF file; and finally, the
|
||||
benchmark must be actually executed, with multiple warm-up rounds and multiple
|
||||
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
|
||||
to two-thirds of a second on a single core. The whole benchmarking phase, on
|
||||
the \texttt{SKL-SP} processor, roughly took eight hours.
|
||||
|
||||
\medskip{}
|
||||
|
||||
As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
|
||||
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
|
||||
in measured cycles between two executions of a benchmark are also a source of
|
||||
non-determinism in the execution of Palmed.
|
||||
|
||||
\medskip{}
|
||||
|
||||
For both these reasons, we implemented into \palmed{} a database-backed storage of
|
||||
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
|
||||
to find a corresponding measure in the database; if the measure does not exist
|
||||
yet, it will be run, then stored in database.
|
||||
|
||||
For each measure, we further store for context:
|
||||
the time and date at which the measure was made;
|
||||
the machine on which the measure was made;
|
||||
how many times the measure was repeated;
|
||||
how many warm-up rounds were performed;
|
||||
how many instructions were in the unrolled loop;
|
||||
how many instructions were executed per repetition in total;
|
||||
the parameters for \pipedream{}'s assembly generation procedure;
|
||||
how the final result was aggregated from the repeated measures;
|
||||
the variance of the set of measures;
|
||||
how many CPU cores were active when the measure was made;
|
||||
which CPU core was used for this measure;
|
||||
whether the kernel's scheduler was set to FIFO mode.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
We believe that, as a whole, the use of a database increases the usability of
|
||||
\palmed{}: it is faster if some measures were already made in the past and
|
||||
recovers better upon error.
|
||||
|
||||
This also gives us a better confidence towards our results: we can easily
|
||||
archive and backup our experimental data, and we can easily trace the origin of
|
||||
a measure if needed. We can also reuse the exact same measures between two runs
|
||||
of \palmed{}, to ensure that the results are as consistent as possible.
|
||||
|
||||
|
||||
\paragraph{General engineering contributions.} Apart from purely scientific
|
||||
contributions, we worked on improving \palmed{} as a whole, from the
|
||||
engineering point of view: code quality; reliable parallel measurements;
|
||||
recovery upon error; logging; \ldots{} These improvements amount to about a
|
||||
hundred merge-requests between \nderumig{} and myself.
|
||||
|
|
|
@ -0,0 +1,60 @@
|
|||
\section{Other contributions}
|
||||
|
||||
\paragraph{Using a database to enhance reproducibility and usability.}
|
||||
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
|
||||
instance, generating a mapping for an x86-64 machine requires the execution of
|
||||
about $10^6$ benchmarks on the CPU\@.
|
||||
|
||||
Each of these measures takes time: the multiset of instructions must be
|
||||
transformed into an assembly code, including the register mapping phrase; this
|
||||
assembly must be assembled and linked into an ELF file; and finally, the
|
||||
benchmark must be actually executed, with multiple warm-up rounds and multiple
|
||||
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
|
||||
to two-thirds of a second on a single core. The whole benchmarking phase, on
|
||||
the \texttt{SKL-SP} processor, roughly took eight hours.
|
||||
|
||||
\medskip{}
|
||||
|
||||
As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
|
||||
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
|
||||
in measured cycles between two executions of a benchmark are also a major
|
||||
source of non-determinism in the execution of Palmed.
|
||||
|
||||
\medskip{}
|
||||
|
||||
For both these reasons, we implemented into \palmed{} a database-backed storage of
|
||||
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
|
||||
to find a corresponding measure in the database; if the measure does not exist
|
||||
yet, it will be run, then stored in database.
|
||||
|
||||
For each measure, we further store for context:
|
||||
the time and date at which the measure was made;
|
||||
the machine on which the measure was made;
|
||||
how many times the measure was repeated;
|
||||
how many warm-up rounds were performed;
|
||||
how many instructions were in the unrolled loop;
|
||||
how many instructions were executed per repetition in total;
|
||||
the parameters for \pipedream{}'s assembly generation procedure;
|
||||
how the final result was aggregated from the repeated measures;
|
||||
the variance of the set of measures;
|
||||
how many CPU cores were active when the measure was made;
|
||||
which CPU core was used for this measure;
|
||||
whether the kernel's scheduler was set to FIFO mode.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
We believe that, as a whole, the use of a database increases the usability of
|
||||
\palmed{}: it is faster if some measures were already made in the past and
|
||||
recovers better upon error.
|
||||
|
||||
This also gives us a better confidence towards our results: we can easily
|
||||
archive and backup our experimental data, and we can easily trace the origin of
|
||||
a measure if needed. We can also reuse the exact same measures between two runs
|
||||
of \palmed{}, to ensure that the results are as consistent as possible.
|
||||
|
||||
|
||||
\paragraph{General engineering contributions.} Apart from purely scientific
|
||||
contributions, we worked on improving \palmed{} as a whole, from the
|
||||
engineering point of view: code quality; reliable parallel measurements;
|
||||
recovery upon error; logging; \ldots{} These improvements amount to about a
|
||||
hundred merge-requests between \nderumig{} and myself.
|
Loading…
Reference in a new issue