Proof-read chapter 2 (Palmed)
This commit is contained in:
parent
596950a835
commit
4e13835886
6 changed files with 119 additions and 93 deletions
manuscrit
|
@ -542,7 +542,7 @@ Given this property, we will use $\cyc{\kerK}$ to refer to $\cycmes{\kerK}{n}$
|
||||||
for large values of $n$ in this manuscript whenever it is clear that this value
|
for large values of $n$ in this manuscript whenever it is clear that this value
|
||||||
is a measure.
|
is a measure.
|
||||||
|
|
||||||
\subsubsection{Basic block of an assembly-level program}
|
\subsubsection{Basic block of an assembly-level program}\label{sssec:def:bbs}
|
||||||
|
|
||||||
Code analyzers are meant to analyze sections of straight-line code, that is,
|
Code analyzers are meant to analyze sections of straight-line code, that is,
|
||||||
portions of code which do not contain control flow. As such, it is convenient
|
portions of code which do not contain control flow. As such, it is convenient
|
||||||
|
|
|
@ -25,5 +25,6 @@ project during the first period of my own PhD.
|
||||||
|
|
||||||
In this chapter, sections~\ref{sec:palmed_resource_models}
|
In this chapter, sections~\ref{sec:palmed_resource_models}
|
||||||
through~\ref{sec:palmed_pipedream} describe \palmed{}, and present what is
|
through~\ref{sec:palmed_pipedream} describe \palmed{}, and present what is
|
||||||
mostly not my own work. Sections~\ref{sec:benchsuite_bb} and later describe my
|
mostly not my own work, but introduce important concepts for this manuscript.
|
||||||
own work on this project.
|
Sections~\ref{sec:benchsuite_bb} and later describe my own work on this
|
||||||
|
project.
|
||||||
|
|
|
@ -21,6 +21,17 @@ instruction's mapping is described as a string, \eg{}
|
||||||
\texttt{VCVTT}\footnote{The precise variant is \texttt{VCVTTSD2SI (R32, XMM)}}
|
\texttt{VCVTT}\footnote{The precise variant is \texttt{VCVTTSD2SI (R32, XMM)}}
|
||||||
is described as \texttt{1*p0+1*p01}.
|
is described as \texttt{1*p0+1*p01}.
|
||||||
|
|
||||||
|
The two layers of such a model play a very different role. Indeed, the
|
||||||
|
top layer (instructions to \uops{}) can be seen as an \emph{and}, or
|
||||||
|
\emph{conjunctive} layer: an instruction is decomposed into each of its
|
||||||
|
\uops{}, which must all be executed for the instruction to be completed. The
|
||||||
|
bottom layer (\uops{} to ports), however, can be seen as an \emph{or}, or
|
||||||
|
\emph{disjunctive} layer: a \uop{} must be executed on \emph{one} of those
|
||||||
|
ports, each able to execute this \uop{}. This can be seen in the example from
|
||||||
|
\uopsinfo{} above: \texttt{VCVTT} is decomposed into two \uops{}, the first
|
||||||
|
necessarily executed on port 0, the second on port either 0 or 1.
|
||||||
|
|
||||||
|
\medskip{}
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
|
@ -38,14 +49,14 @@ dependencies in steady-state, and a port mapping is sufficient.
|
||||||
As some \uops{} are compatible with multiple ports, the number of cycles
|
As some \uops{} are compatible with multiple ports, the number of cycles
|
||||||
required to run one occurrence of a kernel is not trivial. An assignment, for a
|
required to run one occurrence of a kernel is not trivial. An assignment, for a
|
||||||
given kernel, of its constitutive \uops{} to ports, is a \emph{schedule}
|
given kernel, of its constitutive \uops{} to ports, is a \emph{schedule}
|
||||||
---~the number of cycles taken by a kernel with a fixed schedule is
|
---~the number of cycles taken by a kernel given a fixed schedule is
|
||||||
well-defined. The throughput of a kernel is defined as the throughput under an
|
well-defined. The throughput of a kernel is defined as the throughput under an
|
||||||
optimal schedule for this kernel.
|
optimal schedule for this kernel.
|
||||||
|
|
||||||
\begin{example}[Kernel throughputs with port mappings]
|
\begin{example}[Kernel throughputs with port mappings]
|
||||||
The kernel $\kerK_1 = \texttt{DIVPS} + \texttt{BSR} + \texttt{JMP}$ can
|
The kernel $\kerK_1 = \texttt{DIVPS} + \texttt{BSR} + \texttt{JMP}$ can
|
||||||
complete in one cycle: $\cyc{\kerK_1} = 1$. Indeed, according to the port
|
complete in one cycle: $\cyc{\kerK_1} = 1$. Indeed, according to the port
|
||||||
mapping in \autoref{fig:sample_resource_mapping}, each of those
|
mapping in \autoref{fig:sample_port_mapping}, each of those
|
||||||
instructions is decoded into a single \uop{}, each compatible with a
|
instructions is decoded into a single \uop{}, each compatible with a
|
||||||
single, distinct port. Thus, the three instructions can be issued in
|
single, distinct port. Thus, the three instructions can be issued in
|
||||||
parallel in one cycle.
|
parallel in one cycle.
|
||||||
|
@ -59,7 +70,7 @@ optimal schedule for this kernel.
|
||||||
|
|
||||||
The kernel $\kerK_3 = \texttt{ADDSS} + 2\times\texttt{BSR}$, however, needs
|
The kernel $\kerK_3 = \texttt{ADDSS} + 2\times\texttt{BSR}$, however, needs
|
||||||
at least two cycles to be executed: \texttt{BSR} can only be executed on
|
at least two cycles to be executed: \texttt{BSR} can only be executed on
|
||||||
port $p_1$, which can execute at most a \uop{} per cycle. $\cyc{\kerK_3} =
|
port $p_1$, which can execute at most one \uop{} per cycle. $\cyc{\kerK_3} =
|
||||||
2$.
|
2$.
|
||||||
|
|
||||||
The instruction \texttt{ADDSS} alone, however, can be executed twice per
|
The instruction \texttt{ADDSS} alone, however, can be executed twice per
|
||||||
|
@ -197,9 +208,12 @@ $\kerK$, and
|
||||||
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
|
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
|
||||||
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
|
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
|
||||||
\midrule
|
\midrule
|
||||||
Total & 0 & 1 & 1 \\
|
Total & 0 & \textbf{1} & \textbf{1} \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
$\implies{} \cyc{\kerK_2} = 1$
|
||||||
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
|
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
|
||||||
\centering
|
\centering
|
||||||
$\kerK_3$
|
$\kerK_3$
|
||||||
|
@ -212,9 +226,13 @@ $\kerK$, and
|
||||||
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
|
\texttt{ADDSS} & & & $\sfrac{1}{2}$ \\
|
||||||
$2\times$\texttt{BSR} & & 2 & 1 \\
|
$2\times$\texttt{BSR} & & 2 & 1 \\
|
||||||
\midrule
|
\midrule
|
||||||
Total & 0 & 2 & 1.5 \\
|
Total & 0 & \textbf{2} & 1.5 \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
$\implies{} \cyc{\kerK_3} = 2$
|
||||||
|
|
||||||
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
|
\end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth}
|
||||||
\centering
|
\centering
|
||||||
$\kerK_4$
|
$\kerK_4$
|
||||||
|
@ -227,9 +245,13 @@ $\kerK$, and
|
||||||
$2\times$\texttt{ADDSS} & & & 1 \\
|
$2\times$\texttt{ADDSS} & & & 1 \\
|
||||||
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
|
\texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\
|
||||||
\midrule
|
\midrule
|
||||||
Total & 0 & 1 & 1.5 \\
|
Total & 0 & 1 & \textbf{1.5} \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
$\implies{} \cyc{\kerK_4} = 1.5$
|
||||||
|
|
||||||
\end{minipage}
|
\end{minipage}
|
||||||
\end{example}
|
\end{example}
|
||||||
|
|
||||||
|
|
|
@ -9,8 +9,8 @@ for their evaluation. However, random generation may yield basic blocks that
|
||||||
are not representative of the various workloads our model might be used on.
|
are not representative of the various workloads our model might be used on.
|
||||||
Thus, while arbitrarily or randomly generated microbenchmarks were well suited
|
Thus, while arbitrarily or randomly generated microbenchmarks were well suited
|
||||||
to the data acquisition phase needed to generate the model, the kernels on
|
to the data acquisition phase needed to generate the model, the kernels on
|
||||||
which the model would be evaluated could not be arbitrary, but must come from
|
which the model would be evaluated could not be arbitrary, and must instead
|
||||||
real-world programs.
|
come from real-world programs.
|
||||||
|
|
||||||
\subsection{Benchmark suites}
|
\subsection{Benchmark suites}
|
||||||
|
|
||||||
|
@ -23,7 +23,7 @@ blocks used to evaluate \palmed{} should thus be reasonably close from these
|
||||||
criteria.
|
criteria.
|
||||||
|
|
||||||
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
||||||
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
two well-known benchmark suites: Polybench and SPEC CPU 2017.
|
||||||
|
|
||||||
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
||||||
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
||||||
|
@ -49,8 +49,8 @@ review~--- complicated.
|
||||||
|
|
||||||
\subsection{Manually extracting basic blocks}
|
\subsection{Manually extracting basic blocks}
|
||||||
|
|
||||||
Our first approach, that we used to extract basic blocks from the two benchmark
|
The first approach that we used to extract basic blocks from the two benchmark
|
||||||
suites introduced above for the evaluation included in our article for
|
suites introduced above, for the evaluation included in our article for
|
||||||
\palmed{}~\cite{palmed}, was very manual. We use different ---~though
|
\palmed{}~\cite{palmed}, was very manual. We use different ---~though
|
||||||
similar~--- approaches for Polybench and SPEC\@.
|
similar~--- approaches for Polybench and SPEC\@.
|
||||||
|
|
||||||
|
@ -85,10 +85,11 @@ Most importantly, this manual extraction is not reproducible. This comes with
|
||||||
two problems.
|
two problems.
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item If the dataset was to be lost, or if another researcher wanted to
|
\item If the dataset was to be lost, or if another researcher wanted to
|
||||||
reproduce our results, the exact same dataset could not be recreated.
|
reproduce our results, the exact same dataset could not be identically
|
||||||
The same general procedure could be followed again, but code and
|
recreated. The same general procedure could be followed again, but code
|
||||||
scripts would have to be re-written, manually typed and undocumented
|
and scripts would have to be re-written, manually typed and
|
||||||
shell lines re-written, etc.
|
undocumented shell lines re-written, etc. Most importantly, the
|
||||||
|
re-extracted basic blocks may well be slightly different.
|
||||||
\item The same consideration applies to porting the dataset to another ISA.
|
\item The same consideration applies to porting the dataset to another ISA.
|
||||||
Indeed, as the dataset consists of assembly-level basic-blocks, it
|
Indeed, as the dataset consists of assembly-level basic-blocks, it
|
||||||
cannot be transferred to another ISA: it has to be re-generated from
|
cannot be transferred to another ISA: it has to be re-generated from
|
||||||
|
@ -119,7 +120,7 @@ The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
|
||||||
sampling the current program counter (as well as the stack, if requested, to
|
sampling the current program counter (as well as the stack, if requested, to
|
||||||
obtain a stack trace) upon either event occurrences, such as number of elapsed
|
obtain a stack trace) upon either event occurrences, such as number of elapsed
|
||||||
CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
|
CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
|
||||||
user-defined frequency.
|
user-defined time frequency.
|
||||||
|
|
||||||
In our case, we use this second mode to uniformly sample the program counter
|
In our case, we use this second mode to uniformly sample the program counter
|
||||||
across a run. We recover the output of the profiling as a \textit{raw trace}
|
across a run. We recover the output of the profiling as a \textit{raw trace}
|
||||||
|
@ -195,11 +196,13 @@ memoize this step to do it only once per symbol. We then bissect the basic
|
||||||
block corresponding to the current PC from the list of obtained basic blocks to
|
block corresponding to the current PC from the list of obtained basic blocks to
|
||||||
count the occurrences of each block.
|
count the occurrences of each block.
|
||||||
|
|
||||||
To split a symbol into basic blocks, we determine using \texttt{capstone} its
|
To split a symbol into basic blocks, we follow the procedure introduced by our
|
||||||
set of \emph{flow sites} and \emph{jump sites}. The former is the set of
|
formal definition in \autoref{sssec:def:bbs}. We determine using
|
||||||
addresses just after a control flow instruction, while the latter is the set of
|
\texttt{capstone} its set of \emph{flow sites} and \emph{jump sites}. The
|
||||||
addresses to which jump instructions may jump. We then split the
|
former is the set of addresses just after a control flow instruction, while the
|
||||||
straight-line code of the symbol using the union of both sets as boundaries.
|
latter is the set of addresses to which jump instructions may jump. We then
|
||||||
|
split the straight-line code of the symbol using the union of both sets as
|
||||||
|
boundaries.
|
||||||
|
|
||||||
\medskip
|
\medskip
|
||||||
|
|
||||||
|
|
|
@ -20,7 +20,7 @@ does not support some instructions (control flow, x86-64 divisions, \ldots),
|
||||||
those are stripped from the original kernel, which might denature the original
|
those are stripped from the original kernel, which might denature the original
|
||||||
basic block.
|
basic block.
|
||||||
|
|
||||||
To evaluate \palmed{}, the same kernel is run:
|
To evaluate \palmed{}, the same kernel's run time is measured:
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
|
|
||||||
|
@ -31,8 +31,9 @@ To evaluate \palmed{}, the same kernel is run:
|
||||||
|
|
||||||
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
|
\item{} using the \uopsinfo{}~\cite{uopsinfo} port mapping, converted to its
|
||||||
equivalent conjunctive resource mapping\footnote{When this evaluation was
|
equivalent conjunctive resource mapping\footnote{When this evaluation was
|
||||||
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} provides a
|
made, \uica{}~\cite{uica} was not yet published. Since \palmed{} only
|
||||||
resource mapping, the comparison to \uopsinfo{} is fair.};
|
provides a resource mapping, but no frontend, the comparison to \uopsinfo{}
|
||||||
|
is fair.};
|
||||||
|
|
||||||
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
|
\item{} using \pmevo~\cite{PMEvo}, ignoring any instruction not supported by
|
||||||
its provided mapping;
|
its provided mapping;
|
||||||
|
@ -98,21 +99,21 @@ all} in the basic block was present in the model.
|
||||||
|
|
||||||
This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
|
This notion of coverage is partial towards \palmed{}. As we use \pipedream{} as
|
||||||
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
|
a baseline measurement, instructions that cannot be benchmarked by \pipedream{}
|
||||||
are pruned from the benchmarks; hence, \palmed{} has a 100\,\% coverage
|
are pruned from the benchmarks. Hence, \palmed{} has a 100\,\% coverage
|
||||||
\emph{by construction} --- which does not mean that is supports all the
|
\emph{by construction} --- which does not mean that is supports all the
|
||||||
instructions found in the original basic blocks.
|
instructions found in the original basic blocks, but only that our methodology
|
||||||
|
is unable to process basic blocks unsupported by Pipedream.
|
||||||
|
|
||||||
\subsection{Results}
|
\subsection{Results}
|
||||||
|
|
||||||
\input{40-1_results_fig.tex}
|
\input{40-1_results_fig.tex}
|
||||||
|
|
||||||
We run the evaluation harness on three different machines:
|
We run the evaluation harness on two different machines:
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
|
\item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
|
||||||
4114 CPU, totalling 20 cores;
|
4114 CPU, totalling 20 cores;
|
||||||
\item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
|
\item{} an x86-64 AMD \texttt{ZEN1}-based machine, with a single AMD EPYC 7401P
|
||||||
CPU with 24 cores;
|
CPU with 24 cores.
|
||||||
\item{} an ARMv8a Raspberry Pi 4 with 4 Cortex A72 cores.
|
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
|
As \iaca{} only supports Intel CPUs, and \uopsinfo{} only supports x86-64
|
||||||
|
@ -130,64 +131,3 @@ $y$ for a significant number of microkernels with a measured IPC of $x$. The
|
||||||
closer a prediction is to the red horizontal line, the more accurate it is.
|
closer a prediction is to the red horizontal line, the more accurate it is.
|
||||||
|
|
||||||
These results are analyzed in the full article~\cite{palmed}.
|
These results are analyzed in the full article~\cite{palmed}.
|
||||||
|
|
||||||
\section{Other contributions}
|
|
||||||
|
|
||||||
\paragraph{Using a database to enhance reproducibility and usability.}
|
|
||||||
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
|
|
||||||
instance, generating a mapping for an x86-64 machine requires the execution of
|
|
||||||
about $10^6$ benchmarks on the CPU\@.
|
|
||||||
|
|
||||||
Each of these measures takes time: the multiset of instructions must be
|
|
||||||
transformed into an assembly code, including the register mapping phrase; this
|
|
||||||
assembly must be assembled and linked into an ELF file; and finally, the
|
|
||||||
benchmark must be actually executed, with multiple warm-up rounds and multiple
|
|
||||||
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
|
|
||||||
to two-thirds of a second on a single core. The whole benchmarking phase, on
|
|
||||||
the \texttt{SKL-SP} processor, roughly took eight hours.
|
|
||||||
|
|
||||||
\medskip{}
|
|
||||||
|
|
||||||
As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
|
|
||||||
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
|
|
||||||
in measured cycles between two executions of a benchmark are also a source of
|
|
||||||
non-determinism in the execution of Palmed.
|
|
||||||
|
|
||||||
\medskip{}
|
|
||||||
|
|
||||||
For both these reasons, we implemented into \palmed{} a database-backed storage of
|
|
||||||
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
|
|
||||||
to find a corresponding measure in the database; if the measure does not exist
|
|
||||||
yet, it will be run, then stored in database.
|
|
||||||
|
|
||||||
For each measure, we further store for context:
|
|
||||||
the time and date at which the measure was made;
|
|
||||||
the machine on which the measure was made;
|
|
||||||
how many times the measure was repeated;
|
|
||||||
how many warm-up rounds were performed;
|
|
||||||
how many instructions were in the unrolled loop;
|
|
||||||
how many instructions were executed per repetition in total;
|
|
||||||
the parameters for \pipedream{}'s assembly generation procedure;
|
|
||||||
how the final result was aggregated from the repeated measures;
|
|
||||||
the variance of the set of measures;
|
|
||||||
how many CPU cores were active when the measure was made;
|
|
||||||
which CPU core was used for this measure;
|
|
||||||
whether the kernel's scheduler was set to FIFO mode.
|
|
||||||
|
|
||||||
\bigskip{}
|
|
||||||
|
|
||||||
We believe that, as a whole, the use of a database increases the usability of
|
|
||||||
\palmed{}: it is faster if some measures were already made in the past and
|
|
||||||
recovers better upon error.
|
|
||||||
|
|
||||||
This also gives us a better confidence towards our results: we can easily
|
|
||||||
archive and backup our experimental data, and we can easily trace the origin of
|
|
||||||
a measure if needed. We can also reuse the exact same measures between two runs
|
|
||||||
of \palmed{}, to ensure that the results are as consistent as possible.
|
|
||||||
|
|
||||||
|
|
||||||
\paragraph{General engineering contributions.} Apart from purely scientific
|
|
||||||
contributions, we worked on improving \palmed{} as a whole, from the
|
|
||||||
engineering point of view: code quality; reliable parallel measurements;
|
|
||||||
recovery upon error; logging; \ldots{} These improvements amount to about a
|
|
||||||
hundred merge-requests between \nderumig{} and myself.
|
|
||||||
|
|
|
@ -0,0 +1,60 @@
|
||||||
|
\section{Other contributions}
|
||||||
|
|
||||||
|
\paragraph{Using a database to enhance reproducibility and usability.}
|
||||||
|
\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
|
||||||
|
instance, generating a mapping for an x86-64 machine requires the execution of
|
||||||
|
about $10^6$ benchmarks on the CPU\@.
|
||||||
|
|
||||||
|
Each of these measures takes time: the multiset of instructions must be
|
||||||
|
transformed into an assembly code, including the register mapping phrase; this
|
||||||
|
assembly must be assembled and linked into an ELF file; and finally, the
|
||||||
|
benchmark must be actually executed, with multiple warm-up rounds and multiple
|
||||||
|
measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
|
||||||
|
to two-thirds of a second on a single core. The whole benchmarking phase, on
|
||||||
|
the \texttt{SKL-SP} processor, roughly took eight hours.
|
||||||
|
|
||||||
|
\medskip{}
|
||||||
|
|
||||||
|
As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
|
||||||
|
\palmed{} cannot be made truly reproducible. However, the slight fluctuations
|
||||||
|
in measured cycles between two executions of a benchmark are also a major
|
||||||
|
source of non-determinism in the execution of Palmed.
|
||||||
|
|
||||||
|
\medskip{}
|
||||||
|
|
||||||
|
For both these reasons, we implemented into \palmed{} a database-backed storage of
|
||||||
|
measurements. Whenever \palmed{} needs to measure a kernel, it will first try
|
||||||
|
to find a corresponding measure in the database; if the measure does not exist
|
||||||
|
yet, it will be run, then stored in database.
|
||||||
|
|
||||||
|
For each measure, we further store for context:
|
||||||
|
the time and date at which the measure was made;
|
||||||
|
the machine on which the measure was made;
|
||||||
|
how many times the measure was repeated;
|
||||||
|
how many warm-up rounds were performed;
|
||||||
|
how many instructions were in the unrolled loop;
|
||||||
|
how many instructions were executed per repetition in total;
|
||||||
|
the parameters for \pipedream{}'s assembly generation procedure;
|
||||||
|
how the final result was aggregated from the repeated measures;
|
||||||
|
the variance of the set of measures;
|
||||||
|
how many CPU cores were active when the measure was made;
|
||||||
|
which CPU core was used for this measure;
|
||||||
|
whether the kernel's scheduler was set to FIFO mode.
|
||||||
|
|
||||||
|
\bigskip{}
|
||||||
|
|
||||||
|
We believe that, as a whole, the use of a database increases the usability of
|
||||||
|
\palmed{}: it is faster if some measures were already made in the past and
|
||||||
|
recovers better upon error.
|
||||||
|
|
||||||
|
This also gives us a better confidence towards our results: we can easily
|
||||||
|
archive and backup our experimental data, and we can easily trace the origin of
|
||||||
|
a measure if needed. We can also reuse the exact same measures between two runs
|
||||||
|
of \palmed{}, to ensure that the results are as consistent as possible.
|
||||||
|
|
||||||
|
|
||||||
|
\paragraph{General engineering contributions.} Apart from purely scientific
|
||||||
|
contributions, we worked on improving \palmed{} as a whole, from the
|
||||||
|
engineering point of view: code quality; reliable parallel measurements;
|
||||||
|
recovery upon error; logging; \ldots{} These improvements amount to about a
|
||||||
|
hundred merge-requests between \nderumig{} and myself.
|
Loading…
Add table
Reference in a new issue