From 276e1d50c2ccb7d1761e134bb13c53261c740065 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr>
Date: Mon, 18 Sep 2023 17:53:06 +0200
Subject: [PATCH] First version of Palmed chapter

---
 manuscrit/30_palmed/40_palmed_results.tex | 63 ++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/manuscrit/30_palmed/40_palmed_results.tex b/manuscrit/30_palmed/40_palmed_results.tex
index 41ab5f9..360a217 100644
--- a/manuscrit/30_palmed/40_palmed_results.tex
+++ b/manuscrit/30_palmed/40_palmed_results.tex
@@ -165,6 +165,8 @@ instructions found in the original basic blocks.
 
 \subsection{Results}
 
+\input{40-1_results_fig.tex}
+
 We run the evaluation harness on three different machines:
 \begin{itemize}
     \item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver
@@ -190,4 +192,63 @@ closer a prediction is to the red horizontal line, the more accurate it is.
 
 These results are analyzed in the full article~\cite{palmed}.
 
-\input{40-1_results_fig.tex}
+\section{Other contributions}
+
+\paragraph{Using a database to enhance reproducibility and usability.}
+\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For
+instance, generating a mapping for an x86-64 machine requires the execution of
+about $10^6$ benchmarks on the CPU\@.
+
+Each of these measures takes time: the multiset of instructions must be
+transformed into an assembly code, including the register mapping phrase; this
+assembly must be assembled and linked into an ELF file; and finally, the
+benchmark must be actually executed, with multiple warm-up rounds and multiple
+measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half
+to two-thirds of a second on a single core. The whole benchmarking phase, on
+the \texttt{SKL-SP} processor, roughly took eight hours.
+
+\medskip{}
+
+As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic,
+\palmed{} cannot be made truly reproducible. However, the slight fluctuations
+in measured cycles between two executions of a benchmark are also a source of
+non-determinism in the execution of Palmed.
+
+\medskip{}
+
+For both these reasons, we implemented into \palmed{} a database-backed storage of
+measurements. Whenever \palmed{} needs to measure a kernel, it will first try
+to find a corresponding measure in the database; if the measure does not exist
+yet, it will be run, then stored in database.
+
+For each measure, we further store for context:
+the time and date at which the measure was made;
+the machine on which the measure was made;
+how many times the measure was repeated;
+how many warm-up rounds were performed;
+how many instructions were in the unrolled loop;
+how many instructions were executed per repetition in total;
+the parameters for \pipedream{}'s assembly generation procedure;
+how the final result was aggregated from the repeated measures;
+the variance of the set of measures;
+how many CPU cores were active when the measure was made;
+which CPU core was used for this measure;
+whether the kernel's scheduler was set to FIFO mode.
+
+\bigskip{}
+
+We believe that, as a whole, the use of a database increases the usability of
+\palmed{}: it is faster if some measures were already made in the past and
+recovers better upon error.
+
+This also gives us a better confidence towards our results: we can easily
+archive and backup our experimental data, and we can easily trace the origin of
+a measure if needed. We can also reuse the exact same measures between two runs
+of \palmed{}, to ensure that the results are as consistent as possible.
+
+
+\paragraph{General engineering contributions.} Apart from purely scientific
+contributions, we worked on improving \palmed{} as a whole, from the
+engineering point of view: code quality; reliable parallel measurements;
+recovery upon error; logging; \ldots{} These improvements amount to about a
+hundred merge-requests between \nderumig{} and myself.