From 276e1d50c2ccb7d1761e134bb13c53261c740065 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr> Date: Mon, 18 Sep 2023 17:53:06 +0200 Subject: [PATCH] First version of Palmed chapter --- manuscrit/30_palmed/40_palmed_results.tex | 63 ++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/manuscrit/30_palmed/40_palmed_results.tex b/manuscrit/30_palmed/40_palmed_results.tex index 41ab5f9..360a217 100644 --- a/manuscrit/30_palmed/40_palmed_results.tex +++ b/manuscrit/30_palmed/40_palmed_results.tex @@ -165,6 +165,8 @@ instructions found in the original basic blocks. \subsection{Results} +\input{40-1_results_fig.tex} + We run the evaluation harness on three different machines: \begin{itemize} \item{} an x86-64 Intel \texttt{SKL-SP}-based machine, with two Intel Xeon Silver @@ -190,4 +192,63 @@ closer a prediction is to the red horizontal line, the more accurate it is. These results are analyzed in the full article~\cite{palmed}. -\input{40-1_results_fig.tex} +\section{Other contributions} + +\paragraph{Using a database to enhance reproducibility and usability.} +\palmed{}'s method is driven by a large number of \pipedream{} benchmarks. For +instance, generating a mapping for an x86-64 machine requires the execution of +about $10^6$ benchmarks on the CPU\@. + +Each of these measures takes time: the multiset of instructions must be +transformed into an assembly code, including the register mapping phrase; this +assembly must be assembled and linked into an ELF file; and finally, the +benchmark must be actually executed, with multiple warm-up rounds and multiple +measures. On average, on the \texttt{SKL-SP} CPU, each benchmark requires half +to two-thirds of a second on a single core. The whole benchmarking phase, on +the \texttt{SKL-SP} processor, roughly took eight hours. + +\medskip{} + +As \palmed{} relies on the Gurobi optimizer, which is itself non-deterministic, +\palmed{} cannot be made truly reproducible. However, the slight fluctuations +in measured cycles between two executions of a benchmark are also a source of +non-determinism in the execution of Palmed. + +\medskip{} + +For both these reasons, we implemented into \palmed{} a database-backed storage of +measurements. Whenever \palmed{} needs to measure a kernel, it will first try +to find a corresponding measure in the database; if the measure does not exist +yet, it will be run, then stored in database. + +For each measure, we further store for context: +the time and date at which the measure was made; +the machine on which the measure was made; +how many times the measure was repeated; +how many warm-up rounds were performed; +how many instructions were in the unrolled loop; +how many instructions were executed per repetition in total; +the parameters for \pipedream{}'s assembly generation procedure; +how the final result was aggregated from the repeated measures; +the variance of the set of measures; +how many CPU cores were active when the measure was made; +which CPU core was used for this measure; +whether the kernel's scheduler was set to FIFO mode. + +\bigskip{} + +We believe that, as a whole, the use of a database increases the usability of +\palmed{}: it is faster if some measures were already made in the past and +recovers better upon error. + +This also gives us a better confidence towards our results: we can easily +archive and backup our experimental data, and we can easily trace the origin of +a measure if needed. We can also reuse the exact same measures between two runs +of \palmed{}, to ensure that the results are as consistent as possible. + + +\paragraph{General engineering contributions.} Apart from purely scientific +contributions, we worked on improving \palmed{} as a whole, from the +engineering point of view: code quality; reliable parallel measurements; +recovery upon error; logging; \ldots{} These improvements amount to about a +hundred merge-requests between \nderumig{} and myself.