\section{Measuring a kernel's throughput: \pipedream{}}\label{sec:palmed_pipedream} \todo{Introduce pipedream notation for instructions} To build a mapping of a CPU, Palmed fundamentally depends on the ability to measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we saw above, Palmed defines a kernel as a multiset of instructions, and makes hypotheses on the measures accordingly. Specifically, this measure should reflect only out-of-order behaviours, without any dependency between instructions. The behaviour should be purely the CPU's; specifically, no memory effect should be accounted for, and thus, data should be L1-resident. Finally, even if hardware counters could be used to provide such a metric, they should be ignored, as Palmed aims to be as universal as possible and should avoid depending on vendor-specific counters. For this purpose, we use \pipedream{} as a benchmarking backend, a tool initially written by \fgruber{} and described in his PhD thesis~\cite{phd:gruber}, with further developments from \nderumig{} and \cguillon{}. Originally written for a broader purpose, \pipedream{} is able to generate assembly code to benchmark a multiset of instructions, fitting the constraints mentioned above. The generated assembly uses the following high-level shape: \begin{lstlisting}[language=Python] for NUM_MEASURES: measure_start() for NUM_ITER: kernel kernel ... kernel measure_stop() \end{lstlisting} The kernel is unrolled enough times so that the body of the innermost loop has at least \lstc{UNROLL_SIZE} instructions; and \lstc{NUM_ITER} is defined so that $\texttt{\footnotesize{}UNROLL\_SIZE} \times \texttt{\footnotesize{}NUM\_ITER} \geq \texttt{\footnotesize{}TOTAL\_INSN}$. \lstc{UNROLL_SIZE}, \lstc{TOTAL_INSN} and \lstc{NUM_MEASURES} are parameters of the benchmark generation. As \pipedream{} gets a multiset of instructions as a kernel, these instructions' arguments must be instantiated to turn them into actual assembly code ---~that is, turn \eg{} \lstxasm{ADD_GPR64i64_IMMi8} (the x86-64 variant of \texttt{add} taking as arguments a general purpose regsiter of 64 bits and an immediate of 8 bits) into \lstxasm{addq \$0x10, \%rdx}. There is no particular issue to instantiate immediates; however, allocating registers and memory operands requires some care, as no dependency must be created between instructions. \paragraph{Register operands.} For each type of register of the ISA (general purpose, vector, \ldots{}), the registers are split into a read and a write pool. The read pool only contains the maximum number of registers of this type needed by one instruction of the kernel; while the write pool contains the rest. The registers are then allocated for each instruction as follows. \begin{itemize} \item{} The read operands are allocated registers from the read pool without specific care, as read-after-read does not create a dependency between two instructions. \item{} The written operands are allocated registers from the write pool in a round-robin fashion, to maximize the distance between two reuses of the same register. Indeed, on some architectures, write-after-write dependencies do not allow full parallelism; while on some others, some instructions' operands are both read and written, resulting in read-after-write dependencies. \end{itemize} \paragraph{Memory operands.} At startup, a memory arena small enough to fit into the L1 cache is allocated. This arena is again split into a read and a write pool. \begin{itemize} \item In direct register addressing mode ---~\eg{} \lstxasm{movq \%rax, (\%rbx)} in x86-64~---, the same address is always reused (one for reads and one for writes). \item In base-index-displacement mode, the same base address is always reused; displacements are used in a round-robin fashion\footnote{On ARM, a displacement of 0 is always used, resulting in the same accesses as direct register addressing.}. \end{itemize} \bigskip{} These two allocation policies are meant to ensure that, whenever possible, no dependency will be created between two instructions: a dependency should only appear when write-after-write dependencies matter, and not enough registers of a kind are available in the architecture ---~in this case, the same register may be reused too early. To finally ensure that data is always L1-resident, warm-up rounds are performed before actually measuring the inner loop. As the memory area is small enough, and no other memory access is made during the measure, the memory area is subsequently L1-resident. This also has the effect to warm up the branch predictor. \medskip{} Finally, the generated assembly is assembled into a shared library object, and invoked with a lightweight C harness. Using the \papi{}~\cite{tool:papi} library, it measures and records two standard and omnipresent hardware events: the number of elapsed cycles (\lstxasm{PAPI_TOT_CYC}) and the number of completed instructions (\lstxasm{PAPI_TOT_INS}).