\section{Finding basic blocks to evaluate \palmed{}}\label{sec:benchsuite_bb} In the context of all that is described above, my main task in the environment of \palmed{} was to build a system able to evaluate a produced mapping on a given architecture. Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks for their evaluation. However, random generation may yield basic blocks that are not representative of the various workloads our model might be used on. Thus, while arbitrarily or randomly generated microbenchmarks were well suited to the data acquisition phase needed to generate the model, the kernels on which the model would be evaluated could not be arbitrary, and must instead come from real-world programs. \subsection{Benchmark suites} Models generated by \palmed{} are meant to be used on basic blocks that are computationally intensive ---~so that the backend is actually the relevant resource to monitor, compared to \eg{} frontend- or input/output-bound code~---, running in steady-state ---~that is, which is the body of a loop long enough to be reasonably considered infinite for performance modelling purposes. The basic blocks used to evaluate \palmed{} should thus be reasonably close from these criteria. For this reason, we evaluate \palmed{} on basic blocks extracted from two well-known benchmark suites: Polybench and SPEC CPU 2017. \paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of numerical computation~\cite{bench:polybench}. Its benchmarks are domain-specific and centered around scientific computation, mathematical computation, image processing, etc. As the computation kernels are clearly identifiable in the source code, extracting the relevant basic blocks is easy, and fits well for our purpose. It is written in C language. Although it is not under a free/libre software license, it is free to use and open-source. \paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU intensive~\cite{bench:spec}. It is composed of both integer and floating-point based benchmarks, extracted from (mainly open source) real-world software, such as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain metrics and compare CPUs on a unified workload; it is however commonly used throughout the literature to evaluate compilers, optimizers, code analyzers, etc. It is split into four variants: integer and floating-point, combined with speed ---~time to perform a single task~--- and rate ---~throughput for performing a flow of tasks. Most benchmarks exist in both speed and rate mode. The SPEC suite is under a paid license, and cannot be redistributed, which makes peer-review and replication of experiments ---~\eg{} for artifact review~--- complicated. \subsection{Manually extracting basic blocks} The first approach that we used to extract basic blocks from the two benchmark suites introduced above, for the evaluation included in our article for \palmed{}~\cite{palmed}, was very manual. We use different ---~though similar~--- approaches for Polybench and SPEC\@. In the case of Polybench, we compile multiple versions of each benchmark (\texttt{-O2}, \texttt{-O3} and tiled using the Pluto optimizer~\cite{tool:pluto}). We then use \qemu~\cite{tool:qemu} to extract \textit{translation blocks} ---~very akin to basic blocks~--- and an occurrence count for each of those. We finally select the basic blocks that have enough occurrences to be body loops. In the case of SPEC, we replace \qemu{} with the Linux \perf{} profiler, as individual benchmarks of the suite are heavier than Polybench benchmarks, making \qemu{}'s instrumentation overhead impractical. While \perf{} provides us with occurrence statistics, it does not chunk the program into basic blocks; we use an external disassembler and heuristics on the instructions to do this chunking. We describe both aspects ---~profiling and chunking~--- with more details below. Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based and 2\,664 polybench-based basic blocks. \subsection{Automating basic block extraction}\label{ssec:palmed_bb_extraction} This manual method, however, has multiple drawbacks. It is, obviously, tedious to manually compile and run a benchmark suite, then extract basic blocks using two collections of scripts depending on which suite is used. It is also impractical that the two benchmark suites are scrapped using very similar, yet different techniques: as the basic blocks are not chunked using the same code, they might have slightly different properties. Most importantly, this manual extraction is not reproducible. This comes with two problems. \begin{itemize} \item If the dataset was to be lost, or if another researcher wanted to reproduce our results, the exact same dataset could not be identically recreated. The same general procedure could be followed again, but code and scripts would have to be re-written, manually typed and undocumented shell lines re-written, etc. Most importantly, the re-extracted basic blocks may well be slightly different. \item The same consideration applies to porting the dataset to another ISA. Indeed, as the dataset consists of assembly-level basic-blocks, it cannot be transferred to another ISA: it has to be re-generated from source-level benchmarks. This poses the same problems as the first point. \end{itemize} This second point particularly motivated us to automate the basic block extraction procedure when \palmed{} ---~and the underlying \pipedream{}~--- were extended to produce mappings for ARM processors. \medskip{} Our automated extraction tool, \benchsuitebb{}, is able to extract basic blocks from Polybench and SPEC. Although we do not use it to evaluate \palmed{}, it also supports the extraction of basic blocks from Rodinia~\cite{bench:rodinia}, a benchmark suite targeted towards heterogeneous computing, and exhibiting various usual kernels, such as K-means, backpropagation, BFS, \ldots{} For the most part, \benchsuitebb{} implements the manual approach used for SPEC\@. On top of an abstraction layer meant to unify the interface to all benchmark suites, it executes the various compiled binaries while profiling them through \perf{}, and chunks the relevant parts into basic blocks using a disassembler. \paragraph{Profiling with \perf{}.} The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by sampling the current program counter (as well as the stack, if requested, to obtain a stack trace) upon either event occurrences, such as number of elapsed CPU cycles, context switches, cache misses, \ldots, or simply at a fixed, user-defined time frequency. In our case, we use this second mode to uniformly sample the program counter across a run. We recover the output of the profiling as a \textit{raw trace} with \lstbash{perf report -D}. \paragraph{ELF natigation: \texttt{pyelftools} and \texttt{capstone}.} To trace this program counter samplings back to basic blocks, we then need to chunk the relevant sections of the ELF binary down to basic blocks. For this, we use two tools: \texttt{pyelftools} and \texttt{capstone}. The \texttt{pyelftools} Python library is able to parse and decode many informations contained in an ELF file. In our case, it allows us to find the \texttt{.text} section of the input binary, search for symbols, find the symbol containing a given program counter, extract the raw assembled bytes between two addresses, etc. The \texttt{capstone} disassembler, on the other hand, allows to disassemble a portion of assembled binary back to assembly. It supports many ISAs, among which x86-64 and ARM, the two ISAs we investigate in this manuscript. It is able to extract relevant details out of an instruction: which operands, registers, \ldots{} it uses; which broader group of instruction it belongs to; etc. These groups of instructions, in our case, are particularly useful, as it allows us to find control flow instructions without writing code specific to an ISA. These control-altering instructions are jumps, calls and returns. We are also able to trace a (relative) jump to its jump site, enabling us later to have a finer definition of basic blocks. \begin{algorithm} \begin{algorithmic} \Function{bbsOfSymbol}{symbol} \Comment{Memoized, computed only once per symbol} \State instructions $\gets$ disassemble(bytesFor(symbol)) \Comment{Uses both pyelftools and capstone} \State flowSites $\gets \emptyset$ \State jumpSites $\gets \emptyset$ \For{instr $\in$ instructions} \If{isControlFlow(instr)} \State flowSites $\gets \text{flowSites} \cup \{\text{next(instr).addr}\}$ \If{isJump(instr)} \State jumpSites $\gets \text{jumpSites} \cup \{\text{instr.jump\_addr}\}$ \EndIf \EndIf \EndFor \State \textbf{return} instructions.splitAt($\text{flowSites} \cup \text{jumpSites}$) \EndFunction \medskip{} \Function{bbsOfPcs}{pcs} \State occurences $\gets \{\}$ \For{pc $\in$ pcs} \State bbs $\gets$ bbsOfSymbol(symbolOfPc(pc)) \State bb $\gets$ bissect(pc, bbs) \State $\text{occurrences}[\text{bb}] ++$ \EndFor \State \textbf{return} occurrences \EndFunction \end{algorithmic} \caption{Basic block extraction procedure, given a \perf{}-obtain list of program counters.}\label{alg:bb_extr_procedure} \end{algorithm} \paragraph{Extracting basic blocks.} We describe the basic block extraction, given the \perf{}-provided list of sampled program counters, in \autoref{alg:bb_extr_procedure}. For each program counter, we find the ELF symbol it belongs to, and decompose this whole symbol into basic blocks ---~we memoize this step to do it only once per symbol. We then bissect the basic block corresponding to the current PC from the list of obtained basic blocks to count the occurrences of each block. To split a symbol into basic blocks, we follow the procedure introduced by our formal definition in \autoref{sssec:def:bbs}. We determine using \texttt{capstone} its set of \emph{flow sites} and \emph{jump sites}. The former is the set of addresses just after a control flow instruction, while the latter is the set of addresses to which jump instructions may jump. We then split the straight-line code of the symbol using the union of both sets as boundaries. \medskip Altogether, on x86-64, we retrieve 2\,499 basic blocks on Polybench, and 13\,383 basic blocks on SPEC. The structure, dataset and runtime of both benchmark suite, however, makes it significantly harder to gather real ``kernels'' on SPEC: while 724 of the Polybench basic blocks were sampled more than 1\,000 times, only 75 were so sampled in SPEC, and 668 were sampled more than 100 times.