2023-09-26 16:39:12 +02:00
|
|
|
\section{Finding basic blocks to evaluate \palmed{}}
|
|
|
|
|
|
|
|
In the context of all that is described above, my main task in the environment
|
|
|
|
of \palmed{} was to build a system able to evaluate a produced mapping on a
|
|
|
|
given architecture.
|
|
|
|
|
|
|
|
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
|
|
|
for their evaluation. However, random generation may yield basic blocks that
|
|
|
|
are not representative of the various workloads our model might be used on.
|
|
|
|
Thus, while arbitrarily or randomly generated microbenchmarks were well suited
|
|
|
|
to the data acquisition phase needed to generate the model, the kernels on
|
|
|
|
which the model would be evaluated could not be arbitrary, but must come from
|
|
|
|
real-world programs.
|
|
|
|
|
|
|
|
\subsection{Benchmark suites}
|
|
|
|
|
|
|
|
Models generated by \palmed{} are meant to be used on basic blocks that are
|
|
|
|
computationally intensive ---~so that the backend is actually the relevant
|
|
|
|
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
|
|
|
running in steady-state ---~that is, which is the body of a loop long enough to
|
|
|
|
be reasonably considered infinite for performance modelling purposes. The basic
|
|
|
|
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
|
|
|
criteria.
|
|
|
|
|
|
|
|
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
|
|
|
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
|
|
|
|
|
|
|
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
|
|
|
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
|
|
|
domain-specific and centered around scientific computation, mathematical
|
|
|
|
computation, image processing, etc. As the computation kernels are
|
|
|
|
clearly identifiable in the source code, extracting the relevant basic blocks
|
|
|
|
is easy, and fits well for our purpose. It is written in C language. Although
|
|
|
|
it is not under a free/libre software license, it is free to use and
|
|
|
|
open-source.
|
|
|
|
|
|
|
|
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
|
|
|
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
|
|
|
based benchmarks, extracted from (mainly open source) real-world software, such
|
|
|
|
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
|
|
|
metrics and compare CPUs on a unified workload; it is however commonly used
|
|
|
|
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
|
|
|
etc. It is split into four variants: integer and floating-point, combined with
|
|
|
|
speed ---~time to perform a single task~--- and rate ---~throughput for
|
|
|
|
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
|
|
|
The SPEC suite is under a paid license, and cannot be redistributed, which
|
|
|
|
makes peer-review and replication of experiments ---~\eg{} for artifact
|
|
|
|
review~--- complicated.
|
|
|
|
|
|
|
|
\subsection{Manually extracting basic blocks}
|
|
|
|
|
|
|
|
Our first approach, that we used to extract basic blocks from the two benchmark
|
|
|
|
suites introduced above for the evaluation included in our article for
|
|
|
|
\palmed{}~\cite{palmed}, was very manual. We use different ---~though
|
|
|
|
similar~--- approaches for Polybench and SPEC\@.
|
|
|
|
|
|
|
|
In the case of Polybench, we compile multiple versions of each benchmark
|
|
|
|
(\texttt{-O2}, \texttt{-O3} and tiled using the Pluto
|
|
|
|
optimizer~\cite{tool:pluto}). We then use \qemu~\cite{tool:qemu} to extract
|
|
|
|
\textit{translation blocks} ---~very akin to basic blocks~--- and an occurrence
|
|
|
|
count for each of those. We finally select the basic blocks that have enough
|
|
|
|
occurrences to be body loops.
|
|
|
|
|
|
|
|
In the case of SPEC, we replace \qemu{} with the Linux \perf{} profiler, as
|
|
|
|
individual benchmarks of the suite are heavier than Polybench benchmarks,
|
|
|
|
making \qemu{}'s instrumentation overhead impractical. While \perf{} provides
|
|
|
|
us with occurrence statistics, it does not chunk the program into basic blocks;
|
|
|
|
we use an external disassembler and heuristics on the instructions to do this
|
|
|
|
chunking. We describe both aspects ---~profiling and chunking~--- with more
|
|
|
|
details below.
|
|
|
|
|
|
|
|
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
|
|
|
and 2\,664 polybench-based basic blocks.
|
|
|
|
|
|
|
|
\subsection{Automating basic block extraction}
|
|
|
|
|
|
|
|
This manual method, however, has multiple drawbacks. It is, obviously, tedious
|
|
|
|
to manually compile and run a benchmark suite, then extract basic blocks using
|
|
|
|
two collections of scripts depending on which suite is used. It is also
|
|
|
|
impractical that the two benchmark suites are scrapped using very similar, yet
|
|
|
|
different techniques: as the basic blocks are not chunked using the same code,
|
|
|
|
they might have slightly different properties.
|
|
|
|
|
|
|
|
Most importantly, this manual extraction is not reproducible. This comes with
|
|
|
|
two problems.
|
|
|
|
\begin{itemize}
|
|
|
|
\item If the dataset was to be lost, or if another researcher wanted to
|
|
|
|
reproduce our results, the exact same dataset could not be recreated.
|
|
|
|
The same general procedure could be followed again, but code and
|
|
|
|
scripts would have to be re-written, manually typed and undocumented
|
|
|
|
shell lines re-written, etc.
|
|
|
|
\item The same consideration applies to porting the dataset to another ISA.
|
|
|
|
Indeed, as the dataset consists of assembly-level basic-blocks, it
|
|
|
|
cannot be transferred to another ISA: it has to be re-generated from
|
|
|
|
source-level benchmarks. This poses the same problems as the first
|
|
|
|
point.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
This second point particularly motivated us to automate the basic block
|
|
|
|
extraction procedure when \palmed{} ---~and the underlying \pipedream{}~---
|
|
|
|
were extended to produce mappings for ARM processors.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Our automated extraction tool, \benchsuitebb{}, is able to extract basic blocks
|
|
|
|
from Polybench and SPEC. Although we do not use it to evaluate \palmed{}, it
|
|
|
|
also supports the extraction of basic blocks from Rodinia~\cite{bench:rodinia},
|
|
|
|
a benchmark suite targeted towards heterogeneous computing, and exhibiting
|
|
|
|
various usual kernels, such as K-means, backpropagation, BFS, \ldots{}
|
|
|
|
|
|
|
|
For the most part, \benchsuitebb{} implements the manual approach used for
|
|
|
|
SPEC\@. On top of an abstraction layer meant to unify the interface to all
|
|
|
|
benchmark suites, it executes the various compiled binaries while profiling
|
|
|
|
them through \perf{}, and chunks the relevant parts into basic blocks using a
|
|
|
|
disassembler.
|
|
|
|
|
|
|
|
\paragraph{Profiling with \perf{}.}
|
|
|
|
The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
|
|
|
|
sampling the current program counter (as well as the stack, if requested, to
|
|
|
|
obtain a stack trace) upon either event occurrences, such as number of elapsed
|
|
|
|
CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
|
|
|
|
user-defined frequency.
|
|
|
|
|
|
|
|
In our case, we use this second mode to uniformly sample the program counter
|
|
|
|
across a run. We recover the output of the profiling as a \textit{raw trace}
|
|
|
|
with \lstbash{perf report -D}.
|
|
|
|
|
|
|
|
\paragraph{ELF natigation: \texttt{pyelftools} and \texttt{capstone}.}
|
|
|
|
To trace this program counter samplings back to basic blocks, we then need to
|
|
|
|
chunk the relevant sections of the ELF binary down to basic blocks. For this,
|
|
|
|
we use two tools: \texttt{pyelftools} and \texttt{capstone}.
|
|
|
|
|
|
|
|
The \texttt{pyelftools} Python library is able to parse and decode many
|
|
|
|
informations contained in an ELF file. In our case, it allows us to find the
|
|
|
|
\texttt{.text} section of the input binary, search for symbols, find the symbol
|
|
|
|
containing a given program counter, extract the raw assembled bytes between two
|
|
|
|
addresses, etc.
|
|
|
|
|
|
|
|
The \texttt{capstone} disassembler, on the other hand, allows to disassemble a
|
|
|
|
portion of assembled binary back to assembly. It supports many ISAs, among
|
|
|
|
which x86-64 and ARM, the two ISAs we investigate in this manuscript. It is
|
|
|
|
able to extract relevant details out of an instruction: which operands,
|
|
|
|
registers, \ldots{} it uses; which broader group of instruction it belongs to;
|
|
|
|
etc. These groups of instructions, in our case, are particularly useful, as it
|
|
|
|
allows us to find control flow instructions without writing code specific to an
|
|
|
|
ISA. These control-altering instructions are jumps, calls and returns. We are
|
|
|
|
also able to trace a (relative) jump to its jump site, enabling us later to
|
|
|
|
have a finer definition of basic blocks.
|
|
|
|
|
|
|
|
\begin{algorithm}
|
|
|
|
\begin{algorithmic}
|
|
|
|
\Function{bbsOfSymbol}{symbol} \Comment{Memoized, computed only once
|
|
|
|
per symbol}
|
|
|
|
\State instructions $\gets$ disassemble(bytesFor(symbol))
|
|
|
|
\Comment{Uses both pyelftools and capstone}
|
|
|
|
|
|
|
|
\State flowSites $\gets \emptyset$
|
|
|
|
\State jumpSites $\gets \emptyset$
|
|
|
|
|
|
|
|
\For{instr $\in$ instructions}
|
|
|
|
\If{isControlFlow(instr)}
|
|
|
|
\State flowSites $\gets \text{flowSites} \cup \{\text{next(instr).addr}\}$
|
|
|
|
\If{isJump(instr)}
|
|
|
|
\State jumpSites $\gets \text{jumpSites} \cup \{\text{instr.jump\_addr}\}$
|
|
|
|
\EndIf
|
|
|
|
\EndIf
|
|
|
|
\EndFor
|
|
|
|
|
|
|
|
\State \textbf{return} instructions.splitAt($\text{flowSites} \cup
|
|
|
|
\text{jumpSites}$)
|
|
|
|
\EndFunction
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
\Function{bbsOfPcs}{pcs}
|
|
|
|
\State occurences $\gets \{\}$
|
|
|
|
\For{pc $\in$ pcs}
|
|
|
|
\State bbs $\gets$ bbsOfSymbol(symbolOfPc(pc))
|
|
|
|
\State bb $\gets$ bissect(pc, bbs)
|
|
|
|
\State $\text{occurrences}[\text{bb}] ++$
|
|
|
|
\EndFor
|
|
|
|
\State \textbf{return} occurrences
|
|
|
|
\EndFunction
|
|
|
|
\end{algorithmic}
|
|
|
|
|
|
|
|
\caption{Basic block extraction procedure, given a \perf{}-obtain list of
|
|
|
|
program counters.}\label{alg:bb_extr_procedure}
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
\paragraph{Extracting basic blocks.} We describe the basic block extraction,
|
|
|
|
given the \perf{}-provided list of sampled program counters, in
|
|
|
|
\autoref{alg:bb_extr_procedure}. For each program counter, we find the ELF
|
|
|
|
symbol it belongs to, and decompose this whole symbol into basic blocks ---~we
|
|
|
|
memoize this step to do it only once per symbol. We then bissect the basic
|
|
|
|
block corresponding to the current PC from the list of obtained basic blocks to
|
|
|
|
count the occurrences of each block.
|
|
|
|
|
|
|
|
To split a symbol into basic blocks, we determine using \texttt{capstone} its
|
|
|
|
set of \emph{flow sites} and \emph{jump sites}. The former is the set of
|
|
|
|
addresses just after a control flow instruction, while the latter is the set of
|
|
|
|
addresses to which jump instructions may jump. We then split the
|
|
|
|
straight-line code of the symbol using the union of both sets as boundaries.
|
2023-09-26 16:46:58 +02:00
|
|
|
|
2023-09-27 17:02:06 +02:00
|
|
|
\medskip
|
|
|
|
|
|
|
|
Altogether, on x86-64, we retrieve 2\,499 basic blocks on Polybench, and
|
|
|
|
13\,383 basic blocks on SPEC. The structure, dataset and runtime of both
|
|
|
|
benchmark suite, however, makes it significantly harder to gather real
|
|
|
|
``kernels'' on SPEC: while 724 of the Polybench basic blocks were sampled more
|
|
|
|
than 1\,000 times, only 75 were so sampled in SPEC, and 668 were sampled more
|
|
|
|
than 100 times.
|