Palmed: first writeup of Benchsuite-bb
This commit is contained in:
parent
4c4f70c246
commit
625f21739a
6 changed files with 220 additions and 63 deletions
202
manuscrit/30_palmed/35_benchsuite_bb.tex
Normal file
202
manuscrit/30_palmed/35_benchsuite_bb.tex
Normal file
|
@ -0,0 +1,202 @@
|
|||
\section{Finding basic blocks to evaluate \palmed{}}
|
||||
|
||||
In the context of all that is described above, my main task in the environment
|
||||
of \palmed{} was to build a system able to evaluate a produced mapping on a
|
||||
given architecture.
|
||||
|
||||
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
||||
for their evaluation. However, random generation may yield basic blocks that
|
||||
are not representative of the various workloads our model might be used on.
|
||||
Thus, while arbitrarily or randomly generated microbenchmarks were well suited
|
||||
to the data acquisition phase needed to generate the model, the kernels on
|
||||
which the model would be evaluated could not be arbitrary, but must come from
|
||||
real-world programs.
|
||||
|
||||
\subsection{Benchmark suites}
|
||||
|
||||
Models generated by \palmed{} are meant to be used on basic blocks that are
|
||||
computationally intensive ---~so that the backend is actually the relevant
|
||||
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
||||
running in steady-state ---~that is, which is the body of a loop long enough to
|
||||
be reasonably considered infinite for performance modelling purposes. The basic
|
||||
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
||||
criteria.
|
||||
|
||||
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
||||
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
||||
|
||||
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
||||
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
||||
domain-specific and centered around scientific computation, mathematical
|
||||
computation, image processing, etc. As the computation kernels are
|
||||
clearly identifiable in the source code, extracting the relevant basic blocks
|
||||
is easy, and fits well for our purpose. It is written in C language. Although
|
||||
it is not under a free/libre software license, it is free to use and
|
||||
open-source.
|
||||
|
||||
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
||||
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
||||
based benchmarks, extracted from (mainly open source) real-world software, such
|
||||
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
||||
metrics and compare CPUs on a unified workload; it is however commonly used
|
||||
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
||||
etc. It is split into four variants: integer and floating-point, combined with
|
||||
speed ---~time to perform a single task~--- and rate ---~throughput for
|
||||
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
||||
The SPEC suite is under a paid license, and cannot be redistributed, which
|
||||
makes peer-review and replication of experiments ---~\eg{} for artifact
|
||||
review~--- complicated.
|
||||
|
||||
\subsection{Manually extracting basic blocks}
|
||||
|
||||
Our first approach, that we used to extract basic blocks from the two benchmark
|
||||
suites introduced above for the evaluation included in our article for
|
||||
\palmed{}~\cite{palmed}, was very manual. We use different ---~though
|
||||
similar~--- approaches for Polybench and SPEC\@.
|
||||
|
||||
In the case of Polybench, we compile multiple versions of each benchmark
|
||||
(\texttt{-O2}, \texttt{-O3} and tiled using the Pluto
|
||||
optimizer~\cite{tool:pluto}). We then use \qemu~\cite{tool:qemu} to extract
|
||||
\textit{translation blocks} ---~very akin to basic blocks~--- and an occurrence
|
||||
count for each of those. We finally select the basic blocks that have enough
|
||||
occurrences to be body loops.
|
||||
|
||||
In the case of SPEC, we replace \qemu{} with the Linux \perf{} profiler, as
|
||||
individual benchmarks of the suite are heavier than Polybench benchmarks,
|
||||
making \qemu{}'s instrumentation overhead impractical. While \perf{} provides
|
||||
us with occurrence statistics, it does not chunk the program into basic blocks;
|
||||
we use an external disassembler and heuristics on the instructions to do this
|
||||
chunking. We describe both aspects ---~profiling and chunking~--- with more
|
||||
details below.
|
||||
|
||||
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
||||
and 2\,664 polybench-based basic blocks.
|
||||
|
||||
\subsection{Automating basic block extraction}
|
||||
|
||||
This manual method, however, has multiple drawbacks. It is, obviously, tedious
|
||||
to manually compile and run a benchmark suite, then extract basic blocks using
|
||||
two collections of scripts depending on which suite is used. It is also
|
||||
impractical that the two benchmark suites are scrapped using very similar, yet
|
||||
different techniques: as the basic blocks are not chunked using the same code,
|
||||
they might have slightly different properties.
|
||||
|
||||
Most importantly, this manual extraction is not reproducible. This comes with
|
||||
two problems.
|
||||
\begin{itemize}
|
||||
\item If the dataset was to be lost, or if another researcher wanted to
|
||||
reproduce our results, the exact same dataset could not be recreated.
|
||||
The same general procedure could be followed again, but code and
|
||||
scripts would have to be re-written, manually typed and undocumented
|
||||
shell lines re-written, etc.
|
||||
\item The same consideration applies to porting the dataset to another ISA.
|
||||
Indeed, as the dataset consists of assembly-level basic-blocks, it
|
||||
cannot be transferred to another ISA: it has to be re-generated from
|
||||
source-level benchmarks. This poses the same problems as the first
|
||||
point.
|
||||
\end{itemize}
|
||||
|
||||
This second point particularly motivated us to automate the basic block
|
||||
extraction procedure when \palmed{} ---~and the underlying \pipedream{}~---
|
||||
were extended to produce mappings for ARM processors.
|
||||
|
||||
\medskip{}
|
||||
|
||||
Our automated extraction tool, \benchsuitebb{}, is able to extract basic blocks
|
||||
from Polybench and SPEC. Although we do not use it to evaluate \palmed{}, it
|
||||
also supports the extraction of basic blocks from Rodinia~\cite{bench:rodinia},
|
||||
a benchmark suite targeted towards heterogeneous computing, and exhibiting
|
||||
various usual kernels, such as K-means, backpropagation, BFS, \ldots{}
|
||||
|
||||
For the most part, \benchsuitebb{} implements the manual approach used for
|
||||
SPEC\@. On top of an abstraction layer meant to unify the interface to all
|
||||
benchmark suites, it executes the various compiled binaries while profiling
|
||||
them through \perf{}, and chunks the relevant parts into basic blocks using a
|
||||
disassembler.
|
||||
|
||||
\paragraph{Profiling with \perf{}.}
|
||||
The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
|
||||
sampling the current program counter (as well as the stack, if requested, to
|
||||
obtain a stack trace) upon either event occurrences, such as number of elapsed
|
||||
CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
|
||||
user-defined frequency.
|
||||
|
||||
In our case, we use this second mode to uniformly sample the program counter
|
||||
across a run. We recover the output of the profiling as a \textit{raw trace}
|
||||
with \lstbash{perf report -D}.
|
||||
|
||||
\paragraph{ELF natigation: \texttt{pyelftools} and \texttt{capstone}.}
|
||||
To trace this program counter samplings back to basic blocks, we then need to
|
||||
chunk the relevant sections of the ELF binary down to basic blocks. For this,
|
||||
we use two tools: \texttt{pyelftools} and \texttt{capstone}.
|
||||
|
||||
The \texttt{pyelftools} Python library is able to parse and decode many
|
||||
informations contained in an ELF file. In our case, it allows us to find the
|
||||
\texttt{.text} section of the input binary, search for symbols, find the symbol
|
||||
containing a given program counter, extract the raw assembled bytes between two
|
||||
addresses, etc.
|
||||
|
||||
The \texttt{capstone} disassembler, on the other hand, allows to disassemble a
|
||||
portion of assembled binary back to assembly. It supports many ISAs, among
|
||||
which x86-64 and ARM, the two ISAs we investigate in this manuscript. It is
|
||||
able to extract relevant details out of an instruction: which operands,
|
||||
registers, \ldots{} it uses; which broader group of instruction it belongs to;
|
||||
etc. These groups of instructions, in our case, are particularly useful, as it
|
||||
allows us to find control flow instructions without writing code specific to an
|
||||
ISA. These control-altering instructions are jumps, calls and returns. We are
|
||||
also able to trace a (relative) jump to its jump site, enabling us later to
|
||||
have a finer definition of basic blocks.
|
||||
|
||||
\begin{algorithm}
|
||||
\begin{algorithmic}
|
||||
\Function{bbsOfSymbol}{symbol} \Comment{Memoized, computed only once
|
||||
per symbol}
|
||||
\State instructions $\gets$ disassemble(bytesFor(symbol))
|
||||
\Comment{Uses both pyelftools and capstone}
|
||||
|
||||
\State flowSites $\gets \emptyset$
|
||||
\State jumpSites $\gets \emptyset$
|
||||
|
||||
\For{instr $\in$ instructions}
|
||||
\If{isControlFlow(instr)}
|
||||
\State flowSites $\gets \text{flowSites} \cup \{\text{next(instr).addr}\}$
|
||||
\If{isJump(instr)}
|
||||
\State jumpSites $\gets \text{jumpSites} \cup \{\text{instr.jump\_addr}\}$
|
||||
\EndIf
|
||||
\EndIf
|
||||
\EndFor
|
||||
|
||||
\State \textbf{return} instructions.splitAt($\text{flowSites} \cup
|
||||
\text{jumpSites}$)
|
||||
\EndFunction
|
||||
|
||||
\medskip{}
|
||||
|
||||
\Function{bbsOfPcs}{pcs}
|
||||
\State occurences $\gets \{\}$
|
||||
\For{pc $\in$ pcs}
|
||||
\State bbs $\gets$ bbsOfSymbol(symbolOfPc(pc))
|
||||
\State bb $\gets$ bissect(pc, bbs)
|
||||
\State $\text{occurrences}[\text{bb}] ++$
|
||||
\EndFor
|
||||
\State \textbf{return} occurrences
|
||||
\EndFunction
|
||||
\end{algorithmic}
|
||||
|
||||
\caption{Basic block extraction procedure, given a \perf{}-obtain list of
|
||||
program counters.}\label{alg:bb_extr_procedure}
|
||||
\end{algorithm}
|
||||
|
||||
\paragraph{Extracting basic blocks.} We describe the basic block extraction,
|
||||
given the \perf{}-provided list of sampled program counters, in
|
||||
\autoref{alg:bb_extr_procedure}. For each program counter, we find the ELF
|
||||
symbol it belongs to, and decompose this whole symbol into basic blocks ---~we
|
||||
memoize this step to do it only once per symbol. We then bissect the basic
|
||||
block corresponding to the current PC from the list of obtained basic blocks to
|
||||
count the occurrences of each block.
|
||||
|
||||
To split a symbol into basic blocks, we determine using \texttt{capstone} its
|
||||
set of \emph{flow sites} and \emph{jump sites}. The former is the set of
|
||||
addresses just after a control flow instruction, while the latter is the set of
|
||||
addresses to which jump instructions may jump. We then split the
|
||||
straight-line code of the symbol using the union of both sets as boundaries.
|
|
@ -3,69 +3,6 @@
|
|||
The main contribution I made to \palmed{} is its evaluation harness and
|
||||
procedure. \todo{}
|
||||
|
||||
\subsection{Basic blocks from benchmark suites}
|
||||
|
||||
Models generated by \palmed{} are meant to be used on basic blocks that are
|
||||
computationally intensive ---~so that the backend is actually the relevant
|
||||
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
||||
running in steady-state ---~that is, which is the body of a loop long enough to
|
||||
be reasonably considered infinite for performance modelling purposes. The basic
|
||||
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
||||
criteria.
|
||||
|
||||
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
||||
for their evaluation. This approach, however, may yield basic blocks that do
|
||||
not fit in those criteria; furthermore, it may not be representative of
|
||||
real-life code on which the users of the tool expect it to be accurate.
|
||||
|
||||
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
||||
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
||||
|
||||
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
||||
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
||||
domain-specific and centered around scientific computation, mathematical
|
||||
computation, image processing, etc. As the computation kernels are
|
||||
clearly identifiable in the source code, extracting the relevant basic blocks
|
||||
is easy, and fits well for our purpose. It is written in C language. Although
|
||||
it is not under a free/libre software license, it is free to use and
|
||||
open-source.
|
||||
|
||||
We compile multiple versions of each benchmark (\texttt{-O2}, \texttt{-O3} and
|
||||
tiled using the Pluto optimizer~\cite{tool:pluto}), then extract the basic
|
||||
block corresponding to the benchmarks' kernels using \qemu~\cite{tool:qemu},
|
||||
gathering translation blocks and occurrence statistics.
|
||||
|
||||
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
||||
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
||||
based benchmarks, extracted from (mainly open source) real-world software, such
|
||||
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
||||
metrics and compare CPUs on a unified workload; it is however commonly used
|
||||
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
||||
etc. It is split into four variants: integer and floating-point, combined with
|
||||
speed ---~time to perform a single task~--- and rate ---~throughput for
|
||||
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
||||
The SPEC suite is under a paid license, and cannot be redistributed, which
|
||||
makes peer-review and replication of experiments ---~\eg{} for artifact
|
||||
review~--- complicated.
|
||||
|
||||
In the case of SPEC, there is no clear kernel available for each benchmark;
|
||||
extracting basic blocks to evaluate \palmed{} is not trivial. We manually
|
||||
extract the relevant basic blocks using a profiling-based approach with Linux
|
||||
\perf{}, as the \qemu{}-based solution used for Polybench would be too costly
|
||||
for SPEC\@.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
||||
and 2\,664 polybench-based basic blocks.
|
||||
|
||||
\medskip{}
|
||||
|
||||
We automate and describe in detail the \perf{}-based method later in
|
||||
\qtodo{ref}; however, this automation allows us to also extract basic blocks
|
||||
from these same benchmark suites, compiled for ARMv8a.
|
||||
|
||||
|
||||
\subsection{Evaluation harness}
|
||||
|
||||
We implement into \palmed{} an evaluation harness to evaluate it both against
|
||||
|
|
|
@ -4,5 +4,6 @@
|
|||
\input{10_resource_models.tex}
|
||||
\input{20_palmed_design.tex}
|
||||
\input{30_pipedream.tex}
|
||||
\input{35_benchsuite_bb.tex}
|
||||
\input{40_palmed_results.tex}
|
||||
\input{50_other_contributions.tex}
|
||||
|
|
|
@ -45,3 +45,14 @@ keywords = {program characterization, dynamic analysis, code optimization},
|
|||
location = {Montr\'{e}al, QC, Canada},
|
||||
series = {CGO 2023}
|
||||
}
|
||||
|
||||
@INPROCEEDINGS{bench:rodinia,
|
||||
author={Che, Shuai and Boyer, Michael and Meng, Jiayuan and Tarjan, David and Sheaffer, Jeremy W. and Lee, Sang-Ha and Skadron, Kevin},
|
||||
booktitle={2009 IEEE International Symposium on Workload Characterization (IISWC)},
|
||||
title={Rodinia: A benchmark suite for heterogeneous computing},
|
||||
year={2009},
|
||||
volume={},
|
||||
number={},
|
||||
pages={44-54},
|
||||
doi={10.1109/IISWC.2009.5306797}
|
||||
}
|
||||
|
|
|
@ -44,6 +44,7 @@
|
|||
\newcommand{\bhive}{\texttt{BHive}}
|
||||
\newcommand{\anica}{\texttt{AnICA}}
|
||||
\newcommand{\cesasme}{\texttt{CesASMe}}
|
||||
\newcommand{\benchsuitebb}{\texttt{benchsuite-bb}}
|
||||
|
||||
\newcommand{\gdb}{\texttt{gdb}}
|
||||
|
||||
|
|
|
@ -26,6 +26,7 @@
|
|||
\usepackage{wrapfig}
|
||||
\usepackage{float}
|
||||
\usepackage{tikz}
|
||||
\usepackage{algpseudocode}
|
||||
\usepackage[bottom]{footmisc} % footnotes are below floats
|
||||
\usepackage[final]{microtype}
|
||||
|
||||
|
@ -85,3 +86,7 @@
|
|||
\newfloat{lstfloat}{htbp}{lop}
|
||||
\floatname{lstfloat}{Listing}
|
||||
\def\lstfloatautorefname{Listing}
|
||||
|
||||
\newfloat{algorithm}{htbp}{lop}
|
||||
\floatname{algorithm}{Algorithm}
|
||||
\def\algorithmautorefname{Algorithm}
|
||||
|
|
Loading…
Reference in a new issue