Palmed: first writeup of Benchsuite-bb
This commit is contained in:
parent
4c4f70c246
commit
625f21739a
6 changed files with 220 additions and 63 deletions
manuscrit
30_palmed
biblio
include
202
manuscrit/30_palmed/35_benchsuite_bb.tex
Normal file
202
manuscrit/30_palmed/35_benchsuite_bb.tex
Normal file
|
@ -0,0 +1,202 @@
|
||||||
|
\section{Finding basic blocks to evaluate \palmed{}}
|
||||||
|
|
||||||
|
In the context of all that is described above, my main task in the environment
|
||||||
|
of \palmed{} was to build a system able to evaluate a produced mapping on a
|
||||||
|
given architecture.
|
||||||
|
|
||||||
|
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
||||||
|
for their evaluation. However, random generation may yield basic blocks that
|
||||||
|
are not representative of the various workloads our model might be used on.
|
||||||
|
Thus, while arbitrarily or randomly generated microbenchmarks were well suited
|
||||||
|
to the data acquisition phase needed to generate the model, the kernels on
|
||||||
|
which the model would be evaluated could not be arbitrary, but must come from
|
||||||
|
real-world programs.
|
||||||
|
|
||||||
|
\subsection{Benchmark suites}
|
||||||
|
|
||||||
|
Models generated by \palmed{} are meant to be used on basic blocks that are
|
||||||
|
computationally intensive ---~so that the backend is actually the relevant
|
||||||
|
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
||||||
|
running in steady-state ---~that is, which is the body of a loop long enough to
|
||||||
|
be reasonably considered infinite for performance modelling purposes. The basic
|
||||||
|
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
||||||
|
criteria.
|
||||||
|
|
||||||
|
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
||||||
|
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
||||||
|
|
||||||
|
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
||||||
|
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
||||||
|
domain-specific and centered around scientific computation, mathematical
|
||||||
|
computation, image processing, etc. As the computation kernels are
|
||||||
|
clearly identifiable in the source code, extracting the relevant basic blocks
|
||||||
|
is easy, and fits well for our purpose. It is written in C language. Although
|
||||||
|
it is not under a free/libre software license, it is free to use and
|
||||||
|
open-source.
|
||||||
|
|
||||||
|
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
||||||
|
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
||||||
|
based benchmarks, extracted from (mainly open source) real-world software, such
|
||||||
|
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
||||||
|
metrics and compare CPUs on a unified workload; it is however commonly used
|
||||||
|
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
||||||
|
etc. It is split into four variants: integer and floating-point, combined with
|
||||||
|
speed ---~time to perform a single task~--- and rate ---~throughput for
|
||||||
|
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
||||||
|
The SPEC suite is under a paid license, and cannot be redistributed, which
|
||||||
|
makes peer-review and replication of experiments ---~\eg{} for artifact
|
||||||
|
review~--- complicated.
|
||||||
|
|
||||||
|
\subsection{Manually extracting basic blocks}
|
||||||
|
|
||||||
|
Our first approach, that we used to extract basic blocks from the two benchmark
|
||||||
|
suites introduced above for the evaluation included in our article for
|
||||||
|
\palmed{}~\cite{palmed}, was very manual. We use different ---~though
|
||||||
|
similar~--- approaches for Polybench and SPEC\@.
|
||||||
|
|
||||||
|
In the case of Polybench, we compile multiple versions of each benchmark
|
||||||
|
(\texttt{-O2}, \texttt{-O3} and tiled using the Pluto
|
||||||
|
optimizer~\cite{tool:pluto}). We then use \qemu~\cite{tool:qemu} to extract
|
||||||
|
\textit{translation blocks} ---~very akin to basic blocks~--- and an occurrence
|
||||||
|
count for each of those. We finally select the basic blocks that have enough
|
||||||
|
occurrences to be body loops.
|
||||||
|
|
||||||
|
In the case of SPEC, we replace \qemu{} with the Linux \perf{} profiler, as
|
||||||
|
individual benchmarks of the suite are heavier than Polybench benchmarks,
|
||||||
|
making \qemu{}'s instrumentation overhead impractical. While \perf{} provides
|
||||||
|
us with occurrence statistics, it does not chunk the program into basic blocks;
|
||||||
|
we use an external disassembler and heuristics on the instructions to do this
|
||||||
|
chunking. We describe both aspects ---~profiling and chunking~--- with more
|
||||||
|
details below.
|
||||||
|
|
||||||
|
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
||||||
|
and 2\,664 polybench-based basic blocks.
|
||||||
|
|
||||||
|
\subsection{Automating basic block extraction}
|
||||||
|
|
||||||
|
This manual method, however, has multiple drawbacks. It is, obviously, tedious
|
||||||
|
to manually compile and run a benchmark suite, then extract basic blocks using
|
||||||
|
two collections of scripts depending on which suite is used. It is also
|
||||||
|
impractical that the two benchmark suites are scrapped using very similar, yet
|
||||||
|
different techniques: as the basic blocks are not chunked using the same code,
|
||||||
|
they might have slightly different properties.
|
||||||
|
|
||||||
|
Most importantly, this manual extraction is not reproducible. This comes with
|
||||||
|
two problems.
|
||||||
|
\begin{itemize}
|
||||||
|
\item If the dataset was to be lost, or if another researcher wanted to
|
||||||
|
reproduce our results, the exact same dataset could not be recreated.
|
||||||
|
The same general procedure could be followed again, but code and
|
||||||
|
scripts would have to be re-written, manually typed and undocumented
|
||||||
|
shell lines re-written, etc.
|
||||||
|
\item The same consideration applies to porting the dataset to another ISA.
|
||||||
|
Indeed, as the dataset consists of assembly-level basic-blocks, it
|
||||||
|
cannot be transferred to another ISA: it has to be re-generated from
|
||||||
|
source-level benchmarks. This poses the same problems as the first
|
||||||
|
point.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
This second point particularly motivated us to automate the basic block
|
||||||
|
extraction procedure when \palmed{} ---~and the underlying \pipedream{}~---
|
||||||
|
were extended to produce mappings for ARM processors.
|
||||||
|
|
||||||
|
\medskip{}
|
||||||
|
|
||||||
|
Our automated extraction tool, \benchsuitebb{}, is able to extract basic blocks
|
||||||
|
from Polybench and SPEC. Although we do not use it to evaluate \palmed{}, it
|
||||||
|
also supports the extraction of basic blocks from Rodinia~\cite{bench:rodinia},
|
||||||
|
a benchmark suite targeted towards heterogeneous computing, and exhibiting
|
||||||
|
various usual kernels, such as K-means, backpropagation, BFS, \ldots{}
|
||||||
|
|
||||||
|
For the most part, \benchsuitebb{} implements the manual approach used for
|
||||||
|
SPEC\@. On top of an abstraction layer meant to unify the interface to all
|
||||||
|
benchmark suites, it executes the various compiled binaries while profiling
|
||||||
|
them through \perf{}, and chunks the relevant parts into basic blocks using a
|
||||||
|
disassembler.
|
||||||
|
|
||||||
|
\paragraph{Profiling with \perf{}.}
|
||||||
|
The \perf{} profiler~\cite{tool:perf} is part of the Linux kernel. It works by
|
||||||
|
sampling the current program counter (as well as the stack, if requested, to
|
||||||
|
obtain a stack trace) upon either event occurrences, such as number of elapsed
|
||||||
|
CPU cycles, context switches, cache misses, \ldots, or simply at a fixed,
|
||||||
|
user-defined frequency.
|
||||||
|
|
||||||
|
In our case, we use this second mode to uniformly sample the program counter
|
||||||
|
across a run. We recover the output of the profiling as a \textit{raw trace}
|
||||||
|
with \lstbash{perf report -D}.
|
||||||
|
|
||||||
|
\paragraph{ELF natigation: \texttt{pyelftools} and \texttt{capstone}.}
|
||||||
|
To trace this program counter samplings back to basic blocks, we then need to
|
||||||
|
chunk the relevant sections of the ELF binary down to basic blocks. For this,
|
||||||
|
we use two tools: \texttt{pyelftools} and \texttt{capstone}.
|
||||||
|
|
||||||
|
The \texttt{pyelftools} Python library is able to parse and decode many
|
||||||
|
informations contained in an ELF file. In our case, it allows us to find the
|
||||||
|
\texttt{.text} section of the input binary, search for symbols, find the symbol
|
||||||
|
containing a given program counter, extract the raw assembled bytes between two
|
||||||
|
addresses, etc.
|
||||||
|
|
||||||
|
The \texttt{capstone} disassembler, on the other hand, allows to disassemble a
|
||||||
|
portion of assembled binary back to assembly. It supports many ISAs, among
|
||||||
|
which x86-64 and ARM, the two ISAs we investigate in this manuscript. It is
|
||||||
|
able to extract relevant details out of an instruction: which operands,
|
||||||
|
registers, \ldots{} it uses; which broader group of instruction it belongs to;
|
||||||
|
etc. These groups of instructions, in our case, are particularly useful, as it
|
||||||
|
allows us to find control flow instructions without writing code specific to an
|
||||||
|
ISA. These control-altering instructions are jumps, calls and returns. We are
|
||||||
|
also able to trace a (relative) jump to its jump site, enabling us later to
|
||||||
|
have a finer definition of basic blocks.
|
||||||
|
|
||||||
|
\begin{algorithm}
|
||||||
|
\begin{algorithmic}
|
||||||
|
\Function{bbsOfSymbol}{symbol} \Comment{Memoized, computed only once
|
||||||
|
per symbol}
|
||||||
|
\State instructions $\gets$ disassemble(bytesFor(symbol))
|
||||||
|
\Comment{Uses both pyelftools and capstone}
|
||||||
|
|
||||||
|
\State flowSites $\gets \emptyset$
|
||||||
|
\State jumpSites $\gets \emptyset$
|
||||||
|
|
||||||
|
\For{instr $\in$ instructions}
|
||||||
|
\If{isControlFlow(instr)}
|
||||||
|
\State flowSites $\gets \text{flowSites} \cup \{\text{next(instr).addr}\}$
|
||||||
|
\If{isJump(instr)}
|
||||||
|
\State jumpSites $\gets \text{jumpSites} \cup \{\text{instr.jump\_addr}\}$
|
||||||
|
\EndIf
|
||||||
|
\EndIf
|
||||||
|
\EndFor
|
||||||
|
|
||||||
|
\State \textbf{return} instructions.splitAt($\text{flowSites} \cup
|
||||||
|
\text{jumpSites}$)
|
||||||
|
\EndFunction
|
||||||
|
|
||||||
|
\medskip{}
|
||||||
|
|
||||||
|
\Function{bbsOfPcs}{pcs}
|
||||||
|
\State occurences $\gets \{\}$
|
||||||
|
\For{pc $\in$ pcs}
|
||||||
|
\State bbs $\gets$ bbsOfSymbol(symbolOfPc(pc))
|
||||||
|
\State bb $\gets$ bissect(pc, bbs)
|
||||||
|
\State $\text{occurrences}[\text{bb}] ++$
|
||||||
|
\EndFor
|
||||||
|
\State \textbf{return} occurrences
|
||||||
|
\EndFunction
|
||||||
|
\end{algorithmic}
|
||||||
|
|
||||||
|
\caption{Basic block extraction procedure, given a \perf{}-obtain list of
|
||||||
|
program counters.}\label{alg:bb_extr_procedure}
|
||||||
|
\end{algorithm}
|
||||||
|
|
||||||
|
\paragraph{Extracting basic blocks.} We describe the basic block extraction,
|
||||||
|
given the \perf{}-provided list of sampled program counters, in
|
||||||
|
\autoref{alg:bb_extr_procedure}. For each program counter, we find the ELF
|
||||||
|
symbol it belongs to, and decompose this whole symbol into basic blocks ---~we
|
||||||
|
memoize this step to do it only once per symbol. We then bissect the basic
|
||||||
|
block corresponding to the current PC from the list of obtained basic blocks to
|
||||||
|
count the occurrences of each block.
|
||||||
|
|
||||||
|
To split a symbol into basic blocks, we determine using \texttt{capstone} its
|
||||||
|
set of \emph{flow sites} and \emph{jump sites}. The former is the set of
|
||||||
|
addresses just after a control flow instruction, while the latter is the set of
|
||||||
|
addresses to which jump instructions may jump. We then split the
|
||||||
|
straight-line code of the symbol using the union of both sets as boundaries.
|
|
@ -3,69 +3,6 @@
|
||||||
The main contribution I made to \palmed{} is its evaluation harness and
|
The main contribution I made to \palmed{} is its evaluation harness and
|
||||||
procedure. \todo{}
|
procedure. \todo{}
|
||||||
|
|
||||||
\subsection{Basic blocks from benchmark suites}
|
|
||||||
|
|
||||||
Models generated by \palmed{} are meant to be used on basic blocks that are
|
|
||||||
computationally intensive ---~so that the backend is actually the relevant
|
|
||||||
resource to monitor, compared to \eg{} frontend- or input/output-bound code~---,
|
|
||||||
running in steady-state ---~that is, which is the body of a loop long enough to
|
|
||||||
be reasonably considered infinite for performance modelling purposes. The basic
|
|
||||||
blocks used to evaluate \palmed{} should thus be reasonably close from these
|
|
||||||
criteria.
|
|
||||||
|
|
||||||
Some tools, such as \pmevo{}~\cite{PMEvo}, use randomly-sampled basic blocks
|
|
||||||
for their evaluation. This approach, however, may yield basic blocks that do
|
|
||||||
not fit in those criteria; furthermore, it may not be representative of
|
|
||||||
real-life code on which the users of the tool expect it to be accurate.
|
|
||||||
|
|
||||||
For this reason, we evaluate \palmed{} on basic blocks extracted from
|
|
||||||
two well-known benchmark suites instead: Polybench and SPEC CPU 2017.
|
|
||||||
|
|
||||||
\paragraph{Polybench} is a suite of benchmarks built out of 30 kernels of
|
|
||||||
numerical computation~\cite{bench:polybench}. Its benchmarks are
|
|
||||||
domain-specific and centered around scientific computation, mathematical
|
|
||||||
computation, image processing, etc. As the computation kernels are
|
|
||||||
clearly identifiable in the source code, extracting the relevant basic blocks
|
|
||||||
is easy, and fits well for our purpose. It is written in C language. Although
|
|
||||||
it is not under a free/libre software license, it is free to use and
|
|
||||||
open-source.
|
|
||||||
|
|
||||||
We compile multiple versions of each benchmark (\texttt{-O2}, \texttt{-O3} and
|
|
||||||
tiled using the Pluto optimizer~\cite{tool:pluto}), then extract the basic
|
|
||||||
block corresponding to the benchmarks' kernels using \qemu~\cite{tool:qemu},
|
|
||||||
gathering translation blocks and occurrence statistics.
|
|
||||||
|
|
||||||
\paragraph{SPEC CPU 2017} is a suite of benchmarks meant to be CPU
|
|
||||||
intensive~\cite{bench:spec}. It is composed of both integer and floating-point
|
|
||||||
based benchmarks, extracted from (mainly open source) real-world software, such
|
|
||||||
as \texttt{gcc}, \texttt{imagemagick}, \ldots{} Its main purpose is to obtain
|
|
||||||
metrics and compare CPUs on a unified workload; it is however commonly used
|
|
||||||
throughout the literature to evaluate compilers, optimizers, code analyzers,
|
|
||||||
etc. It is split into four variants: integer and floating-point, combined with
|
|
||||||
speed ---~time to perform a single task~--- and rate ---~throughput for
|
|
||||||
performing a flow of tasks. Most benchmarks exist in both speed and rate mode.
|
|
||||||
The SPEC suite is under a paid license, and cannot be redistributed, which
|
|
||||||
makes peer-review and replication of experiments ---~\eg{} for artifact
|
|
||||||
review~--- complicated.
|
|
||||||
|
|
||||||
In the case of SPEC, there is no clear kernel available for each benchmark;
|
|
||||||
extracting basic blocks to evaluate \palmed{} is not trivial. We manually
|
|
||||||
extract the relevant basic blocks using a profiling-based approach with Linux
|
|
||||||
\perf{}, as the \qemu{}-based solution used for Polybench would be too costly
|
|
||||||
for SPEC\@.
|
|
||||||
|
|
||||||
\bigskip{}
|
|
||||||
|
|
||||||
Altogether, this method generates, for x86-64 processors, 13\,778 SPEC-based
|
|
||||||
and 2\,664 polybench-based basic blocks.
|
|
||||||
|
|
||||||
\medskip{}
|
|
||||||
|
|
||||||
We automate and describe in detail the \perf{}-based method later in
|
|
||||||
\qtodo{ref}; however, this automation allows us to also extract basic blocks
|
|
||||||
from these same benchmark suites, compiled for ARMv8a.
|
|
||||||
|
|
||||||
|
|
||||||
\subsection{Evaluation harness}
|
\subsection{Evaluation harness}
|
||||||
|
|
||||||
We implement into \palmed{} an evaluation harness to evaluate it both against
|
We implement into \palmed{} an evaluation harness to evaluate it both against
|
||||||
|
|
|
@ -4,5 +4,6 @@
|
||||||
\input{10_resource_models.tex}
|
\input{10_resource_models.tex}
|
||||||
\input{20_palmed_design.tex}
|
\input{20_palmed_design.tex}
|
||||||
\input{30_pipedream.tex}
|
\input{30_pipedream.tex}
|
||||||
|
\input{35_benchsuite_bb.tex}
|
||||||
\input{40_palmed_results.tex}
|
\input{40_palmed_results.tex}
|
||||||
\input{50_other_contributions.tex}
|
\input{50_other_contributions.tex}
|
||||||
|
|
|
@ -45,3 +45,14 @@ keywords = {program characterization, dynamic analysis, code optimization},
|
||||||
location = {Montr\'{e}al, QC, Canada},
|
location = {Montr\'{e}al, QC, Canada},
|
||||||
series = {CGO 2023}
|
series = {CGO 2023}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@INPROCEEDINGS{bench:rodinia,
|
||||||
|
author={Che, Shuai and Boyer, Michael and Meng, Jiayuan and Tarjan, David and Sheaffer, Jeremy W. and Lee, Sang-Ha and Skadron, Kevin},
|
||||||
|
booktitle={2009 IEEE International Symposium on Workload Characterization (IISWC)},
|
||||||
|
title={Rodinia: A benchmark suite for heterogeneous computing},
|
||||||
|
year={2009},
|
||||||
|
volume={},
|
||||||
|
number={},
|
||||||
|
pages={44-54},
|
||||||
|
doi={10.1109/IISWC.2009.5306797}
|
||||||
|
}
|
||||||
|
|
|
@ -44,6 +44,7 @@
|
||||||
\newcommand{\bhive}{\texttt{BHive}}
|
\newcommand{\bhive}{\texttt{BHive}}
|
||||||
\newcommand{\anica}{\texttt{AnICA}}
|
\newcommand{\anica}{\texttt{AnICA}}
|
||||||
\newcommand{\cesasme}{\texttt{CesASMe}}
|
\newcommand{\cesasme}{\texttt{CesASMe}}
|
||||||
|
\newcommand{\benchsuitebb}{\texttt{benchsuite-bb}}
|
||||||
|
|
||||||
\newcommand{\gdb}{\texttt{gdb}}
|
\newcommand{\gdb}{\texttt{gdb}}
|
||||||
|
|
||||||
|
|
|
@ -26,6 +26,7 @@
|
||||||
\usepackage{wrapfig}
|
\usepackage{wrapfig}
|
||||||
\usepackage{float}
|
\usepackage{float}
|
||||||
\usepackage{tikz}
|
\usepackage{tikz}
|
||||||
|
\usepackage{algpseudocode}
|
||||||
\usepackage[bottom]{footmisc} % footnotes are below floats
|
\usepackage[bottom]{footmisc} % footnotes are below floats
|
||||||
\usepackage[final]{microtype}
|
\usepackage[final]{microtype}
|
||||||
|
|
||||||
|
@ -85,3 +86,7 @@
|
||||||
\newfloat{lstfloat}{htbp}{lop}
|
\newfloat{lstfloat}{htbp}{lop}
|
||||||
\floatname{lstfloat}{Listing}
|
\floatname{lstfloat}{Listing}
|
||||||
\def\lstfloatautorefname{Listing}
|
\def\lstfloatautorefname{Listing}
|
||||||
|
|
||||||
|
\newfloat{algorithm}{htbp}{lop}
|
||||||
|
\floatname{algorithm}{Algorithm}
|
||||||
|
\def\algorithmautorefname{Algorithm}
|
||||||
|
|
Loading…
Add table
Reference in a new issue