report/report/report.tex

\title{DWARF debugging data, compilation and optimization}

\author{Théophile Bastian\\
Under supervision of Francesco Zappa-Nardelli\\
{\textsc{parkas}, \'Ecole Normale Supérieure de Paris}}

\date{March -- August 2018\\August 20, 2018}

\documentclass[11pt]{article}

\usepackage[left=2cm,right=2cm,top=2cm,bottom=2cm]{geometry}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{stmaryrd}
\usepackage{mathtools}
\usepackage{indentfirst}
\usepackage[utf8]{inputenc}
\usepackage{makecell}
\usepackage{booktabs}
\usepackage{wrapfig}
\usepackage{pgfplots}
%\usepackage[backend=biber,style=alphabetic]{biblatex}
\usepackage[backend=biber]{biblatex}

\usepackage{../shared/my_listings}
\usepackage{../shared/my_hyperref}
\usepackage{../shared/specific}
\usepackage{../shared/common}
\usepackage{../shared/todo}

\addbibresource{../shared/report.bib}

\renewcommand\theadalign{c}
\renewcommand\theadfont{\bfseries}
%\renewcommand\theadgape{\Gape[4pt]}
%\renewcommand\cellgape{\Gape[4pt]}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}

%% Main title %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\maketitle

%% Fiche de synthèse %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{fiche_synthese}

%% Table of contents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection*{Source code}\label{ssec:source_code}

The source code of all the implementations made during this internship is
available at \url{https://git.tobast.fr/m2-internship/} under free software
licenses, in various repositories.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stack unwinding data presentation}

The compilation process presented in this section is implemented in
\prog{dwarf-assembly}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Stack frames and x86\_64 calling conventions}

On most platforms, programs make use of a \emph{call stack} to store
information about the nested function calls at the current execution point, and
keep track of their nesting. This call stack is conventionally a contiguous
memory space mapped close to the top of the addressing space. Each function
call has its own \emph{stack frame}, an entry of the call stack, whose precise
contents are often specified in the Application Binary Interface (ABI) of the
platform, and left to various extents up to the compiler. Those frames are
typically used for storing function arguments, machine registers that must be
restored before returning, the function's return address and local variables.

On the x86\_64 platform, with which this report is mostly concerned, the
calling convention that is followed is defined in the System V
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux.
Under this calling convention, the first six arguments of a function are passed
in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8},
\reg{r9}, while additional arguments are pushed onto the stack. It also defines
which registers may be overwritten by the callee, and which parameters must be
restored before returning. This restoration, most of the time, is done by
pushing the register value onto the stack in the function prelude, and
restoring it just before returning. Those preserved registers are \reg{rbx},
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.

\begin{wrapfigure}{r}{0.4\textwidth}
    \centering
    \includegraphics[width=0.9\linewidth]{imgs/call_stack/call_stack.png}
    \caption{Program stack with x86\_64 calling
    conventions}\label{fig:call_stack}
\end{wrapfigure}

The register \reg{rsp} is supposed to always point to the last used memory cell
in the stack. Thus, when the process just enters a new function, \reg{rsp}
points right to the location of the return address. Then, the compiler might
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
the old value of \reg{rbp} just below the return address on the stack, then
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
from anywhere within the function, and also allows for easy addressing of local
variables. Yet, using \reg{rbp} to save \reg{rip} is not always done, since it
somehow ``wastes'' a register. This decision is, on x86\_64 System V, up to the
compiler.

Often, a function will start by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it will push on
the stack the values of the callee-saved registers that are overwritten later,
effectively saving them. Before returning, it will pop the values of the saved
registers back to their original registers and restore \reg{rsp} to its former
value.

\subsection{Stack unwinding}

For various reasons, it might be interesting, at some point of the execution of
a program, to glance at its program stack and be able to extract informations
from it. For instance, when running a debugger such as \prog{gdb}, a frequent
usage is to obtain a \emph{backtrace}, that is, the list of all nested function
calls at this point. This actually reads the stack to find the different stack
frames, and decode them to identify the function names, parameter values, etc.

This operation is far from trivial. Often, a stack frame will only make sense
with correct machine registers values, which can be restored from the previous
stack frame, imposing to \emph{walk} the stack, reading the entries one after
the other, instead of peeking at some frame directly. Moreover, the size of one
stack frame is often not that easy to determine when looking at some
instruction other than \texttt{return}, making it hard to extract single frames
from the whole stack.

Interpreting a frame in order to get the machine state \emph{before} this
frame, and thus be able to decode the next frame recursively, is called
\emph{unwinding} a frame.

Let us consider a stack with x86\_64 calling conventions, such as shown in
Figure~\ref{fig:call_stack}. Assuming the compiler decided here \emph{not} to
use \reg{rbp}, and assuming the function \eg{} allocates a buffer of 8
integers, the area allocated for local variables should be at least $32$ bytes
long (for 4-bytes integers), and \reg{rsp} will be pointing below this area.
Left apart analyzing the assembly code produced, there is no way to find where
the return address is stored, relatively to \reg{rsp}, at some arbitrary point
of the function. Even when \reg{rbp} is used, there is no easy way to guess
where each callee-saved register is stored in the stack frame, and worse, which
callee-saved registers were saved, since it is optional to save a register
that the function never touches.

With this example, it seems pretty clear that it is often necessary to have
additional data to perform stack unwinding. This data is often stored among the
debugging informations of a program, and one common format of debugging data is
DWARF\@.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Unwinding usage and frequency}

Stack unwinding is a more common operation that one might think at first. The
most commonly thought use-case is simply to get a stack trace of a program, and
provide a debugger with the information it needs: for instance, when inspecting
a stack trace in \prog{gdb}, it is quite common to jump to a previous frame:

\lstinputlisting{src/segfault/gdb_session}

To be able to do this, \texttt{gdb} must be able to restore \lstc{fct_a}'s
context, by unwinding \lstc{fct_b}'s frame.

\medskip

Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
debugging}.

Another common usage is profiling. A profiling tool, such as \prog{perf} under
Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
which functions a program spends its time, identify bottlenecks and find out
which parts are critical to optimize.  To do so, modern profilers pause the
traced program at regular, short intervals, inspect their stack, and determine
which function is currently being run. They also often perform a stack
unwinding to determine the call path to this function, to determine which
function indirectly takes time: \eg, a function \lstc{fct_a} can call both
\lstc{fct_b} and \lstc{fct_c}, which are quite heavy; spend practically no time
directly in \lstc{fct_a}, but spend a lot of time in calls to the other two
functions that were made from \lstc{fct_a}.

Exception handling also requires a stack unwinding mechanism in most languages.
Indeed, an exception is completely different from a \lstinline{return}: while the
latter returns to the previous function, the former can be caught by virtually
any function in the call path, at any point of the function. It is thus
necessary to be able to unwind frames, one by one, until a suitable
\lstc{catch} block is found. The C++ language, for one, includes a
stack-unwinding library similar to \prog{libunwind} in its runtime.

Technically, exception handling could be implemented without any stack
unwinding, by using \lstc{setjmp}/\lstc{longjmp} mechanics~\cite{niditoexn}.
However, this is not possible to implement it straight away in C++ (and some
other languages), because the stack needs to be properly unwound in order to
trigger the destructors of stack-allocated objects. Furthermore, this is often
undesirable: \lstc{setjmp} has a quite big overhead, which is introduced
whenever a \lstc{try} block is encountered. Instead, it is often preferred to
have strictly no overhead when no exception happens, at the cost of a greater
overhead when an exception is actually fired --~after all, they are supposed to
be \emph{exceptional}. For more details on C++ exception handling,
see~\cite{koening1990exception} (especially Section~16.5). Possible
implementation mechanisms are also presented in~\cite{dinechin2000exn}.

In both of these two previous cases, performance \emph{can} be a problem. In
the latter, a slow unwinding directly impacts the overall program performance,
particularly if a lot of exceptions are thrown and caught far away in their
call path. In the former, profiling \emph{is} performance-heavy and often quite
slow when analyzing large programs anyway.

One of the causes that inspired this internship were also Stephen Kell's
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
unwinding through \prog{libunwind} and was forced to force \prog{gcc} to use a
frame pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer}
in order to mitigate the slowness.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DWARF format}

The DWARF format was first standardized as the format for debugging information
of the ELF executable binaries, which are standard on UNIX-like systems,
including Linux and MacOS --~but not Windows. It is now commonly used across a
wide variety of binary formats to store debugging information. As of now, the
latest DWARF standard is DWARF 5~\cite{dwarf5std}, which is openly accessible.

The DWARF data commonly includes type information about the variables in the
original programming language, correspondence of assembly instructions with a
line in the original source file, \ldots
The format also specifies a way to represent unwinding data, as described in
the previous paragraph, in an ELF section originally called
\lstc{.debug_frame}, most often found as \ehframe.

For any binary, debugging information can easily get quite large if no
attention is payed to keeping it as compact as possible. In this matter, DWARF
does an excellent job, and everything is stored in a very compact way. This,
however, as we will see, makes it both difficult to parse correctly and quite
slow to interpret.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DWARF unwinding data}

The unwinding data, which we will call from now on the \ehframe, contains, for
each possible instruction pointer (that is, an instruction address within the
program), a set of ``registers'' that can be unwound, and a rule describing how
to do so.

The DWARF language is completely agnostic of the platform and ABI, and in
particular, is completely agnostic of a particular platform's registers. Thus,
when talking about DWARF, a register is merely a numerical identifier that is
often, but not necessarily, mapped to a real machine register by the ABI\@.

In practice, this data takes the form of a collection of tables, one table per
Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such
a table, that has a range of IPs on which it has authority. Most often, but not
necessarily, it corresponds to a single function in the original source code.
Each column of the table is a register (\eg{} \reg{rsp}), with two additional
special registers, CFA (Canonical Frame Address) and RA (Return Address),
containing respectively the base pointer of the current stack frame and the
return address of the current function. For instance, on a x86\_64
architecture, RA would contain the unwound value of \reg{rip}, the instruction
pointer. Each row has a certain validity interval, on which it describes
accurate unwinding data. This range starts at the instruction pointer it is
associated with, and ends at the start IP of the next table row (or the end IP
of the current FDE if it was the last row). In particular, there can be no ``IP
hole'' within a FDE --~unlike FDEs themselves, which can leave holes between
them.

\begin{figure}[h]
    \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language=C, firstline=3, lastline=12,
                         caption={Original C},label={lst:ex1_c}]
            {src/fib7/fib7.c}
    \end{minipage} \hfill \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language=C,caption={Processed DWARF},
                         label={lst:ex1_dw}]
            {src/fib7/fib7.fde}
        \lstinputlisting[language=C,caption={Raw DWARF},label={lst:ex1_dwraw}]
            {src/fib7/fib7.raw_fde}
    \end{minipage}
\end{figure}

\begin{figure}[h]
    \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language={[x86masm]Assembler},lastline=11,
                         caption={Generated assembly},label={lst:ex1_asm}]
            {src/fib7/fib7.s}
    \end{minipage} \hfill \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language={[x86masm]Assembler},firstline=12,
                         firstnumber=last]
            {src/fib7/fib7.s}
    \end{minipage}
\end{figure}

\begin{table}[h]
    \centering
    \begin{tabular}{|c|c|c|c|c|c}
        \stackfhead{+ \mhex{30}}
            & \stackfhead{+ \mhex{28}}
            & \stackfhead{+ \mhex{20}}
            & \stackfhead{+ \mhex{1c}}
            & \stackfhead{+ \mhex{4}}
            & \stackfhead{}
            \\
        \hline{}
            Return Address & \textit{Alignment space}
                & \spaced{2ex}{\lstc{fibo[7]}}
                & \spaced{4ex}{\ldots}
                & \spaced{2ex}{\lstc{fibo[0]}}
                & \textit{Next frame}
                \\
        \hline
    \end{tabular}
    \caption{Stack frame schema}\label{table:ex1_stack_schema}
\end{table}

For instance, the C source code in Listing~\ref{lst:ex1_c} above, when compiled
with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the
assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack
frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding
how the stack frame is constructed. When interpreting the generated \ehframe{}
with \lstbash{readelf -wF}, we obtain the (slightly edited)
Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
\leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
thus the CFA is 8 bytes above \reg{rsp}, and the return address is precisely at
\reg{rsp} --~that is, stored between \reg{rsp} and $\reg{rsp} + 8$. Then, the
contents of \lstc{fibo}, 8 integers of 4 bytes each, are allocated on the
stack, which puts the CFA 32 bytes above \reg{rsp}; the return address still
being 8 bytes below the CFA\@. The variable \lstc{pos} is optimized out in the
generated assembly code, thus no stack space is allocated for it. Yet,
\prog{gcc} decided to allocate a total space of 48 bytes for the stack frame
for memory alignment reasons, which means subtracting 40 bytes to \reg{rsp}
(address $\mhex{615}$ in the assembly). Then, by the end of the function, the
local variables are discarded and \reg{rsp} is reset to its value from the
first row.

However, DWARF data isn't actually stored as a table in the binary files, but
is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
location of the first IP in the FDE, and must define at least its CFA\@. Then,
when all relevant registers are defined, it is possible to define a new row by
providing a location offset (\eg{} here $4$), and the new row is defined as a
clone of the previous one, which can then be altered (\eg{} here by setting
\lstc{CFA} to $\reg{rsp} + 48$). This means that every line is defined \wrt{}
the previous one, and that the IPs of the successive rows cannot be determined
without evaluating every row that comes before in the first place. Thus,
unwinding a frame from an IP close to the end of the frame will require
evaluating pretty much every DWARF row in the table before reaching the
relevant information, slowing down drastically the unwinding process.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{How big are FDEs?}

\begin{figure}[h]
    \centering
    \begin{tikzpicture}
        \begin{axis}[
                width=0.9\linewidth, height=4cm,
                grid=major,
                grid style={dashed,gray!30},
                xlabel=FDE row count,
                ylabel=Proportion,
                %legend style={at={(0.5,-0.2)},anchor=north},
                xtick distance=5,
                ybar, %added here
            ]
            \addplot[blue,fill] table[x=lines,y=proportion, col sep=comma]
            {data/fde_line_count.csv};

        \end{axis}
    \end{tikzpicture}
    \caption{FDE line count density}\label{fig:fde_line_density}
\end{figure}

Since evaluating an \lstc{.eh_frame} FDE entry is, as seen in the previous
section, roughly linear in time in its rows number, we must wonder what is the
distribution of FDE rows count. The histogram in
Figure~\ref{fig:fde_line_density} was generated on a random sample of around
2000 ELF files present on an ArchLinux system.

Most of the FDEs seem to be quite small, which only reflects that most
functions found in the wild are relatively small and do not particularly
allocate many times on the stack. Yet, the median value is at $8$ rows per FDE,
and the average is at $9.7$, which is already not that fast to unwind. Values
up to $50$ are not that uncommon, given some commonly used functions have such
large FDEs, and often end up in the call stack.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Unwinding state-of-the-art}

The most commonly used library to perform stack unwinding, in the Linux
ecosystem, is \prog{libunwind}~\cite{libunwind}. While it is very robust and
quite efficient, most of its optimization comes from fine-tuned code and good
caching mechanisms. While parsing DWARF, \prog{libunwind} is forced to parse
the relevant FDE from its start, until it finds the row it was seeking.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{DWARF semantics}\label{sec:semantics}

We will now define semantics covering most of the operations used for FDEs
described in the DWARF standard~\cite{dwarf5std}, such as seen in
Listing~\ref{lst:ex1_dwraw}, with the exception of DWARF expressions. These are
not exhaustively treated because they are quite rich and would take a lot of
time and space to formalize, and in the meantime are only seldom used (see the
DWARF statistics regarding this).

These semantics are defined with respect to the well-formalized C language, and
are passing through an intermediary language. The DWARF language can read the
whole memory, as well as registers, and is always executed for some instruction
pointer. The C function representing it will thus take as parameters an array
of the registers' values as well as an IP, and will return another array of
registers values, which will represent the evaluated DWARF row.

\subsection{Original language~: DWARF instructions}

These are the DWARF instructions used for CFI description, that is, the
instructions that contain the stack unwinding table informations. The following
list is an exhaustive list of instructions from the DWARF5
specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for
brevity and clarity. All these instructions are up to variants --~most
instructions exist in multiple formats to handle various operands formatting,
to optimize space. Since we won't be talking about the underlying file format
here, those variations between eg. \dwcfa{advance\_loc1} and
\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
operand~-- are irrelevant and will be eluded.

\begin{itemize}
    \item{} \dwcfa{set\_loc(loc)}~:
        start a new table row from address $loc$
    \item{} \dwcfa{advance\_loc(delta)}~:
        start a new table row at address $prev\_loc + delta$
    \item{} \dwcfa{def\_cfa(reg, offset)}~:
        sets this row's CFA at $(\reg{reg} + \textit{offset})$
    \item{} \dwcfa{def\_cfa\_register(reg)}~:
        sets CFA at $(\reg{reg} + \textit{prev\_offset})$
    \item{} \dwcfa{def\_cfa\_offset(offset)}~:
        sets CFA at $(\reg{prev\_reg} + \textit{offset})$
    \item{} \dwcfa{def\_cfa\_expression(expr)}~:
        sets CFA as the result of $expr$
    \item{} \dwcfa{undefined(reg)}~:
        sets the register \reg{reg} as undefined in this row
    \item{} \dwcfa{same\_value(reg)}~:
        declares that the register \reg{reg} hasn't been touched, or was
        restored to its previous value, in this row. An unwinding procedure can
        leave it as-is.
    \item{} \dwcfa{offset(reg, offset)}~:
        the value of the register \reg{reg} is stored in memory at the address
        $CFA + \textit{offset}$.
    \item{} \dwcfa{val\_offset(reg, offset)}~:
        the value of the register \reg{reg} is the value $CFA + \textit{offset}$
    \item{} \dwcfa{register(reg, model)}~:
        the register \reg{reg} has, in this row, the value that $\reg{model}$
        had in the previous row
    \item{} \dwcfa{expression(reg, expr)}~:
        the value of \reg{reg} is stored in memory at the address defined by
        $expr$
    \item{} \dwcfa{val\_expression(reg, expr)}~:
        \reg{reg} has the value of $expr$
    \item{} \dwcfa{restore(reg)}~:
        \reg{reg} has the same value as in this FDE's preamble (CIE) in this
        row. This is \emph{not implemented in this semantics} for simplicity
        and brevity (we would have to introduce CIE (preamble) and FDE (body)
        independently). This is also not much used in actual ELF
        files: the analysis in Section~\ref{ssec:instr_cov} found no such
        instruction, on a random uniform sample of 4000 ELF files.
    \item{} \dwcfa{remember\_state()}~:
        push the state of all the registers of this row on an implicit stack
    \item{} \dwcfa{restore\_state()}~:
        pop an entry of the implicit stack, and restore all registers in this
        row to the value held in the stack record.
    \item{} \dwcfa{nop()}~:
        do nothing (padding)
\end{itemize}

\subsection{Intermediary language $\intermedlang$}

A first pass will translate DWARF instructions into this intermediary language
$\intermedlang$. It is designed to be more mathematical, representing the same
thing, but abstracting all the data compression of the DWARF format away, so
that we can better reason on it and transform it into C code.

Its grammar is as follows:

\begin{align*}
    \FDE &::= {\left(\mathbb{Z} \times \dwrow \right)}^{\ast}
        & \text{FDE (set of rows)} \\
    \dwrow &::= \values ^ \regs
        & \text{A single table row} \\
    \regs &::= \left\{0, 1, \ldots, \operatorname{NB\_REGS - 1} \right\}
        & \text{Machine registers} \\
    \values &::= \bot & \text{Values: undefined,}\\
        &\quad\vert~\valaddr{\spexpr} & \text{at address $x$},\\
        &\quad\vert~\valval{\spexpr} & \text{of value $x$} \\
    \spexpr &::= \regs \times \mathbb{Z}
        & \text{A ``simple'' expression $\reg{reg} + \textit{offset}$} \\
\end{align*}

The entry point of the grammar is a $\FDE$, which is a set of rows, each
annotated with a machine address, the address from which it is valid. Note that
the addresses are necessarily increasing within a FDE\@.

Each row then represents, as a function mapping registers to values, a row of
the unwinding table.

We implicitly consider that $\reg{reg}$ maps to a number, and we use here
\texttt{x86\_64} names for convenience, but actually in DWARF registers are
only handled as register identifiers, so we can safely state that $\reg{reg}
\in \regs$.

A value can then be undefined, stored at memory address $x$ or be directly a
value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
\textit{offset}$. The CFA is considered a simple register here. For instance,
to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA,
we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack
grows downwards.

\subsection{Target language~: a C function body}

The target language of these semantics is a C function, to be interpreted with
respect to the C11 standard~\cite{c11std}. The function is supposed to be run
in the context of the program being unwound. In particular, it must be able to
dereference some pointer derived from DWARF instructions that will point to the
execution stack, or even the heap.

This function takes as arguments an instruction pointer --~supposedly
extracted from $\reg{rip}$~-- and an array of register values; and returns a
fresh array of register values after unwinding this call frame. The function is
compositional: it can be called twice in a row to unwind two stack frames,
unless the IP obtained after the first unwinding comes from another shared
object file, for instance a call to \prog{libc}. In this case, unwinding the
second frame will require loading the corresponding DWARF information.

The function is the following~:

\lstinputlisting[language=C]{src/dw_semantics/c_context.c}

The translation of $\intermedlang$ as produced by the later-defined function
are then to be inserted in this context, where the comment states so.

\subsection{From DWARF to $\intermedlang$}

To define the interpretation of $\DWARF$ to $\intermedlang$, we will need to
proceed forward, but, as the language inherently depends on the previous
instructions to give a meaning to the following ones, we will depend on what
was computed before. At a point of the interpretation $h \vert t$, where $t$ is
what remains to be interpreted, $h$ what has been, and $H$ the result of the
interpretation, it would thus look like $\llbracket t \rrbracket (H)$.

But we also need to keep track of this implicit stack DWARF uses, which will be
kept in subscript.

\medskip

Thus, we define $\semI{\bullet}{s}(\bullet) : \DWARF \times \FDE \to \FDE$, for
$s$ a stack of $\dwrow$, that is,
\[
    s \in \rowstack := \dwrow^\ast
\]

Implicitly, $\semI{\bullet}{} := \semI{\bullet}{\varepsilon}$

\medskip

For convenience, we define $\insarrow{reg}$, the operator changing the value of
a register for a given value in the last row, as

\[
    \left(f \in \FDE\right) \insarrow{$r \in \regs$} (v \in values)
    \quad := \quad
    \left( f\left[0 \ldots |f| - 2\right] \right) \cdot \left\{
        \begin{array}{r l}
            r' \neq r &\mapsto \left(f[-1]\right)(r') \\
            r &\mapsto v \\
        \end{array} \right.
\]

The same way, we define $\extrarrow{reg}$ that \emph{extracts} the rule
currently applied for $\reg{reg}$, eg. $F \extrarrow{CFA} \valval{\reg{reg} +
\text{off}}$. If the rule currently applied in such a case is \emph{not} of the
form $\reg{reg} + \text{off}$, then the program is considered erroneous. One
can see this $\extrarrow{reg}$ somehow as a \lstc{match} statement in OCaml,
but with only one case, allowing to retrieve packed data.

More generally, we define ${\extrarrow{reg}}^{-k}$ as the same operation, but
extracting in the $k$-older row, ie. ${\extrarrow{reg}}^{0}$ is the same as
$\extrarrow{reg}$, and $F {\extrarrow{reg}}^{-1} \bullet$ is the same as
$F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.

\begin{align*}
    \semI{\varepsilon}{s}(F) &:= F \\
    \semI{\dwcfa{set\_loc(loc)} \cdot d}{s}(F) &:=
        \contsem{F \cdot \left(loc, F[-1].row \right)} \\
    \semI{\dwcfa{adv\_loc(delta)} \cdot d}{s}(F) &:=
        \contsem{F \cdot \left(F[-1].addr + delta, F[-1].row \right)} \\
    \semI{\dwcfa{def\_cfa(reg, offset)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{CFA} \valval{\reg{reg} + offset}} \\
    \semI{\dwcfa{def\_cfa\_register(reg)} \cdot d}{s}(F) &:=
        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}}
        \text{ in} \\
        &\quad \contsem{F \insarrow{CFA} \valval{\reg{reg} + oldoffset}} \\
    \semI{\dwcfa{def\_cfa\_offset(offset)} \cdot d}{s}(F) &:=
        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}}
        \text{ in} \\
        &\quad \contsem{F \insarrow{CFA} \valval{\reg{oldreg} + offset}} \\
    \semI{\dwcfa{def\_cfa\_expression(expr)} \cdot d}{s}(F) &:=
        \text{TO BE DEFINED} &\qtodo{CHECK ME?} \\
    \semI{\dwcfa{undefined(reg)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{reg} \bot} \\
    \semI{\dwcfa{same\_value(reg)} \cdot d}{s}(F) &:=
        \valval{\reg{reg}} \\
    \semI{\dwcfa{offset(reg, offset)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{reg} \valaddr{\reg{CFA} + \textit{offset}}} \\
    \semI{\dwcfa{val\_offset(reg, offset)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{reg} \valval{\reg{CFA} + \textit{offset}}} \\
    \semI{\dwcfa{register(reg, model)} \cdot d}{s}(F) &:=
        \text{let } F {\extrarrow{model}}^{-1} r \text{ in }
        \contsem{F \insarrow{reg} r} \\
    \semI{\dwcfa{expression(reg, expr)} \cdot d}{s}(F) &:=
        \text{TO BE DEFINED} &\qtodo{CHECK ME?}\\
    \semI{\dwcfa{val\_expression(reg, expr)} \cdot d}{s}(F) &:=
        \text{TO BE DEFINED} &\qtodo{CHECK ME?}\\
%    \semI{\dwcfa{restore(reg)} \cdot d}{s}(F) &:= \\  %% NOT IMPLEMENTED
    \semI{\dwcfa{remember\_state()} \cdot d}{s}(F) &:=
        \semI{d}{s \cdot F[-1].row}\left(F\right) \\
    \semI{\dwcfa{restore\_state()} \cdot d}{s \cdot t}(F) &:=
        \semI{d}{s}\left(F\left[0 \ldots |F|-2\right] \cdot
        \left(F[-1].addr, t\right) \right) \\
    \semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
\end{align*}

The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
we omit those two operations, we can plainly remove the stack.


\subsection{From $\intermedlang$ to C}

\textit{This only defines the semantics, with respect to standard C, of DWARF
as interpreted by \ehelf\@. The actual DWARF to C compiler is not implemented
this way.}

\medskip

We now define $\semC{\bullet} : \DWARF \to C$, in the context presented
earlier. The translation from $\intermedlang$ to C is defined as follows:

\begin{itemize}
    \item $\semC{\varepsilon} =$ \\
        \begin{lstlisting}[language=C, mathescape=true]
            else {
                for(int reg=0; reg < NB_REGS; ++reg)
                    new_ctx[reg] = $\semR{\bot}$;
            }
        \end{lstlisting}

    \item $\semC{(\text{loc}, \text{row}) \cdot t} = C\_code \cdot \semC{t}$,
        where $C\_code$ is
        \begin{lstlisting}[language=C, mathescape=true]
            if(ip >= $loc$) {
                for(int reg=0; reg < NB_REGS; ++reg)
                    new_ctx[reg] = $\semR{row[reg]}$;
                goto end_ifs; // Avoid if/else if problems
            }
        \end{lstlisting}
\end{itemize}

and $\semR{\bullet}$ is defined as
\begin{align*}
    \semR{\bot} &= \text{\lstc{ERROR_VALUE}} \\
    \semR{\valaddr{\text{reg}, \textit{offset}}} &=
        \text{\lstc{*(old_ctx[reg] + offset)}} \\
    \semR{\valval{\text{reg}, \textit{offset}}} &=
        \text{\lstc{(old_ctx[reg] + offset)}} \\
\end{align*}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stack unwinding data compilation}

The tentative approach that was chosen to try to get better unwinding speeds at
a reasonable space loss was to compile directly the \ehframe{} into native
machine code on the x86\_64 platform.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Compilation: \ehelfs}\label{ssec:ehelfs}

The rough idea of the compilation is to produce, out of the \ehframe{} section
of a binary, C code that resembles the code shown in the DWARF semantics from
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
\lstbash{-O2} mode, since it already provides a good level of optimization and
compiling in \lstbash{-O3} takes way too much time. This saves us the trouble
of optimizing the generated C code whenever GCC does that by itself.

The generated code consists in a single monolithic function, \lstc{_eh_elf},
taking as arguments an instruction pointer and a memory context (\ie{} the
value of the various machine registers) as defined in
Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
context, containing the values the registers hold after unwinding this frame.

The body of the function itself is mostly a huge switch, taking advantage of
the non-standard --~yet widely implemented in C compilers~-- syntax for range
switches, in which each \lstinline{case} can refer to a range. All the FDEs are
merged together into this switch, each row of a FDE being a switch case.
Separating the various FDEs in the C code --~other than with comments~-- is,
unlike what is done in DWARF, pointless, since accessing a ``row'' has a linear
cost, and the C code is not meant to be read, except maybe for debugging
purposes. The switch cases bodies then fill a context with unwound values, then
return it.

A setting of the compiler also optionally enables another parameter to the
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
\lstc{deref} function, when present, replaces everywhere the dereferencing
\lstc{*} operator, and can be used to generate \ehelfs{} that will work on
remote address spaces, that is, whenever the unwinding is not done on the
process reading the \ehelf{} itself, but some other process, or even on a stack
dump of a long-terminated process.

Unlike in the \ehframe, and unlike what should be done in a release,
real-world-proof version of the \ehelfs, the choice was made to keep this
implementation simple, and only handle the few registers that were needed to
simply unwind the stack. Thus, the only registers handled in \ehelfs{} are
\reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used quite
often in \prog{libc} to hold the CFA address. This is enough to unwind the
stack reliably, and thus enough for profiling, but is not sufficient to analyze
every stack frame as \prog{gdb} would do after a \lstbash{frame n} command.
Yet, if one was to enhance the code to handle every register, it would not be
much harder and would probably be only a few hours of code refactoring and
rewriting.

\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
    {src/dwarf_assembly_context/unwind_context.c}

In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
\lstc{uintptr_t} are the values of the corresponding registers, and
\lstc{flags} is a 8-bits value, indicating for each register whether it is
present or not in this context, plus an error bit, indicating whether an error
occurred during unwinding. Such errors can be due \eg{} to an unsupported
operation in the original DWARF\@.

This generated data is stored in separate shared object files, which we call
\ehelfs. It would have been possible to alter the original ELF file to embed
this data as a new section, but getting it to be executed just as any
portion of the \lstc{.text} section would probably have been painful, and
keeping it separated during the experimental phase is quite convenient. It is
possible to have multiple versions of \ehelfs{} files in parallel, with various
options turned on or off, and it doesn't require to alter the base system by
editing \eg{} \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is
required, those files can simply be \lstc{dlopen}'d. It is also possible to
imagine, in a future environment production, packaging \ehelfs{} files
separately, so that people interested in heavy computation can have the choice
to install them.

\medskip

\lstinputlisting[language=C, caption={\ehelf{} for the previous example},
                 label={lst:fib7_eh_elf_basic}]
                 {src/fib7/fib7.eh_elf_basic.c}

The C code in Listing~\ref{lst:fib7_eh_elf_basic} is a part of what was
generated for the C code in Listing~\ref{lst:ex1_c}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{First results}

Without any particular care to efficiency or compactness, it is already
possible to produce a compiled version very close to the one described in
Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be
actually benchmarked, it is already possible to write in a few hundred lines of
C code a simple stack walker printing the functions traversed. It already works
without any problem on the easily tested cases, since corner cases are mostly
found in standard and highly optimized libraries, and it is not that easy to get
the program to stop and print a stack trace from within a system library
without using a debugger.

The major drawback of this approach, without any particular care taken, is the
space waste. The space taken by those tentative \ehelfs{} is analyzed in
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
it depends.


\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r}
        \toprule
        \thead{Shared object} & \thead{Original \\ program size}
            & \thead{Original \\ \lstc{.eh\_frame}}
            & \thead{Generated \\ \ehelf{} \lstc{.text}}
            & \thead{\% of original \\ program size}
            & \thead{Growth \\ factor} \\
        \midrule
        libc-2.27.so & 1.4 MiB & 130.1 KiB & 914.9 KiB & 63.92 & 7.03 \\
        libpthread-2.27.so & 58.1 KiB & 11.6 KiB & 70.5 KiB & 121.48 & 6.09 \\
        ld-2.27.so & 129.6 KiB & 9.6 KiB & 71.7 KiB & 55.34 & 7.44 \\
        hackbench & 2.9 KiB & 568.0 B & 2.1 KiB & 74.78 & 3.97 \\
        Total & 1.6 MiB & 151.8 KiB & 1.0 MiB & 65.32 & 6.98 \\
        \bottomrule
    \end{tabular}

    \caption{Basic \ehelfs{} space usage}\label{table:basic_eh_elf_space}
\end{table}

The first column only includes the sizes of the ELF sections \lstc{.text} (the
program itself) and \lstc{.rodata}, the read-only data (such as static strings,
etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{}
is considered, because it is self-consistent (few data or none is stored in
\lstc{.rodata}), and the other sections could be removed if the \ehelfs{}
\lstc{.text} was somehow embedded in the original shared object.

This first tentative version of \ehelfs{} is roughly 7 times heavier than the
original \lstc{.eh_frame}, and represents a far too significant proportion of
the original program size.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Space optimization}\label{ssec:space_optim}

A lot of small space optimizations, such as filtering out empty FDEs, merging
together the rows that are equivalent on all the registers kept, etc.\ were
made in order to shrink the \ehelfs.

\medskip

The major optimization that most reduced the output size was to use an if/else
tree implementing a binary search on the program counter relevant intervals,
instead of a huge switch. In the process, we also \emph{outline} a lot of code,
that is, find out identical ``switch cases'' bodies --~which are not switch
cases anymore, but if bodies~--, move them outside of the if/else tree,
identify them by a label, and jump to them using a \lstc{goto}, which
de-duplicates a lot of code and contributes greatly to the shrinking. In the
process, we noticed that the vast majority of FDE rows are actually taken among
very few ``common'' FDE rows.

This makes this optimization really efficient, as seen later in
Section~\ref{ssec:results_size}, but also makes it an interesting question
--~not investigated during this internship~-- to find out whether standard
DWARF data could be efficiently compressed in this way.

\begin{minipage}{0.45\textwidth}
    \lstinputlisting[language=C, caption={\ehelf{} for the previous example},
                     label={lst:fib7_eh_elf_outline},
                     lastline=18]
                     {src/fib7/fib7.eh_elf_outline.c}
\end{minipage} \hfill \begin{minipage}{0.45\textwidth}
    \lstinputlisting[language=C, firstnumber=last, firstline=19]
                     {src/fib7/fib7.eh_elf_outline.c}
\end{minipage}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Benchmarking}

Benchmarking turned out to be, quite surprisingly, the hardest part of the
project. It ended up requiring a lot of investigation to find a working
protocol, and afterwards, a good deal of code reading and coding to get the
solution working.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Requirements}\label{ssec:bench_req}

To provide relevant benchmarks of the \ehelfs{} performance, one must sample at
least a few hundreds or thousands of stack unwinding, since a single frame
unwinding with regular DWARF takes the order of magnitude of $10\,\mu s$, and
\ehelfs{} were expected to have significantly better performance.

However, unwinding over and over again from the same program point would have
had no interest at all, since \prog{libunwind} would have simply cached the
relevant DWARF row. In the mean time, making sure that the various unwinding
are made from different locations is somehow cheating, since it makes useless
\prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding
distribution. All in all, the benchmarking method must have a ``natural''
distribution of unwindings.

Another requirement is to also distribute quite evenly the unwinding points
across the program: we would like to benchmark stack unwindings crossing some
standard library functions, starting from inside them, etc.

Finally, the unwound program must be interesting enough to enter and exit a lot
of functions, nest function calls, have FDEs that are not as simple as in
Listing~\ref{lst:ex1_dw}, etc.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation of \prog{perf}}\label{ssec:perf}

\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem, and is
even developed within the Linux kernel source tree. A profiler is an important
tool from the developer's toolbox that analyzes the performance of programs by
recording the time spent in each function, including within nested calls. This
analysis often enables programmers to optimize critical paths and functions in
their programs, while leaving unoptimized functions that are seldom traversed.

\prog{Perf} is a \emph{polling} profiler, to be opposed with
\emph{instrumenting} profilers. This means that with \prog{perf}, the basic
idea is to stop the traced program at regular intervals, unwind its stack,
write down the current nested function calls, and integrate the sampled data in
the end. Instrumenting profilers, on the other hand, do not interrupt the
program, but instead inject code in it.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Benchmarking with \prog{perf}}\label{ssec:bench_perf}

In the context of this internship, the main advantage of \prog{perf} is that it
does a lot of stack unwinding. It also meets all the requirements from
Section~\ref{ssec:bench_req} above: since it stops at regular intervals and
unwinds, the unwindings are evenly distributed \wrt{} the frequency of
execution of the code, which is a natural enough setup for the benchmarks to be
meaningful, while still unwinding from diversified locations, preventing
caching from being be overwhelming. It also has the ability to unwind from
within any function, included functions of linked shared libraries. It can also
be applied to virtually any program, which allows unwinding ``interesting''
code.

The program that was chosen for \prog{perf}-benchmarking is
\prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to
stress-test and benchmark the Linux scheduler by spawning processes or threads
that communicate with each other. It has the interest of generating stack
activity, be linked against \prog{libc} and \prog{pthread}, and be very light.

\medskip

Interfacing \ehelfs{} with \prog{perf} required, in a first place, to fork
\prog{libunwind} and implement \ehelfs{} support for it. In the process, it
turned out necessary to slightly modify \prog{libunwind}'s interface to add a
parameter to an initialisation function, since \prog{libunwind} is made to be
agnostic of the system and process as much as possible, to be able to unwind in
any context.  This very restricted information lacked a \emph{memory map}, a
table indicating which shared object is mapped at which address in memory, in
order to use \ehelfs. Apart from this, the modified version of \prog{libunwind}
produced is entirely compatible with the vanilla version. This means that the
only modifications required to use \ehelfs{} within any project using
\prog{libunwind} should be changing one line of code to add one parameter to a
function call and linking against the modified version of \prog{libunwind}
instead of the system version.

Once this was done, plugging it in \prog{perf} was the matter of a few lines of
code only, left apart the benchmarking code. The major problem encountered was
to understand how \prog{perf} works. In order to avoid perturbing the traced
program, \prog{perf} does not unwind at runtime, but rather records at regular
intervals the program's stack, and all the auxiliary information that is needed
to unwind later. This is done when running \lstbash{perf record}. Then,
\lstbash{perf report} unwinds the stack to analyze it; but at this point of
time, the traced process is long dead, thus any PID-based approach, or any
approach using \texttt{/proc} information will fail. However, as this was the
easiest method, the first version of \ehelfs{} used those mechanisms; thus
requiring some code rewriting.

The modified versions of both \prog{perf} and \prog{libunwind} are present in
the repositories \prog{perf-eh\_elf} and \prog{libunwind-eh\_elf}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Other explored methods}

The first approach tried to benchmark was trying to create some specific C code
that would meet the requirements from Section~\ref{ssec:bench_req}, while
calling itself a benchmarking procedure from time to time. This was abandoned
quite quickly, because generating C code interesting enough to be unwound
turned out hard, and the generated FDEs invariably ended out uninteresting. It
would also never have met the requirement of unwinding from fairly distributed
locations anyway.

Another attempt was made using CSmith~\cite{csmith}, a random C code generator
initially made for C compilers random testing. The idea was still to craft an
interesting C program that would unwind on its own frequently, but to integrate
CSmith-randomly generated C code within hand-written C snippets that
would generate large enough FDEs and nested calls. This was abandoned as well
as the call graph of a CSmith-generated code is often far too small, and the
CSmith code is notoriously hard to understand and edit.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Results}

\subsection{Hardware used}~\label{ssec:bench_hw}

All the measures in this report were made on a computer with an Intel Xeon
E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The
computer has 32\,GB of RAM, and care was taken never to fill it and start
swapping.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured time performance}

A benchmarking of \ehelfs{} against the vanilla \prog{libunwind} was made using
the exact same methodology as in Section~\ref{ssec:bench_perf}, only linking
\prog{perf} against the vanilla \prog{libunwind}. It yields the results in
Table~\ref{table:bench_time}.

\begin{table}[h]
    \centering
    \begin{tabular}{l r r r r r}
        \toprule
        \thead{Unwinding method} & \thead{Frames \\ unwound}
            & \thead{Total time \\ unwinding ($\mu s$)}
            & \thead{Average time \\ per frame ($ns$)}
            & \thead{Unwinding \\ errors}
            & \thead{Time ratio} \\
        \midrule
        \ehelfs{}
            & 23506 % Frames unwound
            & 14837 % Total time
            & 631 % Avg time
            & 1099 % # Errors
            & 1
            \\
        \prog{libunwind}, cached
            & 27058 % Frames unwound
            & 441601 % Total time
            & 16320 % Avg time
            & 885 % # Errors
            & 25.9
            \\
        \prog{libunwind}, uncached
            & 27058 % Frames unwound
            & 671292 % Total time
            & 24809 % Avg time
            & 885 % # Errors
            & 39.3
            \\
        \bottomrule
    \end{tabular}

    \caption{Time benchmarking on hackbench}\label{table:bench_time}
\end{table}

The performance of \ehelfs{} is probably overestimated for a production-ready
version, since \ehelfs{} do not handle all registers from the original DWARF
file, and thus the \prog{libunwind} version must perform more computation.
However, this overhead, although impossible to measure without first
implementing supports for every register, would probably not be that big, since
most of the time is spent finding the relevant row. Support for every DWARF
instruction, however, would not slow down at all the implementation, since
every instruction would simply be compiled to x86\_64 without affecting the
already supported code.

It is also worth noting that the compilation time of \ehelfs{} is also
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
without using multiple cores to compile, the various shared objects needed to
run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured compactness}\label{ssec:results_size}

A first measure of compactness was made in this report for one of the earliest
working versions in Table~\ref{table:basic_eh_elf_space}.

The same data, generated for the latest version of \ehelfs, can be seen in
Table~\ref{table:bench_space}.

The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is
particularly visible in this table: \prog{hackbench} has a significantly bigger
growth than the other shared objects. This is because \prog{hackbench} has a
way smaller \lstc{.eh_frame}, thus, the outlined data is reused only a few
times, compared to \eg{} \prog{libc}, in which the outlined data is reused a
lot.

Just as with time performance, the measured compactness would be impacted by
supporting every register, but probably not that much either, since most
columns are concerned with the four supported registers (see
Section~\ref{ssec:instr_cov}).

\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r}
        \toprule
        \thead{Shared object} & \thead{Original \\ program size}
            & \thead{Original \\ \lstc{.eh\_frame}}
            & \thead{Generated \\ \ehelf{} \lstc{.text}}
            & \thead{\% of original \\ program size}
            & \thead{Growth \\ factor} \\
        \midrule
            libc-2.27.so
                & 1.4 MiB & 130.1 KiB & 313.2 KiB & 21.88 & 2.41 \\
            libpthread-2.27.so
                & 58.1 KiB & 11.6 KiB & 25.4 KiB & 43.71 & 2.19 \\
            ld-2.27.so
                & 129.6 KiB & 9.6 KiB & 28.6 KiB & 22.09 & 2.97 \\
            hackbench
                & 2.9 KiB & 568.0 B & 2.8 KiB & 93.87 & 4.99 \\
            Total
                & 1.6 MiB & 151.8 KiB & 370.0 KiB & 22.81 & 2.44 \\
        \bottomrule
    \end{tabular}

    \caption{\ehelfs{} space usage}\label{table:bench_space}
\end{table}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Instructions coverage}\label{ssec:instr_cov}

In order to determine which proportion of real-world ELF instructions are
covered by our compiler and \ehelfs.

The method chosen was to take a random uniform sample of 4000 ELFs among those
present on a basic ArchLinux system setup, in the directories \texttt{/bin},
\texttt{/lib}, \texttt{/usr/bin}, \texttt{/usr/lib} and their subdirectories,
making sure those files were ELF64 files, then gathering statistics on those
files.

\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r r}
        \toprule
        \thead{} & \thead{Unsupported \\ register rule}
            & \thead{Register \\ rules seen}
            & \thead{\% \\ supp.}
            & \thead{Unsupported \\ expression}
            & \thead{Expressions \\ seen}
            & \thead{\% \\ supp.}
            \\
        \midrule
            \makecell{Only supp. \\ columns} &
                1603 & 42959683 & 99.996\,\% &
                1114 & 5977 & 81.4\,\%
                \\
            All columns &
                1607 & 67587841 & 99.998\,\% &
                1154 & 13869 & 91.7\,\%
                \\
        \bottomrule
    \end{tabular}

    \caption{Instructions coverage statistics}\label{table:instr_cov}
\end{table}

\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r}
        \toprule
        \thead{}
            & \thead{\texttt{Undefined}}
            & \thead{\texttt{Same\_value}}
            & \thead{\texttt{Offset}}
            & \thead{\texttt{Val\_offset}}
            & \thead{\texttt{Register}}
            \\
        \midrule
            \makecell{Only supp. \\ columns}
                & 1698 (0.006\,\%)
                & 0
                & 30038255 (99.9\,\%)
                & 0
                & 14 (0\,\%)
                \\
            All columns
                & 1698 (0.003\,\%)
                & 0
                & 54666405 (99.9\,\%)
                & 0
                & 22 (0\,\%)
                \\
        \bottomrule
        \toprule
        \thead{}
            & \thead{\texttt{Expression}}
            & \thead{\texttt{Val\_expression}}
            & \thead{\texttt{Architectural}}
            & & \thead{Total}
            \\
        \midrule
            \makecell{Only supp. \\ columns}
                & 4475 (0.015\,\%)
                & 0
                & 0
                & & 30044442
                \\
            All columns
                & 12367 (0.02\,\%)
                & 0
                & 0
                & & 54680492
                \\

        \bottomrule
    \end{tabular}

    \caption{Instruction type statistics}\label{table:instr_types}
\end{table}

The Table~\ref{table:instr_cov} gives statistics about the proportion of
instructions encountered that were not supported by \ehelfs. The first row is
only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and
\reg{rbx} (the supported registers --~see Section~\ref{ssec:ehelfs}). The
second row analyzes all the columns that were encountered, no matter whether
supported or not.

The Table~\ref{table:instr_types} analyzes the proportion of each command
--~the formal way a register is set~-- for non-CFA columns in the sampled data. For
a brief explanation, \texttt{Offset} means stored at offset from CFA,
\texttt{Register} means the value from a machine register, \texttt{Expression}
means stored at the address of an expression's result, and the \texttt{Val\_}
prefix means that the value must not be dereferenced. Overall, it can be seen
that supporting \texttt{Offset} already means supporting the vast majority of
registers. The data gathered (not reproduced here) also suggests that
supporting a few common expressions is enough to support most of them.

It is also worth noting that of all the 4000 analyzed files, there are only 12
that contained all the unsupported expressions seen, and only 24 that contained
some unsupported instruction at all.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% Bibliography %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printbibliography{}

%% License notice %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vfill
\hfill \begin{minipage}{0.7\textwidth}
    \begin{flushright}
        \itshape{} \small{}
        Unless otherwise explicitly stated, any image, source code snippet or
        table from the present document can be reused freely by anyone.
    \end{flushright}
\end{minipage}

\end{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%