report/report/report.tex

\title{Speeding up stack unwinding by compiling DWARF debugging data}

\author{Th\'eophile Bastian\\
Under supervision of Francesco Zappa Nardelli, March -- August 2018\\
{\textsc{parkas}, \textsc{inria}}}

%\date{March -- August 2018\\August 20, 2018}
\date{\vspace{-2em}}

\documentclass[11pt]{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[left=2cm,right=2cm,top=2cm,bottom=2cm]{geometry}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{stmaryrd}
\usepackage{mathtools}
\usepackage{indentfirst}
\usepackage{makecell}
\usepackage{booktabs}
\usepackage{wrapfig}
\usepackage{pgfplots}
\usepackage{placeins}
\usepackage{enumitem}
%\usepackage[backend=biber,style=alphabetic]{biblatex}
\usepackage[backend=biber]{biblatex}

\usepackage{../shared/my_listings}
\usepackage{../shared/my_hyperref}
\usepackage{../shared/specific}
\usepackage{../shared/common}
\usepackage{../shared/todo}

\addbibresource{../shared/report.bib}

\renewcommand\theadalign{c}
\renewcommand\theadfont{\bfseries}
%\renewcommand\theadgape{\Gape[4pt]}
%\renewcommand\cellgape{\Gape[4pt]}

\allowdisplaybreaks{}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}

%% Main title %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\maketitle

%% Fiche de synthèse %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{fiche_synthese}

%% Table of contents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection*{Source code}\label{ssec:source_code}

Our implementation is available from \url{https://git.tobast.fr/m2-internship}.
See the \texttt{abstract} repository for an introductive \texttt{README}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stack unwinding data presentation}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Stack frames and x86\_64 calling conventions}

On every common platform, programs make use of a \emph{call stack} to store
information about the nested function calls at the current execution point, and
keep track of their nesting. This call stack is conventionally a contiguous
memory space mapped close to the top of the addressing space. Each function
call has its own \emph{stack frame}, an entry of the call stack, whose precise
contents are often specified in the Application Binary Interface (ABI) of the
platform, and left to various extents up to the compiler. Those frames are
typically used for storing function arguments, machine registers that must be
restored before returning, the function's return address and local variables.

On the x86\_64 platform, with which this report is mostly concerned, the
calling convention followed on UNIX-like operating systems --~among which Linux
and MacOS~-- is defined by the System V ABI~\cite{systemVabi}.  Under this
calling convention, the first six arguments of a function are passed in the
registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while
additional arguments are pushed onto the stack. It also defines which registers
may be overwritten by the callee, and which registers must be restored by the
callee before returning. This restoration, for most compilers, is done by
pushing the register value onto the stack during the function prelude, and
restoring it just before returning. Those preserved registers are \reg{rbx},
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.

\begin{wrapfigure}{r}{0.4\textwidth}
    \centering
    \includegraphics[width=0.9\linewidth]{imgs/call_stack/call_stack.png}
    \caption{Program stack with x86\_64 calling
    conventions}\label{fig:call_stack}
\end{wrapfigure}

The register \reg{rsp} is supposed to always point to the last used address in
the stack. Thus, when the process enters a new function, \reg{rsp} points to
the location of the return address. Then, the compiler might use \reg{rbp}
(``base pointer'') to save this value of \reg{rsp}, writing the old value of
\reg{rbp} below the return address on the stack and copying \reg{rsp} to
\reg{rbp}. This makes it easy to find the return address from anywhere within
the function, and allows for easy addressing of local variables. To some
extents, it also allows for hot debugging, such as saving a useful core dump
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
the decision of using it, on x86\_64 System V, is up to the compiler.

Usually, a function starts by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it saves on the
stack the values of the callee-saved registers that are overwritten later.
Before returning, it pops the values of the saved registers back to their
original registers and restore \reg{rsp} to its former value.

\subsection{Stack unwinding}\label{ssec:stack_unwinding}

For various reasons, it is interesting, at some point of the execution of a
program, to glance at its program stack and be able to extract information
from it. For instance, when running a debugger, a frequent usage is to obtain a
\emph{backtrace}, that is, the list of all nested function calls at the current
IP\@. This actually observes the stack to find the different stack frames, and
decode them to identify the function names, parameter values, etc.

This operation is far from trivial. Often, a stack frame only makes sense
when the machine registers hold the right values. These values,
however, are to be restored from the previous stack frame, where they are
stored. This imposes to \emph{walk} the stack, reading the frames one after
the other, instead of peeking at some frame directly. Moreover, it is often not
even easy to determine the boundaries of each stack frame alone, making it
impossible to just peek at a single frame.

Interpreting a frame in order to get the machine state \emph{before} this
frame, and thus be able to decode the next frame recursively, is called
\emph{unwinding} a frame.

Let us consider a stack with x86\_64 calling conventions, such as shown in
Figure~\ref{fig:call_stack}. Assuming the compiler decided here \emph{not} to
use \reg{rbp}, and assuming the function allocates \eg{} a buffer of 8
integers, the area allocated for local variables is at least $32$ bytes
long (for 4-bytes integers), and \reg{rsp} points below this area.
Left apart analyzing the assembly code produced, there is no way to find where
the return address is stored, relatively to \reg{rsp}, at some arbitrary point
of the function. Even when \reg{rbp} is used, there is no easy way to guess
where each callee-saved register is stored in the stack frame, since the
compiler is free to do as it wishes. Even worse, it is not trivial to know
callee-saved registers were at all, since if the function does not alter a
register, it does not have to save it.

With this example, it seems pretty clear that some additional data is necessary
to perform stack unwinding reliably, without only performing a guesswork. This
data is stored along with the debugging information of a program, and one
common format of debugging data is DWARF\@.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Unwinding usage and frequency}

Stack unwinding is more frequent that one might think at first. The use case
mostly thought of is simply to get a stack trace of a program, and provide a
debugger with the information it needs. For instance, when inspecting a stack
trace in \prog{gdb}, a common operation is to jump to a previous frame:

\lstinputlisting{src/segfault/gdb_session}

To be able to do this, \texttt{gdb} must be able to restore \lstc{fct_a}'s
context, by unwinding \lstc{fct_b}'s frame.

\medskip

Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
debugging}.

Another common usage is profiling. A profiler, such as \prog{perf} under Linux
--~see Section~\ref{ssec:perf}~--, is used to measure and analyze in which
functions a program spends its time, and find out which parts are critical to
optimize.  To do so, modern profilers pause the traced program at regular,
short intervals, inspect their stack, and determine which function is currently
being run. They also perform a stack unwinding to figure out the call path to
this function, in order to determine which function indirectly takes time: for
instance, a function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c},
which both take a lot of time; spend practically no time directly in
\lstc{fct_a}, but spend a lot of time in calls to the other two functions that
were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the
culprit can be useful to a programmer.

Exception handling also requires a stack unwinding mechanism in some languages.
Indeed, an exception is completely different from a \lstinline{return}: while
the latter returns to the previous function, at a well-defined IP, the former
can be caught by virtually any function in the call path, at any point of the
function. It is thus necessary to be able to unwind frames, one by one, until a
suitable \lstc{catch} block is found. The C++ language, for one, includes a
stack-unwinding library similar to \prog{libunwind} in its runtime.

Technically, exception handling could be implemented without any stack
unwinding, by using \lstc{setjmp} and \lstc{longjmp}
mechanics~\cite{niditoexn}.  However, it is not possible to implement this
straight away in C++ (among others), because the stack needs to be
properly unwound in order to trigger the destructors of stack-allocated
objects. Furthermore, this is often undesirable: \lstc{setjmp} introduces an
overhead, which is hit whenever a \lstc{try} block is encountered. Instead, it
is often preferred to have strictly no overhead when no exception happens, at
the cost of a greater overhead when an exception is actually fired --~after
all, they are supposed to be \emph{exceptional}. For more details on C++
exception handling, see~\cite{koening1990exception} (especially Section~16.5).
Possible implementation mechanisms are also presented
in~\cite{dinechin2000exn}.

In both of these two previous cases, performance \emph{can} be a problem. In
the latter, a slow unwinding directly impacts the overall program performance,
particularly if a lot of exceptions are thrown and caught far away in their
call path. As for the former, profiling \emph{is} performance-heavy and slow:
for a session analyzing the \prog{tor-browser} for two and a half minutes,
\prog{perf} spends $100\,\mu \text{s}$ analyzing each of the $325679$ samples,
that is, $300\,\text{ms}$ per second of program run with default settings.

One of the causes that inspired this internship were also Stephen Kell's
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
unwinding through \prog{libunwind} and had to force \prog{gcc} to use a frame
pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer} in
order to mitigate the slowness.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DWARF format}

The DWARF format was first standardized as the format for debugging information
of the ELF executable binaries (Extensible Linking Format), which are standard
on UNIX-like systems, including Linux and MacOS --~but not Windows. It is now
commonly used across a wide variety of binary formats to store debugging
information. As of now, the latest DWARF standard is DWARF 5~\cite{dwarf5std},
which is openly accessible.

The DWARF data commonly includes type information about the variables in the
original programming language, correspondence of assembly instructions with a
line in the original source file, \ldots{}
The format also specifies a way to represent unwinding data, as described in
Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
\lstc{.debug_frame}, but most often found as \ehframe.

For any binary, debugging information can easily take up space and grow bigger
than the program itself if no attention is paid at keeping it as compact as
possible when designing the file format. On this matter, DWARF does an
excellent job, and everything is stored in a very compact way. This, however,
as we will see, makes it both difficult to parse correctly and relatively slow
to interpret.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DWARF unwinding data}

The unwinding data, which we will call from now on the \ehframe, contains, for
each possible IP, a set of ``registers'' that can be unwound, and a rule
describing how to do so.

The DWARF language is completely agnostic of the platform and ABI, and in
particular, is completely agnostic of a particular platform's registers. Thus,
as far as DWARF is concerned, a register is merely a numerical identifier that
is often, but not necessarily, mapped to a real machine register by the ABI\@.

In practice, this data takes the form of a collection of tables, one table per
Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such
a table, that has a range of IPs on which it has authority. Most often, but not
necessarily, it corresponds to a single function in the original source code.
Each column of the table is a register (\eg{} \reg{rsp}), along with two
additional special registers, CFA (Canonical Frame Address) and RA (Return
Address), containing respectively the base pointer of the current stack frame
and the return address of the current function. For instance, on a x86\_64
architecture, RA would contain the unwound value of \reg{rip}, the instruction
pointer. Each row has a certain validity interval, on which it describes
accurate unwinding data. This range starts at the instruction pointer it is
associated with, and ends at the start IP of the next table row --~or the end
IP of the current FDE if it was the last row. In particular, there can be no
``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
between them.

For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled
with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the
assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack
frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding
how the stack frame is constructed. When interpreting the generated \ehframe{}
with \lstbash{readelf -wF}, we obtain the (slightly edited)
Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
\leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
thus the CFA is 8 bytes above \reg{rsp}, and the return address is precisely at
\reg{rsp} --~that is, stored between \reg{rsp} and $\reg{rsp} + 8$. Then, the
contents of \lstc{fibo}, 8 integers of 4 bytes each, are allocated on the
stack, which puts the CFA 32 bytes above \reg{rsp}; the return address still
being 8 bytes below the CFA\@. The variable \lstc{pos} is optimized out in the
generated assembly code, thus no stack space is allocated for it. Yet,
\prog{gcc} decided to allocate a total space of 48 bytes for the stack frame
for memory alignment reasons, which means subtracting 40 bytes to \reg{rsp}
(address $\mhex{615}$ in the assembly). Then, by the end of the function, the
local variables are discarded and \reg{rsp} is reset to its value from the
first row.

\begin{figure}[h]
    \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language=C, firstline=3, lastline=12,
                         caption={Original C},label={lst:ex1_c}]
            {src/fib7/fib7.c}
    \end{minipage} \hfill \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language={[x86masm]Assembler},
                         caption={Generated assembly},label={lst:ex1_asm}]
            {src/fib7/fib7.s}
    \end{minipage}
\end{figure}

\begin{figure}[h]
    \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language=C,caption={Raw DWARF},label={lst:ex1_dwraw}]
            {src/fib7/fib7.raw_fde}
    \end{minipage} \hfill \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language=C,caption={Processed DWARF},
                         label={lst:ex1_dw}]
            {src/fib7/fib7.fde}
    \end{minipage}
\end{figure}

\begin{table}[h!]
    \centering
    \begin{tabular}{|c|c|c|c|c|c}
        \stackfhead{+ \mhex{30}}
            & \stackfhead{+ \mhex{28}}
            & \stackfhead{+ \mhex{20}}
            & \stackfhead{+ \mhex{1c}}
            & \stackfhead{+ \mhex{4}}
            & \stackfhead{}
            \\
        \hline{}
            Return Address & \textit{Alignment space}
                & \spaced{2ex}{\lstc{fibo[7]}}
                & \spaced{4ex}{\ldots}
                & \spaced{2ex}{\lstc{fibo[0]}}
                & \textit{Next frame}
                \\
        \hline
    \end{tabular}
    \caption{Stack frame schema for fib7 (horizontal layout)}\label{table:ex1_stack_schema}
\end{table}

However, DWARF data isn't actually stored as a table in the binary files, but
is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
location of the first IP in the FDE, and must define at least its CFA\@. Then,
when all relevant registers are defined, it is possible to define a new row by
providing a location offset (\eg{} here $4$), and the new row is defined as a
clone of the previous one, which can then be altered (\eg{} here by setting
\lstc{CFA} to $\reg{rsp} + 48$). This means that every line is defined \wrt{}
the previous one, and that the IPs of the successive rows cannot be determined
without evaluating every row that comes before in the first place. Thus,
unwinding a frame from an IP close to the end of the frame requires evaluating
pretty much every DWARF row in the table before reaching the relevant
information, slowing down drastically the unwinding process.

\FloatBarrier{}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{How big are FDEs?}

\begin{figure}[h]
    \centering
    \begin{tikzpicture}
        \begin{axis}[
                width=0.9\linewidth, height=4cm,
                grid=major,
                grid style={dashed,gray!30},
                xlabel=FDE row count,
                ylabel=Proportion,
                %legend style={at={(0.5,-0.2)},anchor=north},
                xtick distance=5,
                ybar, %added here
            ]
            \addplot[blue,fill] table[x=lines,y=proportion, col sep=comma]
            {data/fde_line_count.csv};

        \end{axis}
    \end{tikzpicture}
    \caption{FDE line count density}\label{fig:fde_line_density}
\end{figure}

Since evaluating an \lstc{.eh_frame} FDE entry is, as seen in the previous
section, roughly linear in time in its rows number, we must wonder what is the
distribution of FDE rows count. The histogram in
Figure~\ref{fig:fde_line_density} was generated on a random sample of around
2000 ELF files present on an ArchLinux system.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Unwinding state-of-the-art}

The most commonly used library to perform stack unwinding, in the Linux
ecosystem, is \prog{libunwind}~\cite{libunwind}. While it is very robust and
decently efficient, most of its optimization comes from fine-tuned code and
good caching mechanisms. When parsing DWARF, \prog{libunwind} is forced to
parse the relevant FDE from its start, until it finds the row it was seeking.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{DWARF semantics}\label{sec:semantics}

The DWARF 5 standard~\cite{dwarf5std} is written in English prose, and our
first task is to formalize it. Thus, in this section, we first recall the
informal behaviour of DWARF instructions as provided by the standard; and then
we formalize their semantics by mapping them to well-defined C code. We omit
the translation of DWARF expressions, because they form a rich language and
would take a lot of time and space to formalize, while in the mean time being
seldom used --~see Section~\ref{ssec:instr_cov}.

These semantics are defined \wrt{} the well-formalized C language, and
are passing through an intermediate language. The DWARF language can read the
whole memory, as well as registers, and is always executed for some instruction
pointer. The C function representing it thus takes as parameters an array
of the registers' values as well as an IP, and returns another array of
registers values, which represents the evaluated DWARF row.

\subsection{Original language: DWARF instructions}

These are the DWARF instructions used for CFI description, that is, the
instructions that contain the stack unwinding table information. Below is an
exhaustive list of instructions from the DWARF5 specification~\cite{dwarf5std}
concerning CFI, with reworded descriptions for brevity and clarity. All these
instructions are up to variants --~most instructions exist in multiple formats
to handle various operands formatting for space optimisation. Since we won't be
talking about the underlying file format here, those variations between \eg{}
\dwcfa{advance\_loc1} and \dwcfa{advance\_loc2} --~which differ only on the
number of bytes of their operand~-- are irrelevant and are eluded.

As said before, we also elude here references to DWARF expressions, as they are
complex and are mostly not implemented in the actual compiler anyway --~left
apart some special cases. Those expressions bring in complexity, as they are
turing-complete stack machine expressions that can access virtually the whole
computer's memory. Formalizing them would require designing semantics for such
a language.

\begin{itemize}[itemsep=3pt, parsep=0pt]
    \item{} \dwcfa{set\_loc(loc)}:
        start a new table row from address $loc$
    \item{} \dwcfa{advance\_loc(delta)}:
        start a new table row at address $prev\_loc + delta$
    \item{} \dwcfa{def\_cfa(reg, offset)}:
        sets this row's CFA at $(\reg{reg} + \textit{offset})$
    \item{} \dwcfa{def\_cfa\_register(reg)}:
        sets CFA at $(\reg{reg} + \textit{prev\_offset})$
    \item{} \dwcfa{def\_cfa\_offset(offset)}:
        sets CFA at $(\reg{prev\_reg} + \textit{offset})$
    \item{} \dwcfa{def\_cfa\_expression(expr)}:
        sets CFA as the result of $expr$
    \item{} \dwcfa{undefined(reg)}:
        sets the register \reg{reg} as undefined in this row
    \item{} \dwcfa{same\_value(reg)}:
        declares that the register \reg{reg} hasn't been touched, or was
        restored to its previous value, in this row. An unwinding procedure can
        leave it as-is.
    \item{} \dwcfa{offset(reg, offset)}:
        the value of the register \reg{reg} is stored in memory at the address
        $CFA + \textit{offset}$.
    \item{} \dwcfa{val\_offset(reg, offset)}:
        the value of the register \reg{reg} is the value $CFA + \textit{offset}$
    \item{} \dwcfa{register(reg, model)}:
        the register \reg{reg} has, in this row, the value of $\reg{model}$.
    \item{} \dwcfa{expression(reg, expr)}:
        the value of \reg{reg} is stored in memory at the address defined by
        $expr$
    \item{} \dwcfa{val\_expression(reg, expr)}:
        \reg{reg} has the value of $expr$
    \item{} \dwcfa{restore(reg)}:
        \reg{reg} has the same value as in this FDE's preamble (CIE) in this
        row. This is \emph{not implemented in this semantics} for simplicity
        and brevity, as we would have to introduce CIE (preamble) and FDE
        (body) independently. This is also not often used in actual ELF files:
        the analysis in Section~\ref{ssec:instr_cov} found no such instruction,
        on a random uniform sample of 4000 ELF files.
    \item{} \dwcfa{remember\_state()}:
        push the state of all the registers of this row on a state-saving stack
    \item{} \dwcfa{restore\_state()}:
        pop an entry of the state-saving stack, and restore all registers in
        this row to the value held in the stack record.
    \item{} \dwcfa{nop()}:
        do nothing (padding)
\end{itemize}

\subsection{Intermediary language $\intermedlang$}

A first pass translates DWARF instructions into this intermediate language
$\intermedlang$. It is designed to be more mathematical, representing the same
thing, but abstracting all the data compression of the DWARF format away, so
that we can better reason on it and transform it into C code.

Its grammar is as follows:

\begin{align*}
    \FDE &::= {\left(\mathbb{Z} \times \dwrow \right)}^{\ast}
        & \text{FDE (set of rows)} \\
    \dwrow &::= \values ^ \regs
        & \text{A single table row} \\
    \regs &::= \left\{0, 1, \ldots, \operatorname{NB\_REGS - 1} \right\}
        & \text{Machine registers} \\
    \values &::= \bot & \text{Values: undefined,}\\
        &\quad\vert~\valaddr{\spexpr} & \text{at address $x$},\\
        &\quad\vert~\valval{\spexpr} & \text{of value $x$} \\
        &\quad\vert~\valexpr{} & \text{of expression $x$, see in text} \\
    \spexpr &::= \regs \times \mathbb{Z}
        & \text{A ``simple'' expression $\reg{reg} + \textit{offset}$} \\
\end{align*}

The entry point of the grammar is a $\FDE$, which is a set of rows, each
annotated with the machine address from which it is valid.  The addresses are
necessarily increasing within a FDE\@.

Each row is as a function mapping registers to values, and represents a row of
the unwinding table.

We implicitly consider that $\reg{reg}$ maps to a number. We use here
\texttt{x86\_64} names for convenience, although in DWARF, registers are merely
identifiers.  Thus, we can safely state that $\reg{reg} \in \regs$.

A value can then be undefined, stored at memory address $x$ or be directly a
value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
\textit{offset}$. The CFA is seen just as any other register here, although
DWARF makes a distinction between it and other columns. For instance, to define
$\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we would
have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack grows
downwards. We also leave open the possibility to extend the language with DWARF
expressions support as $\valexpr{}$, although we \emph{do not} specify them
here.

\subsection{Target language: a C function body}

The target language of these semantics is a C function, to be interpreted
\wrt{} the C11 standard~\cite{c11std}. The function is supposed to be run
in the context of the program being unwound. In particular, it must be able to
dereference some pointer derived from DWARF instructions that points to the
execution stack, or even the heap.

This function takes as arguments an instruction pointer --~supposedly
extracted from $\reg{rip}$~-- and an array of register values; and returns a
fresh array of register values after unwinding this call frame. The function is
compositional: it can be called twice in a row to unwind two stack frames,
unless the IP obtained after the first unwinding comes from another shared
object file, for instance a call to \prog{libc}. In this case, unwinding the
second frame requires loading the corresponding DWARF information.

The function is the following:

\lstinputlisting[language=C]{src/dw_semantics/c_context.c}\label{lst:sem_c_ctx}

The translation of $\intermedlang$ as produced by the later-defined function
are then to be inserted in this context, where the comment states so.

In pseudo-C code (for brevity) and assuming the functions and types used are
duly defined elsewhere, unwinding multiple frames would then look like this:

\lstinputlisting[language=C]{src/dw_semantics/stack_walker.c}

Thus, if we hold for true that the IP remains in the same memory segment
--~\ie{} binary file~-- for two frames, we can safely unwind two frames this
way:

\lstinputlisting[language=C]{src/dw_semantics/unwind2.c}

\subsection{From DWARF to $\intermedlang$}

In DWARF, the instructions have a meaning that refer to previously interpreted
instructions, sequentially. For instance, many registers are defined at offsets
from the current CFA, which in turn was previously defined \wrt{} the former
CFA value, etc. Thus, to give a meaning to a DWARF instruction, knowledge of
the current row's values is needed. Let us consider a given point of the
interpretation of $d = h \cdot t$, where we already have interpreted $h$, the
first instructions, and interpreted it as $H \in \FDE$, while $t$ remains to be
interpreted. We then define the interpretation function $\llbracket t
\rrbracket (H)$, interpreting the remainder $t$ of the DWARF instructions,
having the knowledge of $H$, the current interpreted row.

But we also need to keep track of this state-saving stack DWARF uses, which
is kept in subscript.
Thus, we define $\semI{\bullet}{s}(\bullet): \DWARF \times \FDE \to \FDE$, for
$s$ a stack of $\dwrow$, that is,

\[
    s \in \rowstack := \dwrow^\ast
\]

Implicitly, $\semI{\bullet}{} := \semI{\bullet}{\varepsilon}$

\medskip

For convenience, we define $\insarrow{$r \in{} \regs$}$, an operator changing the
value assigned to a register, its right-hand side operand, in the last row
of a given $\FDE$, its left-hand side operand.

\[
    \left(f \in \FDE\right) \insarrow{$r \in{} \regs$} (v \in values)
    \quad := \quad
    \left( f\left[0\ \cdots\  \left(\vert f \vert - 2\right)\right] \right) \cdot \left\{
        \begin{array}{r l}
            r' \neq r &\mapsto \left(f[-1]\right)(r') \\
            r &\mapsto v \\
        \end{array} \right.
\]

Note that for convenience, we allow ourselves to index negatively an array to
retrieve its values from the end; thus, $f[-1]$ refers to the last entry of
$f$. If we consider the fictive following fictive row $R_0$,

\[
    R \in \dwrow := \left\{ \begin{array}{r l}
        CFA &\mapsto \valval{\reg{rsp} - 48} \\
        \reg{rbx} &\mapsto \valaddr{\reg{rsp} - 16}
    \end{array}\right.
\]

\noindent{}then, we would have

\[
    R \insarrow{\reg{rbx}} \left(\valaddr{\reg{rip - 24}}\right)
    \quad = \quad
    \left\{ \begin{array}{r l}
        CFA &\mapsto \valval{\reg{rsp} - 48} \\
        \reg{rbx} &\mapsto \valaddr{\reg{rsp} - 24}
    \end{array}\right.
\]

The same way, we define $\extrarrow{reg}$ that \emph{extracts} the rule
currently applied for $\reg{reg}$ in the last row of a FDE, \eg{} $F
\extrarrow{CFA} \valval{\reg{reg} + \text{off}}$. If the rule currently applied
in such a case is \emph{not} of the form $\reg{reg} + \text{off}$, then the
program is considered erroneous. One can see this $\extrarrow{reg}$ as a
\lstc{match} statement in OCaml, but with only one case, allowing to retrieve
packed data, all the other unmatched cases corresponding to an error.

\begin{align*}
    \semI{\varepsilon}{s}(F) &:= F \\
    \semI{\dwcfa{set\_loc(loc)} \cdot d}{s}(F) &:=
        \contsem{F \cdot \left(loc, F[-1].row \right)} \\
    \semI{\dwcfa{adv\_loc(delta)} \cdot d}{s}(F) &:=
        \contsem{F \cdot \left(F[-1].addr + delta, F[-1].row \right)} \\
    \semI{\dwcfa{def\_cfa(reg, offset)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{CFA} \valval{\reg{reg} + \textit{offset}}} \\
    \semI{\dwcfa{def\_cfa\_register(reg)} \cdot d}{s}(F) &:=
        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \textit{oldoffset}}
        \text{ in} \\
        &\quad \contsem{F \insarrow{CFA} \valval{\reg{reg} + \textit{oldoffset}}} \\
    \semI{\dwcfa{def\_cfa\_offset(offset)} \cdot d}{s}(F) &:=
        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \textit{oldoffset}}
        \text{ in} \\
        &\quad \contsem{F \insarrow{CFA} \valval{\reg{oldreg} + \textit{offset}}} \\
    \semI{\dwcfa{undefined(reg)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{\reg{reg}} \bot} \\
    \semI{\dwcfa{same\_value(reg)} \cdot d}{s}(F) &:=
        \valval{\reg{reg}} \\
    \semI{\dwcfa{offset(reg, offset)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{reg} \valaddr{\textit{CFA} + \textit{offset}}} \\
    \semI{\dwcfa{val\_offset(reg, offset)} \cdot d}{s}(F) &:=
        \contsem{F \insarrow{reg} \valval{\textit{CFA} + \textit{offset}}} \\
    \semI{\dwcfa{register(reg, model)} \cdot d}{s}(F) &:=
        \text{let } F {\extrarrow{model}} r \text{ in }
        \contsem{F \insarrow{reg} r} \\
    \semI{\dwcfa{remember\_state()} \cdot d}{s}(F) &:=
        \semI{d}{s\ \cdot\ \left(F[-1].row\right)}\left(F\right) \\
    \semI{\dwcfa{restore\_state()} \cdot d}{s\ \cdot\ t}(F) &:=
        \semI{d}{s}\left(F\left[0 \ldots |F|-2\right] \cdot
        \left(F[-1].addr, t\right) \right) \\
    \semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
\end{align*}

The state-saving stack is used only for \texttt{remember\_state} and
\texttt{restore\_state}. If we were to omit those two operations, we could
plainly remove the stack from our notations.


\subsection{From $\intermedlang$ to C}

\emph{The C code provided thereafter is a correct but inefficient reference
implementation, which is only provided to specify DWARF \wrt{} C. In
particular, the actual compiler is \emph{not} implemented this way.}

\medskip

We now define $\semC{\bullet}: \DWARF \to C$, in the context presented earlier
in Section~\ref{lst:sem_c_ctx}. The translation from $\intermedlang$ to C is
defined as follows:

\begin{itemize}
    \item $\semC{\varepsilon} =$
        \begin{lstlisting}[language=C, mathescape=true]
for(int reg=0; reg < NB_REGS; ++reg)
    new_ctx[reg] = $\semR{\bot}$; \end{lstlisting}

    \item $\semC{(\text{loc}, \text{row}) \cdot t} = C\_code \cdot \semC{t}$,
        where $C\_code$ is
        \begin{lstlisting}[language=C, mathescape=true]
if(ip >= $loc$) {
    for(int reg=0; reg < NB_REGS; ++reg)
        new_ctx[reg] = $\semR{row[reg]}$;
    goto end_ifs; // Avoid using `else if` (easier for generation)
} \end{lstlisting}
\end{itemize}

\noindent{}\noindent{}while $\semR{\bullet}$ is defined as
\begin{align*}
    \semR{\bot} &\eqspace{}
        \text{\lstc{ERROR_VALUE}} \\
    \semR{\valaddr{\text{reg}, \textit{offset}}} &\eqspace{}
        \text{\lstc{*(old_ctx[reg] + offset)}} \\
    \semR{\valval{\text{reg}, \textit{offset}}} &\eqspace{}
        \text{\lstc{(old_ctx[reg] + offset)}} \\
\end{align*}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Concerning correctness}\label{ssec:sem_correctness}

The semantics described in this section are designed in a concern of
\emph{formalization} of the original standard. This standard, sadly, only
describes in plain English each instruction's action and result. This basis
cannot be used to \emph{prove} anything correct without relying on informal
interpretations.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stack unwinding data compilation}

In this section, we will study all the design options we explored for the
actual C implementation.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Code availability}\label{ssec:code_avail}

All the code produced during the course of this internship is available on the
various repositories from \url{https://git.tobast.fr/m2-internship/}. The
repositories contain \texttt{README} files describing them; a summary and
global description can be found in the \texttt{abstract} repository. This
should be detailed enough to run the project. The source code is entirely under
free software licenses.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Compilation: \ehelfs}\label{ssec:ehelfs}

The rough idea of the compilation is to produce, out of the \ehframe{} section
of a binary, C code close to that of Section~\ref{sec:semantics} above. This C
code is then compiled by GCC in \lstbash{-O2} mode. This saves us the trouble
of optimizing the generated C code whenever GCC does that by itself.

The generated code consists in a single function, \lstc{_eh_elf}, taking as
arguments an instruction pointer and a memory context (\ie{} the value of the
various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
function then returns a fresh memory context loaded with the values the
registers after unwinding this frame.

The body of the function itself consists in a single monolithic switch, taking
advantage of the non-standard --~yet overwhelmingly implemented in common C
compilers~-- syntax for range switches, in which each \lstinline{case} can
refer to a range, \eg{} \lstc{case 17 ... 42:}.  All the FDEs are merged
together into this switch, each row of a FDE being a switch case.  Separating
the various FDEs in the C code --~other than with comments~-- is, unlike what
is done in DWARF, pointless, since accessing a ``row'' has a linear cost, and
the C code is not meant to be read, except maybe for debugging purposes. The
switch cases bodies then fill a context with unwound values before return it.

A setting of the compiler also optionally enables another parameter to the
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
\lstc{deref} function, when present, replaces everywhere the dereferencing
\lstc{*} operator, and can be used to generate \ehelfs{} that works on
remote address spaces, that is, whenever the unwinding is not done on the
process reading the \ehelf{} itself, but some other process, or even on a stack
dump of a long-terminated process.

Unlike in the \ehframe, and unlike what should be done in a release,
real-world-proof version of the \ehelfs, the choice was made to keep this
implementation simple, and only handle the few registers that were needed to
simply unwind the stack. Thus, the only registers handled in \ehelfs{} are
\reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used a few
times in \prog{libc} and other less common libraries to hold the CFA address in
common functions. This is enough to unwind the stack reliably, and thus enough
for profiling, but is not sufficient to analyze every stack frame as \prog{gdb}
would do after a \lstbash{frame n} command.  Yet, if one was to enhance the
code to handle every register, it would not be much harder and would probably
be only a few hours worth of code refactoring and rewriting.

\begin{figure}[h]
    \centering{}
    \lstinputlisting[language=C, caption={Unwinding context},
                     label={lst:unw_ctx}]
        {src/dwarf_assembly_context/unwind_context.c}
\end{figure}

In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
\lstc{uintptr_t} are the values of the corresponding registers, and
\lstc{flags} is a 8-bits value, indicating for each register whether it is
present or not in this context, plus an error bit, indicating whether an error
occurred during unwinding. Such errors can be due \eg{} to an unsupported
operation in the original DWARF\@. This context differs from the one presented
in Section~\ref{lst:sem_c_ctx}, since the previous one was only an array of
values, and the one from the real implementation is more robust, in particular
by including an error flag by lack of $\bot$ value.

This generated data is stored in separate shared object files, which we call
\ehelfs. It would have been possible to alter the original ELF file to embed
this data as a new section, but getting it to be executed just as any portion
of the \lstc{.text} section would probably have been painful, and keeping it
separated during the experimental phase is convenient. It is possible to have
multiple versions of \ehelfs{} files in parallel, with various options turned
on or off, and it doesn't require to alter the base system by editing \eg{}
\texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is required, those
files can simply be \lstc{dlopen}'d. It is also possible to imagine, in a
future environment production, packaging \ehelfs{} files separately, so that
people interested in better performance can have the choice to install them.

This, in particular, means that each ELF file has its unwinding data in a
separate \ehelf{} file, implying that the unwinding data for a given program is
scattered among various \ehelf{} files, one for each shared object loaded
--~just like with DWARF, where each ELF retains its own DWARF data. Thus, an
unwinder must first acquire a \emph{memory map}, a table listing the various
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
memory map is provided by the operating system --~for instance, on Linux, it is
available as a file in \texttt{/proc}, a special part of the file system that
the kernel uses to communicate with the userland processes. Once this map is
acquired, when unwinding from a given IP, the unwinder must identify the memory
segment from which it comes, deduce the source ELF file, and deduce the
corresponding \ehelf.

\medskip

\lstinputlisting[language=C, caption={\ehelf{} for the previous example},
                 label={lst:fib7_eh_elf_basic}]
                 {src/fib7/fib7.eh_elf_basic.c}

The C code in Listing~\ref{lst:fib7_eh_elf_basic} is the relevant part of what
was generated for the C code in Listing~\ref{lst:ex1_c}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{First results}

Without any particular care to efficiency or compactness, it is already
possible to produce a compiled version very close to the one described in
Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be
actually benchmarked, it is already possible to write in a few hundred lines of
C code a simple stack walker printing the functions traversed. It already works
well on the standard cases that are easily tested, and can be used to unwind
the stack of simple programs.

The major drawback of this approach, without any particular care taken, is the
waste of space. The space taken by those tentative \ehelfs{} is analyzed in
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
it depends.


\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r}
        \toprule
        \thead{Shared object} & \thead{Original \\ program size}
            & \thead{Original \\ \lstc{.eh\_frame}}
            & \thead{Generated \\ \ehelf{} \lstc{.text}}
            & \thead{\% of original \\ program size}
            & \thead{Growth \\ factor} \\
        \midrule
        libc-2.27.so & 1.4 MiB & 130.1 KiB & 914.9 KiB & 63.92 & 7.03 \\
        libpthread-2.27.so & 58.1 KiB & 11.6 KiB & 70.5 KiB & 121.48 & 6.09 \\
        ld-2.27.so & 129.6 KiB & 9.6 KiB & 71.7 KiB & 55.34 & 7.44 \\
        hackbench & 2.9 KiB & 568.0 B & 2.1 KiB & 74.78 & 3.97 \\
        Total & 1.6 MiB & 151.8 KiB & 1.0 MiB & 65.32 & 6.98 \\
        \bottomrule
    \end{tabular}

    \caption{Basic \ehelfs{} space usage}\label{table:basic_eh_elf_space}
\end{table}

The first column only includes the sizes of the ELF sections \lstc{.text} (the
program itself) and \lstc{.rodata}, the read-only data (such as static strings,
etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{}
is considered, because it is self-contained (few data or none is stored in
\lstc{.rodata}), and the other sections could be removed if the \ehelfs{}
\lstc{.text} was somehow embedded in the original shared object.

This first tentative version of \ehelfs{} is roughly 7 times heavier than the
original \lstc{.eh_frame}, and represents a far too significant proportion of
the original program size ($65\,\%$).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Space optimization}\label{ssec:space_optim}

A lot of small space optimizations, such as filtering out empty FDEs, merging
together the rows that are equivalent on all the registers kept, etc.\ were
made in order to shrink the size of the \ehelfs.

\medskip

The optimization that most reduced the output size was to use an if/else tree
implementing a binary search on the instruction pointer relevant intervals,
instead of a single monolithic switch. In the process, we also \emph{outline}
code whenever possible, that is, find out identical ``switch cases'' bodies
--~which are not switch cases anymore, but \texttt{if} bodies~--, move them
outside of the if/else tree, identify them by a label, and jump to them using a
\lstc{goto}, which de-duplicates a lot of code and contributes greatly to the
shrinking. In the process, we noticed that the vast majority of FDE rows are
actually taken among very few ``common'' FDE rows. For instance, in the
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) unique rows
remain after the outlining.

This makes this optimization really efficient, as seen later in
Section~\ref{ssec:results_size}, but also makes it an interesting question
--~not investigated during this internship~-- to find out whether standard
DWARF data could be efficiently compressed in this way.

\begin{minipage}{0.45\textwidth}
    \lstinputlisting[language=C, caption={\ehelf{} for the previous example},
                     label={lst:fib7_eh_elf_outline},
                     lastline=18]
                     {src/fib7/fib7.eh_elf_outline.c}
\end{minipage} \hfill \begin{minipage}{0.45\textwidth}
    \lstinputlisting[language=C, firstnumber=last, firstline=19]
                     {src/fib7/fib7.eh_elf_outline.c}
\end{minipage}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Implementation correctness}

The core part of the code, the \texttt{if}/\texttt{else if} bodies, are in
substance strictly equivalent to what was devised in the semantics in
Section~\ref{sec:semantics}. Indeed, the semantics were designed \wrt{} C code
to be able to export them directly to the C code generated by the compiler. As
a result, this part's proof is limited to checking that no errors were made
when copying the code snippets.

The only things that remain to be proved are thus code optimisations, namely
the binary search transformation of the \texttt{switch} statement and the
outlining. The binary search part proof is a common proof, which in substance
requires proving that binary search is equivalent to iterative, exhaustive
search. The outlining only consists in proving that the program flow remains
unchanged, which is easy to prove.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Benchmarking}\label{sec:benchmarking}

Benchmarking turned out to be, quite surprisingly, the hardest part of the
project. It ended up requiring a good deal of investigation to find a working
protocol, and afterwards, a good deal of code reading and coding to get the
solution working.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Requirements}\label{ssec:bench_req}

To provide relevant benchmarks of the \ehelfs{} performance, one must sample at
least a few hundreds or thousands of stack unwindings, since a single frame
unwinding with regular DWARF takes the order of magnitude of $10\,\mu s$, and
\ehelfs{} were expected to have significantly better performance.

However, unwinding over and over again from the same program point would have
had no interest at all, since \prog{libunwind} would have simply cached the
relevant DWARF rows. In the mean time, making sure that the various unwindings
are made from different locations is somehow cheating, since it makes useless
\prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding
distribution. All in all, the benchmarking method must have a ``natural''
distribution of unwindings.

Another requirement is to also distribute evenly enough the unwinding points
across the program to mimic real-world unwinding: we would like to benchmark
stack unwindings crossing some standard library functions, starting from inside
them, etc.

Finally, the unwound program must be interesting enough to call functions
often, building a stack of nested function calls (at least frequently 5), have
FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation of \prog{perf}}\label{ssec:perf}

\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem, and is
even developed within the Linux kernel source tree. A profiler is an important
tool from the developer's toolbox that analyzes the performance of programs by
recording the time spent in each function, including within nested calls. This
analysis often enables programmers to optimize critical paths and functions in
their programs, while leaving unoptimized functions that are seldom traversed.

\prog{Perf} is a \emph{polling} profiler, to be opposed with
\emph{instrumenting} profilers. This means that with \prog{perf}, the basic
idea is to stop the traced program at regular intervals, unwind its stack,
write down the current nested function calls, and integrate the sampled data in
the end. Instrumenting profilers, on the other hand, do not interrupt the
program, but instead inject code in it.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Benchmarking with \prog{perf}}\label{ssec:bench_perf}

In the context of this internship, the main advantage of \prog{perf} is that it
unwinds the stack on a regular, controllable basis, easily unwinding thousands
of time in a few seconds. It also meets all the requirements from
Section~\ref{ssec:bench_req} above: since it stops at regular intervals and
unwinds, the unwindings are evenly distributed \wrt{} the frequency of
execution of the code, which is a natural enough setup for the benchmarks to be
meaningful, while still unwinding from diversified locations, preventing
caching from being be overwhelming --~as can be observed later in
Section~\ref{ssec:timeperf}. It also has the ability to unwind from
within any function, included functions of linked shared libraries. It can also
be applied to virtually any program, which allows unwinding ``interesting''
code.

The program that was chosen for \prog{perf}-benchmarking is
\prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to
stress-test and benchmark the Linux scheduler by spawning processes or threads
that communicate with each other. It has the interest of generating stack
activity, being linked against \prog{libc} and \prog{pthread}, and being very
light.

\medskip

Interfacing \ehelfs{} with \prog{perf} required, in a first place, to fork
\prog{libunwind} and implement \ehelfs{} support for it. In the process, it
turned out necessary to slightly modify \prog{libunwind}'s interface to add a
parameter to an initialisation function, since \prog{libunwind} is made to be
agnostic of the system and process as much as possible, to be able to unwind in
any context.  This very restricted information lacked a memory map (see
Section~\ref{ssec:ehelfs}) in order to use \ehelfs{} --~while, on the other
hand, providing information about the original DWARF that are now useless.
Apart from this, the modified version of \prog{libunwind} produced is entirely
compatible with the vanilla version. This means that the only modifications
required to use \ehelfs{} within any project using \prog{libunwind} should be
changing one line of code to add one parameter to a function call and linking
against the modified version of \prog{libunwind} instead of the system version.

Once this was done, plugging it in \prog{perf} was the matter of a few lines of
code only, left apart the benchmarking code. The major difficulty was to
understand how \prog{perf} works. To avoid perturbing the traced program,
\prog{perf} does not unwind at runtime, but rather records at regular intervals
the program's stack, and all the auxiliary information that is needed to unwind
later. This is done when running \lstbash{perf record}. Then, a subsequent call
to \lstbash{perf report} unwinds the stack to analyze it; but at this point of
time, the traced process is long dead. Thus, any PID-based approach, or any
approach using \texttt{/proc} information fails. However, as this was the
easiest method, the first version of \ehelfs{} used those mechanisms; it took
some code rewriting to move to a PID- and \texttt{/proc}-agnostic
implementation.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Other explored methods}

The first approach tried to benchmark was trying to create some specific C code
that would meet the requirements from Section~\ref{ssec:bench_req}, while
calling itself a benchmarking procedure from time to time. This was quickly
abandoned, because generating C code interesting enough to be unwound turned
out hard, and the generated FDEs invariably ended out uninteresting. It would
also never have met the requirement of unwinding from fairly distributed
locations anyway.

Another attempt was made using CSmith~\cite{csmith}, a random C code generator
designed for random testing on C compilers. The idea was still to craft a
C program that would unwind on its own frequently, but to integrate
CSmith-randomly generated C code within hand-written C snippets that
would generate large enough FDEs and nested calls. This was abandoned as well
as the call graph of a CSmith-generated code is often far too small, and the
CSmith code is notoriously hard to understand and edit.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Results}

\subsection{Hardware used}~\label{ssec:bench_hw}

All the measures in this report were made on a computer with an Intel Xeon
E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The
computer has 32\,GB of RAM, and care was taken never to fill it and start
swapping --~using the hard drive to store data instead of the RAM when it is
full, degrading harshly the performance.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured time performance}\label{ssec:timeperf}

A benchmarking of \ehelfs{} against the vanilla \prog{libunwind} was made using
the exact same methodology as in Section~\ref{ssec:bench_perf}, only linking
\prog{perf} against the vanilla \prog{libunwind}. It yields the results in
Table~\ref{table:bench_time}.

\begin{table}[h]
    \centering
    \begin{tabular}{l r r r r r}
        \toprule
        \thead{Unwinding method} & \thead{Frames \\ unwound}
            & \thead{Total time \\ unwinding ($\mu s$)}
            & \thead{Average time \\ per frame ($ns$)}
            & \thead{Unwinding \\ errors}
            & \thead{Time ratio} \\
        \midrule
        \ehelfs{}
            & 23506 % Frames unwound
            & 14837 % Total time
            & 631 % Avg time
            & 1099 % # Errors
            & 1
            \\
        \prog{libunwind}, cached
            & 27058 % Frames unwound
            & 441601 % Total time
            & 16320 % Avg time
            & 885 % # Errors
            & 25.9
            \\
        \prog{libunwind}, uncached
            & 27058 % Frames unwound
            & 671292 % Total time
            & 24809 % Avg time
            & 885 % # Errors
            & 39.3
            \\
        \bottomrule
    \end{tabular}

    \caption{Time benchmarking on hackbench}\label{table:bench_time}
\end{table}

The performance of \ehelfs{} is probably overestimated for a production-ready
version, since \ehelfs{} do not handle all the registers from the original
DWARF, lightening the computation.  However, this overhead, although impossible
to measure without first implementing supports for every register, would
probably not be that big, since most of the time is spent finding the relevant
row. Support for every DWARF instruction, however, would not slow down at all
the implementation, since every instruction would simply be compiled to x86\_64
without affecting the already supported code.

The fact that there is a sharp difference between cached and uncached
\prog{libunwind} confirm that our experimental setup did not unwind at totally
different locations every single time, and thus was not biased in this
direction, since caching is still very efficient.

The compilation time of \ehelfs{} is also reasonable. On the machine
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
compile, the various shared objects needed to run \prog{hackbench} --~that is,
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
in an overall time of $25.28$ seconds, which a developer is probably prepared
to wait for.

The unwinding errors observed are hard to investigate, but are most probably
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
bytes of the call stack (for a given $n$), and only keeps those for later
unwinding, large stacks leads to lost information when analyzing the results.
The difference between \ehelfs{} and the vanilla library could be due either to
unsupported DWARF instructions or registers, \prog{libdwarfpp} bugs or bugs in
the custom \prog{libunwind} implementation that were not spotted.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured compactness}\label{ssec:results_size}

A first measure of compactness was made for one of the earliest working
versions in Table~\ref{table:basic_eh_elf_space}. The same data, generated for
the latest version of \ehelfs, can be seen in Table~\ref{table:bench_space}.

The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is
particularly visible in this table: \prog{hackbench} has a significantly bigger
growth than the other shared objects. This is because \prog{hackbench} has a
way smaller \lstc{.eh_frame}, thus, the outlined data is reused only a few
times, compared to \eg{} \prog{libc}, in which the outlined data is reused a
lot.

Just as with time performance, the measured compactness would be impacted by
supporting every register, but probably lightly, since the four supported
registers represent most columns --~see Section~\ref{ssec:instr_cov}.

\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r}
        \toprule
        \thead{Shared object} & \thead{Original \\ program size}
            & \thead{Original \\ \lstc{.eh\_frame}}
            & \thead{Generated \\ \ehelf{} \lstc{.text}}
            & \thead{\% of original \\ program size}
            & \thead{Growth \\ factor} \\
        \midrule
            libc-2.27.so
                & 1.4 MiB & 130.1 KiB & 313.2 KiB & 21.88 & 2.41 \\
            libpthread-2.27.so
                & 58.1 KiB & 11.6 KiB & 25.4 KiB & 43.71 & 2.19 \\
            ld-2.27.so
                & 129.6 KiB & 9.6 KiB & 28.6 KiB & 22.09 & 2.97 \\
            hackbench
                & 2.9 KiB & 568.0 B & 2.8 KiB & 93.87 & 4.99 \\
            Total
                & 1.6 MiB & 151.8 KiB & 370.0 KiB & 22.81 & 2.44 \\
        \bottomrule
    \end{tabular}

    \caption{\ehelfs{} space usage}\label{table:bench_space}
\end{table}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Instructions coverage}\label{ssec:instr_cov}

In order to determine which DWARF instructions should be implemented to
have meaningful results, as well as to assess the instruction coverage of our
compiler and \ehelfs, we must look at real-world ELF files and inspect the
instructions used.

The method chosen was to take a random uniform sample of 4000 ELFs among those
present on a basic ArchLinux system setup, in the directories \texttt{/bin},
\texttt{/lib}, \texttt{/usr/bin}, \texttt{/usr/lib} and their subdirectories,
making sure those files were ELF64 files, then gathering statistics on those
files.

\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r r}
        \toprule
        \thead{} & \thead{Unsupported \\ register rule}
            & \thead{Register \\ rules seen}
            & \thead{\% \\ supp.}
            & \thead{Unsupported \\ expression}
            & \thead{Expressions \\ seen}
            & \thead{\% \\ supp.}
            \\
        \midrule
            \makecell{Only supp. \\ columns} &
                1603 & 42959683 & 99.996\,\% &
                1114 & 5977 & 81.4\,\%
                \\
            All columns &
                1607 & 67587841 & 99.998\,\% &
                1154 & 13869 & 91.7\,\%
                \\
        \bottomrule
    \end{tabular}

    \caption{Instructions coverage statistics}\label{table:instr_cov}
\end{table}

\begin{table}[h]
    \centering
    \begin{tabular}{r r r r r r}
        \toprule
        \thead{}
            & \thead{\texttt{Undefined}}
            & \thead{\texttt{Same\_value}}
            & \thead{\texttt{Offset}}
            & \thead{\texttt{Val\_offset}}
            & \thead{\texttt{Register}}
            \\
        \midrule
            \makecell{Only supp. \\ columns}
                & 1698 (0.006\,\%)
                & 0
                & 30038255 (99.9\,\%)
                & 0
                & 14 (0\,\%)
                \\
            All columns
                & 1698 (0.003\,\%)
                & 0
                & 54666405 (99.9\,\%)
                & 0
                & 22 (0\,\%)
                \\
        \bottomrule
        \toprule
        \thead{}
            & \thead{\texttt{Expression}}
            & \thead{\texttt{Val\_expression}}
            & \thead{\texttt{Architectural}}
            & & \thead{Total}
            \\
        \midrule
            \makecell{Only supp. \\ columns}
                & 4475 (0.015\,\%)
                & 0
                & 0
                & & 30044442
                \\
            All columns
                & 12367 (0.02\,\%)
                & 0
                & 0
                & & 54680492
                \\

        \bottomrule
    \end{tabular}

    \caption{Instruction type statistics}\label{table:instr_types}
\end{table}

The Table~\ref{table:instr_cov} gives statistics about the proportion of
instructions encountered that were not supported by \ehelfs. The first row is
only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and
\reg{rbx} (the supported registers --~see Section~\ref{ssec:ehelfs}). The
second row analyzes all the columns that were encountered, no matter whether
supported or not in \ehelfs.

The Table~\ref{table:instr_types} analyzes the proportion of each command
--~the formal way a register is set~-- for non-CFA columns in the sampled data. For
a brief explanation, \texttt{Offset} means stored at offset from CFA,
\texttt{Register} means the value from a machine register, \texttt{Expression}
means stored at the address of an expression's result, and the \texttt{Val\_}
prefix means that the value must not be dereferenced. Overall, it can be seen
that supporting \texttt{Offset} already means supporting the vast majority of
registers. The data gathered (not reproduced here) also suggests that
supporting a few common expressions is enough to support most of them. This is
further supported by the fact that we already support more than $80\,\%$ of
expressions only by supporting two basic constructs.

It is also worth noting that among all of the 4000 analyzed files, all the
unsupported expressions are clustered in only 12 of them, and only 24 contained
unsupported instructions at all.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Conclusion}

From this data, we can deduce that

\begin{itemize}[itemsep=3pt, parsep=0pt]

    \item compilation of the DWARF unwinding data is effective to speed up
        drastically unwinding procedures: speedup of $\times 25.9$;

    \item code outlining is effective to reduce the produced binary size: from
        $1\ \text{MiB}$ to $370\ \text{KiB}$, from a growth factor of $7$
        compared to DWARF unwinding data to a growth factor of $2.45$;

    \item unwinding relies on small subset of DWARF instructions and
        expressions, while most instructions are not used at all in DWARF code
        produced by compilers.

\end{itemize}

The overall size of the project is

\begin{itemize}[itemsep=3pt, parsep=0pt]
    \item compiler: 1628 lines,
    \item \prog{libunwind}: 810 lines,
    \item \prog{perf}: 222 lines
\end{itemize}

\noindent{} for a total of 2660 lines of code on the main project. The various
statistics, benchmarking, testing and analyzing code modules add up to around
1500 more lines.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% Bibliography %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printbibliography{}

%% License notice %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vfill
\hfill \begin{minipage}{0.7\textwidth}
    \begin{flushright}
        \itshape{} \small{}
        Unless otherwise explicitly stated, any image from the present document
        is distributed under Creative Commons BY-SA license, and any code
        snippet is distributed under 3-clause BSD license.
    \end{flushright}
\end{minipage}

\end{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%