report/report/fiche_synthese.tex

\pagestyle{empty} %
\thispagestyle{empty}

%% Attention: pas plus d'un recto-verso!
% Ne conservez pas les questions

\section*{Internship synthesis}

\subsection*{The general context}

The standard debugging data format for ELF binary files, DWARF, contains tables
that permit, for a given instruction pointer (IP), to understand how the
assembly instruction relates to the source code, where variables are currently
allocated in memory or if they are stored in a register, what are their type
and how to unwind the current stack frame. This inforation is generated when
passing \eg{} the switch \lstbash{-g} to \prog{gcc} or equivalents.

Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
the stack unwinding data.  This information is necessary to unwind stack
frames, restoring machine registers to the value they had in the previous
frame, for instance within the context of a debugger or a profiler.

This data is structured into tables, each row corresponding to an IP range for
which it describes valid unwinding data, and each column describing how to
unwind a particular machine register (or virtual register used for various
purposes). The vast majority of the rules actually used are basic --~see
Section~\ref{ssec:instr_cov}~\textendash, consisting in offsets from memory
addresses stored in registers (such as \reg{rbp} or \reg{rsp}).  Yet, the
standard defines rules that take the form of a stack-machine expression that
can access virtually all the process's memory and perform Turing-complete
computation~\cite{oakley2011exploiting}.

\subsection*{The research problem}

As debugging data can easily take an unreasonable space and grow larger than
the program itself if stored carelessly, the DWARF standard pays a great
attention to data compactness and compression, and succeeds particularly well
at it. But this, as always, is at the expense of efficiency: accessing stack
unwinding data for a particular program point is not a light operation --~in
the order of magnitude of $10\,\mu{}\text{s}$ on a modern computer.

This is often not a huge problem, as stack unwinding is often thought of as a
debugging procedure: when something behaves unexpectedly, the programmer might
be interested in opening their debugger and exploring the stack.  Yet, stack
unwinding might, in some cases, be performance-critical: for instance, profiler
programs needs to perform a whole lot of stack unwindings. Even worse,
exception handling relies on stack unwinding in order to find a suitable
catch-block! For such applications, it might be desirable to find a different
time/space trade-off, storing a bit more for a faster unwinding.

This different trade-off is the question that I explored during this
internship: what good alternative trade-off is reachable when storing the stack
unwinding data completely differently?

It seems that the subject has not really been explored yet, and as of now, the
most widely used library for stack unwinding,
\prog{libunwind}~\cite{libunwind}, essentially makes use of aggressive but
fine-tuned caching and optimized code to mitigate this problem.

% What is the question that you studied?
% Why is it important, what are the applications/consequences?
% Is it a new problem?
% If so, why are you the first researcher in the universe who consider it?
% If not, why did you think that you could bring an original contribution?

\subsection*{Your contribution}

This internship explored the possibility to compile DWARF's stack unwinding
data directly into native assembly on the x86\_64 architecture, in order to
provide fast access to the data at assembly level. This compilation process was
fully implemented and tested on complex, real-world examples. The integration
of compiled DWARF into existing projects have been made easy by implementing an
alternative version of the \textit{de facto} standard library for this purpose,
\prog{libunwind}.

Multiple approaches have been tried, in order to determine which compilation
process leads to the best time/space trade-off.

Unexpectedly, the part that proved hardest of the project was finding and
implementing a benchmarking protocol that was both relevant and reliable.
Unwinding one single frame is way too fast to provide a reliable benchmarking
on a few samples (around $10\,\mu s$ per frame). Having enough samples for this
purpose --~at least a few thousands~-- is not easy, since one must avoid
unwinding the same frame over and over again, which would only benchmark the
caching mechanism.  The other problem is to distribute evenly the unwinding
measures across the various IPs, including directly into the loaded libraries
(\eg{} the \prog{libc}).

The solution eventually chosen was to modify \prog{perf}, the standard
profiling program for Linux, in order to gather statistics and benchmarks of
its unwindings. Modifying \prog{perf} was an additional challenge that turned
out to be harder than expected, since the source code is pretty opaque to
someone who doesn't know the project well, and the optimisations make some
parts counter-intuitive. This, in particular, required to produce an
alternative version of \prog{libunwind} interfaced with the compiled debugging
data.

% What is your solution to the question described in the last paragraph?
%
% Be careful, do \emph{not} give technical details, only rough ideas!
%
% Pay a special attention to the description  of the \emph{scientific} approach.

\subsection*{Arguments supporting its validity}

% What is the evidence that your solution is a good solution?
% Experiments? Proofs?
%
% Comment the robustness of your solution: how does it rely/depend on the working assumptions?

The goal was to obtain a compiled version of unwinding data that was faster
than DWARF, reasonably heavier and reliable. The benchmarks mentioned have
yielded convincing results: on the experimental setup created (detailed on
Section~\ref{sec:benchmarking} below), the compiled version is around 26 times
faster than the DWARF version, while it remains only around 2.5 times bigger
than the original data.

The implementation is not yet release-ready, as it does not support 100\ \% of
the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs}
below. Yet, it supports the vast majority --~more than $99.9$\ \%~-- of the
cases seen in the wild, and is decently robust compared to \prog{libunwind},
the reference implementation.  Indeed, corner cases occur often, and on a 27000
samples test, 885 failures were observed for \prog{libunwind}, against 1099 for
the compiled DWARF version (see Section~\ref{ssec:timeperf}).

The implementation, however, as a few other limitations. It only supports the
x86\_64 architecture, and relies to some extent on the Linux operating system.
But none of those are real problems in practice. Other processor architectures
and ABIs are only a matter of time spent and engineering work; and the
operating system dependency is only present in the libraries developed in order
to interact with the compiled unwinding data, which can be developed for
virtually any operating system.

\subsection*{Summary and future work}

In most cases of everyday's life, a slow stack unwinding is not a problem, left
apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
tasks can be really useful to \eg{} profile large programs, particularly if one
wants to profile many times in order to analyze the impact of multiple changes.
It can also be useful for exception-heavy programs. Thus, it might be
interesting to implement a more stable version, and try to interface it cleanly
with mainstream tools, such as \prog{perf}.

Another question worth exploring might be whether it is possible to shrink even
more the original DWARF unwinding data, which would be stored in a format not
too far from the original standard, by applying techniques close to those
used to shrink the compiled unwinding data.

% What is next? In which respect is your approach general?
% What did your contribution bring to the area?
% What should be done now?
% What is the good \emph{next} question?

\pagestyle{plain}
\newpage