161 lines
8.2 KiB
TeX
161 lines
8.2 KiB
TeX
\pagestyle{empty} %
|
|
\thispagestyle{empty}
|
|
|
|
%% Attention: pas plus d'un recto-verso!
|
|
% Ne conservez pas les questions
|
|
|
|
\section*{Internship synthesis}
|
|
|
|
\subsection*{The general context}
|
|
|
|
The standard debugging data format, DWARF (Debugging With Attributed Record
|
|
Formats), contains tables permitting, for a given instruction pointer (IP), to
|
|
understand how instructions from the assembly code relates to the original
|
|
source code, where are variables currently allocated in memory or if they are
|
|
stored in a register, what are their type and how to unwind the current stack
|
|
frame. This information is generated when passing \eg{} the switch \lstbash{-g}
|
|
to \prog{gcc} or equivalents.
|
|
|
|
Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
|
|
the stack unwinding data. This information is necessary to unwind stack
|
|
frames, restoring machine registers to the value they had in the previous
|
|
frame.
|
|
|
|
This data is structured into tables, each row corresponding to an IP range for
|
|
which it describes valid unwinding data, and each column describing how to
|
|
unwind a particular machine register (or virtual register used for various
|
|
purposes). The vast majority of the rules actually used are basic --~see
|
|
Section~\ref{ssec:instr_cov}~\textendash, consisting in offsets from memory
|
|
addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the
|
|
standard defines rules that take the form of a stack-machine expression that
|
|
can access virtually all the process's memory and perform Turing-complete
|
|
computations~\cite{oakley2011exploiting}.
|
|
|
|
\subsection*{The research problem}
|
|
|
|
As debugging data can easily grow larger than the program itself if stored
|
|
carelessly, the DWARF standard pays a great attention to data compactness and
|
|
compression. It succeeds particularly well at it, but at the expense of
|
|
efficiency: accessing stack unwinding data for a particular program point is an
|
|
expensive operation --~the order of magnitude is $10\,\mu{}\text{s}$ on a
|
|
modern computer.
|
|
|
|
This is often not a problem, as stack unwinding is often thought of as a
|
|
debugging procedure: when something behaves unexpectedly, the programmer might
|
|
open their debugger and explore the stack. Yet, stack unwinding might, in some
|
|
cases, be performance-critical: for instance, polling profilers repeatedly
|
|
perform stack unwindings to observe which functions are active. Even worse, C++
|
|
exception handling relies on stack unwinding in order to find a suitable
|
|
catch-block! For such applications, it might be desirable to find a different
|
|
time/space trade-off, storing a bit more for a faster unwinding.
|
|
|
|
This different trade-off is the question that I explored during this
|
|
internship: what good alternative trade-off is reachable when storing the stack
|
|
unwinding data completely differently?
|
|
|
|
It seems that the subject has not been explored yet, and as of now, the most
|
|
widely used library for stack unwinding, \prog{libunwind}~\cite{libunwind},
|
|
essentially makes use of aggressive but fine-tuned caching and optimized code
|
|
to mitigate this problem.
|
|
|
|
% What is the question that you studied?
|
|
% Why is it important, what are the applications/consequences?
|
|
% Is it a new problem?
|
|
% If so, why are you the first researcher in the universe who consider it?
|
|
% If not, why did you think that you could bring an original contribution?
|
|
|
|
\subsection*{Your contribution}
|
|
|
|
This internship explored the possibility to compile DWARF's stack unwinding
|
|
data directly into native assembly on the x86\_64 architecture, in order to
|
|
provide fast access to the data at assembly level. This compilation process was
|
|
fully implemented and tested on complex, real-world examples. The integration
|
|
of compiled DWARF into existing projects have been made easy by implementing an
|
|
alternative version of the \textit{de facto} standard library for this purpose,
|
|
\prog{libunwind}.
|
|
|
|
We explored and evaluated multiple approaches to determine which compilation
|
|
process leads to the best time/space trade-off.
|
|
|
|
Unexpectedly, the part that proved hardest of the project was finding and
|
|
implementing a benchmarking protocol that was both relevant and reliable.
|
|
Unwinding one single frame is too fast to provide a reliable benchmarking on a
|
|
few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having
|
|
enough samples for this purpose --~at least a few thousands~-- is not easy,
|
|
since one must avoid unwinding the same frame over and over again, which would
|
|
only benchmark the caching mechanism. The other problem is to distribute
|
|
evenly the unwinding measures across the various IPs, among which those
|
|
directly located into the loaded libraries (\eg{} the \prog{libc}).
|
|
The solution eventually chosen was to modify \prog{perf}, the standard
|
|
profiling program for Linux, in order to gather statistics and benchmarks of
|
|
its unwindings. Modifying \prog{perf} was an additional challenge that turned
|
|
out to be harder than expected, since the source code is hard to read, and
|
|
optimisations make some parts counter-intuitive. To overcome this, we designed
|
|
an alternative version of \prog{libunwind} interfaced with the
|
|
compiled debugging data.
|
|
|
|
% What is your solution to the question described in the last paragraph?
|
|
%
|
|
% Be careful, do \emph{not} give technical details, only rough ideas!
|
|
%
|
|
% Pay a special attention to the description of the \emph{scientific} approach.
|
|
|
|
\subsection*{Arguments supporting its validity}
|
|
|
|
% What is the evidence that your solution is a good solution?
|
|
% Experiments? Proofs?
|
|
%
|
|
% Comment the robustness of your solution: how does it rely/depend on the working assumptions?
|
|
|
|
The goal of this project was to design a compiled version of unwinding data
|
|
that is faster than DWARF, while still being reliable and reasonably compact.
|
|
Benchmarking has yielded convincing results: on the experimental setup created
|
|
--~detailed on Section~\ref{sec:benchmarking} below~\textendash, the compiled
|
|
version is around 26 times faster than the DWARF version, while it remains only
|
|
around 2.5 times bigger than the original data.
|
|
|
|
We support the vast majority --~more than $99.9\,\%$~-- of the instructions
|
|
actually used in binaries, although we do not support all of DWARF5 instruction
|
|
set. We are almost as robust as libunwind: on a $27000$ samples test, 885
|
|
failures were observed for \prog{libunwind}, against $1099$ for the compiled
|
|
DWARF version (failures are due to signal handlers, unusual instructions,
|
|
\ldots) --~see Section~\ref{ssec:timeperf}.
|
|
|
|
The implementation is not yet release-ready, as it does not support 100\ \% of
|
|
the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs}
|
|
below. Yet, it supports the vast majority --~more than $99.9$\ \%~-- of the
|
|
cases seen in the wild, and is decently robust compared to \prog{libunwind},
|
|
the reference implementation. Indeed, corner cases occur often, and on a 27000
|
|
samples test, 885 failures were observed for \prog{libunwind}, against 1099 for
|
|
the compiled DWARF version (see Section~\ref{ssec:timeperf}).
|
|
|
|
The implementation, however, is not yet production-ready: it only supports the
|
|
x86\_64 architecture, and relies to some extent on the Linux operating system.
|
|
None of these pose a fundamental problem. Supporting other processor
|
|
architectures and ABIs are only a matter of engineering. The operating system
|
|
dependency is only present in the libraries developed in order to interact with
|
|
the compiled unwinding data, which can be developed for virtually any operating
|
|
system.
|
|
|
|
\subsection*{Summary and future work}
|
|
|
|
In most cases of everyday's life, a slow stack unwinding is not a problem, left
|
|
apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
|
|
tasks can be really useful to \eg{} profile large programs, particularly if one
|
|
wants to profile many times in order to analyze the impact of multiple changes.
|
|
It can also be useful for exception-heavy programs. Thus, we plan to address
|
|
the limitations and integrate it cleanly with mainstream tools, such as
|
|
\prog{perf}.
|
|
|
|
Another research direction is to investigate how to compress even more the
|
|
original DWARF unwinding data using outlining techniques, as we already do for
|
|
the compiled data successfully.
|
|
|
|
% What is next? In which respect is your approach general?
|
|
% What did your contribution bring to the area?
|
|
% What should be done now?
|
|
% What is the good \emph{next} question?
|
|
|
|
\pagestyle{plain}
|
|
\newpage
|