Live changes during meeting

This commit is contained in:
Théophile Bastian 2018-08-19 13:13:07 +02:00
parent 2f44049506
commit 73f016f44c
2 changed files with 82 additions and 78 deletions

View file

@ -8,17 +8,17 @@
\subsection*{The general context}
The standard debugging data format for ELF binary files, DWARF, contains tables
that permit, for a given instruction pointer (IP), to understand how the
assembly instruction relates to the source code, where variables are currently
allocated in memory or if they are stored in a register, what are their type
and how to unwind the current stack frame. This inforation is generated when
passing \eg{} the switch \lstbash{-g} to \prog{gcc} or equivalents.
The standard debugging data format, DWARF, contains tables that, for a given
instruction pointer (IP), permit to understand how the assembly instruction
relates to the source code, where variables are currently allocated in memory
or if they are stored in a register, what are their type and how to unwind the
current stack frame. This information is generated when passing \eg{} the
switch \lstbash{-g} to \prog{gcc} or equivalents.
Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
the stack unwinding data. This information is necessary to unwind stack
frames, restoring machine registers to the value they had in the previous
frame, for instance within the context of a debugger or a profiler.
frame.
This data is structured into tables, each row corresponding to an IP range for
which it describes valid unwinding data, and each column describing how to
@ -34,28 +34,29 @@ computation~\cite{oakley2011exploiting}.
As debugging data can easily take an unreasonable space and grow larger than
the program itself if stored carelessly, the DWARF standard pays a great
attention to data compactness and compression, and succeeds particularly well
at it. But this, as always, is at the expense of efficiency: accessing stack
unwinding data for a particular program point is not a light operation --~in
the order of magnitude of $10\,\mu{}\text{s}$ on a modern computer.
attention to data compactness and compression. It succeeds particularly well
at it, but at the expense of efficiency: accessing stack
unwinding data for a particular program point is an expensive operation --~the
order of magnitude is $10\,\mu{}\text{s}$ on a modern computer.
This is often not a huge problem, as stack unwinding is often thought of as a
This is often not a problem, as stack unwinding is often thought of as a
debugging procedure: when something behaves unexpectedly, the programmer might
be interested in opening their debugger and exploring the stack. Yet, stack
unwinding might, in some cases, be performance-critical: for instance, profiler
programs needs to perform a whole lot of stack unwindings. Even worse,
exception handling relies on stack unwinding in order to find a suitable
catch-block! For such applications, it might be desirable to find a different
time/space trade-off, storing a bit more for a faster unwinding.
unwinding might, in some cases, be performance-critical: for instance, polling
profilers repeatedly perform stack unwindings to observe which functions are
active. Even worse, C++ exception handling relies on stack unwinding in order
to find a suitable catch-block! For such applications, it might be desirable to
find a different time/space trade-off, storing a bit more for a faster
unwinding.
This different trade-off is the question that I explored during this
internship: what good alternative trade-off is reachable when storing the stack
unwinding data completely differently?
It seems that the subject has not really been explored yet, and as of now, the
most widely used library for stack unwinding,
\prog{libunwind}~\cite{libunwind}, essentially makes use of aggressive but
fine-tuned caching and optimized code to mitigate this problem.
It seems that the subject has not been explored yet, and as of now, the most
widely used library for stack unwinding, \prog{libunwind}~\cite{libunwind},
essentially makes use of aggressive but fine-tuned caching and optimized code
to mitigate this problem.
% What is the question that you studied?
% Why is it important, what are the applications/consequences?
@ -73,27 +74,25 @@ of compiled DWARF into existing projects have been made easy by implementing an
alternative version of the \textit{de facto} standard library for this purpose,
\prog{libunwind}.
Multiple approaches have been tried, in order to determine which compilation
process leads to the best time/space trade-off.
Multiple approaches have been tried and evaluated to determine which
compilation process leads to the best time/space trade-off.
Unexpectedly, the part that proved hardest of the project was finding and
implementing a benchmarking protocol that was both relevant and reliable.
Unwinding one single frame is way too fast to provide a reliable benchmarking
on a few samples (around $10\,\mu s$ per frame). Having enough samples for this
purpose --~at least a few thousands~-- is not easy, since one must avoid
unwinding the same frame over and over again, which would only benchmark the
caching mechanism. The other problem is to distribute evenly the unwinding
measures across the various IPs, including directly into the loaded libraries
(\eg{} the \prog{libc}).
Unwinding one single frame is too fast to provide a reliable benchmarking on a
few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having
enough samples for this purpose --~at least a few thousands~-- is not easy,
since one must avoid unwinding the same frame over and over again, which would
only benchmark the caching mechanism. The other problem is to distribute
evenly the unwinding measures across the various IPs, including directly into
the loaded libraries (\eg{} the \prog{libc}).
The solution eventually chosen was to modify \prog{perf}, the standard
profiling program for Linux, in order to gather statistics and benchmarks of
its unwindings. Modifying \prog{perf} was an additional challenge that turned
out to be harder than expected, since the source code is pretty opaque to
someone who doesn't know the project well, and the optimisations make some
parts counter-intuitive. This, in particular, required to produce an
alternative version of \prog{libunwind} interfaced with the compiled debugging
data.
out to be harder than expected, since the source code is hard to read, and
optimisations make some parts counter-intuitive. To overcome this, we designed
an alternative version of \prog{libunwind} interfaced with the
compiled debugging data.
% What is your solution to the question described in the last paragraph?
%
@ -108,12 +107,19 @@ data.
%
% Comment the robustness of your solution: how does it rely/depend on the working assumptions?
The goal was to obtain a compiled version of unwinding data that was faster
than DWARF, reasonably heavier and reliable. The benchmarks mentioned have
yielded convincing results: on the experimental setup created (detailed on
Section~\ref{sec:benchmarking} below), the compiled version is around 26 times
faster than the DWARF version, while it remains only around 2.5 times bigger
than the original data.
The goal of this project was to design a compiled version of unwinding data
that is faster than DWARF, while still being reliable and reasonably compact.
The benchmarks mentioned have yielded convincing results: on the experimental
setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash,
the compiled version is around 26 times faster than the DWARF version, while it
remains only around 2.5 times bigger than the original data.
We support the vast majority --~more than $99.9\,\%$~-- of the instructions
actually used in binaries, although we do not support all of DWARF5 instruction
set. We are almost as robust as libunwind: on a $27000$ samples test, 885
failures were observed for \prog{libunwind}, against $1099$ for the compiled
DWARF version (failures are due to signal handlers, unusual instructions,
\ldots) --~see Section~\ref{ssec:timeperf}.
The implementation is not yet release-ready, as it does not support 100\ \% of
the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs}
@ -123,13 +129,13 @@ the reference implementation. Indeed, corner cases occur often, and on a 27000
samples test, 885 failures were observed for \prog{libunwind}, against 1099 for
the compiled DWARF version (see Section~\ref{ssec:timeperf}).
The implementation, however, as a few other limitations. It only supports the
The implementation, however, is not production-ready: it only supports the
x86\_64 architecture, and relies to some extent on the Linux operating system.
But none of those are real problems in practice. Other processor architectures
and ABIs are only a matter of time spent and engineering work; and the
operating system dependency is only present in the libraries developed in order
to interact with the compiled unwinding data, which can be developed for
virtually any operating system.
None of those are real problems in practice. Supporting other processor
architectures and ABIs are only a matter of engineering,. The operating system
dependency is only present in the libraries developed in order to interact with
the compiled unwinding data, which can be developed for virtually any operating
system.
\subsection*{Summary and future work}
@ -137,14 +143,13 @@ In most cases of everyday's life, a slow stack unwinding is not a problem, left
apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
tasks can be really useful to \eg{} profile large programs, particularly if one
wants to profile many times in order to analyze the impact of multiple changes.
It can also be useful for exception-heavy programs. Thus, it might be
interesting to implement a more stable version, and try to interface it cleanly
with mainstream tools, such as \prog{perf}.
It can also be useful for exception-heavy programs. Thus, we plan to address
the limitations and integrate it cleanly with mainstream tools, such as
\prog{perf}.
Another question worth exploring might be whether it is possible to shrink even
more the original DWARF unwinding data, which would be stored in a format not
too far from the original standard, by applying techniques close to those
used to shrink the compiled unwinding data.
Another research direction is to investigate how to compress even more the
original DWARF unwinding data using outlining techniques, as we already do for
the compiled data successfully.
% What is next? In which respect is your approach general?
% What did your contribution bring to the area?

View file

@ -1,10 +1,11 @@
\title{DWARF debugging data, compilation and optimization}
\author{Théophile Bastian\\
Under supervision of Francesco Zappa Nardelli\\
Under supervision of Francesco Zappa Nardelli, March -- August 2018\\
{\textsc{parkas}, \'Ecole Normale Supérieure de Paris}}
\date{March -- August 2018\\August 20, 2018}
%\date{March -- August 2018\\August 20, 2018}
\date{\vspace{-2em}}
\documentclass[11pt]{article}
@ -54,8 +55,8 @@ Under supervision of Francesco Zappa Nardelli\\
\subsection*{Source code}\label{ssec:source_code}
All the source code produced during this internship is available openly. See
Section~\ref{ssec:code_avail} for details.
Our implementation is available from \url{https://git.tobast.fr/m2-internship}.
See the \texttt{abstract} repository for an introductive \texttt{README}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -102,25 +103,24 @@ copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
from anywhere within the function, and also allows for easy addressing of local
variables. To some extents, it also allows for hot debugging, such as saving a
useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
always done, since it somehow ``wastes'' a register. This decision is, on
x86\_64 System V, up to the compiler.
always done, since it wastes a register. This decision is, on x86\_64 System V,
up to the compiler.
Often, a function will start by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it will push on
Usually, a function starts by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it pushes on
the stack the values of the callee-saved registers that are overwritten later,
effectively saving them. Before returning, it will pop the values of the saved
effectively saving them. Before returning, it pops the values of the saved
registers back to their original registers and restore \reg{rsp} to its former
value.
\subsection{Stack unwinding}\label{ssec:stack_unwinding}
For various reasons, it might be interesting, at some point of the execution of
a program, to glance at its program stack and be able to extract informations
from it. For instance, when running a debugger such as \prog{gdb}, a frequent
usage is to obtain a \emph{backtrace}, that is, the list of all nested function
calls at the current IP\@. This actually reads the stack to find the different
stack frames, and decode them to identify the function names, parameter values,
etc.
For various reasons, it is interesting, at some point of the execution of a
program, to glance at its program stack and be able to extract informations
from it. For instance, when running a debugger, a frequent usage is to obtain a
\emph{backtrace}, that is, the list of all nested function calls at the current
IP\@. This actually observes the stack to find the different stack frames, and
decode them to identify the function names, parameter values, etc.
This operation is far from trivial. Often, a stack frame will only make sense
when the correct values are stored in the machine registers. These values,
@ -184,7 +184,7 @@ no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
two functions that were made from \lstc{fct_a}. Knowing that after all,
\lstc{fct_a} is the culprit can be useful to a programmer.
Exception handling also requires a stack unwinding mechanism in most languages.
Exception handling also requires a stack unwinding mechanism in some languages.
Indeed, an exception is completely different from a \lstinline{return}: while
the latter returns to the previous function, at a well-defined IP, the former
can be caught by virtually any function in the call path, at any point of the
@ -313,7 +313,7 @@ between them.
\\
\hline
\end{tabular}
\caption{Stack frame schema}\label{table:ex1_stack_schema}
\caption{Stack frame schema for fib7 (horizontal layout)}\label{table:ex1_stack_schema}
\end{table}
For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled
@ -492,8 +492,8 @@ Its grammar is as follows:
\end{align*}
The entry point of the grammar is a $\FDE$, which is a set of rows, each
annotated with a machine address, the address from which it is valid. Note that
the addresses are necessarily increasing within a FDE\@.
annotated with a machine address, the address from which it is valid.
The addresses are necessarily increasing within a FDE\@.
Each row then represents, as a function mapping registers to values, a row of
the unwinding table.
@ -672,9 +672,8 @@ and $\semR{\bullet}$ is defined as
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stack unwinding data compilation}
The tentative approach that was chosen to try to get better unwinding speeds at
a reasonable space loss was to compile directly the \ehframe{} into native
machine code on the x86\_64 platform.
In this section, we will study all the design options we explored for the
actual C implementation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Code availability}\label{ssec:code_avail}