Rephrase and correct everything up to end of §1
This commit is contained in:
parent
4016b4f46c
commit
f0809dbf1c
2 changed files with 203 additions and 195 deletions
|
@ -8,45 +8,45 @@
|
|||
|
||||
\subsection*{The general context}
|
||||
|
||||
The standard debugging data format for ELF binary files, DWARF, contains a lot
|
||||
of information, which is generated mostly when passing \eg{} the switch
|
||||
\lstbash{-g} to \prog{gcc}. This information, essentially provided for
|
||||
debuggers, contains all that is needed to connect the generated assembly with
|
||||
the original code, information that can be used by sanitizers (\eg{} the type
|
||||
of each variable in the source language), etc.
|
||||
The standard debugging data format for ELF binary files, DWARF, contains tables
|
||||
that permit, for a given instruction pointer (IP), to understand how the
|
||||
assembly instruction relates to the source code, where variables are currently
|
||||
allocated in memory or if they are stored in a register, what are their type
|
||||
and how to unwind the current stack frame. This inforation is generated when
|
||||
passing \eg{} the switch \lstbash{-g} to \prog{gcc} or equivalents.
|
||||
|
||||
Even in stripped (non-debug) binaries, a small portion of DWARF data remains.
|
||||
Among this essential data that is never stripped is the stack unwinding data,
|
||||
which allows to unwind stack frames, restoring machine registers to the value
|
||||
they had in the previous frame, for instance within the context of a debugger
|
||||
or a profiler.
|
||||
Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
|
||||
the stack unwinding data. This information is necessary to unwind stack
|
||||
frames, restoring machine registers to the value they had in the previous
|
||||
frame, for instance within the context of a debugger or a profiler.
|
||||
|
||||
This data is structured into tables, each row corresponding to an program
|
||||
counter (PC) range for which it describes valid unwinding data, and each column
|
||||
describing how to unwind a particular machine register (or virtual register
|
||||
used for various purposes). These rules are mostly basic, consisting in offsets
|
||||
from memory addresses stored in registers (such as \reg{rbp} or \reg{rsp}), but
|
||||
in some cases, they can take the form of a stack-machine expression that can
|
||||
access virtually all the process's memory and perform Turing-complete
|
||||
This data is structured into tables, each row corresponding to an IP range for
|
||||
which it describes valid unwinding data, and each column describing how to
|
||||
unwind a particular machine register (or virtual register used for various
|
||||
purposes). The vast majority of the rules actually used are basic --~see
|
||||
Section~\ref{ssec:instr_cov}~\textendash, consisting in offsets from memory
|
||||
addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the
|
||||
standard defines rules that take the form of a stack-machine expression that
|
||||
can access virtually all the process's memory and perform Turing-complete
|
||||
computation~\cite{oakley2011exploiting}.
|
||||
|
||||
\subsection*{The research problem}
|
||||
|
||||
As debugging data can easily take an unreasonable space if stored carelessly,
|
||||
the DWARF standard pays a great attention to data compactness and compression,
|
||||
and succeeds particularly well at it. But this, as always, is at the expense
|
||||
of efficiency: accessing stack unwinding data for a particular program point
|
||||
can be quite costly.
|
||||
As debugging data can easily take an unreasonable space and grow larger than
|
||||
the program itself if stored carelessly, the DWARF standard pays a great
|
||||
attention to data compactness and compression, and succeeds particularly well
|
||||
at it. But this, as always, is at the expense of efficiency: accessing stack
|
||||
unwinding data for a particular program point is not a light operation --~in
|
||||
the order of magnitude of $10\,\mu{}\text{s}$ on a modern computer.
|
||||
|
||||
This is often not a huge problem, as stack unwinding is mostly thought of as a
|
||||
This is often not a huge problem, as stack unwinding is often thought of as a
|
||||
debugging procedure: when something behaves unexpectedly, the programmer might
|
||||
be interested in opening their debugger and exploring the stack. Yet, stack
|
||||
unwinding might, in some cases, be performance-critical: for instance, profiler
|
||||
programs needs to perform a whole lot of stack unwindings. Even worse,
|
||||
exception handling relies on stack unwinding in order to find a suitable
|
||||
catch-block! For such applications, it might be desirable to find a different
|
||||
time/space trade-off, allowing a slightly space-heavier, but far more
|
||||
time-efficient unwinding procedure.
|
||||
time/space trade-off, storing a bit more for a faster unwinding.
|
||||
|
||||
This different trade-off is the question that I explored during this
|
||||
internship: what good alternative trade-off is reachable when storing the stack
|
||||
|
@ -69,29 +69,31 @@ This internship explored the possibility to compile DWARF's stack unwinding
|
|||
data directly into native assembly on the x86\_64 architecture, in order to
|
||||
provide fast access to the data at assembly level. This compilation process was
|
||||
fully implemented and tested on complex, real-world examples. The integration
|
||||
of compiled DWARF into existing, real-world projects have been made easy by
|
||||
implementing an alternative version of the \textit{de facto} standard library
|
||||
for this purpose, \prog{libunwind}.
|
||||
of compiled DWARF into existing projects have been made easy by implementing an
|
||||
alternative version of the \textit{de facto} standard library for this purpose,
|
||||
\prog{libunwind}.
|
||||
|
||||
Multiple approaches have been tried, in order to determine which compilation
|
||||
process leads to the best time/space trade-off.
|
||||
|
||||
Unexpectedly, the part that proved hardest of the project was finding a
|
||||
benchmarking protocol that was both relevant and reliable. Unwinding one single
|
||||
frame is way too fast to provide a reliable benchmarking on a few samples
|
||||
(around $10\,\mu s$ per frame). Having a lot of samples is not easy, since one
|
||||
must avoid unwinding the same frame over and over again, which would only
|
||||
benchmark the caching mechanism. The other problem is to distribute evenly the
|
||||
unwinding measures across the various program positions, including directly
|
||||
into the loaded libraries (\eg{} the \prog{libc}).
|
||||
Unexpectedly, the part that proved hardest of the project was finding and
|
||||
implementing a benchmarking protocol that was both relevant and reliable.
|
||||
Unwinding one single frame is way too fast to provide a reliable benchmarking
|
||||
on a few samples (around $10\,\mu s$ per frame). Having enough samples for this
|
||||
purpose --~at least a few thousands~-- is not easy, since one must avoid
|
||||
unwinding the same frame over and over again, which would only benchmark the
|
||||
caching mechanism. The other problem is to distribute evenly the unwinding
|
||||
measures across the various IPs, including directly into the loaded libraries
|
||||
(\eg{} the \prog{libc}).
|
||||
|
||||
The solution eventually chosen was to modify \prog{perf}, the standard
|
||||
profiling program for Linux, in order to gather statistics and benchmarks of
|
||||
its unwindings. Modifying \prog{perf} was an additional challenge that turned
|
||||
out to be harder than expected, since the source code is pretty opaque to
|
||||
someone who doesn't know the project well. This, in particular, required to
|
||||
produce an alternative version of \prog{libunwind} interfaced with the compiled
|
||||
debugging data.
|
||||
someone who doesn't know the project well, and the optimisations make some
|
||||
parts counter-intuitive. This, in particular, required to produce an
|
||||
alternative version of \prog{libunwind} interfaced with the compiled debugging
|
||||
data.
|
||||
|
||||
% What is your solution to the question described in the last paragraph?
|
||||
%
|
||||
|
@ -108,15 +110,16 @@ debugging data.
|
|||
|
||||
The goal was to obtain a compiled version of unwinding data that was faster
|
||||
than DWARF, reasonably heavier and reliable. The benchmarks mentioned have
|
||||
yielded convincing results: on the experimental setup created (detailed later
|
||||
in this report), the compiled version is around 26 times faster than the DWARF
|
||||
version, while it remains only around 2.5 times bigger than the original data.
|
||||
yielded convincing results: on the experimental setup created (detailed on
|
||||
Section~\ref{sec:benchmarking} below), the compiled version is around 26 times
|
||||
faster than the DWARF version, while it remains only around 2.5 times bigger
|
||||
than the original data.
|
||||
|
||||
The implementation is not yet release-ready, as it does not support 100\ \% of
|
||||
the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs}
|
||||
below. Yet, it supports the vast majority --~around $99.9$\ \%~-- of the cases
|
||||
seen in the wild, and is decently robust compared to \prog{libunwind}, the
|
||||
reference implementation. Indeed, corner cases occur often, and on a 27000
|
||||
below. Yet, it supports the vast majority --~more than $99.9$\ \%~-- of the
|
||||
cases seen in the wild, and is decently robust compared to \prog{libunwind},
|
||||
the reference implementation. Indeed, corner cases occur often, and on a 27000
|
||||
samples test, 885 failures were observed for \prog{libunwind}, against 1099 for
|
||||
the compiled DWARF version (see Section~\ref{ssec:timeperf}).
|
||||
|
||||
|
@ -130,13 +133,13 @@ virtually any operating system.
|
|||
|
||||
\subsection*{Summary and future work}
|
||||
|
||||
In most cases of everyday's life, a slow stack unwinding is not a problem, or
|
||||
even an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
|
||||
tasks, such as profiling, can be really useful to profile large programs,
|
||||
particularly if one wants to profile many times in order to analyze the impact
|
||||
of multiple changes. It can also be useful for exception-heavy programs. Thus,
|
||||
it might be interesting to implement a more stable version, and try to
|
||||
interface it cleanly with mainstream tools, such as \prog{perf}.
|
||||
In most cases of everyday's life, a slow stack unwinding is not a problem, left
|
||||
apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
|
||||
tasks can be really useful to \eg{} profile large programs, particularly if one
|
||||
wants to profile many times in order to analyze the impact of multiple changes.
|
||||
It can also be useful for exception-heavy programs. Thus, it might be
|
||||
interesting to implement a more stable version, and try to interface it cleanly
|
||||
with mainstream tools, such as \prog{perf}.
|
||||
|
||||
Another question worth exploring might be whether it is possible to shrink even
|
||||
more the original DWARF unwinding data, which would be stored in a format not
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
\title{DWARF debugging data, compilation and optimization}
|
||||
|
||||
\author{Théophile Bastian\\
|
||||
Under supervision of Francesco Zappa-Nardelli\\
|
||||
Under supervision of Francesco Zappa Nardelli\\
|
||||
{\textsc{parkas}, \'Ecole Normale Supérieure de Paris}}
|
||||
|
||||
\date{March -- August 2018\\August 20, 2018}
|
||||
|
@ -54,21 +54,17 @@ Under supervision of Francesco Zappa-Nardelli\\
|
|||
|
||||
\subsection*{Source code}\label{ssec:source_code}
|
||||
|
||||
The source code of all the implementations made during this internship is
|
||||
available at \url{https://git.tobast.fr/m2-internship/}. See
|
||||
All the source code produced during this internship is available openly. See
|
||||
Section~\ref{ssec:code_avail} for details.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\section{Stack unwinding data presentation}
|
||||
|
||||
The compilation process presented in this section is implemented in
|
||||
\prog{dwarf-assembly}.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Stack frames and x86\_64 calling conventions}
|
||||
|
||||
On most platforms, programs make use of a \emph{call stack} to store
|
||||
On every common platform, programs make use of a \emph{call stack} to store
|
||||
information about the nested function calls at the current execution point, and
|
||||
keep track of their nesting. This call stack is conventionally a contiguous
|
||||
memory space mapped close to the top of the addressing space. Each function
|
||||
|
@ -80,15 +76,15 @@ restored before returning, the function's return address and local variables.
|
|||
|
||||
On the x86\_64 platform, with which this report is mostly concerned, the
|
||||
calling convention that is followed is defined in the System V
|
||||
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux.
|
||||
Under this calling convention, the first six arguments of a function are passed
|
||||
in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8},
|
||||
\reg{r9}, while additional arguments are pushed onto the stack. It also defines
|
||||
which registers may be overwritten by the callee, and which parameters must be
|
||||
restored before returning. This restoration, most of the time, is done by
|
||||
pushing the register value onto the stack in the function prelude, and
|
||||
restoring it just before returning. Those preserved registers are \reg{rbx},
|
||||
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
|
||||
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux
|
||||
and MacOS\@. Under this calling convention, the first six arguments of a
|
||||
function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx},
|
||||
\reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the
|
||||
stack. It also defines which registers may be overwritten by the callee, and
|
||||
which registers must be restored before returning. This restoration, for most
|
||||
compilers, is done by pushing the register value onto the stack in the function
|
||||
prelude, and restoring it just before returning. Those preserved registers are
|
||||
\reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
|
||||
|
||||
\begin{wrapfigure}{r}{0.4\textwidth}
|
||||
\centering
|
||||
|
@ -104,29 +100,32 @@ use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
|
|||
the old value of \reg{rbp} just below the return address on the stack, then
|
||||
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
|
||||
from anywhere within the function, and also allows for easy addressing of local
|
||||
variables. Yet, using \reg{rbp} to save \reg{rip} is not always done, since it
|
||||
somehow ``wastes'' a register. This decision is, on x86\_64 System V, up to the
|
||||
compiler.
|
||||
variables. To some extents, it also allows for hot debugging, such as saving a
|
||||
useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
|
||||
always done, since it somehow ``wastes'' a register. This decision is, on
|
||||
x86\_64 System V, up to the compiler.
|
||||
|
||||
Often, a function will start by subtracting some value to \reg{rsp}, allocating
|
||||
some space in the stack frame for its local variables. Then, it will push on
|
||||
some space in the stack frame for its local variables. Then, it will push on
|
||||
the stack the values of the callee-saved registers that are overwritten later,
|
||||
effectively saving them. Before returning, it will pop the values of the saved
|
||||
registers back to their original registers and restore \reg{rsp} to its former
|
||||
value.
|
||||
|
||||
\subsection{Stack unwinding}
|
||||
\subsection{Stack unwinding}\label{ssec:stack_unwinding}
|
||||
|
||||
For various reasons, it might be interesting, at some point of the execution of
|
||||
a program, to glance at its program stack and be able to extract informations
|
||||
from it. For instance, when running a debugger such as \prog{gdb}, a frequent
|
||||
usage is to obtain a \emph{backtrace}, that is, the list of all nested function
|
||||
calls at this point. This actually reads the stack to find the different stack
|
||||
frames, and decode them to identify the function names, parameter values, etc.
|
||||
calls at the current IP\@. This actually reads the stack to find the different
|
||||
stack frames, and decode them to identify the function names, parameter values,
|
||||
etc.
|
||||
|
||||
This operation is far from trivial. Often, a stack frame will only make sense
|
||||
with correct machine registers values, which can be restored from the previous
|
||||
stack frame, imposing to \emph{walk} the stack, reading the entries one after
|
||||
when the correct values are stored in the machine registers. These values,
|
||||
however, are to be restored from the previous stack frame, where they are
|
||||
stored. This imposes to \emph{walk} the stack, reading the entries one after
|
||||
the other, instead of peeking at some frame directly. Moreover, the size of one
|
||||
stack frame is often not that easy to determine when looking at some
|
||||
instruction other than \texttt{return}, making it hard to extract single frames
|
||||
|
@ -138,28 +137,29 @@ frame, and thus be able to decode the next frame recursively, is called
|
|||
|
||||
Let us consider a stack with x86\_64 calling conventions, such as shown in
|
||||
Figure~\ref{fig:call_stack}. Assuming the compiler decided here \emph{not} to
|
||||
use \reg{rbp}, and assuming the function \eg{} allocates a buffer of 8
|
||||
use \reg{rbp}, and assuming the function allocates \eg{} a buffer of 8
|
||||
integers, the area allocated for local variables should be at least $32$ bytes
|
||||
long (for 4-bytes integers), and \reg{rsp} will be pointing below this area.
|
||||
Left apart analyzing the assembly code produced, there is no way to find where
|
||||
the return address is stored, relatively to \reg{rsp}, at some arbitrary point
|
||||
of the function. Even when \reg{rbp} is used, there is no easy way to guess
|
||||
where each callee-saved register is stored in the stack frame, and worse, which
|
||||
callee-saved registers were saved, since it is optional to save a register
|
||||
that the function never touches.
|
||||
where each callee-saved register is stored in the stack frame, since the
|
||||
compiler is free to do as it wishes. Even worse, it is not trivial to know
|
||||
callee-saved registers were at all, since if the function does not alter a
|
||||
register, it does not have to save it.
|
||||
|
||||
With this example, it seems pretty clear that it is often necessary to have
|
||||
additional data to perform stack unwinding. This data is often stored among the
|
||||
debugging informations of a program, and one common format of debugging data is
|
||||
DWARF\@.
|
||||
With this example, it seems pretty clear tha some additional data is necessary
|
||||
to perform stack unwinding reliably, without only performing a guesswork. This
|
||||
data is stored along with the debugging informations of a program, and one
|
||||
common format of debugging data is DWARF\@.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Unwinding usage and frequency}
|
||||
|
||||
Stack unwinding is a more common operation that one might think at first. The
|
||||
most commonly thought use-case is simply to get a stack trace of a program, and
|
||||
provide a debugger with the information it needs: for instance, when inspecting
|
||||
a stack trace in \prog{gdb}, it is quite common to jump to a previous frame:
|
||||
use case mostly thought of is simply to get a stack trace of a program, and
|
||||
provide a debugger with the information it needs. For instance, when inspecting
|
||||
a stack trace in \prog{gdb}, a common operation is to jump to a previous frame:
|
||||
|
||||
\lstinputlisting{src/segfault/gdb_session}
|
||||
|
||||
|
@ -176,39 +176,43 @@ Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
|
|||
which functions a program spends its time, identify bottlenecks and find out
|
||||
which parts are critical to optimize. To do so, modern profilers pause the
|
||||
traced program at regular, short intervals, inspect their stack, and determine
|
||||
which function is currently being run. They also often perform a stack
|
||||
unwinding to determine the call path to this function, to determine which
|
||||
function indirectly takes time: \eg, a function \lstc{fct_a} can call both
|
||||
\lstc{fct_b} and \lstc{fct_c}, which take a lot of time; spend practically no
|
||||
time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
|
||||
two functions that were made from \lstc{fct_a}.
|
||||
which function is currently being run. They also perform a stack unwinding to
|
||||
figure out the call path to this function, in order to determine which function
|
||||
indirectly takes time: for instance, a function \lstc{fct_a} can call both
|
||||
\lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically
|
||||
no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
|
||||
two functions that were made from \lstc{fct_a}. Knowing that after all,
|
||||
\lstc{fct_a} is the culprit can be useful to a programmer.
|
||||
|
||||
Exception handling also requires a stack unwinding mechanism in most languages.
|
||||
Indeed, an exception is completely different from a \lstinline{return}: while the
|
||||
latter returns to the previous function, the former can be caught by virtually
|
||||
any function in the call path, at any point of the function. It is thus
|
||||
necessary to be able to unwind frames, one by one, until a suitable
|
||||
\lstc{catch} block is found. The C++ language, for one, includes a
|
||||
Indeed, an exception is completely different from a \lstinline{return}: while
|
||||
the latter returns to the previous function, at a well-defined IP, the former
|
||||
can be caught by virtually any function in the call path, at any point of the
|
||||
function. It is thus necessary to be able to unwind frames, one by one, until a
|
||||
suitable \lstc{catch} block is found. The C++ language, for one, includes a
|
||||
stack-unwinding library similar to \prog{libunwind} in its runtime.
|
||||
|
||||
Technically, exception handling could be implemented without any stack
|
||||
unwinding, by using \lstc{setjmp}/\lstc{longjmp} mechanics~\cite{niditoexn}.
|
||||
However, this is not possible to implement it straight away in C++ (and some
|
||||
other languages), because the stack needs to be properly unwound in order to
|
||||
trigger the destructors of stack-allocated objects. Furthermore, this is often
|
||||
undesirable: \lstc{setjmp} has a quite big overhead, which is introduced
|
||||
whenever a \lstc{try} block is encountered. Instead, it is often preferred to
|
||||
have strictly no overhead when no exception happens, at the cost of a greater
|
||||
overhead when an exception is actually fired --~after all, they are supposed to
|
||||
be \emph{exceptional}. For more details on C++ exception handling,
|
||||
see~\cite{koening1990exception} (especially Section~16.5). Possible
|
||||
implementation mechanisms are also presented in~\cite{dinechin2000exn}.
|
||||
unwinding, by using \lstc{setjmp} and \lstc{longjmp}
|
||||
mechanics~\cite{niditoexn}. However, it is not possible to implement this
|
||||
straight away in C++ (among others), because the stack needs to be
|
||||
properly unwound in order to trigger the destructors of stack-allocated
|
||||
objects. Furthermore, this is often undesirable: \lstc{setjmp} introduces an
|
||||
overhead, which is hit whenever a \lstc{try} block is encountered. Instead, it
|
||||
is often preferred to have strictly no overhead when no exception happens, at
|
||||
the cost of a greater overhead when an exception is actually fired --~after
|
||||
all, they are supposed to be \emph{exceptional}. For more details on C++
|
||||
exception handling, see~\cite{koening1990exception} (especially Section~16.5).
|
||||
Possible implementation mechanisms are also presented
|
||||
in~\cite{dinechin2000exn}.
|
||||
|
||||
In both of these two previous cases, performance \emph{can} be a problem. In
|
||||
the latter, a slow unwinding directly impacts the overall program performance,
|
||||
particularly if a lot of exceptions are thrown and caught far away in their
|
||||
call path. In the former, profiling \emph{is} performance-heavy and often quite
|
||||
slow when analyzing large programs anyway.
|
||||
call path. As for the former, profiling \emph{is} performance-heavy and slow:
|
||||
for a session analyzing the \prog{tor-browser} for two and a half minutes,
|
||||
\prog{perf} spends $100\,\mu \text{s}$ analyzing each of the $325679$ samples,
|
||||
that is, $300\,\text{ms}$ per second of program run with default settings.
|
||||
|
||||
One of the causes that inspired this internship were also Stephen Kell's
|
||||
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
|
||||
|
@ -229,43 +233,43 @@ The DWARF data commonly includes type information about the variables in the
|
|||
original programming language, correspondence of assembly instructions with a
|
||||
line in the original source file, \ldots
|
||||
The format also specifies a way to represent unwinding data, as described in
|
||||
the previous paragraph, in an ELF section originally called
|
||||
\lstc{.debug_frame}, most often found as \ehframe.
|
||||
Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
|
||||
\lstc{.debug_frame}, but most often found as \ehframe.
|
||||
|
||||
For any binary, debugging information can easily get quite large if no
|
||||
attention is payed to keeping it as compact as possible. In this matter, DWARF
|
||||
does an excellent job, and everything is stored in a very compact way. This,
|
||||
however, as we will see, makes it both difficult to parse correctly and quite
|
||||
slow to interpret.
|
||||
For any binary, debugging information can easily take up space and grow bigger
|
||||
than the program itself if no attention is paid at keeping it as compact as
|
||||
possible when designing the file format. On this matter, DWARF does an
|
||||
excellent job, and everything is stored in a very compact way. This, however,
|
||||
as we will see, makes it both difficult to parse correctly and relatively slow
|
||||
to interpret.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{DWARF unwinding data}
|
||||
|
||||
The unwinding data, which we will call from now on the \ehframe, contains, for
|
||||
each possible instruction pointer (that is, an instruction address within the
|
||||
program), a set of ``registers'' that can be unwound, and a rule describing how
|
||||
to do so.
|
||||
each possible IP, a set of ``registers'' that can be unwound, and a rule
|
||||
describing how to do so.
|
||||
|
||||
The DWARF language is completely agnostic of the platform and ABI, and in
|
||||
particular, is completely agnostic of a particular platform's registers. Thus,
|
||||
when talking about DWARF, a register is merely a numerical identifier that is
|
||||
often, but not necessarily, mapped to a real machine register by the ABI\@.
|
||||
as far as DWARF is concerned, a register is merely a numerical identifier that
|
||||
is often, but not necessarily, mapped to a real machine register by the ABI\@.
|
||||
|
||||
In practice, this data takes the form of a collection of tables, one table per
|
||||
Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such
|
||||
a table, that has a range of IPs on which it has authority. Most often, but not
|
||||
necessarily, it corresponds to a single function in the original source code.
|
||||
Each column of the table is a register (\eg{} \reg{rsp}), with two additional
|
||||
special registers, CFA (Canonical Frame Address) and RA (Return Address),
|
||||
containing respectively the base pointer of the current stack frame and the
|
||||
return address of the current function. For instance, on a x86\_64
|
||||
Each column of the table is a register (\eg{} \reg{rsp}), along with two
|
||||
additional special registers, CFA (Canonical Frame Address) and RA (Return
|
||||
Address), containing respectively the base pointer of the current stack frame
|
||||
and the return address of the current function. For instance, on a x86\_64
|
||||
architecture, RA would contain the unwound value of \reg{rip}, the instruction
|
||||
pointer. Each row has a certain validity interval, on which it describes
|
||||
accurate unwinding data. This range starts at the instruction pointer it is
|
||||
associated with, and ends at the start IP of the next table row (or the end IP
|
||||
of the current FDE if it was the last row). In particular, there can be no ``IP
|
||||
hole'' within a FDE --~unlike FDEs themselves, which can leave holes between
|
||||
them.
|
||||
associated with, and ends at the start IP of the next table row --~or the end
|
||||
IP of the current FDE if it was the last row. In particular, there can be no
|
||||
``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
|
||||
between them.
|
||||
|
||||
\begin{figure}[h]
|
||||
\begin{minipage}{0.45\textwidth}
|
||||
|
@ -312,7 +316,7 @@ them.
|
|||
\caption{Stack frame schema}\label{table:ex1_stack_schema}
|
||||
\end{table}
|
||||
|
||||
For instance, the C source code in Listing~\ref{lst:ex1_c} above, when compiled
|
||||
For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled
|
||||
with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the
|
||||
assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack
|
||||
frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding
|
||||
|
@ -380,9 +384,9 @@ Figure~\ref{fig:fde_line_density} was generated on a random sample of around
|
|||
|
||||
The most commonly used library to perform stack unwinding, in the Linux
|
||||
ecosystem, is \prog{libunwind}~\cite{libunwind}. While it is very robust and
|
||||
quite efficient, most of its optimization comes from fine-tuned code and good
|
||||
caching mechanisms. While parsing DWARF, \prog{libunwind} is forced to parse
|
||||
the relevant FDE from its start, until it finds the row it was seeking.
|
||||
decently efficient, most of its optimization comes from fine-tuned code and
|
||||
good caching mechanisms. When parsing DWARF, \prog{libunwind} is forced to
|
||||
parse the relevant FDE from its start, until it finds the row it was seeking.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -392,9 +396,9 @@ the relevant FDE from its start, until it finds the row it was seeking.
|
|||
We will now define semantics covering most of the operations used for FDEs
|
||||
described in the DWARF standard~\cite{dwarf5std}, such as seen in
|
||||
Listing~\ref{lst:ex1_dwraw}, with the exception of DWARF expressions. These are
|
||||
not exhaustively treated because they are quite rich and would take a lot of
|
||||
time and space to formalize, and in the meantime are only seldom used (see the
|
||||
DWARF statistics regarding this).
|
||||
not exhaustively treated because they form a rich language and would take a lot
|
||||
of time and space to formalize, and in the meantime are only seldom used (see
|
||||
the DWARF statistics regarding this).
|
||||
|
||||
These semantics are defined with respect to the well-formalized C language, and
|
||||
are passing through an intermediary language. The DWARF language can read the
|
||||
|
@ -650,7 +654,7 @@ earlier. The translation from $\intermedlang$ to C is defined as follows:
|
|||
if(ip >= $loc$) {
|
||||
for(int reg=0; reg < NB_REGS; ++reg)
|
||||
new_ctx[reg] = $\semR{row[reg]}$;
|
||||
goto end_ifs; // Avoid if/else if problems
|
||||
goto end_ifs; // Avoid using `else if` (easier for generation)
|
||||
}
|
||||
\end{lstlisting}
|
||||
\end{itemize}
|
||||
|
@ -688,9 +692,8 @@ licenses.
|
|||
The rough idea of the compilation is to produce, out of the \ehframe{} section
|
||||
of a binary, C code that resembles the code shown in the DWARF semantics from
|
||||
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
|
||||
\lstbash{-O2} mode, since it already provides a good level of optimization and
|
||||
compiling in \lstbash{-O3} takes way too much time. This saves us the trouble
|
||||
of optimizing the generated C code whenever GCC does that by itself.
|
||||
\lstbash{-O2} mode. This saves us the trouble of optimizing the generated C
|
||||
code whenever GCC does that by itself.
|
||||
|
||||
The generated code consists in a single monolithic function, \lstc{_eh_elf},
|
||||
taking as arguments an instruction pointer and a memory context (\ie{} the
|
||||
|
@ -698,15 +701,15 @@ value of the various machine registers) as defined in
|
|||
Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
|
||||
context, containing the values the registers hold after unwinding this frame.
|
||||
|
||||
The body of the function itself is mostly a huge switch, taking advantage of
|
||||
the non-standard --~yet widely implemented in C compilers~-- syntax for range
|
||||
switches, in which each \lstinline{case} can refer to a range. All the FDEs are
|
||||
merged together into this switch, each row of a FDE being a switch case.
|
||||
Separating the various FDEs in the C code --~other than with comments~-- is,
|
||||
unlike what is done in DWARF, pointless, since accessing a ``row'' has a linear
|
||||
cost, and the C code is not meant to be read, except maybe for debugging
|
||||
purposes. The switch cases bodies then fill a context with unwound values, then
|
||||
return it.
|
||||
The body of the function itself consists in a single monolithic switch, taking
|
||||
advantage of the non-standard --~yet widely implemented in C compilers~--
|
||||
syntax for range switches, in which each \lstinline{case} can refer to a range.
|
||||
All the FDEs are merged together into this switch, each row of a FDE being a
|
||||
switch case. Separating the various FDEs in the C code --~other than with
|
||||
comments~-- is, unlike what is done in DWARF, pointless, since accessing a
|
||||
``row'' has a linear cost, and the C code is not meant to be read, except maybe
|
||||
for debugging purposes. The switch cases bodies then fill a context with
|
||||
unwound values, then return it.
|
||||
|
||||
A setting of the compiler also optionally enables another parameter to the
|
||||
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
|
||||
|
@ -720,13 +723,13 @@ Unlike in the \ehframe, and unlike what should be done in a release,
|
|||
real-world-proof version of the \ehelfs, the choice was made to keep this
|
||||
implementation simple, and only handle the few registers that were needed to
|
||||
simply unwind the stack. Thus, the only registers handled in \ehelfs{} are
|
||||
\reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used quite
|
||||
often in \prog{libc} to hold the CFA address. This is enough to unwind the
|
||||
stack reliably, and thus enough for profiling, but is not sufficient to analyze
|
||||
every stack frame as \prog{gdb} would do after a \lstbash{frame n} command.
|
||||
Yet, if one was to enhance the code to handle every register, it would not be
|
||||
much harder and would probably be only a few hours of code refactoring and
|
||||
rewriting.
|
||||
\reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used a few
|
||||
times in \prog{libc} to hold the CFA address in common functions. This is
|
||||
enough to unwind the stack reliably, and thus enough for profiling, but is not
|
||||
sufficient to analyze every stack frame as \prog{gdb} would do after a
|
||||
\lstbash{frame n} command. Yet, if one was to enhance the code to handle every
|
||||
register, it would not be much harder and would probably be only a few hours of
|
||||
code refactoring and rewriting.
|
||||
|
||||
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
|
||||
{src/dwarf_assembly_context/unwind_context.c}
|
||||
|
@ -743,16 +746,15 @@ by including an error flag by lack of $\bot$ value.
|
|||
|
||||
This generated data is stored in separate shared object files, which we call
|
||||
\ehelfs. It would have been possible to alter the original ELF file to embed
|
||||
this data as a new section, but getting it to be executed just as any
|
||||
portion of the \lstc{.text} section would probably have been painful, and
|
||||
keeping it separated during the experimental phase is quite convenient. It is
|
||||
possible to have multiple versions of \ehelfs{} files in parallel, with various
|
||||
options turned on or off, and it doesn't require to alter the base system by
|
||||
editing \eg{} \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is
|
||||
required, those files can simply be \lstc{dlopen}'d. It is also possible to
|
||||
imagine, in a future environment production, packaging \ehelfs{} files
|
||||
separately, so that people interested in heavy computation can have the choice
|
||||
to install them.
|
||||
this data as a new section, but getting it to be executed just as any portion
|
||||
of the \lstc{.text} section would probably have been painful, and keeping it
|
||||
separated during the experimental phase is convenient. It is possible to have
|
||||
multiple versions of \ehelfs{} files in parallel, with various options turned
|
||||
on or off, and it doesn't require to alter the base system by editing \eg{}
|
||||
\texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is required, those
|
||||
files can simply be \lstc{dlopen}'d. It is also possible to imagine, in a
|
||||
future environment production, packaging \ehelfs{} files separately, so that
|
||||
people interested in heavy computation can have the choice to install them.
|
||||
|
||||
This, in particular, means that each ELF file has its unwinding data in a
|
||||
separate \ehelf{} file --~just like with DWARF, where each ELF retains its own
|
||||
|
@ -781,10 +783,8 @@ possible to produce a compiled version very close to the one described in
|
|||
Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be
|
||||
actually benchmarked, it is already possible to write in a few hundred lines of
|
||||
C code a simple stack walker printing the functions traversed. It already works
|
||||
without any problem on the easily tested cases, since corner cases are mostly
|
||||
found in standard and highly optimized libraries, and it is not that easy to get
|
||||
the program to stop and print a stack trace from within a system library
|
||||
without using a debugger.
|
||||
well on the standard cases that are easily tested, and can be used to unwind
|
||||
the stack of simple programs.
|
||||
|
||||
The major drawback of this approach, without any particular care taken, is the
|
||||
space waste. The space taken by those tentative \ehelfs{} is analyzed in
|
||||
|
@ -835,14 +835,16 @@ made in order to shrink the \ehelfs.
|
|||
\medskip
|
||||
|
||||
The major optimization that most reduced the output size was to use an if/else
|
||||
tree implementing a binary search on the program counter relevant intervals,
|
||||
instead of a huge switch. In the process, we also \emph{outline} a lot of code,
|
||||
that is, find out identical ``switch cases'' bodies --~which are not switch
|
||||
cases anymore, but if bodies~--, move them outside of the if/else tree,
|
||||
identify them by a label, and jump to them using a \lstc{goto}, which
|
||||
de-duplicates a lot of code and contributes greatly to the shrinking. In the
|
||||
process, we noticed that the vast majority of FDE rows are actually taken among
|
||||
very few ``common'' FDE rows.
|
||||
tree implementing a binary search on the instruction pointer relevant
|
||||
intervals, instead of a single monolithic switch. In the process, we also
|
||||
\emph{outline} code whenever possible, that is, find out identical ``switch
|
||||
cases'' bodies --~which are not switch cases anymore, but if bodies~--, move
|
||||
them outside of the if/else tree, identify them by a label, and jump to them
|
||||
using a \lstc{goto}, which de-duplicates a lot of code and contributes greatly
|
||||
to the shrinking. In the process, we noticed that the vast majority of FDE rows
|
||||
are actually taken among very few ``common'' FDE rows. For instance, in the
|
||||
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) remain
|
||||
after the outlining.
|
||||
|
||||
This makes this optimization really efficient, as seen later in
|
||||
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
||||
|
@ -861,10 +863,10 @@ DWARF data could be efficiently compressed in this way.
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\section{Benchmarking}
|
||||
\section{Benchmarking}\label{sec:benchmarking}
|
||||
|
||||
Benchmarking turned out to be, quite surprisingly, the hardest part of the
|
||||
project. It ended up requiring a lot of investigation to find a working
|
||||
project. It ended up requiring a good deal of investigation to find a working
|
||||
protocol, and afterwards, a good deal of code reading and coding to get the
|
||||
solution working.
|
||||
|
||||
|
@ -884,13 +886,15 @@ are made from different locations is somehow cheating, since it makes useless
|
|||
distribution. All in all, the benchmarking method must have a ``natural''
|
||||
distribution of unwindings.
|
||||
|
||||
Another requirement is to also distribute quite evenly the unwinding points
|
||||
across the program: we would like to benchmark stack unwindings crossing some
|
||||
standard library functions, starting from inside them, etc.
|
||||
Another requirement is to also distribute evenly enough the unwinding points
|
||||
across the program to mimic real-world unwinding: we would like to benchmark
|
||||
stack unwindings crossing some standard library functions, starting from inside
|
||||
them, etc.
|
||||
|
||||
Finally, the unwound program must be interesting enough to enter and exit a lot
|
||||
of functions, nest function calls, have FDEs that are not as simple as in
|
||||
Listing~\ref{lst:ex1_dw}, etc.
|
||||
Finally, the unwound program must be interesting enough to enter and exit
|
||||
functions often, building a good stack of nested function calls (at least 5
|
||||
frequently), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
|
||||
etc.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -915,7 +919,8 @@ program, but instead inject code in it.
|
|||
\subsection{Benchmarking with \prog{perf}}\label{ssec:bench_perf}
|
||||
|
||||
In the context of this internship, the main advantage of \prog{perf} is that it
|
||||
does a lot of stack unwinding. It also meets all the requirements from
|
||||
unwinds the stack on a regular, controllable basis, easily unwinding thousands
|
||||
of time in a few seconds. It also meets all the requirements from
|
||||
Section~\ref{ssec:bench_req} above: since it stops at regular intervals and
|
||||
unwinds, the unwindings are evenly distributed \wrt{} the frequency of
|
||||
execution of the code, which is a natural enough setup for the benchmarks to be
|
||||
|
@ -966,10 +971,10 @@ the repositories \prog{perf-eh\_elf} and \prog{libunwind-eh\_elf}.
|
|||
|
||||
The first approach tried to benchmark was trying to create some specific C code
|
||||
that would meet the requirements from Section~\ref{ssec:bench_req}, while
|
||||
calling itself a benchmarking procedure from time to time. This was abandoned
|
||||
quite quickly, because generating C code interesting enough to be unwound
|
||||
turned out hard, and the generated FDEs invariably ended out uninteresting. It
|
||||
would also never have met the requirement of unwinding from fairly distributed
|
||||
calling itself a benchmarking procedure from time to time. This was quickly
|
||||
abandoned, because generating C code interesting enough to be unwound turned
|
||||
out hard, and the generated FDEs invariably ended out uninteresting. It would
|
||||
also never have met the requirement of unwinding from fairly distributed
|
||||
locations anyway.
|
||||
|
||||
Another attempt was made using CSmith~\cite{csmith}, a random C code generator
|
||||
|
|
Loading…
Reference in a new issue