Rephrase and correct everything up to end of §1

This commit is contained in:
Théophile Bastian 2018-08-18 21:12:05 +02:00
parent 4016b4f46c
commit f0809dbf1c
2 changed files with 203 additions and 195 deletions

View file

@ -8,45 +8,45 @@
\subsection*{The general context} \subsection*{The general context}
The standard debugging data format for ELF binary files, DWARF, contains a lot The standard debugging data format for ELF binary files, DWARF, contains tables
of information, which is generated mostly when passing \eg{} the switch that permit, for a given instruction pointer (IP), to understand how the
\lstbash{-g} to \prog{gcc}. This information, essentially provided for assembly instruction relates to the source code, where variables are currently
debuggers, contains all that is needed to connect the generated assembly with allocated in memory or if they are stored in a register, what are their type
the original code, information that can be used by sanitizers (\eg{} the type and how to unwind the current stack frame. This inforation is generated when
of each variable in the source language), etc. passing \eg{} the switch \lstbash{-g} to \prog{gcc} or equivalents.
Even in stripped (non-debug) binaries, a small portion of DWARF data remains. Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
Among this essential data that is never stripped is the stack unwinding data, the stack unwinding data. This information is necessary to unwind stack
which allows to unwind stack frames, restoring machine registers to the value frames, restoring machine registers to the value they had in the previous
they had in the previous frame, for instance within the context of a debugger frame, for instance within the context of a debugger or a profiler.
or a profiler.
This data is structured into tables, each row corresponding to an program This data is structured into tables, each row corresponding to an IP range for
counter (PC) range for which it describes valid unwinding data, and each column which it describes valid unwinding data, and each column describing how to
describing how to unwind a particular machine register (or virtual register unwind a particular machine register (or virtual register used for various
used for various purposes). These rules are mostly basic, consisting in offsets purposes). The vast majority of the rules actually used are basic --~see
from memory addresses stored in registers (such as \reg{rbp} or \reg{rsp}), but Section~\ref{ssec:instr_cov}~\textendash, consisting in offsets from memory
in some cases, they can take the form of a stack-machine expression that can addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the
access virtually all the process's memory and perform Turing-complete standard defines rules that take the form of a stack-machine expression that
can access virtually all the process's memory and perform Turing-complete
computation~\cite{oakley2011exploiting}. computation~\cite{oakley2011exploiting}.
\subsection*{The research problem} \subsection*{The research problem}
As debugging data can easily take an unreasonable space if stored carelessly, As debugging data can easily take an unreasonable space and grow larger than
the DWARF standard pays a great attention to data compactness and compression, the program itself if stored carelessly, the DWARF standard pays a great
and succeeds particularly well at it. But this, as always, is at the expense attention to data compactness and compression, and succeeds particularly well
of efficiency: accessing stack unwinding data for a particular program point at it. But this, as always, is at the expense of efficiency: accessing stack
can be quite costly. unwinding data for a particular program point is not a light operation --~in
the order of magnitude of $10\,\mu{}\text{s}$ on a modern computer.
This is often not a huge problem, as stack unwinding is mostly thought of as a This is often not a huge problem, as stack unwinding is often thought of as a
debugging procedure: when something behaves unexpectedly, the programmer might debugging procedure: when something behaves unexpectedly, the programmer might
be interested in opening their debugger and exploring the stack. Yet, stack be interested in opening their debugger and exploring the stack. Yet, stack
unwinding might, in some cases, be performance-critical: for instance, profiler unwinding might, in some cases, be performance-critical: for instance, profiler
programs needs to perform a whole lot of stack unwindings. Even worse, programs needs to perform a whole lot of stack unwindings. Even worse,
exception handling relies on stack unwinding in order to find a suitable exception handling relies on stack unwinding in order to find a suitable
catch-block! For such applications, it might be desirable to find a different catch-block! For such applications, it might be desirable to find a different
time/space trade-off, allowing a slightly space-heavier, but far more time/space trade-off, storing a bit more for a faster unwinding.
time-efficient unwinding procedure.
This different trade-off is the question that I explored during this This different trade-off is the question that I explored during this
internship: what good alternative trade-off is reachable when storing the stack internship: what good alternative trade-off is reachable when storing the stack
@ -69,29 +69,31 @@ This internship explored the possibility to compile DWARF's stack unwinding
data directly into native assembly on the x86\_64 architecture, in order to data directly into native assembly on the x86\_64 architecture, in order to
provide fast access to the data at assembly level. This compilation process was provide fast access to the data at assembly level. This compilation process was
fully implemented and tested on complex, real-world examples. The integration fully implemented and tested on complex, real-world examples. The integration
of compiled DWARF into existing, real-world projects have been made easy by of compiled DWARF into existing projects have been made easy by implementing an
implementing an alternative version of the \textit{de facto} standard library alternative version of the \textit{de facto} standard library for this purpose,
for this purpose, \prog{libunwind}. \prog{libunwind}.
Multiple approaches have been tried, in order to determine which compilation Multiple approaches have been tried, in order to determine which compilation
process leads to the best time/space trade-off. process leads to the best time/space trade-off.
Unexpectedly, the part that proved hardest of the project was finding a Unexpectedly, the part that proved hardest of the project was finding and
benchmarking protocol that was both relevant and reliable. Unwinding one single implementing a benchmarking protocol that was both relevant and reliable.
frame is way too fast to provide a reliable benchmarking on a few samples Unwinding one single frame is way too fast to provide a reliable benchmarking
(around $10\,\mu s$ per frame). Having a lot of samples is not easy, since one on a few samples (around $10\,\mu s$ per frame). Having enough samples for this
must avoid unwinding the same frame over and over again, which would only purpose --~at least a few thousands~-- is not easy, since one must avoid
benchmark the caching mechanism. The other problem is to distribute evenly the unwinding the same frame over and over again, which would only benchmark the
unwinding measures across the various program positions, including directly caching mechanism. The other problem is to distribute evenly the unwinding
into the loaded libraries (\eg{} the \prog{libc}). measures across the various IPs, including directly into the loaded libraries
(\eg{} the \prog{libc}).
The solution eventually chosen was to modify \prog{perf}, the standard The solution eventually chosen was to modify \prog{perf}, the standard
profiling program for Linux, in order to gather statistics and benchmarks of profiling program for Linux, in order to gather statistics and benchmarks of
its unwindings. Modifying \prog{perf} was an additional challenge that turned its unwindings. Modifying \prog{perf} was an additional challenge that turned
out to be harder than expected, since the source code is pretty opaque to out to be harder than expected, since the source code is pretty opaque to
someone who doesn't know the project well. This, in particular, required to someone who doesn't know the project well, and the optimisations make some
produce an alternative version of \prog{libunwind} interfaced with the compiled parts counter-intuitive. This, in particular, required to produce an
debugging data. alternative version of \prog{libunwind} interfaced with the compiled debugging
data.
% What is your solution to the question described in the last paragraph? % What is your solution to the question described in the last paragraph?
% %
@ -108,15 +110,16 @@ debugging data.
The goal was to obtain a compiled version of unwinding data that was faster The goal was to obtain a compiled version of unwinding data that was faster
than DWARF, reasonably heavier and reliable. The benchmarks mentioned have than DWARF, reasonably heavier and reliable. The benchmarks mentioned have
yielded convincing results: on the experimental setup created (detailed later yielded convincing results: on the experimental setup created (detailed on
in this report), the compiled version is around 26 times faster than the DWARF Section~\ref{sec:benchmarking} below), the compiled version is around 26 times
version, while it remains only around 2.5 times bigger than the original data. faster than the DWARF version, while it remains only around 2.5 times bigger
than the original data.
The implementation is not yet release-ready, as it does not support 100\ \% of The implementation is not yet release-ready, as it does not support 100\ \% of
the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs} the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs}
below. Yet, it supports the vast majority --~around $99.9$\ \%~-- of the cases below. Yet, it supports the vast majority --~more than $99.9$\ \%~-- of the
seen in the wild, and is decently robust compared to \prog{libunwind}, the cases seen in the wild, and is decently robust compared to \prog{libunwind},
reference implementation. Indeed, corner cases occur often, and on a 27000 the reference implementation. Indeed, corner cases occur often, and on a 27000
samples test, 885 failures were observed for \prog{libunwind}, against 1099 for samples test, 885 failures were observed for \prog{libunwind}, against 1099 for
the compiled DWARF version (see Section~\ref{ssec:timeperf}). the compiled DWARF version (see Section~\ref{ssec:timeperf}).
@ -130,13 +133,13 @@ virtually any operating system.
\subsection*{Summary and future work} \subsection*{Summary and future work}
In most cases of everyday's life, a slow stack unwinding is not a problem, or In most cases of everyday's life, a slow stack unwinding is not a problem, left
even an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
tasks, such as profiling, can be really useful to profile large programs, tasks can be really useful to \eg{} profile large programs, particularly if one
particularly if one wants to profile many times in order to analyze the impact wants to profile many times in order to analyze the impact of multiple changes.
of multiple changes. It can also be useful for exception-heavy programs. Thus, It can also be useful for exception-heavy programs. Thus, it might be
it might be interesting to implement a more stable version, and try to interesting to implement a more stable version, and try to interface it cleanly
interface it cleanly with mainstream tools, such as \prog{perf}. with mainstream tools, such as \prog{perf}.
Another question worth exploring might be whether it is possible to shrink even Another question worth exploring might be whether it is possible to shrink even
more the original DWARF unwinding data, which would be stored in a format not more the original DWARF unwinding data, which would be stored in a format not

View file

@ -1,7 +1,7 @@
\title{DWARF debugging data, compilation and optimization} \title{DWARF debugging data, compilation and optimization}
\author{Théophile Bastian\\ \author{Théophile Bastian\\
Under supervision of Francesco Zappa-Nardelli\\ Under supervision of Francesco Zappa Nardelli\\
{\textsc{parkas}, \'Ecole Normale Supérieure de Paris}} {\textsc{parkas}, \'Ecole Normale Supérieure de Paris}}
\date{March -- August 2018\\August 20, 2018} \date{March -- August 2018\\August 20, 2018}
@ -54,21 +54,17 @@ Under supervision of Francesco Zappa-Nardelli\\
\subsection*{Source code}\label{ssec:source_code} \subsection*{Source code}\label{ssec:source_code}
The source code of all the implementations made during this internship is All the source code produced during this internship is available openly. See
available at \url{https://git.tobast.fr/m2-internship/}. See
Section~\ref{ssec:code_avail} for details. Section~\ref{ssec:code_avail} for details.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stack unwinding data presentation} \section{Stack unwinding data presentation}
The compilation process presented in this section is implemented in
\prog{dwarf-assembly}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Stack frames and x86\_64 calling conventions} \subsection{Stack frames and x86\_64 calling conventions}
On most platforms, programs make use of a \emph{call stack} to store On every common platform, programs make use of a \emph{call stack} to store
information about the nested function calls at the current execution point, and information about the nested function calls at the current execution point, and
keep track of their nesting. This call stack is conventionally a contiguous keep track of their nesting. This call stack is conventionally a contiguous
memory space mapped close to the top of the addressing space. Each function memory space mapped close to the top of the addressing space. Each function
@ -80,15 +76,15 @@ restored before returning, the function's return address and local variables.
On the x86\_64 platform, with which this report is mostly concerned, the On the x86\_64 platform, with which this report is mostly concerned, the
calling convention that is followed is defined in the System V calling convention that is followed is defined in the System V
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux. ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux
Under this calling convention, the first six arguments of a function are passed and MacOS\@. Under this calling convention, the first six arguments of a
in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx},
\reg{r9}, while additional arguments are pushed onto the stack. It also defines \reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the
which registers may be overwritten by the callee, and which parameters must be stack. It also defines which registers may be overwritten by the callee, and
restored before returning. This restoration, most of the time, is done by which registers must be restored before returning. This restoration, for most
pushing the register value onto the stack in the function prelude, and compilers, is done by pushing the register value onto the stack in the function
restoring it just before returning. Those preserved registers are \reg{rbx}, prelude, and restoring it just before returning. Those preserved registers are
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}. \reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
\begin{wrapfigure}{r}{0.4\textwidth} \begin{wrapfigure}{r}{0.4\textwidth}
\centering \centering
@ -104,29 +100,32 @@ use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
the old value of \reg{rbp} just below the return address on the stack, then the old value of \reg{rbp} just below the return address on the stack, then
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
from anywhere within the function, and also allows for easy addressing of local from anywhere within the function, and also allows for easy addressing of local
variables. Yet, using \reg{rbp} to save \reg{rip} is not always done, since it variables. To some extents, it also allows for hot debugging, such as saving a
somehow ``wastes'' a register. This decision is, on x86\_64 System V, up to the useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
compiler. always done, since it somehow ``wastes'' a register. This decision is, on
x86\_64 System V, up to the compiler.
Often, a function will start by subtracting some value to \reg{rsp}, allocating Often, a function will start by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it will push on some space in the stack frame for its local variables. Then, it will push on
the stack the values of the callee-saved registers that are overwritten later, the stack the values of the callee-saved registers that are overwritten later,
effectively saving them. Before returning, it will pop the values of the saved effectively saving them. Before returning, it will pop the values of the saved
registers back to their original registers and restore \reg{rsp} to its former registers back to their original registers and restore \reg{rsp} to its former
value. value.
\subsection{Stack unwinding} \subsection{Stack unwinding}\label{ssec:stack_unwinding}
For various reasons, it might be interesting, at some point of the execution of For various reasons, it might be interesting, at some point of the execution of
a program, to glance at its program stack and be able to extract informations a program, to glance at its program stack and be able to extract informations
from it. For instance, when running a debugger such as \prog{gdb}, a frequent from it. For instance, when running a debugger such as \prog{gdb}, a frequent
usage is to obtain a \emph{backtrace}, that is, the list of all nested function usage is to obtain a \emph{backtrace}, that is, the list of all nested function
calls at this point. This actually reads the stack to find the different stack calls at the current IP\@. This actually reads the stack to find the different
frames, and decode them to identify the function names, parameter values, etc. stack frames, and decode them to identify the function names, parameter values,
etc.
This operation is far from trivial. Often, a stack frame will only make sense This operation is far from trivial. Often, a stack frame will only make sense
with correct machine registers values, which can be restored from the previous when the correct values are stored in the machine registers. These values,
stack frame, imposing to \emph{walk} the stack, reading the entries one after however, are to be restored from the previous stack frame, where they are
stored. This imposes to \emph{walk} the stack, reading the entries one after
the other, instead of peeking at some frame directly. Moreover, the size of one the other, instead of peeking at some frame directly. Moreover, the size of one
stack frame is often not that easy to determine when looking at some stack frame is often not that easy to determine when looking at some
instruction other than \texttt{return}, making it hard to extract single frames instruction other than \texttt{return}, making it hard to extract single frames
@ -138,28 +137,29 @@ frame, and thus be able to decode the next frame recursively, is called
Let us consider a stack with x86\_64 calling conventions, such as shown in Let us consider a stack with x86\_64 calling conventions, such as shown in
Figure~\ref{fig:call_stack}. Assuming the compiler decided here \emph{not} to Figure~\ref{fig:call_stack}. Assuming the compiler decided here \emph{not} to
use \reg{rbp}, and assuming the function \eg{} allocates a buffer of 8 use \reg{rbp}, and assuming the function allocates \eg{} a buffer of 8
integers, the area allocated for local variables should be at least $32$ bytes integers, the area allocated for local variables should be at least $32$ bytes
long (for 4-bytes integers), and \reg{rsp} will be pointing below this area. long (for 4-bytes integers), and \reg{rsp} will be pointing below this area.
Left apart analyzing the assembly code produced, there is no way to find where Left apart analyzing the assembly code produced, there is no way to find where
the return address is stored, relatively to \reg{rsp}, at some arbitrary point the return address is stored, relatively to \reg{rsp}, at some arbitrary point
of the function. Even when \reg{rbp} is used, there is no easy way to guess of the function. Even when \reg{rbp} is used, there is no easy way to guess
where each callee-saved register is stored in the stack frame, and worse, which where each callee-saved register is stored in the stack frame, since the
callee-saved registers were saved, since it is optional to save a register compiler is free to do as it wishes. Even worse, it is not trivial to know
that the function never touches. callee-saved registers were at all, since if the function does not alter a
register, it does not have to save it.
With this example, it seems pretty clear that it is often necessary to have With this example, it seems pretty clear tha some additional data is necessary
additional data to perform stack unwinding. This data is often stored among the to perform stack unwinding reliably, without only performing a guesswork. This
debugging informations of a program, and one common format of debugging data is data is stored along with the debugging informations of a program, and one
DWARF\@. common format of debugging data is DWARF\@.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Unwinding usage and frequency} \subsection{Unwinding usage and frequency}
Stack unwinding is a more common operation that one might think at first. The Stack unwinding is a more common operation that one might think at first. The
most commonly thought use-case is simply to get a stack trace of a program, and use case mostly thought of is simply to get a stack trace of a program, and
provide a debugger with the information it needs: for instance, when inspecting provide a debugger with the information it needs. For instance, when inspecting
a stack trace in \prog{gdb}, it is quite common to jump to a previous frame: a stack trace in \prog{gdb}, a common operation is to jump to a previous frame:
\lstinputlisting{src/segfault/gdb_session} \lstinputlisting{src/segfault/gdb_session}
@ -176,39 +176,43 @@ Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
which functions a program spends its time, identify bottlenecks and find out which functions a program spends its time, identify bottlenecks and find out
which parts are critical to optimize. To do so, modern profilers pause the which parts are critical to optimize. To do so, modern profilers pause the
traced program at regular, short intervals, inspect their stack, and determine traced program at regular, short intervals, inspect their stack, and determine
which function is currently being run. They also often perform a stack which function is currently being run. They also perform a stack unwinding to
unwinding to determine the call path to this function, to determine which figure out the call path to this function, in order to determine which function
function indirectly takes time: \eg, a function \lstc{fct_a} can call both indirectly takes time: for instance, a function \lstc{fct_a} can call both
\lstc{fct_b} and \lstc{fct_c}, which take a lot of time; spend practically no \lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically
time directly in \lstc{fct_a}, but spend a lot of time in calls to the other no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
two functions that were made from \lstc{fct_a}. two functions that were made from \lstc{fct_a}. Knowing that after all,
\lstc{fct_a} is the culprit can be useful to a programmer.
Exception handling also requires a stack unwinding mechanism in most languages. Exception handling also requires a stack unwinding mechanism in most languages.
Indeed, an exception is completely different from a \lstinline{return}: while the Indeed, an exception is completely different from a \lstinline{return}: while
latter returns to the previous function, the former can be caught by virtually the latter returns to the previous function, at a well-defined IP, the former
any function in the call path, at any point of the function. It is thus can be caught by virtually any function in the call path, at any point of the
necessary to be able to unwind frames, one by one, until a suitable function. It is thus necessary to be able to unwind frames, one by one, until a
\lstc{catch} block is found. The C++ language, for one, includes a suitable \lstc{catch} block is found. The C++ language, for one, includes a
stack-unwinding library similar to \prog{libunwind} in its runtime. stack-unwinding library similar to \prog{libunwind} in its runtime.
Technically, exception handling could be implemented without any stack Technically, exception handling could be implemented without any stack
unwinding, by using \lstc{setjmp}/\lstc{longjmp} mechanics~\cite{niditoexn}. unwinding, by using \lstc{setjmp} and \lstc{longjmp}
However, this is not possible to implement it straight away in C++ (and some mechanics~\cite{niditoexn}. However, it is not possible to implement this
other languages), because the stack needs to be properly unwound in order to straight away in C++ (among others), because the stack needs to be
trigger the destructors of stack-allocated objects. Furthermore, this is often properly unwound in order to trigger the destructors of stack-allocated
undesirable: \lstc{setjmp} has a quite big overhead, which is introduced objects. Furthermore, this is often undesirable: \lstc{setjmp} introduces an
whenever a \lstc{try} block is encountered. Instead, it is often preferred to overhead, which is hit whenever a \lstc{try} block is encountered. Instead, it
have strictly no overhead when no exception happens, at the cost of a greater is often preferred to have strictly no overhead when no exception happens, at
overhead when an exception is actually fired --~after all, they are supposed to the cost of a greater overhead when an exception is actually fired --~after
be \emph{exceptional}. For more details on C++ exception handling, all, they are supposed to be \emph{exceptional}. For more details on C++
see~\cite{koening1990exception} (especially Section~16.5). Possible exception handling, see~\cite{koening1990exception} (especially Section~16.5).
implementation mechanisms are also presented in~\cite{dinechin2000exn}. Possible implementation mechanisms are also presented
in~\cite{dinechin2000exn}.
In both of these two previous cases, performance \emph{can} be a problem. In In both of these two previous cases, performance \emph{can} be a problem. In
the latter, a slow unwinding directly impacts the overall program performance, the latter, a slow unwinding directly impacts the overall program performance,
particularly if a lot of exceptions are thrown and caught far away in their particularly if a lot of exceptions are thrown and caught far away in their
call path. In the former, profiling \emph{is} performance-heavy and often quite call path. As for the former, profiling \emph{is} performance-heavy and slow:
slow when analyzing large programs anyway. for a session analyzing the \prog{tor-browser} for two and a half minutes,
\prog{perf} spends $100\,\mu \text{s}$ analyzing each of the $325679$ samples,
that is, $300\,\text{ms}$ per second of program run with default settings.
One of the causes that inspired this internship were also Stephen Kell's One of the causes that inspired this internship were also Stephen Kell's
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack \prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
@ -229,43 +233,43 @@ The DWARF data commonly includes type information about the variables in the
original programming language, correspondence of assembly instructions with a original programming language, correspondence of assembly instructions with a
line in the original source file, \ldots line in the original source file, \ldots
The format also specifies a way to represent unwinding data, as described in The format also specifies a way to represent unwinding data, as described in
the previous paragraph, in an ELF section originally called Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
\lstc{.debug_frame}, most often found as \ehframe. \lstc{.debug_frame}, but most often found as \ehframe.
For any binary, debugging information can easily get quite large if no For any binary, debugging information can easily take up space and grow bigger
attention is payed to keeping it as compact as possible. In this matter, DWARF than the program itself if no attention is paid at keeping it as compact as
does an excellent job, and everything is stored in a very compact way. This, possible when designing the file format. On this matter, DWARF does an
however, as we will see, makes it both difficult to parse correctly and quite excellent job, and everything is stored in a very compact way. This, however,
slow to interpret. as we will see, makes it both difficult to parse correctly and relatively slow
to interpret.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DWARF unwinding data} \subsection{DWARF unwinding data}
The unwinding data, which we will call from now on the \ehframe, contains, for The unwinding data, which we will call from now on the \ehframe, contains, for
each possible instruction pointer (that is, an instruction address within the each possible IP, a set of ``registers'' that can be unwound, and a rule
program), a set of ``registers'' that can be unwound, and a rule describing how describing how to do so.
to do so.
The DWARF language is completely agnostic of the platform and ABI, and in The DWARF language is completely agnostic of the platform and ABI, and in
particular, is completely agnostic of a particular platform's registers. Thus, particular, is completely agnostic of a particular platform's registers. Thus,
when talking about DWARF, a register is merely a numerical identifier that is as far as DWARF is concerned, a register is merely a numerical identifier that
often, but not necessarily, mapped to a real machine register by the ABI\@. is often, but not necessarily, mapped to a real machine register by the ABI\@.
In practice, this data takes the form of a collection of tables, one table per In practice, this data takes the form of a collection of tables, one table per
Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such
a table, that has a range of IPs on which it has authority. Most often, but not a table, that has a range of IPs on which it has authority. Most often, but not
necessarily, it corresponds to a single function in the original source code. necessarily, it corresponds to a single function in the original source code.
Each column of the table is a register (\eg{} \reg{rsp}), with two additional Each column of the table is a register (\eg{} \reg{rsp}), along with two
special registers, CFA (Canonical Frame Address) and RA (Return Address), additional special registers, CFA (Canonical Frame Address) and RA (Return
containing respectively the base pointer of the current stack frame and the Address), containing respectively the base pointer of the current stack frame
return address of the current function. For instance, on a x86\_64 and the return address of the current function. For instance, on a x86\_64
architecture, RA would contain the unwound value of \reg{rip}, the instruction architecture, RA would contain the unwound value of \reg{rip}, the instruction
pointer. Each row has a certain validity interval, on which it describes pointer. Each row has a certain validity interval, on which it describes
accurate unwinding data. This range starts at the instruction pointer it is accurate unwinding data. This range starts at the instruction pointer it is
associated with, and ends at the start IP of the next table row (or the end IP associated with, and ends at the start IP of the next table row --~or the end
of the current FDE if it was the last row). In particular, there can be no ``IP IP of the current FDE if it was the last row. In particular, there can be no
hole'' within a FDE --~unlike FDEs themselves, which can leave holes between ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
them. between them.
\begin{figure}[h] \begin{figure}[h]
\begin{minipage}{0.45\textwidth} \begin{minipage}{0.45\textwidth}
@ -312,7 +316,7 @@ them.
\caption{Stack frame schema}\label{table:ex1_stack_schema} \caption{Stack frame schema}\label{table:ex1_stack_schema}
\end{table} \end{table}
For instance, the C source code in Listing~\ref{lst:ex1_c} above, when compiled For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled
with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the
assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack
frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding
@ -380,9 +384,9 @@ Figure~\ref{fig:fde_line_density} was generated on a random sample of around
The most commonly used library to perform stack unwinding, in the Linux The most commonly used library to perform stack unwinding, in the Linux
ecosystem, is \prog{libunwind}~\cite{libunwind}. While it is very robust and ecosystem, is \prog{libunwind}~\cite{libunwind}. While it is very robust and
quite efficient, most of its optimization comes from fine-tuned code and good decently efficient, most of its optimization comes from fine-tuned code and
caching mechanisms. While parsing DWARF, \prog{libunwind} is forced to parse good caching mechanisms. When parsing DWARF, \prog{libunwind} is forced to
the relevant FDE from its start, until it finds the row it was seeking. parse the relevant FDE from its start, until it finds the row it was seeking.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -392,9 +396,9 @@ the relevant FDE from its start, until it finds the row it was seeking.
We will now define semantics covering most of the operations used for FDEs We will now define semantics covering most of the operations used for FDEs
described in the DWARF standard~\cite{dwarf5std}, such as seen in described in the DWARF standard~\cite{dwarf5std}, such as seen in
Listing~\ref{lst:ex1_dwraw}, with the exception of DWARF expressions. These are Listing~\ref{lst:ex1_dwraw}, with the exception of DWARF expressions. These are
not exhaustively treated because they are quite rich and would take a lot of not exhaustively treated because they form a rich language and would take a lot
time and space to formalize, and in the meantime are only seldom used (see the of time and space to formalize, and in the meantime are only seldom used (see
DWARF statistics regarding this). the DWARF statistics regarding this).
These semantics are defined with respect to the well-formalized C language, and These semantics are defined with respect to the well-formalized C language, and
are passing through an intermediary language. The DWARF language can read the are passing through an intermediary language. The DWARF language can read the
@ -650,7 +654,7 @@ earlier. The translation from $\intermedlang$ to C is defined as follows:
if(ip >= $loc$) { if(ip >= $loc$) {
for(int reg=0; reg < NB_REGS; ++reg) for(int reg=0; reg < NB_REGS; ++reg)
new_ctx[reg] = $\semR{row[reg]}$; new_ctx[reg] = $\semR{row[reg]}$;
goto end_ifs; // Avoid if/else if problems goto end_ifs; // Avoid using `else if` (easier for generation)
} }
\end{lstlisting} \end{lstlisting}
\end{itemize} \end{itemize}
@ -688,9 +692,8 @@ licenses.
The rough idea of the compilation is to produce, out of the \ehframe{} section The rough idea of the compilation is to produce, out of the \ehframe{} section
of a binary, C code that resembles the code shown in the DWARF semantics from of a binary, C code that resembles the code shown in the DWARF semantics from
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
\lstbash{-O2} mode, since it already provides a good level of optimization and \lstbash{-O2} mode. This saves us the trouble of optimizing the generated C
compiling in \lstbash{-O3} takes way too much time. This saves us the trouble code whenever GCC does that by itself.
of optimizing the generated C code whenever GCC does that by itself.
The generated code consists in a single monolithic function, \lstc{_eh_elf}, The generated code consists in a single monolithic function, \lstc{_eh_elf},
taking as arguments an instruction pointer and a memory context (\ie{} the taking as arguments an instruction pointer and a memory context (\ie{} the
@ -698,15 +701,15 @@ value of the various machine registers) as defined in
Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
context, containing the values the registers hold after unwinding this frame. context, containing the values the registers hold after unwinding this frame.
The body of the function itself is mostly a huge switch, taking advantage of The body of the function itself consists in a single monolithic switch, taking
the non-standard --~yet widely implemented in C compilers~-- syntax for range advantage of the non-standard --~yet widely implemented in C compilers~--
switches, in which each \lstinline{case} can refer to a range. All the FDEs are syntax for range switches, in which each \lstinline{case} can refer to a range.
merged together into this switch, each row of a FDE being a switch case. All the FDEs are merged together into this switch, each row of a FDE being a
Separating the various FDEs in the C code --~other than with comments~-- is, switch case. Separating the various FDEs in the C code --~other than with
unlike what is done in DWARF, pointless, since accessing a ``row'' has a linear comments~-- is, unlike what is done in DWARF, pointless, since accessing a
cost, and the C code is not meant to be read, except maybe for debugging ``row'' has a linear cost, and the C code is not meant to be read, except maybe
purposes. The switch cases bodies then fill a context with unwound values, then for debugging purposes. The switch cases bodies then fill a context with
return it. unwound values, then return it.
A setting of the compiler also optionally enables another parameter to the A setting of the compiler also optionally enables another parameter to the
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This \lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
@ -720,13 +723,13 @@ Unlike in the \ehframe, and unlike what should be done in a release,
real-world-proof version of the \ehelfs, the choice was made to keep this real-world-proof version of the \ehelfs, the choice was made to keep this
implementation simple, and only handle the few registers that were needed to implementation simple, and only handle the few registers that were needed to
simply unwind the stack. Thus, the only registers handled in \ehelfs{} are simply unwind the stack. Thus, the only registers handled in \ehelfs{} are
\reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used quite \reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used a few
often in \prog{libc} to hold the CFA address. This is enough to unwind the times in \prog{libc} to hold the CFA address in common functions. This is
stack reliably, and thus enough for profiling, but is not sufficient to analyze enough to unwind the stack reliably, and thus enough for profiling, but is not
every stack frame as \prog{gdb} would do after a \lstbash{frame n} command. sufficient to analyze every stack frame as \prog{gdb} would do after a
Yet, if one was to enhance the code to handle every register, it would not be \lstbash{frame n} command. Yet, if one was to enhance the code to handle every
much harder and would probably be only a few hours of code refactoring and register, it would not be much harder and would probably be only a few hours of
rewriting. code refactoring and rewriting.
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}] \lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
{src/dwarf_assembly_context/unwind_context.c} {src/dwarf_assembly_context/unwind_context.c}
@ -743,16 +746,15 @@ by including an error flag by lack of $\bot$ value.
This generated data is stored in separate shared object files, which we call This generated data is stored in separate shared object files, which we call
\ehelfs. It would have been possible to alter the original ELF file to embed \ehelfs. It would have been possible to alter the original ELF file to embed
this data as a new section, but getting it to be executed just as any this data as a new section, but getting it to be executed just as any portion
portion of the \lstc{.text} section would probably have been painful, and of the \lstc{.text} section would probably have been painful, and keeping it
keeping it separated during the experimental phase is quite convenient. It is separated during the experimental phase is convenient. It is possible to have
possible to have multiple versions of \ehelfs{} files in parallel, with various multiple versions of \ehelfs{} files in parallel, with various options turned
options turned on or off, and it doesn't require to alter the base system by on or off, and it doesn't require to alter the base system by editing \eg{}
editing \eg{} \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is required, those
required, those files can simply be \lstc{dlopen}'d. It is also possible to files can simply be \lstc{dlopen}'d. It is also possible to imagine, in a
imagine, in a future environment production, packaging \ehelfs{} files future environment production, packaging \ehelfs{} files separately, so that
separately, so that people interested in heavy computation can have the choice people interested in heavy computation can have the choice to install them.
to install them.
This, in particular, means that each ELF file has its unwinding data in a This, in particular, means that each ELF file has its unwinding data in a
separate \ehelf{} file --~just like with DWARF, where each ELF retains its own separate \ehelf{} file --~just like with DWARF, where each ELF retains its own
@ -781,10 +783,8 @@ possible to produce a compiled version very close to the one described in
Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be
actually benchmarked, it is already possible to write in a few hundred lines of actually benchmarked, it is already possible to write in a few hundred lines of
C code a simple stack walker printing the functions traversed. It already works C code a simple stack walker printing the functions traversed. It already works
without any problem on the easily tested cases, since corner cases are mostly well on the standard cases that are easily tested, and can be used to unwind
found in standard and highly optimized libraries, and it is not that easy to get the stack of simple programs.
the program to stop and print a stack trace from within a system library
without using a debugger.
The major drawback of this approach, without any particular care taken, is the The major drawback of this approach, without any particular care taken, is the
space waste. The space taken by those tentative \ehelfs{} is analyzed in space waste. The space taken by those tentative \ehelfs{} is analyzed in
@ -835,14 +835,16 @@ made in order to shrink the \ehelfs.
\medskip \medskip
The major optimization that most reduced the output size was to use an if/else The major optimization that most reduced the output size was to use an if/else
tree implementing a binary search on the program counter relevant intervals, tree implementing a binary search on the instruction pointer relevant
instead of a huge switch. In the process, we also \emph{outline} a lot of code, intervals, instead of a single monolithic switch. In the process, we also
that is, find out identical ``switch cases'' bodies --~which are not switch \emph{outline} code whenever possible, that is, find out identical ``switch
cases anymore, but if bodies~--, move them outside of the if/else tree, cases'' bodies --~which are not switch cases anymore, but if bodies~--, move
identify them by a label, and jump to them using a \lstc{goto}, which them outside of the if/else tree, identify them by a label, and jump to them
de-duplicates a lot of code and contributes greatly to the shrinking. In the using a \lstc{goto}, which de-duplicates a lot of code and contributes greatly
process, we noticed that the vast majority of FDE rows are actually taken among to the shrinking. In the process, we noticed that the vast majority of FDE rows
very few ``common'' FDE rows. are actually taken among very few ``common'' FDE rows. For instance, in the
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) remain
after the outlining.
This makes this optimization really efficient, as seen later in This makes this optimization really efficient, as seen later in
Section~\ref{ssec:results_size}, but also makes it an interesting question Section~\ref{ssec:results_size}, but also makes it an interesting question
@ -861,10 +863,10 @@ DWARF data could be efficiently compressed in this way.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Benchmarking} \section{Benchmarking}\label{sec:benchmarking}
Benchmarking turned out to be, quite surprisingly, the hardest part of the Benchmarking turned out to be, quite surprisingly, the hardest part of the
project. It ended up requiring a lot of investigation to find a working project. It ended up requiring a good deal of investigation to find a working
protocol, and afterwards, a good deal of code reading and coding to get the protocol, and afterwards, a good deal of code reading and coding to get the
solution working. solution working.
@ -884,13 +886,15 @@ are made from different locations is somehow cheating, since it makes useless
distribution. All in all, the benchmarking method must have a ``natural'' distribution. All in all, the benchmarking method must have a ``natural''
distribution of unwindings. distribution of unwindings.
Another requirement is to also distribute quite evenly the unwinding points Another requirement is to also distribute evenly enough the unwinding points
across the program: we would like to benchmark stack unwindings crossing some across the program to mimic real-world unwinding: we would like to benchmark
standard library functions, starting from inside them, etc. stack unwindings crossing some standard library functions, starting from inside
them, etc.
Finally, the unwound program must be interesting enough to enter and exit a lot Finally, the unwound program must be interesting enough to enter and exit
of functions, nest function calls, have FDEs that are not as simple as in functions often, building a good stack of nested function calls (at least 5
Listing~\ref{lst:ex1_dw}, etc. frequently), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
etc.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -915,7 +919,8 @@ program, but instead inject code in it.
\subsection{Benchmarking with \prog{perf}}\label{ssec:bench_perf} \subsection{Benchmarking with \prog{perf}}\label{ssec:bench_perf}
In the context of this internship, the main advantage of \prog{perf} is that it In the context of this internship, the main advantage of \prog{perf} is that it
does a lot of stack unwinding. It also meets all the requirements from unwinds the stack on a regular, controllable basis, easily unwinding thousands
of time in a few seconds. It also meets all the requirements from
Section~\ref{ssec:bench_req} above: since it stops at regular intervals and Section~\ref{ssec:bench_req} above: since it stops at regular intervals and
unwinds, the unwindings are evenly distributed \wrt{} the frequency of unwinds, the unwindings are evenly distributed \wrt{} the frequency of
execution of the code, which is a natural enough setup for the benchmarks to be execution of the code, which is a natural enough setup for the benchmarks to be
@ -966,10 +971,10 @@ the repositories \prog{perf-eh\_elf} and \prog{libunwind-eh\_elf}.
The first approach tried to benchmark was trying to create some specific C code The first approach tried to benchmark was trying to create some specific C code
that would meet the requirements from Section~\ref{ssec:bench_req}, while that would meet the requirements from Section~\ref{ssec:bench_req}, while
calling itself a benchmarking procedure from time to time. This was abandoned calling itself a benchmarking procedure from time to time. This was quickly
quite quickly, because generating C code interesting enough to be unwound abandoned, because generating C code interesting enough to be unwound turned
turned out hard, and the generated FDEs invariably ended out uninteresting. It out hard, and the generated FDEs invariably ended out uninteresting. It would
would also never have met the requirement of unwinding from fairly distributed also never have met the requirement of unwinding from fairly distributed
locations anyway. locations anyway.
Another attempt was made using CSmith~\cite{csmith}, a random C code generator Another attempt was made using CSmith~\cite{csmith}, a random C code generator