Live changes during meeting

This commit is contained in:
Théophile Bastian 2018-08-19 13:13:07 +02:00
parent 2f44049506
commit 73f016f44c
2 changed files with 82 additions and 78 deletions

View file

@ -8,17 +8,17 @@
\subsection*{The general context} \subsection*{The general context}
The standard debugging data format for ELF binary files, DWARF, contains tables The standard debugging data format, DWARF, contains tables that, for a given
that permit, for a given instruction pointer (IP), to understand how the instruction pointer (IP), permit to understand how the assembly instruction
assembly instruction relates to the source code, where variables are currently relates to the source code, where variables are currently allocated in memory
allocated in memory or if they are stored in a register, what are their type or if they are stored in a register, what are their type and how to unwind the
and how to unwind the current stack frame. This inforation is generated when current stack frame. This information is generated when passing \eg{} the
passing \eg{} the switch \lstbash{-g} to \prog{gcc} or equivalents. switch \lstbash{-g} to \prog{gcc} or equivalents.
Even in stripped (non-debug) binaries, a small portion of DWARF data remains: Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
the stack unwinding data. This information is necessary to unwind stack the stack unwinding data. This information is necessary to unwind stack
frames, restoring machine registers to the value they had in the previous frames, restoring machine registers to the value they had in the previous
frame, for instance within the context of a debugger or a profiler. frame.
This data is structured into tables, each row corresponding to an IP range for This data is structured into tables, each row corresponding to an IP range for
which it describes valid unwinding data, and each column describing how to which it describes valid unwinding data, and each column describing how to
@ -34,28 +34,29 @@ computation~\cite{oakley2011exploiting}.
As debugging data can easily take an unreasonable space and grow larger than As debugging data can easily take an unreasonable space and grow larger than
the program itself if stored carelessly, the DWARF standard pays a great the program itself if stored carelessly, the DWARF standard pays a great
attention to data compactness and compression, and succeeds particularly well attention to data compactness and compression. It succeeds particularly well
at it. But this, as always, is at the expense of efficiency: accessing stack at it, but at the expense of efficiency: accessing stack
unwinding data for a particular program point is not a light operation --~in unwinding data for a particular program point is an expensive operation --~the
the order of magnitude of $10\,\mu{}\text{s}$ on a modern computer. order of magnitude is $10\,\mu{}\text{s}$ on a modern computer.
This is often not a huge problem, as stack unwinding is often thought of as a This is often not a problem, as stack unwinding is often thought of as a
debugging procedure: when something behaves unexpectedly, the programmer might debugging procedure: when something behaves unexpectedly, the programmer might
be interested in opening their debugger and exploring the stack. Yet, stack be interested in opening their debugger and exploring the stack. Yet, stack
unwinding might, in some cases, be performance-critical: for instance, profiler unwinding might, in some cases, be performance-critical: for instance, polling
programs needs to perform a whole lot of stack unwindings. Even worse, profilers repeatedly perform stack unwindings to observe which functions are
exception handling relies on stack unwinding in order to find a suitable active. Even worse, C++ exception handling relies on stack unwinding in order
catch-block! For such applications, it might be desirable to find a different to find a suitable catch-block! For such applications, it might be desirable to
time/space trade-off, storing a bit more for a faster unwinding. find a different time/space trade-off, storing a bit more for a faster
unwinding.
This different trade-off is the question that I explored during this This different trade-off is the question that I explored during this
internship: what good alternative trade-off is reachable when storing the stack internship: what good alternative trade-off is reachable when storing the stack
unwinding data completely differently? unwinding data completely differently?
It seems that the subject has not really been explored yet, and as of now, the It seems that the subject has not been explored yet, and as of now, the most
most widely used library for stack unwinding, widely used library for stack unwinding, \prog{libunwind}~\cite{libunwind},
\prog{libunwind}~\cite{libunwind}, essentially makes use of aggressive but essentially makes use of aggressive but fine-tuned caching and optimized code
fine-tuned caching and optimized code to mitigate this problem. to mitigate this problem.
% What is the question that you studied? % What is the question that you studied?
% Why is it important, what are the applications/consequences? % Why is it important, what are the applications/consequences?
@ -73,27 +74,25 @@ of compiled DWARF into existing projects have been made easy by implementing an
alternative version of the \textit{de facto} standard library for this purpose, alternative version of the \textit{de facto} standard library for this purpose,
\prog{libunwind}. \prog{libunwind}.
Multiple approaches have been tried, in order to determine which compilation Multiple approaches have been tried and evaluated to determine which
process leads to the best time/space trade-off. compilation process leads to the best time/space trade-off.
Unexpectedly, the part that proved hardest of the project was finding and Unexpectedly, the part that proved hardest of the project was finding and
implementing a benchmarking protocol that was both relevant and reliable. implementing a benchmarking protocol that was both relevant and reliable.
Unwinding one single frame is way too fast to provide a reliable benchmarking Unwinding one single frame is too fast to provide a reliable benchmarking on a
on a few samples (around $10\,\mu s$ per frame). Having enough samples for this few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having
purpose --~at least a few thousands~-- is not easy, since one must avoid enough samples for this purpose --~at least a few thousands~-- is not easy,
unwinding the same frame over and over again, which would only benchmark the since one must avoid unwinding the same frame over and over again, which would
caching mechanism. The other problem is to distribute evenly the unwinding only benchmark the caching mechanism. The other problem is to distribute
measures across the various IPs, including directly into the loaded libraries evenly the unwinding measures across the various IPs, including directly into
(\eg{} the \prog{libc}). the loaded libraries (\eg{} the \prog{libc}).
The solution eventually chosen was to modify \prog{perf}, the standard The solution eventually chosen was to modify \prog{perf}, the standard
profiling program for Linux, in order to gather statistics and benchmarks of profiling program for Linux, in order to gather statistics and benchmarks of
its unwindings. Modifying \prog{perf} was an additional challenge that turned its unwindings. Modifying \prog{perf} was an additional challenge that turned
out to be harder than expected, since the source code is pretty opaque to out to be harder than expected, since the source code is hard to read, and
someone who doesn't know the project well, and the optimisations make some optimisations make some parts counter-intuitive. To overcome this, we designed
parts counter-intuitive. This, in particular, required to produce an an alternative version of \prog{libunwind} interfaced with the
alternative version of \prog{libunwind} interfaced with the compiled debugging compiled debugging data.
data.
% What is your solution to the question described in the last paragraph? % What is your solution to the question described in the last paragraph?
% %
@ -108,12 +107,19 @@ data.
% %
% Comment the robustness of your solution: how does it rely/depend on the working assumptions? % Comment the robustness of your solution: how does it rely/depend on the working assumptions?
The goal was to obtain a compiled version of unwinding data that was faster The goal of this project was to design a compiled version of unwinding data
than DWARF, reasonably heavier and reliable. The benchmarks mentioned have that is faster than DWARF, while still being reliable and reasonably compact.
yielded convincing results: on the experimental setup created (detailed on The benchmarks mentioned have yielded convincing results: on the experimental
Section~\ref{sec:benchmarking} below), the compiled version is around 26 times setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash,
faster than the DWARF version, while it remains only around 2.5 times bigger the compiled version is around 26 times faster than the DWARF version, while it
than the original data. remains only around 2.5 times bigger than the original data.
We support the vast majority --~more than $99.9\,\%$~-- of the instructions
actually used in binaries, although we do not support all of DWARF5 instruction
set. We are almost as robust as libunwind: on a $27000$ samples test, 885
failures were observed for \prog{libunwind}, against $1099$ for the compiled
DWARF version (failures are due to signal handlers, unusual instructions,
\ldots) --~see Section~\ref{ssec:timeperf}.
The implementation is not yet release-ready, as it does not support 100\ \% of The implementation is not yet release-ready, as it does not support 100\ \% of
the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs} the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs}
@ -123,13 +129,13 @@ the reference implementation. Indeed, corner cases occur often, and on a 27000
samples test, 885 failures were observed for \prog{libunwind}, against 1099 for samples test, 885 failures were observed for \prog{libunwind}, against 1099 for
the compiled DWARF version (see Section~\ref{ssec:timeperf}). the compiled DWARF version (see Section~\ref{ssec:timeperf}).
The implementation, however, as a few other limitations. It only supports the The implementation, however, is not production-ready: it only supports the
x86\_64 architecture, and relies to some extent on the Linux operating system. x86\_64 architecture, and relies to some extent on the Linux operating system.
But none of those are real problems in practice. Other processor architectures None of those are real problems in practice. Supporting other processor
and ABIs are only a matter of time spent and engineering work; and the architectures and ABIs are only a matter of engineering,. The operating system
operating system dependency is only present in the libraries developed in order dependency is only present in the libraries developed in order to interact with
to interact with the compiled unwinding data, which can be developed for the compiled unwinding data, which can be developed for virtually any operating
virtually any operating system. system.
\subsection*{Summary and future work} \subsection*{Summary and future work}
@ -137,14 +143,13 @@ In most cases of everyday's life, a slow stack unwinding is not a problem, left
apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
tasks can be really useful to \eg{} profile large programs, particularly if one tasks can be really useful to \eg{} profile large programs, particularly if one
wants to profile many times in order to analyze the impact of multiple changes. wants to profile many times in order to analyze the impact of multiple changes.
It can also be useful for exception-heavy programs. Thus, it might be It can also be useful for exception-heavy programs. Thus, we plan to address
interesting to implement a more stable version, and try to interface it cleanly the limitations and integrate it cleanly with mainstream tools, such as
with mainstream tools, such as \prog{perf}. \prog{perf}.
Another question worth exploring might be whether it is possible to shrink even Another research direction is to investigate how to compress even more the
more the original DWARF unwinding data, which would be stored in a format not original DWARF unwinding data using outlining techniques, as we already do for
too far from the original standard, by applying techniques close to those the compiled data successfully.
used to shrink the compiled unwinding data.
% What is next? In which respect is your approach general? % What is next? In which respect is your approach general?
% What did your contribution bring to the area? % What did your contribution bring to the area?

View file

@ -1,10 +1,11 @@
\title{DWARF debugging data, compilation and optimization} \title{DWARF debugging data, compilation and optimization}
\author{Théophile Bastian\\ \author{Théophile Bastian\\
Under supervision of Francesco Zappa Nardelli\\ Under supervision of Francesco Zappa Nardelli, March -- August 2018\\
{\textsc{parkas}, \'Ecole Normale Supérieure de Paris}} {\textsc{parkas}, \'Ecole Normale Supérieure de Paris}}
\date{March -- August 2018\\August 20, 2018} %\date{March -- August 2018\\August 20, 2018}
\date{\vspace{-2em}}
\documentclass[11pt]{article} \documentclass[11pt]{article}
@ -54,8 +55,8 @@ Under supervision of Francesco Zappa Nardelli\\
\subsection*{Source code}\label{ssec:source_code} \subsection*{Source code}\label{ssec:source_code}
All the source code produced during this internship is available openly. See Our implementation is available from \url{https://git.tobast.fr/m2-internship}.
Section~\ref{ssec:code_avail} for details. See the \texttt{abstract} repository for an introductive \texttt{README}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -102,25 +103,24 @@ copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
from anywhere within the function, and also allows for easy addressing of local from anywhere within the function, and also allows for easy addressing of local
variables. To some extents, it also allows for hot debugging, such as saving a variables. To some extents, it also allows for hot debugging, such as saving a
useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
always done, since it somehow ``wastes'' a register. This decision is, on always done, since it wastes a register. This decision is, on x86\_64 System V,
x86\_64 System V, up to the compiler. up to the compiler.
Often, a function will start by subtracting some value to \reg{rsp}, allocating Usually, a function starts by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it will push on some space in the stack frame for its local variables. Then, it pushes on
the stack the values of the callee-saved registers that are overwritten later, the stack the values of the callee-saved registers that are overwritten later,
effectively saving them. Before returning, it will pop the values of the saved effectively saving them. Before returning, it pops the values of the saved
registers back to their original registers and restore \reg{rsp} to its former registers back to their original registers and restore \reg{rsp} to its former
value. value.
\subsection{Stack unwinding}\label{ssec:stack_unwinding} \subsection{Stack unwinding}\label{ssec:stack_unwinding}
For various reasons, it might be interesting, at some point of the execution of For various reasons, it is interesting, at some point of the execution of a
a program, to glance at its program stack and be able to extract informations program, to glance at its program stack and be able to extract informations
from it. For instance, when running a debugger such as \prog{gdb}, a frequent from it. For instance, when running a debugger, a frequent usage is to obtain a
usage is to obtain a \emph{backtrace}, that is, the list of all nested function \emph{backtrace}, that is, the list of all nested function calls at the current
calls at the current IP\@. This actually reads the stack to find the different IP\@. This actually observes the stack to find the different stack frames, and
stack frames, and decode them to identify the function names, parameter values, decode them to identify the function names, parameter values, etc.
etc.
This operation is far from trivial. Often, a stack frame will only make sense This operation is far from trivial. Often, a stack frame will only make sense
when the correct values are stored in the machine registers. These values, when the correct values are stored in the machine registers. These values,
@ -184,7 +184,7 @@ no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
two functions that were made from \lstc{fct_a}. Knowing that after all, two functions that were made from \lstc{fct_a}. Knowing that after all,
\lstc{fct_a} is the culprit can be useful to a programmer. \lstc{fct_a} is the culprit can be useful to a programmer.
Exception handling also requires a stack unwinding mechanism in most languages. Exception handling also requires a stack unwinding mechanism in some languages.
Indeed, an exception is completely different from a \lstinline{return}: while Indeed, an exception is completely different from a \lstinline{return}: while
the latter returns to the previous function, at a well-defined IP, the former the latter returns to the previous function, at a well-defined IP, the former
can be caught by virtually any function in the call path, at any point of the can be caught by virtually any function in the call path, at any point of the
@ -313,7 +313,7 @@ between them.
\\ \\
\hline \hline
\end{tabular} \end{tabular}
\caption{Stack frame schema}\label{table:ex1_stack_schema} \caption{Stack frame schema for fib7 (horizontal layout)}\label{table:ex1_stack_schema}
\end{table} \end{table}
For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled
@ -492,8 +492,8 @@ Its grammar is as follows:
\end{align*} \end{align*}
The entry point of the grammar is a $\FDE$, which is a set of rows, each The entry point of the grammar is a $\FDE$, which is a set of rows, each
annotated with a machine address, the address from which it is valid. Note that annotated with a machine address, the address from which it is valid.
the addresses are necessarily increasing within a FDE\@. The addresses are necessarily increasing within a FDE\@.
Each row then represents, as a function mapping registers to values, a row of Each row then represents, as a function mapping registers to values, a row of
the unwinding table. the unwinding table.
@ -672,9 +672,8 @@ and $\semR{\bullet}$ is defined as
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stack unwinding data compilation} \section{Stack unwinding data compilation}
The tentative approach that was chosen to try to get better unwinding speeds at In this section, we will study all the design options we explored for the
a reasonable space loss was to compile directly the \ehframe{} into native actual C implementation.
machine code on the x86\_64 platform.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Code availability}\label{ssec:code_avail} \subsection{Code availability}\label{ssec:code_avail}