From 73f016f44cde819edac0766fe79736c0c8a5522a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= Date: Sun, 19 Aug 2018 13:13:07 +0200 Subject: [PATCH] Live changes during meeting --- report/fiche_synthese.tex | 115 ++++++++++++++++++++------------------ report/report.tex | 45 ++++++++------- 2 files changed, 82 insertions(+), 78 deletions(-) diff --git a/report/fiche_synthese.tex b/report/fiche_synthese.tex index 48095d9..93067d2 100644 --- a/report/fiche_synthese.tex +++ b/report/fiche_synthese.tex @@ -8,17 +8,17 @@ \subsection*{The general context} -The standard debugging data format for ELF binary files, DWARF, contains tables -that permit, for a given instruction pointer (IP), to understand how the -assembly instruction relates to the source code, where variables are currently -allocated in memory or if they are stored in a register, what are their type -and how to unwind the current stack frame. This inforation is generated when -passing \eg{} the switch \lstbash{-g} to \prog{gcc} or equivalents. +The standard debugging data format, DWARF, contains tables that, for a given +instruction pointer (IP), permit to understand how the assembly instruction +relates to the source code, where variables are currently allocated in memory +or if they are stored in a register, what are their type and how to unwind the +current stack frame. This information is generated when passing \eg{} the +switch \lstbash{-g} to \prog{gcc} or equivalents. Even in stripped (non-debug) binaries, a small portion of DWARF data remains: the stack unwinding data. This information is necessary to unwind stack frames, restoring machine registers to the value they had in the previous -frame, for instance within the context of a debugger or a profiler. +frame. This data is structured into tables, each row corresponding to an IP range for which it describes valid unwinding data, and each column describing how to @@ -34,28 +34,29 @@ computation~\cite{oakley2011exploiting}. As debugging data can easily take an unreasonable space and grow larger than the program itself if stored carelessly, the DWARF standard pays a great -attention to data compactness and compression, and succeeds particularly well -at it. But this, as always, is at the expense of efficiency: accessing stack -unwinding data for a particular program point is not a light operation --~in -the order of magnitude of $10\,\mu{}\text{s}$ on a modern computer. +attention to data compactness and compression. It succeeds particularly well +at it, but at the expense of efficiency: accessing stack +unwinding data for a particular program point is an expensive operation --~the +order of magnitude is $10\,\mu{}\text{s}$ on a modern computer. -This is often not a huge problem, as stack unwinding is often thought of as a +This is often not a problem, as stack unwinding is often thought of as a debugging procedure: when something behaves unexpectedly, the programmer might be interested in opening their debugger and exploring the stack. Yet, stack -unwinding might, in some cases, be performance-critical: for instance, profiler -programs needs to perform a whole lot of stack unwindings. Even worse, -exception handling relies on stack unwinding in order to find a suitable -catch-block! For such applications, it might be desirable to find a different -time/space trade-off, storing a bit more for a faster unwinding. +unwinding might, in some cases, be performance-critical: for instance, polling +profilers repeatedly perform stack unwindings to observe which functions are +active. Even worse, C++ exception handling relies on stack unwinding in order +to find a suitable catch-block! For such applications, it might be desirable to +find a different time/space trade-off, storing a bit more for a faster +unwinding. This different trade-off is the question that I explored during this internship: what good alternative trade-off is reachable when storing the stack unwinding data completely differently? -It seems that the subject has not really been explored yet, and as of now, the -most widely used library for stack unwinding, -\prog{libunwind}~\cite{libunwind}, essentially makes use of aggressive but -fine-tuned caching and optimized code to mitigate this problem. +It seems that the subject has not been explored yet, and as of now, the most +widely used library for stack unwinding, \prog{libunwind}~\cite{libunwind}, +essentially makes use of aggressive but fine-tuned caching and optimized code +to mitigate this problem. % What is the question that you studied? % Why is it important, what are the applications/consequences? @@ -73,27 +74,25 @@ of compiled DWARF into existing projects have been made easy by implementing an alternative version of the \textit{de facto} standard library for this purpose, \prog{libunwind}. -Multiple approaches have been tried, in order to determine which compilation -process leads to the best time/space trade-off. +Multiple approaches have been tried and evaluated to determine which +compilation process leads to the best time/space trade-off. Unexpectedly, the part that proved hardest of the project was finding and implementing a benchmarking protocol that was both relevant and reliable. -Unwinding one single frame is way too fast to provide a reliable benchmarking -on a few samples (around $10\,\mu s$ per frame). Having enough samples for this -purpose --~at least a few thousands~-- is not easy, since one must avoid -unwinding the same frame over and over again, which would only benchmark the -caching mechanism. The other problem is to distribute evenly the unwinding -measures across the various IPs, including directly into the loaded libraries -(\eg{} the \prog{libc}). - +Unwinding one single frame is too fast to provide a reliable benchmarking on a +few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having +enough samples for this purpose --~at least a few thousands~-- is not easy, +since one must avoid unwinding the same frame over and over again, which would +only benchmark the caching mechanism. The other problem is to distribute +evenly the unwinding measures across the various IPs, including directly into +the loaded libraries (\eg{} the \prog{libc}). The solution eventually chosen was to modify \prog{perf}, the standard profiling program for Linux, in order to gather statistics and benchmarks of its unwindings. Modifying \prog{perf} was an additional challenge that turned -out to be harder than expected, since the source code is pretty opaque to -someone who doesn't know the project well, and the optimisations make some -parts counter-intuitive. This, in particular, required to produce an -alternative version of \prog{libunwind} interfaced with the compiled debugging -data. +out to be harder than expected, since the source code is hard to read, and +optimisations make some parts counter-intuitive. To overcome this, we designed +an alternative version of \prog{libunwind} interfaced with the +compiled debugging data. % What is your solution to the question described in the last paragraph? % @@ -108,12 +107,19 @@ data. % % Comment the robustness of your solution: how does it rely/depend on the working assumptions? -The goal was to obtain a compiled version of unwinding data that was faster -than DWARF, reasonably heavier and reliable. The benchmarks mentioned have -yielded convincing results: on the experimental setup created (detailed on -Section~\ref{sec:benchmarking} below), the compiled version is around 26 times -faster than the DWARF version, while it remains only around 2.5 times bigger -than the original data. +The goal of this project was to design a compiled version of unwinding data +that is faster than DWARF, while still being reliable and reasonably compact. +The benchmarks mentioned have yielded convincing results: on the experimental +setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash, +the compiled version is around 26 times faster than the DWARF version, while it +remains only around 2.5 times bigger than the original data. + +We support the vast majority --~more than $99.9\,\%$~-- of the instructions +actually used in binaries, although we do not support all of DWARF5 instruction +set. We are almost as robust as libunwind: on a $27000$ samples test, 885 +failures were observed for \prog{libunwind}, against $1099$ for the compiled +DWARF version (failures are due to signal handlers, unusual instructions, +\ldots) --~see Section~\ref{ssec:timeperf}. The implementation is not yet release-ready, as it does not support 100\ \% of the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs} @@ -123,13 +129,13 @@ the reference implementation. Indeed, corner cases occur often, and on a 27000 samples test, 885 failures were observed for \prog{libunwind}, against 1099 for the compiled DWARF version (see Section~\ref{ssec:timeperf}). -The implementation, however, as a few other limitations. It only supports the +The implementation, however, is not production-ready: it only supports the x86\_64 architecture, and relies to some extent on the Linux operating system. -But none of those are real problems in practice. Other processor architectures -and ABIs are only a matter of time spent and engineering work; and the -operating system dependency is only present in the libraries developed in order -to interact with the compiled unwinding data, which can be developed for -virtually any operating system. +None of those are real problems in practice. Supporting other processor +architectures and ABIs are only a matter of engineering,. The operating system +dependency is only present in the libraries developed in order to interact with +the compiled unwinding data, which can be developed for virtually any operating +system. \subsection*{Summary and future work} @@ -137,14 +143,13 @@ In most cases of everyday's life, a slow stack unwinding is not a problem, left apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy tasks can be really useful to \eg{} profile large programs, particularly if one wants to profile many times in order to analyze the impact of multiple changes. -It can also be useful for exception-heavy programs. Thus, it might be -interesting to implement a more stable version, and try to interface it cleanly -with mainstream tools, such as \prog{perf}. +It can also be useful for exception-heavy programs. Thus, we plan to address +the limitations and integrate it cleanly with mainstream tools, such as +\prog{perf}. -Another question worth exploring might be whether it is possible to shrink even -more the original DWARF unwinding data, which would be stored in a format not -too far from the original standard, by applying techniques close to those -used to shrink the compiled unwinding data. +Another research direction is to investigate how to compress even more the +original DWARF unwinding data using outlining techniques, as we already do for +the compiled data successfully. % What is next? In which respect is your approach general? % What did your contribution bring to the area? diff --git a/report/report.tex b/report/report.tex index c011745..8eccec2 100644 --- a/report/report.tex +++ b/report/report.tex @@ -1,10 +1,11 @@ \title{DWARF debugging data, compilation and optimization} \author{Théophile Bastian\\ -Under supervision of Francesco Zappa Nardelli\\ +Under supervision of Francesco Zappa Nardelli, March -- August 2018\\ {\textsc{parkas}, \'Ecole Normale Supérieure de Paris}} -\date{March -- August 2018\\August 20, 2018} +%\date{March -- August 2018\\August 20, 2018} +\date{\vspace{-2em}} \documentclass[11pt]{article} @@ -54,8 +55,8 @@ Under supervision of Francesco Zappa Nardelli\\ \subsection*{Source code}\label{ssec:source_code} -All the source code produced during this internship is available openly. See -Section~\ref{ssec:code_avail} for details. +Our implementation is available from \url{https://git.tobast.fr/m2-internship}. +See the \texttt{abstract} repository for an introductive \texttt{README}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -102,25 +103,24 @@ copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address from anywhere within the function, and also allows for easy addressing of local variables. To some extents, it also allows for hot debugging, such as saving a useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not -always done, since it somehow ``wastes'' a register. This decision is, on -x86\_64 System V, up to the compiler. +always done, since it wastes a register. This decision is, on x86\_64 System V, +up to the compiler. -Often, a function will start by subtracting some value to \reg{rsp}, allocating -some space in the stack frame for its local variables. Then, it will push on +Usually, a function starts by subtracting some value to \reg{rsp}, allocating +some space in the stack frame for its local variables. Then, it pushes on the stack the values of the callee-saved registers that are overwritten later, -effectively saving them. Before returning, it will pop the values of the saved +effectively saving them. Before returning, it pops the values of the saved registers back to their original registers and restore \reg{rsp} to its former value. \subsection{Stack unwinding}\label{ssec:stack_unwinding} -For various reasons, it might be interesting, at some point of the execution of -a program, to glance at its program stack and be able to extract informations -from it. For instance, when running a debugger such as \prog{gdb}, a frequent -usage is to obtain a \emph{backtrace}, that is, the list of all nested function -calls at the current IP\@. This actually reads the stack to find the different -stack frames, and decode them to identify the function names, parameter values, -etc. +For various reasons, it is interesting, at some point of the execution of a +program, to glance at its program stack and be able to extract informations +from it. For instance, when running a debugger, a frequent usage is to obtain a +\emph{backtrace}, that is, the list of all nested function calls at the current +IP\@. This actually observes the stack to find the different stack frames, and +decode them to identify the function names, parameter values, etc. This operation is far from trivial. Often, a stack frame will only make sense when the correct values are stored in the machine registers. These values, @@ -184,7 +184,7 @@ no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other two functions that were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the culprit can be useful to a programmer. -Exception handling also requires a stack unwinding mechanism in most languages. +Exception handling also requires a stack unwinding mechanism in some languages. Indeed, an exception is completely different from a \lstinline{return}: while the latter returns to the previous function, at a well-defined IP, the former can be caught by virtually any function in the call path, at any point of the @@ -313,7 +313,7 @@ between them. \\ \hline \end{tabular} - \caption{Stack frame schema}\label{table:ex1_stack_schema} + \caption{Stack frame schema for fib7 (horizontal layout)}\label{table:ex1_stack_schema} \end{table} For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled @@ -492,8 +492,8 @@ Its grammar is as follows: \end{align*} The entry point of the grammar is a $\FDE$, which is a set of rows, each -annotated with a machine address, the address from which it is valid. Note that -the addresses are necessarily increasing within a FDE\@. +annotated with a machine address, the address from which it is valid. +The addresses are necessarily increasing within a FDE\@. Each row then represents, as a function mapping registers to values, a row of the unwinding table. @@ -672,9 +672,8 @@ and $\semR{\bullet}$ is defined as %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Stack unwinding data compilation} -The tentative approach that was chosen to try to get better unwinding speeds at -a reasonable space loss was to compile directly the \ehframe{} into native -machine code on the x86\_64 platform. +In this section, we will study all the design options we explored for the +actual C implementation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Code availability}\label{ssec:code_avail}