Live changes during meeting

2018-08-19 13:13:07 +02:00 · 2018-08-19 13:13:07 +02:00 · 73f016f44c
parent 2f44049506
commit 73f016f44c
2 changed files with 82 additions and 78 deletions
--- a/report/fiche_synthese.tex
+++ b/report/fiche_synthese.tex
@ -8,17 +8,17 @@
 \subsection*{The general context}
-The standard debugging data format for ELF binary files, DWARF, contains tables
+The standard debugging data format, DWARF, contains tables that, for a given
-that permit, for a given instruction pointer (IP), to understand how the
+instruction pointer (IP), permit to understand how the assembly instruction
-assembly instruction relates to the source code, where variables are currently
+relates to the source code, where variables are currently allocated in memory
-allocated in memory or if they are stored in a register, what are their type
+or if they are stored in a register, what are their type and how to unwind the
-and how to unwind the current stack frame. This inforation is generated when
+current stack frame. This information is generated when passing \eg{} the
-passing \eg{} the switch \lstbash{-g} to \prog{gcc} or equivalents.
+switch \lstbash{-g} to \prog{gcc} or equivalents.
 Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
 the stack unwinding data.  This information is necessary to unwind stack
 frames, restoring machine registers to the value they had in the previous
-frame, for instance within the context of a debugger or a profiler.
+frame.
 This data is structured into tables, each row corresponding to an IP range for
 which it describes valid unwinding data, and each column describing how to
@ -34,28 +34,29 @@ computation~\cite{oakley2011exploiting}.
 As debugging data can easily take an unreasonable space and grow larger than
 the program itself if stored carelessly, the DWARF standard pays a great
-attention to data compactness and compression, and succeeds particularly well
+attention to data compactness and compression. It succeeds particularly well
-at it. But this, as always, is at the expense of efficiency: accessing stack
+at it, but at the expense of efficiency: accessing stack
-unwinding data for a particular program point is not a light operation --~in
+unwinding data for a particular program point is an expensive operation --~the
-the order of magnitude of $10\,\mu{}\text{s}$ on a modern computer.
+order of magnitude is $10\,\mu{}\text{s}$ on a modern computer.
-This is often not a huge problem, as stack unwinding is often thought of as a
+This is often not a problem, as stack unwinding is often thought of as a
 debugging procedure: when something behaves unexpectedly, the programmer might
 be interested in opening their debugger and exploring the stack.  Yet, stack
-unwinding might, in some cases, be performance-critical: for instance, profiler
+unwinding might, in some cases, be performance-critical: for instance, polling
-programs needs to perform a whole lot of stack unwindings. Even worse,
+profilers repeatedly perform stack unwindings to observe which functions are
-exception handling relies on stack unwinding in order to find a suitable
+active. Even worse, C++ exception handling relies on stack unwinding in order
-catch-block! For such applications, it might be desirable to find a different
+to find a suitable catch-block! For such applications, it might be desirable to
-time/space trade-off, storing a bit more for a faster unwinding.
+find a different time/space trade-off, storing a bit more for a faster
 unwinding.
 This different trade-off is the question that I explored during this
 internship: what good alternative trade-off is reachable when storing the stack
 unwinding data completely differently?
-It seems that the subject has not really been explored yet, and as of now, the
+It seems that the subject has not been explored yet, and as of now, the most
-most widely used library for stack unwinding,
+widely used library for stack unwinding, \prog{libunwind}~\cite{libunwind},
-\prog{libunwind}~\cite{libunwind}, essentially makes use of aggressive but
+essentially makes use of aggressive but fine-tuned caching and optimized code
-fine-tuned caching and optimized code to mitigate this problem.
+to mitigate this problem.
 % What is the question that you studied?
 % Why is it important, what are the applications/consequences?
@ -73,27 +74,25 @@ of compiled DWARF into existing projects have been made easy by implementing an
 alternative version of the \textit{de facto} standard library for this purpose,
 \prog{libunwind}.
-Multiple approaches have been tried, in order to determine which compilation
+Multiple approaches have been tried and evaluated to determine which
-process leads to the best time/space trade-off.
+compilation process leads to the best time/space trade-off.
 Unexpectedly, the part that proved hardest of the project was finding and
 implementing a benchmarking protocol that was both relevant and reliable.
-Unwinding one single frame is way too fast to provide a reliable benchmarking
+Unwinding one single frame is too fast to provide a reliable benchmarking on a
-on a few samples (around $10\,\mu s$ per frame). Having enough samples for this
+few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having
-purpose --~at least a few thousands~-- is not easy, since one must avoid
+enough samples for this purpose --~at least a few thousands~-- is not easy,
-unwinding the same frame over and over again, which would only benchmark the
+since one must avoid unwinding the same frame over and over again, which would
-caching mechanism.  The other problem is to distribute evenly the unwinding
+only benchmark the caching mechanism.  The other problem is to distribute
-measures across the various IPs, including directly into the loaded libraries
+evenly the unwinding measures across the various IPs, including directly into
-(\eg{} the \prog{libc}).
+the loaded libraries (\eg{} the \prog{libc}).
 The solution eventually chosen was to modify \prog{perf}, the standard
 profiling program for Linux, in order to gather statistics and benchmarks of
 its unwindings. Modifying \prog{perf} was an additional challenge that turned
-out to be harder than expected, since the source code is pretty opaque to
+out to be harder than expected, since the source code is hard to read, and
-someone who doesn't know the project well, and the optimisations make some
+optimisations make some parts counter-intuitive. To overcome this, we designed
-parts counter-intuitive. This, in particular, required to produce an
+an alternative version of \prog{libunwind} interfaced with the
-alternative version of \prog{libunwind} interfaced with the compiled debugging
+compiled debugging data.
 data.
 % What is your solution to the question described in the last paragraph?
 %
@ -108,12 +107,19 @@ data.
 % 
 % Comment the robustness of your solution: how does it rely/depend on the working assumptions?
-The goal was to obtain a compiled version of unwinding data that was faster
+The goal of this project was to design a compiled version of unwinding data
-than DWARF, reasonably heavier and reliable. The benchmarks mentioned have
+that is faster than DWARF, while still being reliable and reasonably compact.
-yielded convincing results: on the experimental setup created (detailed on
+The benchmarks mentioned have yielded convincing results: on the experimental
-Section~\ref{sec:benchmarking} below), the compiled version is around 26 times
+setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash,
-faster than the DWARF version, while it remains only around 2.5 times bigger
+the compiled version is around 26 times faster than the DWARF version, while it
-than the original data.
+remains only around 2.5 times bigger than the original data.
 We support the vast majority --~more than $99.9\,\%$~-- of the instructions
 actually used in binaries, although we do not support all of DWARF5 instruction
 set. We are almost as robust as libunwind: on a $27000$ samples test, 885
 failures were observed for \prog{libunwind}, against $1099$ for the compiled
 DWARF version (failures are due to signal handlers, unusual instructions,
 \ldots) --~see Section~\ref{ssec:timeperf}.
 The implementation is not yet release-ready, as it does not support 100\ \% of
 the DWARF5 specification~\cite{dwarf5std} --~see Section~\ref{ssec:ehelfs}
@ -123,13 +129,13 @@ the reference implementation.  Indeed, corner cases occur often, and on a 27000
 samples test, 885 failures were observed for \prog{libunwind}, against 1099 for
 the compiled DWARF version (see Section~\ref{ssec:timeperf}).
-The implementation, however, as a few other limitations. It only supports the
+The implementation, however, is not production-ready: it only supports the
 x86\_64 architecture, and relies to some extent on the Linux operating system.
-But none of those are real problems in practice. Other processor architectures
+None of those are real problems in practice. Supporting other processor
-and ABIs are only a matter of time spent and engineering work; and the
+architectures and ABIs are only a matter of engineering,. The operating system
-operating system dependency is only present in the libraries developed in order
+dependency is only present in the libraries developed in order to interact with
-to interact with the compiled unwinding data, which can be developed for
+the compiled unwinding data, which can be developed for virtually any operating
-virtually any operating system.
+system.
 \subsection*{Summary and future work}
@ -137,14 +143,13 @@ In most cases of everyday's life, a slow stack unwinding is not a problem, left
 apart an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
 tasks can be really useful to \eg{} profile large programs, particularly if one
 wants to profile many times in order to analyze the impact of multiple changes.
-It can also be useful for exception-heavy programs. Thus, it might be
+It can also be useful for exception-heavy programs. Thus, we plan to address
-interesting to implement a more stable version, and try to interface it cleanly
+the limitations and integrate it cleanly with mainstream tools, such as
-with mainstream tools, such as \prog{perf}.
+\prog{perf}.
-Another question worth exploring might be whether it is possible to shrink even
+Another research direction is to investigate how to compress even more the
-more the original DWARF unwinding data, which would be stored in a format not
+original DWARF unwinding data using outlining techniques, as we already do for
-too far from the original standard, by applying techniques close to those
+the compiled data successfully.
 used to shrink the compiled unwinding data.
 % What is next? In which respect is your approach general?
 % What did your contribution bring to the area?
--- a/report/report.tex
+++ b/report/report.tex
@ -1,10 +1,11 @@
 \title{DWARF debugging data, compilation and optimization}
 \author{Théophile Bastian\\
-Under supervision of Francesco Zappa Nardelli\\
+Under supervision of Francesco Zappa Nardelli, March -- August 2018\\
 {\textsc{parkas}, \'Ecole Normale Supérieure de Paris}}
-\date{March -- August 2018\\August 20, 2018}
+%\date{March -- August 2018\\August 20, 2018}
 \date{\vspace{-2em}}
 \documentclass[11pt]{article}
@ -54,8 +55,8 @@ Under supervision of Francesco Zappa Nardelli\\
 \subsection*{Source code}\label{ssec:source_code}
-All the source code produced during this internship is available openly. See
+Our implementation is available from \url{https://git.tobast.fr/m2-internship}.
-Section~\ref{ssec:code_avail} for details.
+See the \texttt{abstract} repository for an introductive \texttt{README}.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -102,25 +103,24 @@ copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
 from anywhere within the function, and also allows for easy addressing of local
 variables. To some extents, it also allows for hot debugging, such as saving a
 useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
-always done, since it somehow ``wastes'' a register. This decision is, on
+always done, since it wastes a register. This decision is, on x86\_64 System V,
-x86\_64 System V, up to the compiler.
+up to the compiler.
-Often, a function will start by subtracting some value to \reg{rsp}, allocating
+Usually, a function starts by subtracting some value to \reg{rsp}, allocating
-some space in the stack frame for its local variables.  Then, it will push on
+some space in the stack frame for its local variables. Then, it pushes on
 the stack the values of the callee-saved registers that are overwritten later,
-effectively saving them. Before returning, it will pop the values of the saved
+effectively saving them. Before returning, it pops the values of the saved
 registers back to their original registers and restore \reg{rsp} to its former
 value.
 \subsection{Stack unwinding}\label{ssec:stack_unwinding}
-For various reasons, it might be interesting, at some point of the execution of
+For various reasons, it is interesting, at some point of the execution of a
-a program, to glance at its program stack and be able to extract informations
+program, to glance at its program stack and be able to extract informations
-from it. For instance, when running a debugger such as \prog{gdb}, a frequent
+from it. For instance, when running a debugger, a frequent usage is to obtain a
-usage is to obtain a \emph{backtrace}, that is, the list of all nested function
+\emph{backtrace}, that is, the list of all nested function calls at the current
-calls at the current IP\@. This actually reads the stack to find the different
+IP\@. This actually observes the stack to find the different stack frames, and
-stack frames, and decode them to identify the function names, parameter values,
+decode them to identify the function names, parameter values, etc.
 etc.
 This operation is far from trivial. Often, a stack frame will only make sense
 when the correct values are stored in the machine registers. These values,
@ -184,7 +184,7 @@ no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
 two functions that were made from \lstc{fct_a}. Knowing that after all,
 \lstc{fct_a} is the culprit can be useful to a programmer.
-Exception handling also requires a stack unwinding mechanism in most languages.
+Exception handling also requires a stack unwinding mechanism in some languages.
 Indeed, an exception is completely different from a \lstinline{return}: while
 the latter returns to the previous function, at a well-defined IP, the former
 can be caught by virtually any function in the call path, at any point of the
@ -313,7 +313,7 @@ between them.
                \\
        \hline
    \end{tabular}
-    \caption{Stack frame schema}\label{table:ex1_stack_schema}
+    \caption{Stack frame schema for fib7 (horizontal layout)}\label{table:ex1_stack_schema}
 \end{table}
 For instance, the C source code in Listing~\ref{lst:ex1_c}, when compiled
@ -492,8 +492,8 @@ Its grammar is as follows:
 \end{align*}
 The entry point of the grammar is a $\FDE$, which is a set of rows, each
-annotated with a machine address, the address from which it is valid. Note that
+annotated with a machine address, the address from which it is valid.
-the addresses are necessarily increasing within a FDE\@.
+The addresses are necessarily increasing within a FDE\@.
 Each row then represents, as a function mapping registers to values, a row of
 the unwinding table.
@ -672,9 +672,8 @@ and $\semR{\bullet}$ is defined as
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Stack unwinding data compilation}
-The tentative approach that was chosen to try to get better unwinding speeds at
+In this section, we will study all the design options we explored for the
-a reasonable space loss was to compile directly the \ehframe{} into native
+actual C implementation.
 machine code on the x86\_64 platform.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Code availability}\label{ssec:code_avail}