\pagestyle{empty} % \thispagestyle{empty} %% Attention: pas plus d'un recto-verso! % Ne conservez pas les questions \section*{Internship synthesis} \subsection*{The general context} The standard debugging data format for ELF binary files, DWARF, contains a lot of information, which is generated mostly when passing \eg{} the switch \lstbash{-g} to \prog{gcc}. This information, essentially provided for debuggers, contains all that is needed to connect the generated assembly with the original code, information that can be used by sanitizers (\eg{} the type of each variable in the source language), etc. Even in stripped (non-debug) binaries, a small portion of DWARF data remains. Among this essential data that is never stripped is the stack unwinding data, which allows to unwind stack frames, restoring machine registers to the value they had in the previous frame, for instance within the context of a debugger or a profiler. This data is structured into tables, each row corresponding to an program counter (PC) range for which it describes valid unwinding data, and each column describing how to unwind a particular machine register (or virtual register used for various purposes). These rules are mostly basic, consisting in offsets from memory addresses stored in registers (such as \reg{rbp} or \reg{rsp}), but in some cases, they can take the form of a stack-machine expression that can access virtually all the process's memory and perform Turing-complete computation~\cite{oakley2011exploiting}. \subsection*{The research problem} As debugging data can easily get heavy beyond reasonable if stored carelessly, the DWARF standard pays a great attention to data compactness and compression, and succeeds particularly well at it. But this, as always, is at the expense of efficiency: accessing stack unwinding data for a particular program point can be quite costly. This is often not a huge problem, as stack unwinding is mostly thought of as a debugging procedure: when something behaves unexpectedly, the programmer might be interested in exploring the stack. Yet, stack unwinding might, in some cases, be performance-critical: for instance, profiler programs needs to perform a whole lot of stack unwindings. Even worse, exception handling relies on stack unwinding in order to find a suitable catch-block! For such applications, it might be desirable to find a different time/space trade-off, allowing a slightly space-heavier, but far more time-efficient unwinding procedure. This different trade-off is the question that I explored during this internship: what good alternative trade-off is reachable when storing the stack unwinding data completely differently? It seems that the subject has not really been explored yet, and as of now, the most widely used library for stack unwinding, \prog{libunwind}~\cite{libunwind}, essentially makes use of aggressive but fine-tuned caching and optimized code to mitigate this problem. % What is the question that you studied? % Why is it important, what are the applications/consequences? % Is it a new problem? % If so, why are you the first researcher in the universe who consider it? % If not, why did you think that you could bring an original contribution? \subsection*{Your contribution} This internship explored the possibility to compile DWARF's stack unwinding data directly into native assembly on the x86\_64 architecture. Instead of parsing and interpreting at runtime the debug data, the stack unwinding data is accessed as a function of a dynamically-loaded shared library. Multiple approaches have been tried, in order to determine which compilation process leads to the best time/space trade-off. Quite unexpectedly, the part that proved hardest of the project was finding a benchmarking protocol that was both relevant and reliable. Unwinding one single frame is way too fast to be benched on a few samples (around $10\,\mu s$ per frame), and having a lot of samples is quite complex, since one must avoid unwinding the same frame over and over again, which would only benchmark the caching mechanism. The other problem is to distribute evenly the unwinding measures across the various program positions, including directly into the loaded libraries (\eg{} the \prog{libc}). The solution eventually chosen was to modify \prog{perf}, the standard profiling program for Linux, in order to gather statistics and benchmarks of its unwindings, and produce an alternative version of \prog{libunwind} using the compiled debugging data, in order to interface it with \prog{perf}, allowing to benchmark \prog{perf} with both the standard stack unwinding data and the alternative experimental compiled format. As a free and enjoyable side-effect, the experimental unwinding data is perfectly interfaced with \prog{libunwind}, and thus interfaceable at practically no cost with any existing project using the \textit{de facto} standard library \prog{libunwind}. % What is your solution to the question described in the last paragraph? % % Be careful, do \emph{not} give technical details, only rough ideas! % % Pay a special attention to the description of the \emph{scientific} approach. \subsection*{Arguments supporting its validity} % What is the evidence that your solution is a good solution? % Experiments? Proofs? % % Comment the robustness of your solution: how does it rely/depend on the working assumptions? The goal was to obtain a compiled version of unwinding data that was faster than DWARF, reasonably heavier and reliable. The benchmarks mentioned have yielded convincing results: on the experimental setup created (detailed later in this report), the compiled version is up to 25 times faster than the DWARF version, while it remains only around 2.5 times bigger than the original data. Even though the implementation is more a research prototype than a release version, is still reasonably robust, compared to \prog{libunwind}, which is built for robustness. Corner cases are frequent while analyzing stack data, and even more when analyzing them through a profiler; yet the prototype fails only on around 200 cases more than \prog{libunwind} on a 27000 samples test (1099 failures, against 885 for \prog{libunwind}). The prototype, unlike \prog{libunwind}, does not support $100\,\%$ of the DWARF instructions present in the DWARF5 standard~\cite{dwarf5std}. It is also limited to the x86\_64 architecture, and relies to some extent on the Linux operating system. But none of those limitations are real problems in practice. As argued later on, the vast majority of the DWARF instruction set actually used in the wild is implemented; other processor architectures and ABIs are only a matter of time spent and engineering work; and the operating system dependency is only present in the libraries developed in order to interact with the compiled unwinding data, which can be developed for virtually any operating system. \subsection*{Summary and future work} In most cases of everyday's life, a slow stack unwinding is not a problem, or even an annoyance. Yet, having a 25 times speed-up on stack unwinding-heavy tasks, such as profiling, can be really useful to profile heavy programs, particularly if one wants to profile many times in order to analyze the impact of multiple changes. It can also be useful for exception-heavy programs. Thus, it might be interesting to implement a more stable version, and try to interface it cleanly with mainstream tools, such as \prog{perf}. Another question worth exploring might be whether it is possible to shrink even more the original DWARF unwinding data, which would be stored in a format not too far from the original standard, by applying techniques close to those used to shrink the compiled unwinding data. % What is next? In which respect is your approach general? % What did your contribution bring to the area? % What should be done now? % What is the good \emph{next} question? \pagestyle{plain} \newpage