diff --git a/report/fiche_synthese.tex b/report/fiche_synthese.tex index 93067d2..67a2898 100644 --- a/report/fiche_synthese.tex +++ b/report/fiche_synthese.tex @@ -32,22 +32,21 @@ computation~\cite{oakley2011exploiting}. \subsection*{The research problem} -As debugging data can easily take an unreasonable space and grow larger than -the program itself if stored carelessly, the DWARF standard pays a great -attention to data compactness and compression. It succeeds particularly well -at it, but at the expense of efficiency: accessing stack -unwinding data for a particular program point is an expensive operation --~the -order of magnitude is $10\,\mu{}\text{s}$ on a modern computer. +As debugging data can easily grow larger than the program itself if stored +carelessly, the DWARF standard pays a great attention to data compactness and +compression. It succeeds particularly well at it, but at the expense of +efficiency: accessing stack unwinding data for a particular program point is an +expensive operation --~the order of magnitude is $10\,\mu{}\text{s}$ on a +modern computer. This is often not a problem, as stack unwinding is often thought of as a debugging procedure: when something behaves unexpectedly, the programmer might -be interested in opening their debugger and exploring the stack. Yet, stack -unwinding might, in some cases, be performance-critical: for instance, polling -profilers repeatedly perform stack unwindings to observe which functions are -active. Even worse, C++ exception handling relies on stack unwinding in order -to find a suitable catch-block! For such applications, it might be desirable to -find a different time/space trade-off, storing a bit more for a faster -unwinding. +open their debugger and explore the stack. Yet, stack unwinding might, in some +cases, be performance-critical: for instance, polling profilers repeatedly +perform stack unwindings to observe which functions are active. Even worse, C++ +exception handling relies on stack unwinding in order to find a suitable +catch-block! For such applications, it might be desirable to find a different +time/space trade-off, storing a bit more for a faster unwinding. This different trade-off is the question that I explored during this internship: what good alternative trade-off is reachable when storing the stack @@ -109,10 +108,10 @@ compiled debugging data. The goal of this project was to design a compiled version of unwinding data that is faster than DWARF, while still being reliable and reasonably compact. -The benchmarks mentioned have yielded convincing results: on the experimental -setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash, -the compiled version is around 26 times faster than the DWARF version, while it -remains only around 2.5 times bigger than the original data. +Benchmarking has yielded convincing results: on the experimental setup created +--~detailed on Section~\ref{sec:benchmarking} below~\textendash, the compiled +version is around 26 times faster than the DWARF version, while it remains only +around 2.5 times bigger than the original data. We support the vast majority --~more than $99.9\,\%$~-- of the instructions actually used in binaries, although we do not support all of DWARF5 instruction diff --git a/report/report.tex b/report/report.tex index 52741a3..6459789 100644 --- a/report/report.tex +++ b/report/report.tex @@ -79,16 +79,16 @@ typically used for storing function arguments, machine registers that must be restored before returning, the function's return address and local variables. On the x86\_64 platform, with which this report is mostly concerned, the -calling convention that is followed is defined in the System V -ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux -and MacOS\@. Under this calling convention, the first six arguments of a -function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, -\reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the -stack. It also defines which registers may be overwritten by the callee, and -which registers must be restored before returning. This restoration, for most -compilers, is done by pushing the register value onto the stack in the function -prelude, and restoring it just before returning. Those preserved registers are -\reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}. +calling convention followed on UNIX-like operating systems --~among which Linux +and MacOS~-- is defined by the System V ABI~\cite{systemVabi}. Under this +calling convention, the first six arguments of a function are passed in the +registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while +additional arguments are pushed onto the stack. It also defines which registers +may be overwritten by the callee, and which registers must be restored by the +callee before returning. This restoration, for most compilers, is done by +pushing the register value onto the stack during the function prelude, and +restoring it just before returning. Those preserved registers are \reg{rbx}, +\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}. \begin{wrapfigure}{r}{0.4\textwidth} \centering @@ -97,24 +97,22 @@ prelude, and restoring it just before returning. Those preserved registers are conventions}\label{fig:call_stack} \end{wrapfigure} -The register \reg{rsp} is supposed to always point to the last used memory cell -in the stack. Thus, when the process just enters a new function, \reg{rsp} -points right to the location of the return address. Then, the compiler might -use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing -the old value of \reg{rbp} just below the return address on the stack, then -copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address -from anywhere within the function, and also allows for easy addressing of local -variables. To some extents, it also allows for hot debugging, such as saving a -useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not -always done, since it wastes a register. This decision is, on x86\_64 System V, -up to the compiler. +The register \reg{rsp} is supposed to always point to the last used address in +the stack. Thus, when the process enters a new function, \reg{rsp} points to +the location of the return address. Then, the compiler might use \reg{rbp} +(``base pointer'') to save this value of \reg{rsp}, writing the old value of +\reg{rbp} below the return address on the stack and copying \reg{rsp} to +\reg{rbp}. This makes it easy to find the return address from anywhere within +the function, and allows for easy addressing of local variables. To some +extents, it also allows for hot debugging, such as saving a useful core dump +upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and +the decision of using it is, on x86\_64 System V, up to the compiler. Usually, a function starts by subtracting some value to \reg{rsp}, allocating -some space in the stack frame for its local variables. Then, it pushes on -the stack the values of the callee-saved registers that are overwritten later, -effectively saving them. Before returning, it pops the values of the saved -registers back to their original registers and restore \reg{rsp} to its former -value. +some space in the stack frame for its local variables. Then, it saves on the +stack the values of the callee-saved registers that are overwritten later. +Before returning, it pops the values of the saved registers back to their +original registers and restore \reg{rsp} to its former value. \subsection{Stack unwinding}\label{ssec:stack_unwinding} @@ -126,13 +124,12 @@ IP\@. This actually observes the stack to find the different stack frames, and decode them to identify the function names, parameter values, etc. This operation is far from trivial. Often, a stack frame will only make sense -when the correct values are stored in the machine registers. These values, +when the machine registers hold the right values. These values, however, are to be restored from the previous stack frame, where they are -stored. This imposes to \emph{walk} the stack, reading the entries one after -the other, instead of peeking at some frame directly. Moreover, the size of one -stack frame is often not that easy to determine when looking at some -instruction other than \texttt{return}, making it hard to extract single frames -from the whole stack. +stored. This imposes to \emph{walk} the stack, reading the frames one after +the other, instead of peeking at some frame directly. Moreover, it is often not +even easy to determine the boundaries of each stack frame alone, making it +impossible to just peek at a single frame. Interpreting a frame in order to get the machine state \emph{before} this frame, and thus be able to decode the next frame recursively, is called @@ -159,10 +156,10 @@ common format of debugging data is DWARF\@. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Unwinding usage and frequency} -Stack unwinding is a more common operation that one might think at first. The -use case mostly thought of is simply to get a stack trace of a program, and -provide a debugger with the information it needs. For instance, when inspecting -a stack trace in \prog{gdb}, a common operation is to jump to a previous frame: +Stack unwinding is more frequent that one might think at first. The use case +mostly thought of is simply to get a stack trace of a program, and provide a +debugger with the information it needs. For instance, when inspecting a stack +trace in \prog{gdb}, a common operation is to jump to a previous frame: \lstinputlisting{src/segfault/gdb_session} @@ -174,18 +171,18 @@ context, by unwinding \lstc{fct_b}'s frame. Yet, stack unwinding, and thus, debugging data, \emph{is not limited to debugging}. -Another common usage is profiling. A profiling tool, such as \prog{perf} under -Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in -which functions a program spends its time, identify bottlenecks and find out -which parts are critical to optimize. To do so, modern profilers pause the -traced program at regular, short intervals, inspect their stack, and determine -which function is currently being run. They also perform a stack unwinding to -figure out the call path to this function, in order to determine which function -indirectly takes time: for instance, a function \lstc{fct_a} can call both -\lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically -no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other -two functions that were made from \lstc{fct_a}. Knowing that after all, -\lstc{fct_a} is the culprit can be useful to a programmer. +Another common usage is profiling. A profiler, such as \prog{perf} under Linux +--~see Section~\ref{ssec:perf}~--, is used to measure and analyze in which +functions a program spends its time, and find out which parts are critical to +optimize. To do so, modern profilers pause the traced program at regular, +short intervals, inspect their stack, and determine which function is currently +being run. They also perform a stack unwinding to figure out the call path to +this function, in order to determine which function indirectly takes time: for +instance, a function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c}, +which both take a lot of time; spend practically no time directly in +\lstc{fct_a}, but spend a lot of time in calls to the other two functions that +were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the +culprit can be useful to a programmer. Exception handling also requires a stack unwinding mechanism in some languages. Indeed, an exception is completely different from a \lstinline{return}: while @@ -413,10 +410,10 @@ registers values, which will represent the evaluated DWARF row. \subsection{Concerning correctness}\label{ssec:sem_correctness} The semantics described in this section are designed in a concern of -\emph{formalization} of the original DWARF standard. This standard, sadly, only -devises a plain English description of each instruction's action and result, -which cannot be used as a basis to \emph{prove} anything correct without -relying on informal interpretations. +\emph{formalization} of the original standard. This standard, sadly, only +describes in plain English each instruction's action and result. This basis +cannot be used to \emph{prove} anything correct without relying on informal +interpretations. \subsection{Original language: DWARF instructions} @@ -732,16 +729,15 @@ licenses. \subsection{Compilation: \ehelfs}\label{ssec:ehelfs} The rough idea of the compilation is to produce, out of the \ehframe{} section -of a binary, C code that resembles the code shown in the DWARF semantics from -Section~\ref{sec:semantics} above. This C code is then compiled by GCC in -\lstbash{-O2} mode. This saves us the trouble of optimizing the generated C -code whenever GCC does that by itself. +of a binary, C code close to that of Section~\ref{sec:semantics} above. This C +code is then compiled by GCC in \lstbash{-O2} mode. This saves us the trouble +of optimizing the generated C code whenever GCC does that by itself. -The generated code consists in a single monolithic function, \lstc{_eh_elf}, -taking as arguments an instruction pointer and a memory context (\ie{} the -value of the various machine registers) as defined in -Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory -context, containing the values the registers hold after unwinding this frame. +The generated code consists in a single function, \lstc{_eh_elf}, taking as +arguments an instruction pointer and a memory context (\ie{} the value of the +various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The +function then returns a fresh memory context loaded with the values the +registers after unwinding this frame. The body of the function itself consists in a single monolithic switch, taking advantage of the non-standard --~yet overwhelmingly implemented in common C @@ -953,10 +949,9 @@ across the program to mimic real-world unwinding: we would like to benchmark stack unwindings crossing some standard library functions, starting from inside them, etc. -Finally, the unwound program must be interesting enough to enter and exit -functions often, building a good stack of nested function calls (at least -frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, -etc. +Finally, the unwound program must be interesting enough to call functions +often, building a stack of nested function calls (at least frequently 5), have +FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -1016,17 +1011,17 @@ changing one line of code to add one parameter to a function call and linking against the modified version of \prog{libunwind} instead of the system version. Once this was done, plugging it in \prog{perf} was the matter of a few lines of -code only, left apart the benchmarking code. The major problem encountered was -to understand how \prog{perf} works. In order to avoid perturbing the traced -program, \prog{perf} does not unwind at runtime, but rather records at regular -intervals the program's stack, and all the auxiliary information that is needed -to unwind later. This is done when running \lstbash{perf record}. Then, a -subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but -at this point of time, the traced process is long dead. Thus, any PID-based -approach, or any approach using \texttt{/proc} information will fail. However, -as this was the easiest method, the first version of \ehelfs{} used those -mechanisms; it took some code rewriting to move to a PID- and -\texttt{/proc}-agnostic implementation. +code only, left apart the benchmarking code. The major difficulty was to +understand how \prog{perf} works. To avoid perturbing the traced program, +\prog{perf} does not unwind at runtime, but rather records at regular intervals +the program's stack, and all the auxiliary information that is needed to unwind +later. This is done when running \lstbash{perf record}. Then, a subsequent call +to \lstbash{perf report} unwinds the stack to analyze it; but at this point of +time, the traced process is long dead. Thus, any PID-based approach, or any +approach using \texttt{/proc} information will fail. However, as this was the +easiest method, the first version of \ehelfs{} used those mechanisms; it took +some code rewriting to move to a PID- and \texttt{/proc}-agnostic +implementation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Other explored methods} @@ -1040,8 +1035,8 @@ also never have met the requirement of unwinding from fairly distributed locations anyway. Another attempt was made using CSmith~\cite{csmith}, a random C code generator -initially made for C compilers random testing. The idea was still to craft an -interesting C program that would unwind on its own frequently, but to integrate +designed for random testing on C compilers. The idea was still to craft a +C program that would unwind on its own frequently, but to integrate CSmith-randomly generated C code within hand-written C snippets that would generate large enough FDEs and nested calls. This was abandoned as well as the call graph of a CSmith-generated code is often far too small, and the @@ -1105,25 +1100,24 @@ Table~\ref{table:bench_time}. \end{table} The performance of \ehelfs{} is probably overestimated for a production-ready -version, since \ehelfs{} do not handle all registers from the original DWARF -file, and thus the \prog{libunwind} version must perform more computation. -However, this overhead, although impossible to measure without first -implementing supports for every register, would probably not be that big, since -most of the time is spent finding the relevant row. Support for every DWARF -instruction, however, would not slow down at all the implementation, since -every instruction would simply be compiled to x86\_64 without affecting the -already supported code. +version, since \ehelfs{} do not handle all the registers from the original +DWARF, lightening the computation. However, this overhead, although impossible +to measure without first implementing supports for every register, would +probably not be that big, since most of the time is spent finding the relevant +row. Support for every DWARF instruction, however, would not slow down at all +the implementation, since every instruction would simply be compiled to x86\_64 +without affecting the already supported code. The fact that there is a sharp difference between cached and uncached \prog{libunwind} confirm that our experimental setup did not unwind at totally different locations every single time, and thus was not biased in this direction, since caching is still very efficient. -It is also worth noting that the compilation time of \ehelfs{} is also -reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and -without using multiple cores to compile, the various shared objects needed to -run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and -\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds. +The compilation time of \ehelfs{} is also reasonable. On the machine +described in Section~\ref{ssec:bench_hw}, and without using multiple cores to +compile, the various shared objects needed to run \prog{hackbench} --~that is, +\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled +in an overall time of $25.28$ seconds. The unwinding errors observed are hard to investigate, but are most probably due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$ @@ -1136,11 +1130,9 @@ the custom \prog{libunwind} implementation that were not spotted. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Measured compactness}\label{ssec:results_size} -A first measure of compactness was made in this report for one of the earliest -working versions in Table~\ref{table:basic_eh_elf_space}. - -The same data, generated for the latest version of \ehelfs, can be seen in -Table~\ref{table:bench_space}. +A first measure of compactness was made for one of the earliest working +versions in Table~\ref{table:basic_eh_elf_space}. The same data, generated for +the latest version of \ehelfs, can be seen in Table~\ref{table:bench_space}. The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is particularly visible in this table: \prog{hackbench} has a significantly bigger @@ -1150,9 +1142,8 @@ times, compared to \eg{} \prog{libc}, in which the outlined data is reused a lot. Just as with time performance, the measured compactness would be impacted by -supporting every register, but probably not that much either, since most -columns are concerned with the four supported registers (see -Section~\ref{ssec:instr_cov}). +supporting every register, but probably lightly, since the four supported +registers represent most columns --~see Section~\ref{ssec:instr_cov}. \begin{table}[h] \centering