Eliminate widow lines

2018-08-19 18:32:03 +02:00 · 2018-08-19 18:32:03 +02:00 · d4f417017e
parent c847d71d28
commit d4f417017e
2 changed files with 106 additions and 116 deletions
--- a/report/fiche_synthese.tex
+++ b/report/fiche_synthese.tex
@ -32,22 +32,21 @@ computation~\cite{oakley2011exploiting}.
 \subsection*{The research problem}
-As debugging data can easily take an unreasonable space and grow larger than
+As debugging data can easily grow larger than the program itself if stored
-the program itself if stored carelessly, the DWARF standard pays a great
+carelessly, the DWARF standard pays a great attention to data compactness and
-attention to data compactness and compression. It succeeds particularly well
+compression. It succeeds particularly well at it, but at the expense of
-at it, but at the expense of efficiency: accessing stack
+efficiency: accessing stack unwinding data for a particular program point is an
-unwinding data for a particular program point is an expensive operation --~the
+expensive operation --~the order of magnitude is $10\,\mu{}\text{s}$ on a
-order of magnitude is $10\,\mu{}\text{s}$ on a modern computer.
+modern computer.
 This is often not a problem, as stack unwinding is often thought of as a
 debugging procedure: when something behaves unexpectedly, the programmer might
-be interested in opening their debugger and exploring the stack.  Yet, stack
+open their debugger and explore the stack.  Yet, stack unwinding might, in some
-unwinding might, in some cases, be performance-critical: for instance, polling
+cases, be performance-critical: for instance, polling profilers repeatedly
-profilers repeatedly perform stack unwindings to observe which functions are
+perform stack unwindings to observe which functions are active. Even worse, C++
-active. Even worse, C++ exception handling relies on stack unwinding in order
+exception handling relies on stack unwinding in order to find a suitable
-to find a suitable catch-block! For such applications, it might be desirable to
+catch-block! For such applications, it might be desirable to find a different
-find a different time/space trade-off, storing a bit more for a faster
+time/space trade-off, storing a bit more for a faster unwinding.
 unwinding.
 This different trade-off is the question that I explored during this
 internship: what good alternative trade-off is reachable when storing the stack
@ -109,10 +108,10 @@ compiled debugging data.
 The goal of this project was to design a compiled version of unwinding data
 that is faster than DWARF, while still being reliable and reasonably compact.
-The benchmarks mentioned have yielded convincing results: on the experimental
+Benchmarking has yielded convincing results: on the experimental setup created
-setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash,
+--~detailed on Section~\ref{sec:benchmarking} below~\textendash, the compiled
-the compiled version is around 26 times faster than the DWARF version, while it
+version is around 26 times faster than the DWARF version, while it remains only
-remains only around 2.5 times bigger than the original data.
+around 2.5 times bigger than the original data.
 We support the vast majority --~more than $99.9\,\%$~-- of the instructions
 actually used in binaries, although we do not support all of DWARF5 instruction
--- a/report/report.tex
+++ b/report/report.tex
@ -79,16 +79,16 @@ typically used for storing function arguments, machine registers that must be
 restored before returning, the function's return address and local variables.
 On the x86\_64 platform, with which this report is mostly concerned, the
-calling convention that is followed is defined in the System V
+calling convention followed on UNIX-like operating systems --~among which Linux
-ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux
+and MacOS~-- is defined by the System V ABI~\cite{systemVabi}.  Under this
-and MacOS\@. Under this calling convention, the first six arguments of a
+calling convention, the first six arguments of a function are passed in the
-function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx},
+registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while
-\reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the
+additional arguments are pushed onto the stack. It also defines which registers
-stack. It also defines which registers may be overwritten by the callee, and
+may be overwritten by the callee, and which registers must be restored by the
-which registers must be restored before returning. This restoration, for most
+callee before returning. This restoration, for most compilers, is done by
-compilers, is done by pushing the register value onto the stack in the function
+pushing the register value onto the stack during the function prelude, and
-prelude, and restoring it just before returning. Those preserved registers are
+restoring it just before returning. Those preserved registers are \reg{rbx},
-\reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
+\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
 \begin{wrapfigure}{r}{0.4\textwidth}
    \centering
@ -97,24 +97,22 @@ prelude, and restoring it just before returning. Those preserved registers are
    conventions}\label{fig:call_stack}
 \end{wrapfigure}
-The register \reg{rsp} is supposed to always point to the last used memory cell
+The register \reg{rsp} is supposed to always point to the last used address in
-in the stack. Thus, when the process just enters a new function, \reg{rsp}
+the stack. Thus, when the process enters a new function, \reg{rsp} points to
-points right to the location of the return address. Then, the compiler might
+the location of the return address. Then, the compiler might use \reg{rbp}
-use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
+(``base pointer'') to save this value of \reg{rsp}, writing the old value of
-the old value of \reg{rbp} just below the return address on the stack, then
+\reg{rbp} below the return address on the stack and copying \reg{rsp} to
-copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
+\reg{rbp}. This makes it easy to find the return address from anywhere within
-from anywhere within the function, and also allows for easy addressing of local
+the function, and allows for easy addressing of local variables. To some
-variables. To some extents, it also allows for hot debugging, such as saving a
+extents, it also allows for hot debugging, such as saving a useful core dump
-useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
+upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
-always done, since it wastes a register. This decision is, on x86\_64 System V,
+the decision of using it is, on x86\_64 System V, up to the compiler.
 up to the compiler.
 Usually, a function starts by subtracting some value to \reg{rsp}, allocating
-some space in the stack frame for its local variables. Then, it pushes on
+some space in the stack frame for its local variables. Then, it saves on the
-the stack the values of the callee-saved registers that are overwritten later,
+stack the values of the callee-saved registers that are overwritten later.
-effectively saving them. Before returning, it pops the values of the saved
+Before returning, it pops the values of the saved registers back to their
-registers back to their original registers and restore \reg{rsp} to its former
+original registers and restore \reg{rsp} to its former value.
 value.
 \subsection{Stack unwinding}\label{ssec:stack_unwinding}
@ -126,13 +124,12 @@ IP\@. This actually observes the stack to find the different stack frames, and
 decode them to identify the function names, parameter values, etc.
 This operation is far from trivial. Often, a stack frame will only make sense
-when the correct values are stored in the machine registers. These values,
+when the machine registers hold the right values. These values,
 however, are to be restored from the previous stack frame, where they are
-stored. This imposes to \emph{walk} the stack, reading the entries one after
+stored. This imposes to \emph{walk} the stack, reading the frames one after
-the other, instead of peeking at some frame directly. Moreover, the size of one
+the other, instead of peeking at some frame directly. Moreover, it is often not
-stack frame is often not that easy to determine when looking at some
+even easy to determine the boundaries of each stack frame alone, making it
-instruction other than \texttt{return}, making it hard to extract single frames
+impossible to just peek at a single frame.
 from the whole stack.
 Interpreting a frame in order to get the machine state \emph{before} this
 frame, and thus be able to decode the next frame recursively, is called
@ -159,10 +156,10 @@ common format of debugging data is DWARF\@.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Unwinding usage and frequency}
-Stack unwinding is a more common operation that one might think at first. The
+Stack unwinding is more frequent that one might think at first. The use case
-use case mostly thought of is simply to get a stack trace of a program, and
+mostly thought of is simply to get a stack trace of a program, and provide a
-provide a debugger with the information it needs. For instance, when inspecting
+debugger with the information it needs. For instance, when inspecting a stack
-a stack trace in \prog{gdb}, a common operation is to jump to a previous frame:
+trace in \prog{gdb}, a common operation is to jump to a previous frame:
 \lstinputlisting{src/segfault/gdb_session}
@ -174,18 +171,18 @@ context, by unwinding \lstc{fct_b}'s frame.
 Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
 debugging}.
-Another common usage is profiling. A profiling tool, such as \prog{perf} under
+Another common usage is profiling. A profiler, such as \prog{perf} under Linux
-Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
+--~see Section~\ref{ssec:perf}~--, is used to measure and analyze in which
-which functions a program spends its time, identify bottlenecks and find out
+functions a program spends its time, and find out which parts are critical to
-which parts are critical to optimize.  To do so, modern profilers pause the
+optimize.  To do so, modern profilers pause the traced program at regular,
-traced program at regular, short intervals, inspect their stack, and determine
+short intervals, inspect their stack, and determine which function is currently
-which function is currently being run. They also perform a stack unwinding to
+being run. They also perform a stack unwinding to figure out the call path to
-figure out the call path to this function, in order to determine which function
+this function, in order to determine which function indirectly takes time: for
-indirectly takes time: for instance, a function \lstc{fct_a} can call both
+instance, a function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c},
-\lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically
+which both take a lot of time; spend practically no time directly in
-no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
+\lstc{fct_a}, but spend a lot of time in calls to the other two functions that
-two functions that were made from \lstc{fct_a}. Knowing that after all,
+were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the
-\lstc{fct_a} is the culprit can be useful to a programmer.
+culprit can be useful to a programmer.
 Exception handling also requires a stack unwinding mechanism in some languages.
 Indeed, an exception is completely different from a \lstinline{return}: while
@ -413,10 +410,10 @@ registers values, which will represent the evaluated DWARF row.
 \subsection{Concerning correctness}\label{ssec:sem_correctness}
 The semantics described in this section are designed in a concern of
-\emph{formalization} of the original DWARF standard. This standard, sadly, only
+\emph{formalization} of the original standard. This standard, sadly, only
-devises a plain English description of each instruction's action and result,
+describes in plain English each instruction's action and result. This basis
-which cannot be used as a basis to \emph{prove} anything correct without
+cannot be used to \emph{prove} anything correct without relying on informal
-relying on informal interpretations.
+interpretations.
 \subsection{Original language: DWARF instructions}
@ -732,16 +729,15 @@ licenses.
 \subsection{Compilation: \ehelfs}\label{ssec:ehelfs}
 The rough idea of the compilation is to produce, out of the \ehframe{} section
-of a binary, C code that resembles the code shown in the DWARF semantics from
+of a binary, C code close to that of Section~\ref{sec:semantics} above. This C
-Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
+code is then compiled by GCC in \lstbash{-O2} mode. This saves us the trouble
-\lstbash{-O2} mode. This saves us the trouble of optimizing the generated C
+of optimizing the generated C code whenever GCC does that by itself.
 code whenever GCC does that by itself.
-The generated code consists in a single monolithic function, \lstc{_eh_elf},
+The generated code consists in a single function, \lstc{_eh_elf}, taking as
-taking as arguments an instruction pointer and a memory context (\ie{} the
+arguments an instruction pointer and a memory context (\ie{} the value of the
-value of the various machine registers) as defined in
+various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
-Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
+function then returns a fresh memory context loaded with the values the
-context, containing the values the registers hold after unwinding this frame.
+registers after unwinding this frame.
 The body of the function itself consists in a single monolithic switch, taking
 advantage of the non-standard --~yet overwhelmingly implemented in common C
@ -953,10 +949,9 @@ across the program to mimic real-world unwinding: we would like to benchmark
 stack unwindings crossing some standard library functions, starting from inside
 them, etc.
-Finally, the unwound program must be interesting enough to enter and exit
+Finally, the unwound program must be interesting enough to call functions
-functions often, building a good stack of nested function calls (at least
+often, building a stack of nested function calls (at least frequently 5), have
-frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
+FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc.
 etc.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1016,17 +1011,17 @@ changing one line of code to add one parameter to a function call and linking
 against the modified version of \prog{libunwind} instead of the system version.
 Once this was done, plugging it in \prog{perf} was the matter of a few lines of
-code only, left apart the benchmarking code. The major problem encountered was
+code only, left apart the benchmarking code. The major difficulty was to
-to understand how \prog{perf} works. In order to avoid perturbing the traced
+understand how \prog{perf} works. To avoid perturbing the traced program,
-program, \prog{perf} does not unwind at runtime, but rather records at regular
+\prog{perf} does not unwind at runtime, but rather records at regular intervals
-intervals the program's stack, and all the auxiliary information that is needed
+the program's stack, and all the auxiliary information that is needed to unwind
-to unwind later. This is done when running \lstbash{perf record}. Then, a
+later. This is done when running \lstbash{perf record}. Then, a subsequent call
-subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but
+to \lstbash{perf report} unwinds the stack to analyze it; but at this point of
-at this point of time, the traced process is long dead. Thus, any PID-based
+time, the traced process is long dead. Thus, any PID-based approach, or any
-approach, or any approach using \texttt{/proc} information will fail. However,
+approach using \texttt{/proc} information will fail. However, as this was the
-as this was the easiest method, the first version of \ehelfs{} used those
+easiest method, the first version of \ehelfs{} used those mechanisms; it took
-mechanisms; it took some code rewriting to move to a PID- and
+some code rewriting to move to a PID- and \texttt{/proc}-agnostic
-\texttt{/proc}-agnostic implementation.
+implementation.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Other explored methods}
@ -1040,8 +1035,8 @@ also never have met the requirement of unwinding from fairly distributed
 locations anyway.
 Another attempt was made using CSmith~\cite{csmith}, a random C code generator
-initially made for C compilers random testing. The idea was still to craft an
+designed for random testing on C compilers. The idea was still to craft a
-interesting C program that would unwind on its own frequently, but to integrate
+C program that would unwind on its own frequently, but to integrate
 CSmith-randomly generated C code within hand-written C snippets that
 would generate large enough FDEs and nested calls. This was abandoned as well
 as the call graph of a CSmith-generated code is often far too small, and the
@ -1105,25 +1100,24 @@ Table~\ref{table:bench_time}.
 \end{table}
 The performance of \ehelfs{} is probably overestimated for a production-ready
-version, since \ehelfs{} do not handle all registers from the original DWARF
+version, since \ehelfs{} do not handle all the registers from the original
-file, and thus the \prog{libunwind} version must perform more computation.
+DWARF, lightening the computation.  However, this overhead, although impossible
-However, this overhead, although impossible to measure without first
+to measure without first implementing supports for every register, would
-implementing supports for every register, would probably not be that big, since
+probably not be that big, since most of the time is spent finding the relevant
-most of the time is spent finding the relevant row. Support for every DWARF
+row. Support for every DWARF instruction, however, would not slow down at all
-instruction, however, would not slow down at all the implementation, since
+the implementation, since every instruction would simply be compiled to x86\_64
-every instruction would simply be compiled to x86\_64 without affecting the
+without affecting the already supported code.
 already supported code.
 The fact that there is a sharp difference between cached and uncached
 \prog{libunwind} confirm that our experimental setup did not unwind at totally
 different locations every single time, and thus was not biased in this
 direction, since caching is still very efficient.
-It is also worth noting that the compilation time of \ehelfs{} is also
+The compilation time of \ehelfs{} is also reasonable. On the machine
-reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
+described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
-without using multiple cores to compile, the various shared objects needed to
+compile, the various shared objects needed to run \prog{hackbench} --~that is,
-run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
+\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
-\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.
+in an overall time of $25.28$ seconds.
 The unwinding errors observed are hard to investigate, but are most probably
 due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
@ -1136,11 +1130,9 @@ the custom \prog{libunwind} implementation that were not spotted.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Measured compactness}\label{ssec:results_size}
-A first measure of compactness was made in this report for one of the earliest
+A first measure of compactness was made for one of the earliest working
-working versions in Table~\ref{table:basic_eh_elf_space}.
+versions in Table~\ref{table:basic_eh_elf_space}. The same data, generated for
-
+the latest version of \ehelfs, can be seen in Table~\ref{table:bench_space}.
 The same data, generated for the latest version of \ehelfs, can be seen in
 Table~\ref{table:bench_space}.
 The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is
 particularly visible in this table: \prog{hackbench} has a significantly bigger
@ -1150,9 +1142,8 @@ times, compared to \eg{} \prog{libc}, in which the outlined data is reused a
 lot.
 Just as with time performance, the measured compactness would be impacted by
-supporting every register, but probably not that much either, since most
+supporting every register, but probably lightly, since the four supported
-columns are concerned with the four supported registers (see
+registers represent most columns --~see Section~\ref{ssec:instr_cov}.
 Section~\ref{ssec:instr_cov}).
 \begin{table}[h]
    \centering