Eliminate widow lines

2018-08-19 18:32:03 +02:00 · 2018-08-19 18:32:03 +02:00 · d4f417017e
commit d4f417017e
parent c847d71d28
2 changed files with 106 additions and 116 deletions
--- a/report/fiche_synthese.tex
+++ b/report/fiche_synthese.tex
@ -32,22 +32,21 @@ computation~\cite{oakley2011exploiting}.

 \subsection*{The research problem}

-As debugging data can easily take an unreasonable space and grow larger than
-the program itself if stored carelessly, the DWARF standard pays a great
-attention to data compactness and compression. It succeeds particularly well
-at it, but at the expense of efficiency: accessing stack
-unwinding data for a particular program point is an expensive operation --~the
-order of magnitude is $10\,\mu{}\text{s}$ on a modern computer.
+As debugging data can easily grow larger than the program itself if stored
+carelessly, the DWARF standard pays a great attention to data compactness and
+compression. It succeeds particularly well at it, but at the expense of
+efficiency: accessing stack unwinding data for a particular program point is an
+expensive operation --~the order of magnitude is $10\,\mu{}\text{s}$ on a
+modern computer.

 This is often not a problem, as stack unwinding is often thought of as a
 debugging procedure: when something behaves unexpectedly, the programmer might
-be interested in opening their debugger and exploring the stack.  Yet, stack
-unwinding might, in some cases, be performance-critical: for instance, polling
-profilers repeatedly perform stack unwindings to observe which functions are
-active. Even worse, C++ exception handling relies on stack unwinding in order
-to find a suitable catch-block! For such applications, it might be desirable to
-find a different time/space trade-off, storing a bit more for a faster
-unwinding.
+open their debugger and explore the stack.  Yet, stack unwinding might, in some
+cases, be performance-critical: for instance, polling profilers repeatedly
+perform stack unwindings to observe which functions are active. Even worse, C++
+exception handling relies on stack unwinding in order to find a suitable
+catch-block! For such applications, it might be desirable to find a different
+time/space trade-off, storing a bit more for a faster unwinding.

 This different trade-off is the question that I explored during this
 internship: what good alternative trade-off is reachable when storing the stack
@ -109,10 +108,10 @@ compiled debugging data.

 The goal of this project was to design a compiled version of unwinding data
 that is faster than DWARF, while still being reliable and reasonably compact.
-The benchmarks mentioned have yielded convincing results: on the experimental
-setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash,
-the compiled version is around 26 times faster than the DWARF version, while it
-remains only around 2.5 times bigger than the original data.
+Benchmarking has yielded convincing results: on the experimental setup created
+--~detailed on Section~\ref{sec:benchmarking} below~\textendash, the compiled
+version is around 26 times faster than the DWARF version, while it remains only
+around 2.5 times bigger than the original data.

 We support the vast majority --~more than $99.9\,\%$~-- of the instructions
 actually used in binaries, although we do not support all of DWARF5 instruction
--- a/report/report.tex
+++ b/report/report.tex
@ -79,16 +79,16 @@ typically used for storing function arguments, machine registers that must be
 restored before returning, the function's return address and local variables.

 On the x86\_64 platform, with which this report is mostly concerned, the
-calling convention that is followed is defined in the System V
-ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux
-and MacOS\@. Under this calling convention, the first six arguments of a
-function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx},
-\reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the
-stack. It also defines which registers may be overwritten by the callee, and
-which registers must be restored before returning. This restoration, for most
-compilers, is done by pushing the register value onto the stack in the function
-prelude, and restoring it just before returning. Those preserved registers are
-\reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
+calling convention followed on UNIX-like operating systems --~among which Linux
+and MacOS~-- is defined by the System V ABI~\cite{systemVabi}.  Under this
+calling convention, the first six arguments of a function are passed in the
+registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while
+additional arguments are pushed onto the stack. It also defines which registers
+may be overwritten by the callee, and which registers must be restored by the
+callee before returning. This restoration, for most compilers, is done by
+pushing the register value onto the stack during the function prelude, and
+restoring it just before returning. Those preserved registers are \reg{rbx},
+\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.

 \begin{wrapfigure}{r}{0.4\textwidth}
    \centering
@ -97,24 +97,22 @@ prelude, and restoring it just before returning. Those preserved registers are
    conventions}\label{fig:call_stack}
 \end{wrapfigure}

-The register \reg{rsp} is supposed to always point to the last used memory cell
-in the stack. Thus, when the process just enters a new function, \reg{rsp}
-points right to the location of the return address. Then, the compiler might
-use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
-the old value of \reg{rbp} just below the return address on the stack, then
-copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
-from anywhere within the function, and also allows for easy addressing of local
-variables. To some extents, it also allows for hot debugging, such as saving a
-useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
-always done, since it wastes a register. This decision is, on x86\_64 System V,
-up to the compiler.
+The register \reg{rsp} is supposed to always point to the last used address in
+the stack. Thus, when the process enters a new function, \reg{rsp} points to
+the location of the return address. Then, the compiler might use \reg{rbp}
+(``base pointer'') to save this value of \reg{rsp}, writing the old value of
+\reg{rbp} below the return address on the stack and copying \reg{rsp} to
+\reg{rbp}. This makes it easy to find the return address from anywhere within
+the function, and allows for easy addressing of local variables. To some
+extents, it also allows for hot debugging, such as saving a useful core dump
+upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
+the decision of using it is, on x86\_64 System V, up to the compiler.

 Usually, a function starts by subtracting some value to \reg{rsp}, allocating
-some space in the stack frame for its local variables. Then, it pushes on
-the stack the values of the callee-saved registers that are overwritten later,
-effectively saving them. Before returning, it pops the values of the saved
-registers back to their original registers and restore \reg{rsp} to its former
-value.
+some space in the stack frame for its local variables. Then, it saves on the
+stack the values of the callee-saved registers that are overwritten later.
+Before returning, it pops the values of the saved registers back to their
+original registers and restore \reg{rsp} to its former value.

 \subsection{Stack unwinding}\label{ssec:stack_unwinding}

@ -126,13 +124,12 @@ IP\@. This actually observes the stack to find the different stack frames, and
 decode them to identify the function names, parameter values, etc.

 This operation is far from trivial. Often, a stack frame will only make sense
-when the correct values are stored in the machine registers. These values,
+when the machine registers hold the right values. These values,
 however, are to be restored from the previous stack frame, where they are
-stored. This imposes to \emph{walk} the stack, reading the entries one after
-the other, instead of peeking at some frame directly. Moreover, the size of one
-stack frame is often not that easy to determine when looking at some
-instruction other than \texttt{return}, making it hard to extract single frames
-from the whole stack.
+stored. This imposes to \emph{walk} the stack, reading the frames one after
+the other, instead of peeking at some frame directly. Moreover, it is often not
+even easy to determine the boundaries of each stack frame alone, making it
+impossible to just peek at a single frame.

 Interpreting a frame in order to get the machine state \emph{before} this
 frame, and thus be able to decode the next frame recursively, is called
@ -159,10 +156,10 @@ common format of debugging data is DWARF\@.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Unwinding usage and frequency}

-Stack unwinding is a more common operation that one might think at first. The
-use case mostly thought of is simply to get a stack trace of a program, and
-provide a debugger with the information it needs. For instance, when inspecting
-a stack trace in \prog{gdb}, a common operation is to jump to a previous frame:
+Stack unwinding is more frequent that one might think at first. The use case
+mostly thought of is simply to get a stack trace of a program, and provide a
+debugger with the information it needs. For instance, when inspecting a stack
+trace in \prog{gdb}, a common operation is to jump to a previous frame:

 \lstinputlisting{src/segfault/gdb_session}

@ -174,18 +171,18 @@ context, by unwinding \lstc{fct_b}'s frame.
 Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
 debugging}.

-Another common usage is profiling. A profiling tool, such as \prog{perf} under
-Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
-which functions a program spends its time, identify bottlenecks and find out
-which parts are critical to optimize.  To do so, modern profilers pause the
-traced program at regular, short intervals, inspect their stack, and determine
-which function is currently being run. They also perform a stack unwinding to
-figure out the call path to this function, in order to determine which function
-indirectly takes time: for instance, a function \lstc{fct_a} can call both
-\lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically
-no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
-two functions that were made from \lstc{fct_a}. Knowing that after all,
-\lstc{fct_a} is the culprit can be useful to a programmer.
+Another common usage is profiling. A profiler, such as \prog{perf} under Linux
+--~see Section~\ref{ssec:perf}~--, is used to measure and analyze in which
+functions a program spends its time, and find out which parts are critical to
+optimize.  To do so, modern profilers pause the traced program at regular,
+short intervals, inspect their stack, and determine which function is currently
+being run. They also perform a stack unwinding to figure out the call path to
+this function, in order to determine which function indirectly takes time: for
+instance, a function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c},
+which both take a lot of time; spend practically no time directly in
+\lstc{fct_a}, but spend a lot of time in calls to the other two functions that
+were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the
+culprit can be useful to a programmer.

 Exception handling also requires a stack unwinding mechanism in some languages.
 Indeed, an exception is completely different from a \lstinline{return}: while
@ -413,10 +410,10 @@ registers values, which will represent the evaluated DWARF row.
 \subsection{Concerning correctness}\label{ssec:sem_correctness}

 The semantics described in this section are designed in a concern of
-\emph{formalization} of the original DWARF standard. This standard, sadly, only
-devises a plain English description of each instruction's action and result,
-which cannot be used as a basis to \emph{prove} anything correct without
-relying on informal interpretations.
+\emph{formalization} of the original standard. This standard, sadly, only
+describes in plain English each instruction's action and result. This basis
+cannot be used to \emph{prove} anything correct without relying on informal
+interpretations.

 \subsection{Original language: DWARF instructions}

@ -732,16 +729,15 @@ licenses.
 \subsection{Compilation: \ehelfs}\label{ssec:ehelfs}

 The rough idea of the compilation is to produce, out of the \ehframe{} section
-of a binary, C code that resembles the code shown in the DWARF semantics from
-Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
-\lstbash{-O2} mode. This saves us the trouble of optimizing the generated C
-code whenever GCC does that by itself.
+of a binary, C code close to that of Section~\ref{sec:semantics} above. This C
+code is then compiled by GCC in \lstbash{-O2} mode. This saves us the trouble
+of optimizing the generated C code whenever GCC does that by itself.

-The generated code consists in a single monolithic function, \lstc{_eh_elf},
-taking as arguments an instruction pointer and a memory context (\ie{} the
-value of the various machine registers) as defined in
-Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
-context, containing the values the registers hold after unwinding this frame.
+The generated code consists in a single function, \lstc{_eh_elf}, taking as
+arguments an instruction pointer and a memory context (\ie{} the value of the
+various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
+function then returns a fresh memory context loaded with the values the
+registers after unwinding this frame.

 The body of the function itself consists in a single monolithic switch, taking
 advantage of the non-standard --~yet overwhelmingly implemented in common C
@ -953,10 +949,9 @@ across the program to mimic real-world unwinding: we would like to benchmark
 stack unwindings crossing some standard library functions, starting from inside
 them, etc.

-Finally, the unwound program must be interesting enough to enter and exit
-functions often, building a good stack of nested function calls (at least
-frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
-etc.
+Finally, the unwound program must be interesting enough to call functions
+often, building a stack of nested function calls (at least frequently 5), have
+FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc.


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1016,17 +1011,17 @@ changing one line of code to add one parameter to a function call and linking
 against the modified version of \prog{libunwind} instead of the system version.

 Once this was done, plugging it in \prog{perf} was the matter of a few lines of
-code only, left apart the benchmarking code. The major problem encountered was
-to understand how \prog{perf} works. In order to avoid perturbing the traced
-program, \prog{perf} does not unwind at runtime, but rather records at regular
-intervals the program's stack, and all the auxiliary information that is needed
-to unwind later. This is done when running \lstbash{perf record}. Then, a
-subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but
-at this point of time, the traced process is long dead. Thus, any PID-based
-approach, or any approach using \texttt{/proc} information will fail. However,
-as this was the easiest method, the first version of \ehelfs{} used those
-mechanisms; it took some code rewriting to move to a PID- and
-\texttt{/proc}-agnostic implementation.
+code only, left apart the benchmarking code. The major difficulty was to
+understand how \prog{perf} works. To avoid perturbing the traced program,
+\prog{perf} does not unwind at runtime, but rather records at regular intervals
+the program's stack, and all the auxiliary information that is needed to unwind
+later. This is done when running \lstbash{perf record}. Then, a subsequent call
+to \lstbash{perf report} unwinds the stack to analyze it; but at this point of
+time, the traced process is long dead. Thus, any PID-based approach, or any
+approach using \texttt{/proc} information will fail. However, as this was the
+easiest method, the first version of \ehelfs{} used those mechanisms; it took
+some code rewriting to move to a PID- and \texttt{/proc}-agnostic
+implementation.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Other explored methods}
@ -1040,8 +1035,8 @@ also never have met the requirement of unwinding from fairly distributed
 locations anyway.

 Another attempt was made using CSmith~\cite{csmith}, a random C code generator
-initially made for C compilers random testing. The idea was still to craft an
-interesting C program that would unwind on its own frequently, but to integrate
+designed for random testing on C compilers. The idea was still to craft a
+C program that would unwind on its own frequently, but to integrate
 CSmith-randomly generated C code within hand-written C snippets that
 would generate large enough FDEs and nested calls. This was abandoned as well
 as the call graph of a CSmith-generated code is often far too small, and the
@ -1105,25 +1100,24 @@ Table~\ref{table:bench_time}.
 \end{table}

 The performance of \ehelfs{} is probably overestimated for a production-ready
-version, since \ehelfs{} do not handle all registers from the original DWARF
-file, and thus the \prog{libunwind} version must perform more computation.
-However, this overhead, although impossible to measure without first
-implementing supports for every register, would probably not be that big, since
-most of the time is spent finding the relevant row. Support for every DWARF
-instruction, however, would not slow down at all the implementation, since
-every instruction would simply be compiled to x86\_64 without affecting the
-already supported code.
+version, since \ehelfs{} do not handle all the registers from the original
+DWARF, lightening the computation.  However, this overhead, although impossible
+to measure without first implementing supports for every register, would
+probably not be that big, since most of the time is spent finding the relevant
+row. Support for every DWARF instruction, however, would not slow down at all
+the implementation, since every instruction would simply be compiled to x86\_64
+without affecting the already supported code.

 The fact that there is a sharp difference between cached and uncached
 \prog{libunwind} confirm that our experimental setup did not unwind at totally
 different locations every single time, and thus was not biased in this
 direction, since caching is still very efficient.

-It is also worth noting that the compilation time of \ehelfs{} is also
-reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
-without using multiple cores to compile, the various shared objects needed to
-run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
-\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.
+The compilation time of \ehelfs{} is also reasonable. On the machine
+described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
+compile, the various shared objects needed to run \prog{hackbench} --~that is,
+\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
+in an overall time of $25.28$ seconds.

 The unwinding errors observed are hard to investigate, but are most probably
 due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
@ -1136,11 +1130,9 @@ the custom \prog{libunwind} implementation that were not spotted.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Measured compactness}\label{ssec:results_size}

-A first measure of compactness was made in this report for one of the earliest
-working versions in Table~\ref{table:basic_eh_elf_space}.
-
-The same data, generated for the latest version of \ehelfs, can be seen in
-Table~\ref{table:bench_space}.
+A first measure of compactness was made for one of the earliest working
+versions in Table~\ref{table:basic_eh_elf_space}. The same data, generated for
+the latest version of \ehelfs, can be seen in Table~\ref{table:bench_space}.

 The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is
 particularly visible in this table: \prog{hackbench} has a significantly bigger
@ -1150,9 +1142,8 @@ times, compared to \eg{} \prog{libc}, in which the outlined data is reused a
 lot.

 Just as with time performance, the measured compactness would be impacted by
-supporting every register, but probably not that much either, since most
-columns are concerned with the four supported registers (see
-Section~\ref{ssec:instr_cov}).
+supporting every register, but probably lightly, since the four supported
+registers represent most columns --~see Section~\ref{ssec:instr_cov}.

 \begin{table}[h]
    \centering