Eliminate widow lines

This commit is contained in:
Théophile Bastian 2018-08-19 18:32:03 +02:00
parent c847d71d28
commit d4f417017e
2 changed files with 106 additions and 116 deletions

View file

@ -32,22 +32,21 @@ computation~\cite{oakley2011exploiting}.
\subsection*{The research problem} \subsection*{The research problem}
As debugging data can easily take an unreasonable space and grow larger than As debugging data can easily grow larger than the program itself if stored
the program itself if stored carelessly, the DWARF standard pays a great carelessly, the DWARF standard pays a great attention to data compactness and
attention to data compactness and compression. It succeeds particularly well compression. It succeeds particularly well at it, but at the expense of
at it, but at the expense of efficiency: accessing stack efficiency: accessing stack unwinding data for a particular program point is an
unwinding data for a particular program point is an expensive operation --~the expensive operation --~the order of magnitude is $10\,\mu{}\text{s}$ on a
order of magnitude is $10\,\mu{}\text{s}$ on a modern computer. modern computer.
This is often not a problem, as stack unwinding is often thought of as a This is often not a problem, as stack unwinding is often thought of as a
debugging procedure: when something behaves unexpectedly, the programmer might debugging procedure: when something behaves unexpectedly, the programmer might
be interested in opening their debugger and exploring the stack. Yet, stack open their debugger and explore the stack. Yet, stack unwinding might, in some
unwinding might, in some cases, be performance-critical: for instance, polling cases, be performance-critical: for instance, polling profilers repeatedly
profilers repeatedly perform stack unwindings to observe which functions are perform stack unwindings to observe which functions are active. Even worse, C++
active. Even worse, C++ exception handling relies on stack unwinding in order exception handling relies on stack unwinding in order to find a suitable
to find a suitable catch-block! For such applications, it might be desirable to catch-block! For such applications, it might be desirable to find a different
find a different time/space trade-off, storing a bit more for a faster time/space trade-off, storing a bit more for a faster unwinding.
unwinding.
This different trade-off is the question that I explored during this This different trade-off is the question that I explored during this
internship: what good alternative trade-off is reachable when storing the stack internship: what good alternative trade-off is reachable when storing the stack
@ -109,10 +108,10 @@ compiled debugging data.
The goal of this project was to design a compiled version of unwinding data The goal of this project was to design a compiled version of unwinding data
that is faster than DWARF, while still being reliable and reasonably compact. that is faster than DWARF, while still being reliable and reasonably compact.
The benchmarks mentioned have yielded convincing results: on the experimental Benchmarking has yielded convincing results: on the experimental setup created
setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash, --~detailed on Section~\ref{sec:benchmarking} below~\textendash, the compiled
the compiled version is around 26 times faster than the DWARF version, while it version is around 26 times faster than the DWARF version, while it remains only
remains only around 2.5 times bigger than the original data. around 2.5 times bigger than the original data.
We support the vast majority --~more than $99.9\,\%$~-- of the instructions We support the vast majority --~more than $99.9\,\%$~-- of the instructions
actually used in binaries, although we do not support all of DWARF5 instruction actually used in binaries, although we do not support all of DWARF5 instruction

View file

@ -79,16 +79,16 @@ typically used for storing function arguments, machine registers that must be
restored before returning, the function's return address and local variables. restored before returning, the function's return address and local variables.
On the x86\_64 platform, with which this report is mostly concerned, the On the x86\_64 platform, with which this report is mostly concerned, the
calling convention that is followed is defined in the System V calling convention followed on UNIX-like operating systems --~among which Linux
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux and MacOS~-- is defined by the System V ABI~\cite{systemVabi}. Under this
and MacOS\@. Under this calling convention, the first six arguments of a calling convention, the first six arguments of a function are passed in the
function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while
\reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the additional arguments are pushed onto the stack. It also defines which registers
stack. It also defines which registers may be overwritten by the callee, and may be overwritten by the callee, and which registers must be restored by the
which registers must be restored before returning. This restoration, for most callee before returning. This restoration, for most compilers, is done by
compilers, is done by pushing the register value onto the stack in the function pushing the register value onto the stack during the function prelude, and
prelude, and restoring it just before returning. Those preserved registers are restoring it just before returning. Those preserved registers are \reg{rbx},
\reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}. \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
\begin{wrapfigure}{r}{0.4\textwidth} \begin{wrapfigure}{r}{0.4\textwidth}
\centering \centering
@ -97,24 +97,22 @@ prelude, and restoring it just before returning. Those preserved registers are
conventions}\label{fig:call_stack} conventions}\label{fig:call_stack}
\end{wrapfigure} \end{wrapfigure}
The register \reg{rsp} is supposed to always point to the last used memory cell The register \reg{rsp} is supposed to always point to the last used address in
in the stack. Thus, when the process just enters a new function, \reg{rsp} the stack. Thus, when the process enters a new function, \reg{rsp} points to
points right to the location of the return address. Then, the compiler might the location of the return address. Then, the compiler might use \reg{rbp}
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing (``base pointer'') to save this value of \reg{rsp}, writing the old value of
the old value of \reg{rbp} just below the return address on the stack, then \reg{rbp} below the return address on the stack and copying \reg{rsp} to
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address \reg{rbp}. This makes it easy to find the return address from anywhere within
from anywhere within the function, and also allows for easy addressing of local the function, and allows for easy addressing of local variables. To some
variables. To some extents, it also allows for hot debugging, such as saving a extents, it also allows for hot debugging, such as saving a useful core dump
useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
always done, since it wastes a register. This decision is, on x86\_64 System V, the decision of using it is, on x86\_64 System V, up to the compiler.
up to the compiler.
Usually, a function starts by subtracting some value to \reg{rsp}, allocating Usually, a function starts by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it pushes on some space in the stack frame for its local variables. Then, it saves on the
the stack the values of the callee-saved registers that are overwritten later, stack the values of the callee-saved registers that are overwritten later.
effectively saving them. Before returning, it pops the values of the saved Before returning, it pops the values of the saved registers back to their
registers back to their original registers and restore \reg{rsp} to its former original registers and restore \reg{rsp} to its former value.
value.
\subsection{Stack unwinding}\label{ssec:stack_unwinding} \subsection{Stack unwinding}\label{ssec:stack_unwinding}
@ -126,13 +124,12 @@ IP\@. This actually observes the stack to find the different stack frames, and
decode them to identify the function names, parameter values, etc. decode them to identify the function names, parameter values, etc.
This operation is far from trivial. Often, a stack frame will only make sense This operation is far from trivial. Often, a stack frame will only make sense
when the correct values are stored in the machine registers. These values, when the machine registers hold the right values. These values,
however, are to be restored from the previous stack frame, where they are however, are to be restored from the previous stack frame, where they are
stored. This imposes to \emph{walk} the stack, reading the entries one after stored. This imposes to \emph{walk} the stack, reading the frames one after
the other, instead of peeking at some frame directly. Moreover, the size of one the other, instead of peeking at some frame directly. Moreover, it is often not
stack frame is often not that easy to determine when looking at some even easy to determine the boundaries of each stack frame alone, making it
instruction other than \texttt{return}, making it hard to extract single frames impossible to just peek at a single frame.
from the whole stack.
Interpreting a frame in order to get the machine state \emph{before} this Interpreting a frame in order to get the machine state \emph{before} this
frame, and thus be able to decode the next frame recursively, is called frame, and thus be able to decode the next frame recursively, is called
@ -159,10 +156,10 @@ common format of debugging data is DWARF\@.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Unwinding usage and frequency} \subsection{Unwinding usage and frequency}
Stack unwinding is a more common operation that one might think at first. The Stack unwinding is more frequent that one might think at first. The use case
use case mostly thought of is simply to get a stack trace of a program, and mostly thought of is simply to get a stack trace of a program, and provide a
provide a debugger with the information it needs. For instance, when inspecting debugger with the information it needs. For instance, when inspecting a stack
a stack trace in \prog{gdb}, a common operation is to jump to a previous frame: trace in \prog{gdb}, a common operation is to jump to a previous frame:
\lstinputlisting{src/segfault/gdb_session} \lstinputlisting{src/segfault/gdb_session}
@ -174,18 +171,18 @@ context, by unwinding \lstc{fct_b}'s frame.
Yet, stack unwinding, and thus, debugging data, \emph{is not limited to Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
debugging}. debugging}.
Another common usage is profiling. A profiling tool, such as \prog{perf} under Another common usage is profiling. A profiler, such as \prog{perf} under Linux
Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in --~see Section~\ref{ssec:perf}~--, is used to measure and analyze in which
which functions a program spends its time, identify bottlenecks and find out functions a program spends its time, and find out which parts are critical to
which parts are critical to optimize. To do so, modern profilers pause the optimize. To do so, modern profilers pause the traced program at regular,
traced program at regular, short intervals, inspect their stack, and determine short intervals, inspect their stack, and determine which function is currently
which function is currently being run. They also perform a stack unwinding to being run. They also perform a stack unwinding to figure out the call path to
figure out the call path to this function, in order to determine which function this function, in order to determine which function indirectly takes time: for
indirectly takes time: for instance, a function \lstc{fct_a} can call both instance, a function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c},
\lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically which both take a lot of time; spend practically no time directly in
no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other \lstc{fct_a}, but spend a lot of time in calls to the other two functions that
two functions that were made from \lstc{fct_a}. Knowing that after all, were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the
\lstc{fct_a} is the culprit can be useful to a programmer. culprit can be useful to a programmer.
Exception handling also requires a stack unwinding mechanism in some languages. Exception handling also requires a stack unwinding mechanism in some languages.
Indeed, an exception is completely different from a \lstinline{return}: while Indeed, an exception is completely different from a \lstinline{return}: while
@ -413,10 +410,10 @@ registers values, which will represent the evaluated DWARF row.
\subsection{Concerning correctness}\label{ssec:sem_correctness} \subsection{Concerning correctness}\label{ssec:sem_correctness}
The semantics described in this section are designed in a concern of The semantics described in this section are designed in a concern of
\emph{formalization} of the original DWARF standard. This standard, sadly, only \emph{formalization} of the original standard. This standard, sadly, only
devises a plain English description of each instruction's action and result, describes in plain English each instruction's action and result. This basis
which cannot be used as a basis to \emph{prove} anything correct without cannot be used to \emph{prove} anything correct without relying on informal
relying on informal interpretations. interpretations.
\subsection{Original language: DWARF instructions} \subsection{Original language: DWARF instructions}
@ -732,16 +729,15 @@ licenses.
\subsection{Compilation: \ehelfs}\label{ssec:ehelfs} \subsection{Compilation: \ehelfs}\label{ssec:ehelfs}
The rough idea of the compilation is to produce, out of the \ehframe{} section The rough idea of the compilation is to produce, out of the \ehframe{} section
of a binary, C code that resembles the code shown in the DWARF semantics from of a binary, C code close to that of Section~\ref{sec:semantics} above. This C
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in code is then compiled by GCC in \lstbash{-O2} mode. This saves us the trouble
\lstbash{-O2} mode. This saves us the trouble of optimizing the generated C of optimizing the generated C code whenever GCC does that by itself.
code whenever GCC does that by itself.
The generated code consists in a single monolithic function, \lstc{_eh_elf}, The generated code consists in a single function, \lstc{_eh_elf}, taking as
taking as arguments an instruction pointer and a memory context (\ie{} the arguments an instruction pointer and a memory context (\ie{} the value of the
value of the various machine registers) as defined in various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory function then returns a fresh memory context loaded with the values the
context, containing the values the registers hold after unwinding this frame. registers after unwinding this frame.
The body of the function itself consists in a single monolithic switch, taking The body of the function itself consists in a single monolithic switch, taking
advantage of the non-standard --~yet overwhelmingly implemented in common C advantage of the non-standard --~yet overwhelmingly implemented in common C
@ -953,10 +949,9 @@ across the program to mimic real-world unwinding: we would like to benchmark
stack unwindings crossing some standard library functions, starting from inside stack unwindings crossing some standard library functions, starting from inside
them, etc. them, etc.
Finally, the unwound program must be interesting enough to enter and exit Finally, the unwound program must be interesting enough to call functions
functions often, building a good stack of nested function calls (at least often, building a stack of nested function calls (at least frequently 5), have
frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc.
etc.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1016,17 +1011,17 @@ changing one line of code to add one parameter to a function call and linking
against the modified version of \prog{libunwind} instead of the system version. against the modified version of \prog{libunwind} instead of the system version.
Once this was done, plugging it in \prog{perf} was the matter of a few lines of Once this was done, plugging it in \prog{perf} was the matter of a few lines of
code only, left apart the benchmarking code. The major problem encountered was code only, left apart the benchmarking code. The major difficulty was to
to understand how \prog{perf} works. In order to avoid perturbing the traced understand how \prog{perf} works. To avoid perturbing the traced program,
program, \prog{perf} does not unwind at runtime, but rather records at regular \prog{perf} does not unwind at runtime, but rather records at regular intervals
intervals the program's stack, and all the auxiliary information that is needed the program's stack, and all the auxiliary information that is needed to unwind
to unwind later. This is done when running \lstbash{perf record}. Then, a later. This is done when running \lstbash{perf record}. Then, a subsequent call
subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but to \lstbash{perf report} unwinds the stack to analyze it; but at this point of
at this point of time, the traced process is long dead. Thus, any PID-based time, the traced process is long dead. Thus, any PID-based approach, or any
approach, or any approach using \texttt{/proc} information will fail. However, approach using \texttt{/proc} information will fail. However, as this was the
as this was the easiest method, the first version of \ehelfs{} used those easiest method, the first version of \ehelfs{} used those mechanisms; it took
mechanisms; it took some code rewriting to move to a PID- and some code rewriting to move to a PID- and \texttt{/proc}-agnostic
\texttt{/proc}-agnostic implementation. implementation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Other explored methods} \subsection{Other explored methods}
@ -1040,8 +1035,8 @@ also never have met the requirement of unwinding from fairly distributed
locations anyway. locations anyway.
Another attempt was made using CSmith~\cite{csmith}, a random C code generator Another attempt was made using CSmith~\cite{csmith}, a random C code generator
initially made for C compilers random testing. The idea was still to craft an designed for random testing on C compilers. The idea was still to craft a
interesting C program that would unwind on its own frequently, but to integrate C program that would unwind on its own frequently, but to integrate
CSmith-randomly generated C code within hand-written C snippets that CSmith-randomly generated C code within hand-written C snippets that
would generate large enough FDEs and nested calls. This was abandoned as well would generate large enough FDEs and nested calls. This was abandoned as well
as the call graph of a CSmith-generated code is often far too small, and the as the call graph of a CSmith-generated code is often far too small, and the
@ -1105,25 +1100,24 @@ Table~\ref{table:bench_time}.
\end{table} \end{table}
The performance of \ehelfs{} is probably overestimated for a production-ready The performance of \ehelfs{} is probably overestimated for a production-ready
version, since \ehelfs{} do not handle all registers from the original DWARF version, since \ehelfs{} do not handle all the registers from the original
file, and thus the \prog{libunwind} version must perform more computation. DWARF, lightening the computation. However, this overhead, although impossible
However, this overhead, although impossible to measure without first to measure without first implementing supports for every register, would
implementing supports for every register, would probably not be that big, since probably not be that big, since most of the time is spent finding the relevant
most of the time is spent finding the relevant row. Support for every DWARF row. Support for every DWARF instruction, however, would not slow down at all
instruction, however, would not slow down at all the implementation, since the implementation, since every instruction would simply be compiled to x86\_64
every instruction would simply be compiled to x86\_64 without affecting the without affecting the already supported code.
already supported code.
The fact that there is a sharp difference between cached and uncached The fact that there is a sharp difference between cached and uncached
\prog{libunwind} confirm that our experimental setup did not unwind at totally \prog{libunwind} confirm that our experimental setup did not unwind at totally
different locations every single time, and thus was not biased in this different locations every single time, and thus was not biased in this
direction, since caching is still very efficient. direction, since caching is still very efficient.
It is also worth noting that the compilation time of \ehelfs{} is also The compilation time of \ehelfs{} is also reasonable. On the machine
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
without using multiple cores to compile, the various shared objects needed to compile, the various shared objects needed to run \prog{hackbench} --~that is,
run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and \prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds. in an overall time of $25.28$ seconds.
The unwinding errors observed are hard to investigate, but are most probably The unwinding errors observed are hard to investigate, but are most probably
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$ due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
@ -1136,11 +1130,9 @@ the custom \prog{libunwind} implementation that were not spotted.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured compactness}\label{ssec:results_size} \subsection{Measured compactness}\label{ssec:results_size}
A first measure of compactness was made in this report for one of the earliest A first measure of compactness was made for one of the earliest working
working versions in Table~\ref{table:basic_eh_elf_space}. versions in Table~\ref{table:basic_eh_elf_space}. The same data, generated for
the latest version of \ehelfs, can be seen in Table~\ref{table:bench_space}.
The same data, generated for the latest version of \ehelfs, can be seen in
Table~\ref{table:bench_space}.
The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is
particularly visible in this table: \prog{hackbench} has a significantly bigger particularly visible in this table: \prog{hackbench} has a significantly bigger
@ -1150,9 +1142,8 @@ times, compared to \eg{} \prog{libc}, in which the outlined data is reused a
lot. lot.
Just as with time performance, the measured compactness would be impacted by Just as with time performance, the measured compactness would be impacted by
supporting every register, but probably not that much either, since most supporting every register, but probably lightly, since the four supported
columns are concerned with the four supported registers (see registers represent most columns --~see Section~\ref{ssec:instr_cov}.
Section~\ref{ssec:instr_cov}).
\begin{table}[h] \begin{table}[h]
\centering \centering