Eliminate widow lines

This commit is contained in:
Théophile Bastian 2018-08-19 18:32:03 +02:00
parent c847d71d28
commit d4f417017e
2 changed files with 106 additions and 116 deletions

View file

@ -32,22 +32,21 @@ computation~\cite{oakley2011exploiting}.
\subsection*{The research problem}
As debugging data can easily take an unreasonable space and grow larger than
the program itself if stored carelessly, the DWARF standard pays a great
attention to data compactness and compression. It succeeds particularly well
at it, but at the expense of efficiency: accessing stack
unwinding data for a particular program point is an expensive operation --~the
order of magnitude is $10\,\mu{}\text{s}$ on a modern computer.
As debugging data can easily grow larger than the program itself if stored
carelessly, the DWARF standard pays a great attention to data compactness and
compression. It succeeds particularly well at it, but at the expense of
efficiency: accessing stack unwinding data for a particular program point is an
expensive operation --~the order of magnitude is $10\,\mu{}\text{s}$ on a
modern computer.
This is often not a problem, as stack unwinding is often thought of as a
debugging procedure: when something behaves unexpectedly, the programmer might
be interested in opening their debugger and exploring the stack. Yet, stack
unwinding might, in some cases, be performance-critical: for instance, polling
profilers repeatedly perform stack unwindings to observe which functions are
active. Even worse, C++ exception handling relies on stack unwinding in order
to find a suitable catch-block! For such applications, it might be desirable to
find a different time/space trade-off, storing a bit more for a faster
unwinding.
open their debugger and explore the stack. Yet, stack unwinding might, in some
cases, be performance-critical: for instance, polling profilers repeatedly
perform stack unwindings to observe which functions are active. Even worse, C++
exception handling relies on stack unwinding in order to find a suitable
catch-block! For such applications, it might be desirable to find a different
time/space trade-off, storing a bit more for a faster unwinding.
This different trade-off is the question that I explored during this
internship: what good alternative trade-off is reachable when storing the stack
@ -109,10 +108,10 @@ compiled debugging data.
The goal of this project was to design a compiled version of unwinding data
that is faster than DWARF, while still being reliable and reasonably compact.
The benchmarks mentioned have yielded convincing results: on the experimental
setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash,
the compiled version is around 26 times faster than the DWARF version, while it
remains only around 2.5 times bigger than the original data.
Benchmarking has yielded convincing results: on the experimental setup created
--~detailed on Section~\ref{sec:benchmarking} below~\textendash, the compiled
version is around 26 times faster than the DWARF version, while it remains only
around 2.5 times bigger than the original data.
We support the vast majority --~more than $99.9\,\%$~-- of the instructions
actually used in binaries, although we do not support all of DWARF5 instruction

View file

@ -79,16 +79,16 @@ typically used for storing function arguments, machine registers that must be
restored before returning, the function's return address and local variables.
On the x86\_64 platform, with which this report is mostly concerned, the
calling convention that is followed is defined in the System V
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux
and MacOS\@. Under this calling convention, the first six arguments of a
function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx},
\reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the
stack. It also defines which registers may be overwritten by the callee, and
which registers must be restored before returning. This restoration, for most
compilers, is done by pushing the register value onto the stack in the function
prelude, and restoring it just before returning. Those preserved registers are
\reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
calling convention followed on UNIX-like operating systems --~among which Linux
and MacOS~-- is defined by the System V ABI~\cite{systemVabi}. Under this
calling convention, the first six arguments of a function are passed in the
registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while
additional arguments are pushed onto the stack. It also defines which registers
may be overwritten by the callee, and which registers must be restored by the
callee before returning. This restoration, for most compilers, is done by
pushing the register value onto the stack during the function prelude, and
restoring it just before returning. Those preserved registers are \reg{rbx},
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
\begin{wrapfigure}{r}{0.4\textwidth}
\centering
@ -97,24 +97,22 @@ prelude, and restoring it just before returning. Those preserved registers are
conventions}\label{fig:call_stack}
\end{wrapfigure}
The register \reg{rsp} is supposed to always point to the last used memory cell
in the stack. Thus, when the process just enters a new function, \reg{rsp}
points right to the location of the return address. Then, the compiler might
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
the old value of \reg{rbp} just below the return address on the stack, then
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
from anywhere within the function, and also allows for easy addressing of local
variables. To some extents, it also allows for hot debugging, such as saving a
useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
always done, since it wastes a register. This decision is, on x86\_64 System V,
up to the compiler.
The register \reg{rsp} is supposed to always point to the last used address in
the stack. Thus, when the process enters a new function, \reg{rsp} points to
the location of the return address. Then, the compiler might use \reg{rbp}
(``base pointer'') to save this value of \reg{rsp}, writing the old value of
\reg{rbp} below the return address on the stack and copying \reg{rsp} to
\reg{rbp}. This makes it easy to find the return address from anywhere within
the function, and allows for easy addressing of local variables. To some
extents, it also allows for hot debugging, such as saving a useful core dump
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
the decision of using it is, on x86\_64 System V, up to the compiler.
Usually, a function starts by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it pushes on
the stack the values of the callee-saved registers that are overwritten later,
effectively saving them. Before returning, it pops the values of the saved
registers back to their original registers and restore \reg{rsp} to its former
value.
some space in the stack frame for its local variables. Then, it saves on the
stack the values of the callee-saved registers that are overwritten later.
Before returning, it pops the values of the saved registers back to their
original registers and restore \reg{rsp} to its former value.
\subsection{Stack unwinding}\label{ssec:stack_unwinding}
@ -126,13 +124,12 @@ IP\@. This actually observes the stack to find the different stack frames, and
decode them to identify the function names, parameter values, etc.
This operation is far from trivial. Often, a stack frame will only make sense
when the correct values are stored in the machine registers. These values,
when the machine registers hold the right values. These values,
however, are to be restored from the previous stack frame, where they are
stored. This imposes to \emph{walk} the stack, reading the entries one after
the other, instead of peeking at some frame directly. Moreover, the size of one
stack frame is often not that easy to determine when looking at some
instruction other than \texttt{return}, making it hard to extract single frames
from the whole stack.
stored. This imposes to \emph{walk} the stack, reading the frames one after
the other, instead of peeking at some frame directly. Moreover, it is often not
even easy to determine the boundaries of each stack frame alone, making it
impossible to just peek at a single frame.
Interpreting a frame in order to get the machine state \emph{before} this
frame, and thus be able to decode the next frame recursively, is called
@ -159,10 +156,10 @@ common format of debugging data is DWARF\@.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Unwinding usage and frequency}
Stack unwinding is a more common operation that one might think at first. The
use case mostly thought of is simply to get a stack trace of a program, and
provide a debugger with the information it needs. For instance, when inspecting
a stack trace in \prog{gdb}, a common operation is to jump to a previous frame:
Stack unwinding is more frequent that one might think at first. The use case
mostly thought of is simply to get a stack trace of a program, and provide a
debugger with the information it needs. For instance, when inspecting a stack
trace in \prog{gdb}, a common operation is to jump to a previous frame:
\lstinputlisting{src/segfault/gdb_session}
@ -174,18 +171,18 @@ context, by unwinding \lstc{fct_b}'s frame.
Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
debugging}.
Another common usage is profiling. A profiling tool, such as \prog{perf} under
Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
which functions a program spends its time, identify bottlenecks and find out
which parts are critical to optimize. To do so, modern profilers pause the
traced program at regular, short intervals, inspect their stack, and determine
which function is currently being run. They also perform a stack unwinding to
figure out the call path to this function, in order to determine which function
indirectly takes time: for instance, a function \lstc{fct_a} can call both
\lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically
no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
two functions that were made from \lstc{fct_a}. Knowing that after all,
\lstc{fct_a} is the culprit can be useful to a programmer.
Another common usage is profiling. A profiler, such as \prog{perf} under Linux
--~see Section~\ref{ssec:perf}~--, is used to measure and analyze in which
functions a program spends its time, and find out which parts are critical to
optimize. To do so, modern profilers pause the traced program at regular,
short intervals, inspect their stack, and determine which function is currently
being run. They also perform a stack unwinding to figure out the call path to
this function, in order to determine which function indirectly takes time: for
instance, a function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c},
which both take a lot of time; spend practically no time directly in
\lstc{fct_a}, but spend a lot of time in calls to the other two functions that
were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the
culprit can be useful to a programmer.
Exception handling also requires a stack unwinding mechanism in some languages.
Indeed, an exception is completely different from a \lstinline{return}: while
@ -413,10 +410,10 @@ registers values, which will represent the evaluated DWARF row.
\subsection{Concerning correctness}\label{ssec:sem_correctness}
The semantics described in this section are designed in a concern of
\emph{formalization} of the original DWARF standard. This standard, sadly, only
devises a plain English description of each instruction's action and result,
which cannot be used as a basis to \emph{prove} anything correct without
relying on informal interpretations.
\emph{formalization} of the original standard. This standard, sadly, only
describes in plain English each instruction's action and result. This basis
cannot be used to \emph{prove} anything correct without relying on informal
interpretations.
\subsection{Original language: DWARF instructions}
@ -732,16 +729,15 @@ licenses.
\subsection{Compilation: \ehelfs}\label{ssec:ehelfs}
The rough idea of the compilation is to produce, out of the \ehframe{} section
of a binary, C code that resembles the code shown in the DWARF semantics from
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
\lstbash{-O2} mode. This saves us the trouble of optimizing the generated C
code whenever GCC does that by itself.
of a binary, C code close to that of Section~\ref{sec:semantics} above. This C
code is then compiled by GCC in \lstbash{-O2} mode. This saves us the trouble
of optimizing the generated C code whenever GCC does that by itself.
The generated code consists in a single monolithic function, \lstc{_eh_elf},
taking as arguments an instruction pointer and a memory context (\ie{} the
value of the various machine registers) as defined in
Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
context, containing the values the registers hold after unwinding this frame.
The generated code consists in a single function, \lstc{_eh_elf}, taking as
arguments an instruction pointer and a memory context (\ie{} the value of the
various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
function then returns a fresh memory context loaded with the values the
registers after unwinding this frame.
The body of the function itself consists in a single monolithic switch, taking
advantage of the non-standard --~yet overwhelmingly implemented in common C
@ -953,10 +949,9 @@ across the program to mimic real-world unwinding: we would like to benchmark
stack unwindings crossing some standard library functions, starting from inside
them, etc.
Finally, the unwound program must be interesting enough to enter and exit
functions often, building a good stack of nested function calls (at least
frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
etc.
Finally, the unwound program must be interesting enough to call functions
often, building a stack of nested function calls (at least frequently 5), have
FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1016,17 +1011,17 @@ changing one line of code to add one parameter to a function call and linking
against the modified version of \prog{libunwind} instead of the system version.
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
code only, left apart the benchmarking code. The major problem encountered was
to understand how \prog{perf} works. In order to avoid perturbing the traced
program, \prog{perf} does not unwind at runtime, but rather records at regular
intervals the program's stack, and all the auxiliary information that is needed
to unwind later. This is done when running \lstbash{perf record}. Then, a
subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but
at this point of time, the traced process is long dead. Thus, any PID-based
approach, or any approach using \texttt{/proc} information will fail. However,
as this was the easiest method, the first version of \ehelfs{} used those
mechanisms; it took some code rewriting to move to a PID- and
\texttt{/proc}-agnostic implementation.
code only, left apart the benchmarking code. The major difficulty was to
understand how \prog{perf} works. To avoid perturbing the traced program,
\prog{perf} does not unwind at runtime, but rather records at regular intervals
the program's stack, and all the auxiliary information that is needed to unwind
later. This is done when running \lstbash{perf record}. Then, a subsequent call
to \lstbash{perf report} unwinds the stack to analyze it; but at this point of
time, the traced process is long dead. Thus, any PID-based approach, or any
approach using \texttt{/proc} information will fail. However, as this was the
easiest method, the first version of \ehelfs{} used those mechanisms; it took
some code rewriting to move to a PID- and \texttt{/proc}-agnostic
implementation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Other explored methods}
@ -1040,8 +1035,8 @@ also never have met the requirement of unwinding from fairly distributed
locations anyway.
Another attempt was made using CSmith~\cite{csmith}, a random C code generator
initially made for C compilers random testing. The idea was still to craft an
interesting C program that would unwind on its own frequently, but to integrate
designed for random testing on C compilers. The idea was still to craft a
C program that would unwind on its own frequently, but to integrate
CSmith-randomly generated C code within hand-written C snippets that
would generate large enough FDEs and nested calls. This was abandoned as well
as the call graph of a CSmith-generated code is often far too small, and the
@ -1105,25 +1100,24 @@ Table~\ref{table:bench_time}.
\end{table}
The performance of \ehelfs{} is probably overestimated for a production-ready
version, since \ehelfs{} do not handle all registers from the original DWARF
file, and thus the \prog{libunwind} version must perform more computation.
However, this overhead, although impossible to measure without first
implementing supports for every register, would probably not be that big, since
most of the time is spent finding the relevant row. Support for every DWARF
instruction, however, would not slow down at all the implementation, since
every instruction would simply be compiled to x86\_64 without affecting the
already supported code.
version, since \ehelfs{} do not handle all the registers from the original
DWARF, lightening the computation. However, this overhead, although impossible
to measure without first implementing supports for every register, would
probably not be that big, since most of the time is spent finding the relevant
row. Support for every DWARF instruction, however, would not slow down at all
the implementation, since every instruction would simply be compiled to x86\_64
without affecting the already supported code.
The fact that there is a sharp difference between cached and uncached
\prog{libunwind} confirm that our experimental setup did not unwind at totally
different locations every single time, and thus was not biased in this
direction, since caching is still very efficient.
It is also worth noting that the compilation time of \ehelfs{} is also
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
without using multiple cores to compile, the various shared objects needed to
run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.
The compilation time of \ehelfs{} is also reasonable. On the machine
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
compile, the various shared objects needed to run \prog{hackbench} --~that is,
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
in an overall time of $25.28$ seconds.
The unwinding errors observed are hard to investigate, but are most probably
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
@ -1136,11 +1130,9 @@ the custom \prog{libunwind} implementation that were not spotted.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured compactness}\label{ssec:results_size}
A first measure of compactness was made in this report for one of the earliest
working versions in Table~\ref{table:basic_eh_elf_space}.
The same data, generated for the latest version of \ehelfs, can be seen in
Table~\ref{table:bench_space}.
A first measure of compactness was made for one of the earliest working
versions in Table~\ref{table:basic_eh_elf_space}. The same data, generated for
the latest version of \ehelfs, can be seen in Table~\ref{table:bench_space}.
The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is
particularly visible in this table: \prog{hackbench} has a significantly bigger
@ -1150,9 +1142,8 @@ times, compared to \eg{} \prog{libc}, in which the outlined data is reused a
lot.
Just as with time performance, the measured compactness would be impacted by
supporting every register, but probably not that much either, since most
columns are concerned with the four supported registers (see
Section~\ref{ssec:instr_cov}).
supporting every register, but probably lightly, since the four supported
registers represent most columns --~see Section~\ref{ssec:instr_cov}.
\begin{table}[h]
\centering