Eliminate widow lines
This commit is contained in:
parent
c847d71d28
commit
d4f417017e
2 changed files with 106 additions and 116 deletions
|
@ -32,22 +32,21 @@ computation~\cite{oakley2011exploiting}.
|
|||
|
||||
\subsection*{The research problem}
|
||||
|
||||
As debugging data can easily take an unreasonable space and grow larger than
|
||||
the program itself if stored carelessly, the DWARF standard pays a great
|
||||
attention to data compactness and compression. It succeeds particularly well
|
||||
at it, but at the expense of efficiency: accessing stack
|
||||
unwinding data for a particular program point is an expensive operation --~the
|
||||
order of magnitude is $10\,\mu{}\text{s}$ on a modern computer.
|
||||
As debugging data can easily grow larger than the program itself if stored
|
||||
carelessly, the DWARF standard pays a great attention to data compactness and
|
||||
compression. It succeeds particularly well at it, but at the expense of
|
||||
efficiency: accessing stack unwinding data for a particular program point is an
|
||||
expensive operation --~the order of magnitude is $10\,\mu{}\text{s}$ on a
|
||||
modern computer.
|
||||
|
||||
This is often not a problem, as stack unwinding is often thought of as a
|
||||
debugging procedure: when something behaves unexpectedly, the programmer might
|
||||
be interested in opening their debugger and exploring the stack. Yet, stack
|
||||
unwinding might, in some cases, be performance-critical: for instance, polling
|
||||
profilers repeatedly perform stack unwindings to observe which functions are
|
||||
active. Even worse, C++ exception handling relies on stack unwinding in order
|
||||
to find a suitable catch-block! For such applications, it might be desirable to
|
||||
find a different time/space trade-off, storing a bit more for a faster
|
||||
unwinding.
|
||||
open their debugger and explore the stack. Yet, stack unwinding might, in some
|
||||
cases, be performance-critical: for instance, polling profilers repeatedly
|
||||
perform stack unwindings to observe which functions are active. Even worse, C++
|
||||
exception handling relies on stack unwinding in order to find a suitable
|
||||
catch-block! For such applications, it might be desirable to find a different
|
||||
time/space trade-off, storing a bit more for a faster unwinding.
|
||||
|
||||
This different trade-off is the question that I explored during this
|
||||
internship: what good alternative trade-off is reachable when storing the stack
|
||||
|
@ -109,10 +108,10 @@ compiled debugging data.
|
|||
|
||||
The goal of this project was to design a compiled version of unwinding data
|
||||
that is faster than DWARF, while still being reliable and reasonably compact.
|
||||
The benchmarks mentioned have yielded convincing results: on the experimental
|
||||
setup created --~detailed on Section~\ref{sec:benchmarking} below~\textendash,
|
||||
the compiled version is around 26 times faster than the DWARF version, while it
|
||||
remains only around 2.5 times bigger than the original data.
|
||||
Benchmarking has yielded convincing results: on the experimental setup created
|
||||
--~detailed on Section~\ref{sec:benchmarking} below~\textendash, the compiled
|
||||
version is around 26 times faster than the DWARF version, while it remains only
|
||||
around 2.5 times bigger than the original data.
|
||||
|
||||
We support the vast majority --~more than $99.9\,\%$~-- of the instructions
|
||||
actually used in binaries, although we do not support all of DWARF5 instruction
|
||||
|
|
|
@ -79,16 +79,16 @@ typically used for storing function arguments, machine registers that must be
|
|||
restored before returning, the function's return address and local variables.
|
||||
|
||||
On the x86\_64 platform, with which this report is mostly concerned, the
|
||||
calling convention that is followed is defined in the System V
|
||||
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux
|
||||
and MacOS\@. Under this calling convention, the first six arguments of a
|
||||
function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx},
|
||||
\reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the
|
||||
stack. It also defines which registers may be overwritten by the callee, and
|
||||
which registers must be restored before returning. This restoration, for most
|
||||
compilers, is done by pushing the register value onto the stack in the function
|
||||
prelude, and restoring it just before returning. Those preserved registers are
|
||||
\reg{rbx}, \reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
|
||||
calling convention followed on UNIX-like operating systems --~among which Linux
|
||||
and MacOS~-- is defined by the System V ABI~\cite{systemVabi}. Under this
|
||||
calling convention, the first six arguments of a function are passed in the
|
||||
registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while
|
||||
additional arguments are pushed onto the stack. It also defines which registers
|
||||
may be overwritten by the callee, and which registers must be restored by the
|
||||
callee before returning. This restoration, for most compilers, is done by
|
||||
pushing the register value onto the stack during the function prelude, and
|
||||
restoring it just before returning. Those preserved registers are \reg{rbx},
|
||||
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
|
||||
|
||||
\begin{wrapfigure}{r}{0.4\textwidth}
|
||||
\centering
|
||||
|
@ -97,24 +97,22 @@ prelude, and restoring it just before returning. Those preserved registers are
|
|||
conventions}\label{fig:call_stack}
|
||||
\end{wrapfigure}
|
||||
|
||||
The register \reg{rsp} is supposed to always point to the last used memory cell
|
||||
in the stack. Thus, when the process just enters a new function, \reg{rsp}
|
||||
points right to the location of the return address. Then, the compiler might
|
||||
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
|
||||
the old value of \reg{rbp} just below the return address on the stack, then
|
||||
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
|
||||
from anywhere within the function, and also allows for easy addressing of local
|
||||
variables. To some extents, it also allows for hot debugging, such as saving a
|
||||
useful core dump upon segfault. Yet, using \reg{rbp} to save \reg{rip} is not
|
||||
always done, since it wastes a register. This decision is, on x86\_64 System V,
|
||||
up to the compiler.
|
||||
The register \reg{rsp} is supposed to always point to the last used address in
|
||||
the stack. Thus, when the process enters a new function, \reg{rsp} points to
|
||||
the location of the return address. Then, the compiler might use \reg{rbp}
|
||||
(``base pointer'') to save this value of \reg{rsp}, writing the old value of
|
||||
\reg{rbp} below the return address on the stack and copying \reg{rsp} to
|
||||
\reg{rbp}. This makes it easy to find the return address from anywhere within
|
||||
the function, and allows for easy addressing of local variables. To some
|
||||
extents, it also allows for hot debugging, such as saving a useful core dump
|
||||
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
|
||||
the decision of using it is, on x86\_64 System V, up to the compiler.
|
||||
|
||||
Usually, a function starts by subtracting some value to \reg{rsp}, allocating
|
||||
some space in the stack frame for its local variables. Then, it pushes on
|
||||
the stack the values of the callee-saved registers that are overwritten later,
|
||||
effectively saving them. Before returning, it pops the values of the saved
|
||||
registers back to their original registers and restore \reg{rsp} to its former
|
||||
value.
|
||||
some space in the stack frame for its local variables. Then, it saves on the
|
||||
stack the values of the callee-saved registers that are overwritten later.
|
||||
Before returning, it pops the values of the saved registers back to their
|
||||
original registers and restore \reg{rsp} to its former value.
|
||||
|
||||
\subsection{Stack unwinding}\label{ssec:stack_unwinding}
|
||||
|
||||
|
@ -126,13 +124,12 @@ IP\@. This actually observes the stack to find the different stack frames, and
|
|||
decode them to identify the function names, parameter values, etc.
|
||||
|
||||
This operation is far from trivial. Often, a stack frame will only make sense
|
||||
when the correct values are stored in the machine registers. These values,
|
||||
when the machine registers hold the right values. These values,
|
||||
however, are to be restored from the previous stack frame, where they are
|
||||
stored. This imposes to \emph{walk} the stack, reading the entries one after
|
||||
the other, instead of peeking at some frame directly. Moreover, the size of one
|
||||
stack frame is often not that easy to determine when looking at some
|
||||
instruction other than \texttt{return}, making it hard to extract single frames
|
||||
from the whole stack.
|
||||
stored. This imposes to \emph{walk} the stack, reading the frames one after
|
||||
the other, instead of peeking at some frame directly. Moreover, it is often not
|
||||
even easy to determine the boundaries of each stack frame alone, making it
|
||||
impossible to just peek at a single frame.
|
||||
|
||||
Interpreting a frame in order to get the machine state \emph{before} this
|
||||
frame, and thus be able to decode the next frame recursively, is called
|
||||
|
@ -159,10 +156,10 @@ common format of debugging data is DWARF\@.
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Unwinding usage and frequency}
|
||||
|
||||
Stack unwinding is a more common operation that one might think at first. The
|
||||
use case mostly thought of is simply to get a stack trace of a program, and
|
||||
provide a debugger with the information it needs. For instance, when inspecting
|
||||
a stack trace in \prog{gdb}, a common operation is to jump to a previous frame:
|
||||
Stack unwinding is more frequent that one might think at first. The use case
|
||||
mostly thought of is simply to get a stack trace of a program, and provide a
|
||||
debugger with the information it needs. For instance, when inspecting a stack
|
||||
trace in \prog{gdb}, a common operation is to jump to a previous frame:
|
||||
|
||||
\lstinputlisting{src/segfault/gdb_session}
|
||||
|
||||
|
@ -174,18 +171,18 @@ context, by unwinding \lstc{fct_b}'s frame.
|
|||
Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
|
||||
debugging}.
|
||||
|
||||
Another common usage is profiling. A profiling tool, such as \prog{perf} under
|
||||
Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
|
||||
which functions a program spends its time, identify bottlenecks and find out
|
||||
which parts are critical to optimize. To do so, modern profilers pause the
|
||||
traced program at regular, short intervals, inspect their stack, and determine
|
||||
which function is currently being run. They also perform a stack unwinding to
|
||||
figure out the call path to this function, in order to determine which function
|
||||
indirectly takes time: for instance, a function \lstc{fct_a} can call both
|
||||
\lstc{fct_b} and \lstc{fct_c}, which both take a lot of time; spend practically
|
||||
no time directly in \lstc{fct_a}, but spend a lot of time in calls to the other
|
||||
two functions that were made from \lstc{fct_a}. Knowing that after all,
|
||||
\lstc{fct_a} is the culprit can be useful to a programmer.
|
||||
Another common usage is profiling. A profiler, such as \prog{perf} under Linux
|
||||
--~see Section~\ref{ssec:perf}~--, is used to measure and analyze in which
|
||||
functions a program spends its time, and find out which parts are critical to
|
||||
optimize. To do so, modern profilers pause the traced program at regular,
|
||||
short intervals, inspect their stack, and determine which function is currently
|
||||
being run. They also perform a stack unwinding to figure out the call path to
|
||||
this function, in order to determine which function indirectly takes time: for
|
||||
instance, a function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c},
|
||||
which both take a lot of time; spend practically no time directly in
|
||||
\lstc{fct_a}, but spend a lot of time in calls to the other two functions that
|
||||
were made from \lstc{fct_a}. Knowing that after all, \lstc{fct_a} is the
|
||||
culprit can be useful to a programmer.
|
||||
|
||||
Exception handling also requires a stack unwinding mechanism in some languages.
|
||||
Indeed, an exception is completely different from a \lstinline{return}: while
|
||||
|
@ -413,10 +410,10 @@ registers values, which will represent the evaluated DWARF row.
|
|||
\subsection{Concerning correctness}\label{ssec:sem_correctness}
|
||||
|
||||
The semantics described in this section are designed in a concern of
|
||||
\emph{formalization} of the original DWARF standard. This standard, sadly, only
|
||||
devises a plain English description of each instruction's action and result,
|
||||
which cannot be used as a basis to \emph{prove} anything correct without
|
||||
relying on informal interpretations.
|
||||
\emph{formalization} of the original standard. This standard, sadly, only
|
||||
describes in plain English each instruction's action and result. This basis
|
||||
cannot be used to \emph{prove} anything correct without relying on informal
|
||||
interpretations.
|
||||
|
||||
\subsection{Original language: DWARF instructions}
|
||||
|
||||
|
@ -732,16 +729,15 @@ licenses.
|
|||
\subsection{Compilation: \ehelfs}\label{ssec:ehelfs}
|
||||
|
||||
The rough idea of the compilation is to produce, out of the \ehframe{} section
|
||||
of a binary, C code that resembles the code shown in the DWARF semantics from
|
||||
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
|
||||
\lstbash{-O2} mode. This saves us the trouble of optimizing the generated C
|
||||
code whenever GCC does that by itself.
|
||||
of a binary, C code close to that of Section~\ref{sec:semantics} above. This C
|
||||
code is then compiled by GCC in \lstbash{-O2} mode. This saves us the trouble
|
||||
of optimizing the generated C code whenever GCC does that by itself.
|
||||
|
||||
The generated code consists in a single monolithic function, \lstc{_eh_elf},
|
||||
taking as arguments an instruction pointer and a memory context (\ie{} the
|
||||
value of the various machine registers) as defined in
|
||||
Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
|
||||
context, containing the values the registers hold after unwinding this frame.
|
||||
The generated code consists in a single function, \lstc{_eh_elf}, taking as
|
||||
arguments an instruction pointer and a memory context (\ie{} the value of the
|
||||
various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
|
||||
function then returns a fresh memory context loaded with the values the
|
||||
registers after unwinding this frame.
|
||||
|
||||
The body of the function itself consists in a single monolithic switch, taking
|
||||
advantage of the non-standard --~yet overwhelmingly implemented in common C
|
||||
|
@ -953,10 +949,9 @@ across the program to mimic real-world unwinding: we would like to benchmark
|
|||
stack unwindings crossing some standard library functions, starting from inside
|
||||
them, etc.
|
||||
|
||||
Finally, the unwound program must be interesting enough to enter and exit
|
||||
functions often, building a good stack of nested function calls (at least
|
||||
frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
|
||||
etc.
|
||||
Finally, the unwound program must be interesting enough to call functions
|
||||
often, building a stack of nested function calls (at least frequently 5), have
|
||||
FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -1016,17 +1011,17 @@ changing one line of code to add one parameter to a function call and linking
|
|||
against the modified version of \prog{libunwind} instead of the system version.
|
||||
|
||||
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
|
||||
code only, left apart the benchmarking code. The major problem encountered was
|
||||
to understand how \prog{perf} works. In order to avoid perturbing the traced
|
||||
program, \prog{perf} does not unwind at runtime, but rather records at regular
|
||||
intervals the program's stack, and all the auxiliary information that is needed
|
||||
to unwind later. This is done when running \lstbash{perf record}. Then, a
|
||||
subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but
|
||||
at this point of time, the traced process is long dead. Thus, any PID-based
|
||||
approach, or any approach using \texttt{/proc} information will fail. However,
|
||||
as this was the easiest method, the first version of \ehelfs{} used those
|
||||
mechanisms; it took some code rewriting to move to a PID- and
|
||||
\texttt{/proc}-agnostic implementation.
|
||||
code only, left apart the benchmarking code. The major difficulty was to
|
||||
understand how \prog{perf} works. To avoid perturbing the traced program,
|
||||
\prog{perf} does not unwind at runtime, but rather records at regular intervals
|
||||
the program's stack, and all the auxiliary information that is needed to unwind
|
||||
later. This is done when running \lstbash{perf record}. Then, a subsequent call
|
||||
to \lstbash{perf report} unwinds the stack to analyze it; but at this point of
|
||||
time, the traced process is long dead. Thus, any PID-based approach, or any
|
||||
approach using \texttt{/proc} information will fail. However, as this was the
|
||||
easiest method, the first version of \ehelfs{} used those mechanisms; it took
|
||||
some code rewriting to move to a PID- and \texttt{/proc}-agnostic
|
||||
implementation.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Other explored methods}
|
||||
|
@ -1040,8 +1035,8 @@ also never have met the requirement of unwinding from fairly distributed
|
|||
locations anyway.
|
||||
|
||||
Another attempt was made using CSmith~\cite{csmith}, a random C code generator
|
||||
initially made for C compilers random testing. The idea was still to craft an
|
||||
interesting C program that would unwind on its own frequently, but to integrate
|
||||
designed for random testing on C compilers. The idea was still to craft a
|
||||
C program that would unwind on its own frequently, but to integrate
|
||||
CSmith-randomly generated C code within hand-written C snippets that
|
||||
would generate large enough FDEs and nested calls. This was abandoned as well
|
||||
as the call graph of a CSmith-generated code is often far too small, and the
|
||||
|
@ -1105,25 +1100,24 @@ Table~\ref{table:bench_time}.
|
|||
\end{table}
|
||||
|
||||
The performance of \ehelfs{} is probably overestimated for a production-ready
|
||||
version, since \ehelfs{} do not handle all registers from the original DWARF
|
||||
file, and thus the \prog{libunwind} version must perform more computation.
|
||||
However, this overhead, although impossible to measure without first
|
||||
implementing supports for every register, would probably not be that big, since
|
||||
most of the time is spent finding the relevant row. Support for every DWARF
|
||||
instruction, however, would not slow down at all the implementation, since
|
||||
every instruction would simply be compiled to x86\_64 without affecting the
|
||||
already supported code.
|
||||
version, since \ehelfs{} do not handle all the registers from the original
|
||||
DWARF, lightening the computation. However, this overhead, although impossible
|
||||
to measure without first implementing supports for every register, would
|
||||
probably not be that big, since most of the time is spent finding the relevant
|
||||
row. Support for every DWARF instruction, however, would not slow down at all
|
||||
the implementation, since every instruction would simply be compiled to x86\_64
|
||||
without affecting the already supported code.
|
||||
|
||||
The fact that there is a sharp difference between cached and uncached
|
||||
\prog{libunwind} confirm that our experimental setup did not unwind at totally
|
||||
different locations every single time, and thus was not biased in this
|
||||
direction, since caching is still very efficient.
|
||||
|
||||
It is also worth noting that the compilation time of \ehelfs{} is also
|
||||
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
|
||||
without using multiple cores to compile, the various shared objects needed to
|
||||
run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
|
||||
\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.
|
||||
The compilation time of \ehelfs{} is also reasonable. On the machine
|
||||
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
|
||||
compile, the various shared objects needed to run \prog{hackbench} --~that is,
|
||||
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
|
||||
in an overall time of $25.28$ seconds.
|
||||
|
||||
The unwinding errors observed are hard to investigate, but are most probably
|
||||
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
|
||||
|
@ -1136,11 +1130,9 @@ the custom \prog{libunwind} implementation that were not spotted.
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Measured compactness}\label{ssec:results_size}
|
||||
|
||||
A first measure of compactness was made in this report for one of the earliest
|
||||
working versions in Table~\ref{table:basic_eh_elf_space}.
|
||||
|
||||
The same data, generated for the latest version of \ehelfs, can be seen in
|
||||
Table~\ref{table:bench_space}.
|
||||
A first measure of compactness was made for one of the earliest working
|
||||
versions in Table~\ref{table:basic_eh_elf_space}. The same data, generated for
|
||||
the latest version of \ehelfs, can be seen in Table~\ref{table:bench_space}.
|
||||
|
||||
The effect of the outlining mentioned in Section~\ref{ssec:space_optim} is
|
||||
particularly visible in this table: \prog{hackbench} has a significantly bigger
|
||||
|
@ -1150,9 +1142,8 @@ times, compared to \eg{} \prog{libc}, in which the outlined data is reused a
|
|||
lot.
|
||||
|
||||
Just as with time performance, the measured compactness would be impacted by
|
||||
supporting every register, but probably not that much either, since most
|
||||
columns are concerned with the four supported registers (see
|
||||
Section~\ref{ssec:instr_cov}).
|
||||
supporting every register, but probably lightly, since the four supported
|
||||
registers represent most columns --~see Section~\ref{ssec:instr_cov}.
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
|
|
Loading…
Reference in a new issue