Few fixes

This commit is contained in:
Théophile Bastian 2018-08-07 20:44:12 +02:00
parent 57a6cb9e8b
commit b761f360cc
2 changed files with 44 additions and 40 deletions

View file

@ -40,13 +40,13 @@ can be quite costly.
This is often not a huge problem, as stack unwinding is mostly thought of as a
debugging procedure: when something behaves unexpectedly, the programmer might
be interested in exploring the stack. Yet, stack unwinding might, in some
cases, be performance-critical: for instance, profiler programs needs to
perform a whole lot of stack unwindings. Even worse, exception handling relies
on stack unwinding in order to find a suitable catch-block! For such
applications, it might be desirable to find a different time/space trade-off,
allowing a slightly space-heavier, but far more time-efficient unwinding
procedure.
be interested in opening their debugger and exploring the stack. Yet, stack
unwinding might, in some cases, be performance-critical: for instance, profiler
programs needs to perform a whole lot of stack unwindings. Even worse,
exception handling relies on stack unwinding in order to find a suitable
catch-block! For such applications, it might be desirable to find a different
time/space trade-off, allowing a slightly space-heavier, but far more
time-efficient unwinding procedure.
This different trade-off is the question that I explored during this
internship: what good alternative trade-off is reachable when storing the stack
@ -108,7 +108,7 @@ existing project using the \textit{de facto} standard library \prog{libunwind}.
The goal was to obtain a compiled version of unwinding data that was faster
than DWARF, reasonably heavier and reliable. The benchmarks mentioned have
yielded convincing results: on the experimental setup created (detailed later
in this report), the compiled version is up to 25 times faster than the DWARF
in this report), the compiled version is around 26 times faster than the DWARF
version, while it remains only around 2.5 times bigger than the original data.
Even though the implementation is more a research prototype than a release
@ -132,7 +132,7 @@ system.
\subsection*{Summary and future work}
In most cases of everyday's life, a slow stack unwinding is not a problem, or
even an annoyance. Yet, having a 25 times speed-up on stack unwinding-heavy
even an annoyance. Yet, having a 26 times speed-up on stack unwinding-heavy
tasks, such as profiling, can be really useful to profile heavy programs,
particularly if one wants to profile many times in order to analyze the impact
of multiple changes. It can also be useful for exception-heavy programs. Thus,

View file

@ -61,12 +61,13 @@ Under supervision of Francesco Zappa-Nardelli\\
On most platforms, programs make use of a \emph{call stack} to store
information about the nested function calls at the current execution point, and
keep track of their nesting. Each function call has its own \emph{stack frame},
an entry of the call stack, whose precise contents are often specified in the
Application Binary Interface (ABI) of the platform, and left to various extents
up to the compiler. Those frames are typically used for storing function
arguments, machine registers that must be restored before returning, the
function's return address and local variables.
keep track of their nesting. This call stack is conventionally a contiguous
memory space mapped close to the top of the addressing space. Each function
call has its own \emph{stack frame}, an entry of the call stack, whose precise
contents are often specified in the Application Binary Interface (ABI) of the
platform, and left to various extents up to the compiler. Those frames are
typically used for storing function arguments, machine registers that must be
restored before returning, the function's return address and local variables.
On the x86\_64 platform, with which this report is mostly concerned, the
calling convention that is followed is defined in the System V
@ -94,14 +95,16 @@ compiler might use \reg{rbp} (``base pointer'') to save this value of
\reg{rip}, by writing the old value of \reg{rbp} just below the return address
on the stack, then copying \reg{rsp} to \reg{rbp}. This makes it easy to find
the return address from anywhere within the function, and also allows for easy
addressing of local variables.
addressing of local variables. Yet, using \reg{rbp} to save \reg{rip} is not
always done, since it somehow ``wastes'' a register. This decision is, on
x86\_64 System V, up to the compiler.
Often, a function will start by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it will push on
the stack the values of the callee-saved registers that are overwritten later,
effectively saving them. Before returning, it will pop the values of the saved
registers back to their original registers, then restoring \reg{rsp} to its
former value.
registers back to their original registers and restore \reg{rsp} to its former
value.
\subsection{Stack unwinding}
@ -128,7 +131,7 @@ Let us consider a stack with x86\_64 calling conventions, such as shown in
Figure~\ref{fig:call_stack}. Assuming the compiler decided here \emph{not} to
use \reg{rbp}, and assuming the function \eg{} allocates a buffer of 8
integers, the area allocated for local variables should be at least $32$ bytes
long (for 4-bytes integers), and \reg{rip} will be pointing below this area.
long (for 4-bytes integers), and \reg{rsp} will be pointing below this area.
Left apart analyzing the assembly code produced, there is no way to find where
the return address is stored, relatively to \reg{rsp}, at some arbitrary point
of the function. Even when \reg{rbp} is used, there is no easy way to guess
@ -160,18 +163,19 @@ Yet, stack unwinding (and thus debugging data) \emph{is not limited to
debugging}.
Another common usage is profiling. A profiling tool, such as \prog{perf} under
Linux, is used to measure and analyze in which functions a program spends its
time, identify bottlenecks and find out which parts are critical to optimize.
To do so, modern profilers pause the traced program at regular, short
intervals, inspect their stack, and determine which function is currently being
run. They also often perform a stack unwinding to determine the call path to
this function, to determine which function indirectly takes time: \eg, a
function \lstc{fct_a} can call both \lstc{fct_b} and \lstc{fct_c}, which are
quite heavy; spend practically no time directly in \lstc{fct_a}, but spend a
lot of time in calls to the other two functions that were made by \lstc{fct_a}.
Linux -- see Section~\ref{ssec:perf} --, is used to measure and analyze in
which functions a program spends its time, identify bottlenecks and find out
which parts are critical to optimize. To do so, modern profilers pause the
traced program at regular, short intervals, inspect their stack, and determine
which function is currently being run. They also often perform a stack
unwinding to determine the call path to this function, to determine which
function indirectly takes time: \eg, a function \lstc{fct_a} can call both
\lstc{fct_b} and \lstc{fct_c}, which are quite heavy; spend practically no time
directly in \lstc{fct_a}, but spend a lot of time in calls to the other two
functions that were made from \lstc{fct_a}.
Exception handling also requires a stack unwinding mechanism in most languages.
Indeed, an exception is completely different from a \lstc{return}: while the
Indeed, an exception is completely different from a \lstinline{return}: while the
latter returns to the previous function, the former can be caught by virtually
any function in the call path, at any point of the function. It is thus
necessary to be able to unwind frames, one by one, until a suitable
@ -180,16 +184,16 @@ stack-unwinding library similar to \prog{libunwind} in its runtime.
Technically, exception handling could be implemented without any stack
unwinding, by using \lstc{setjmp}/\lstc{longjmp} mechanics~\cite{niditoexn}.
However, this is not possible in C++ (and some other languages), because the
stack needs to be properly unwound in order to trigger the destructors of
stack-allocated objects. Furthermore, this is often undesirable: \lstc{setjmp}
has a quite big overhead, which is introduced whenever a \lstc{try} block is
encountered. Instead, it is often preferred to have strictly no overhead when
no exception happens, at the cost of a greater overhead when an exception is
actually fired (after all, they are supposed to be \emph{exceptional}). For
more details on C++ exception handling, see~\cite{koening1990exception}
(especially Section~16.5). Possible implementation mechanisms are also
presented in~\cite{dinechin2000exn}.
However, this is not possible to implement it straight away in C++ (and some
other languages), because the stack needs to be properly unwound in order to
trigger the destructors of stack-allocated objects. Furthermore, this is often
undesirable: \lstc{setjmp} has a quite big overhead, which is introduced
whenever a \lstc{try} block is encountered. Instead, it is often preferred to
have strictly no overhead when no exception happens, at the cost of a greater
overhead when an exception is actually fired (after all, they are supposed to
be \emph{exceptional}). For more details on C++ exception handling,
see~\cite{koening1990exception} (especially Section~16.5). Possible
implementation mechanisms are also presented in~\cite{dinechin2000exn}.
In both of these two previous cases, performance \emph{can} be a problem. In
the latter, a slow unwinding directly impacts the overall program performance,
@ -815,7 +819,7 @@ Listing~\ref{lst:ex1_dw}, etc.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation of \prog{perf}}
\subsection{Presentation of \prog{perf}}\label{ssec:perf}
\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem (actually,
\prog{perf} is developed within the Linux kernel source tree). A profiler is an