Factor out irrelevant footnotes and parentheses

This commit is contained in:
Théophile Bastian 2018-08-16 00:26:59 +02:00
parent c5f1f8615b
commit 67b25ca038

View file

@ -80,15 +80,15 @@ restored before returning, the function's return address and local variables.
On the x86\_64 platform, with which this report is mostly concerned, the
calling convention that is followed is defined in the System V
ABI~\cite{systemVabi} for the Unix-like operating systems (among which Linux).
ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux.
Under this calling convention, the first six arguments of a function are passed
in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8},
\reg{r9}, while additional arguments are pushed onto the stack. It also defines
which registers may be overwritten by the callee, and which parameters must be
restored before returning (which most of the time is done by pushing the
register value onto the stack in the function prelude, and restoring it just
before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
\reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
restored before returning. This restoration, most of the time, is done by
pushing the register value onto the stack in the function prelude, and
restoring it just before returning. Those preserved registers are \reg{rbx},
\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
\begin{wrapfigure}{r}{0.4\textwidth}
\centering
@ -98,11 +98,8 @@ before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
\end{wrapfigure}
The register \reg{rsp} is supposed to always point to the last used memory cell
in the stack, thus, when the process just enters a new function, \reg{rsp}
points right to the location of the return address\footnote{Remember that since
the stack grows \emph{downwards} in memory, the arrow of \reg{rsp} points
\emph{below} the RA cell in the figure, and yet the memory cell indexed is the
one \emph{above} in the drawing, that is, the RA.}. Then, the compiler might
in the stack. Thus, when the process just enters a new function, \reg{rsp}
points right to the location of the return address. Then, the compiler might
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
the old value of \reg{rbp} just below the return address on the stack, then
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
@ -148,8 +145,8 @@ Left apart analyzing the assembly code produced, there is no way to find where
the return address is stored, relatively to \reg{rsp}, at some arbitrary point
of the function. Even when \reg{rbp} is used, there is no easy way to guess
where each callee-saved register is stored in the stack frame, and worse, which
callee-saved registers were saved (since it is not necessary to save a register
that the function never touches).
callee-saved registers were saved, since it is optional to save a register
that the function never touches.
With this example, it seems pretty clear that it is often necessary to have
additional data to perform stack unwinding. This data is often stored among the
@ -171,11 +168,11 @@ context, by unwinding \lstc{fct_b}'s frame.
\medskip
Yet, stack unwinding (and thus debugging data) \emph{is not limited to
Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
debugging}.
Another common usage is profiling. A profiling tool, such as \prog{perf} under
Linux -- see Section~\ref{ssec:perf} --, is used to measure and analyze in
Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
which functions a program spends its time, identify bottlenecks and find out
which parts are critical to optimize. To do so, modern profilers pause the
traced program at regular, short intervals, inspect their stack, and determine
@ -202,8 +199,8 @@ trigger the destructors of stack-allocated objects. Furthermore, this is often
undesirable: \lstc{setjmp} has a quite big overhead, which is introduced
whenever a \lstc{try} block is encountered. Instead, it is often preferred to
have strictly no overhead when no exception happens, at the cost of a greater
overhead when an exception is actually fired (after all, they are supposed to
be \emph{exceptional}). For more details on C++ exception handling,
overhead when an exception is actually fired --~after all, they are supposed to
be \emph{exceptional}. For more details on C++ exception handling,
see~\cite{koening1990exception} (especially Section~16.5). Possible
implementation mechanisms are also presented in~\cite{dinechin2000exn}.
@ -237,8 +234,8 @@ the previous paragraph, in an ELF section originally called
For any binary, debugging information can easily get quite large if no
attention is payed to keeping it as compact as possible. In this matter, DWARF
does an excellent job, and everything is stored in a very compact way. This,
however, as we will see, makes it both difficult to parse correctly (with \eg{}
variable-length integers) and quite slow to interpret.
however, as we will see, makes it both difficult to parse correctly and quite
slow to interpret.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DWARF unwinding data}
@ -259,19 +256,15 @@ a table, that has a range of IPs on which it has authority. Most often, but not
necessarily, it corresponds to a single function in the original source code.
Each column of the table is a register (\eg{} \reg{rsp}), with two additional
special registers, CFA (Canonical Frame Address) and RA (Return Address),
containing respectively the base pointer of the current stack
frame\footnote{The CFA is most commonly thought of as the base pointer of the
frame, yet this is not enforced by DWARF\@. The CFA is used as an address from
which other registers will be deduced as offsets, and although it is supposed
to be the actual base pointer, it can be anything as long as it is close enough
to the addresses that will be deduced from it.} and the return address of the
current function (\ie{} for x86\_64, the unwound value of \reg{rip}, the
instruction pointer). Each row has a certain validity interval, on which it
describes accurate unwinding data. This range starts at the instruction pointer
it is associated with, and ends at the start IP of the next table row (or the
end IP of the current FDE if it was the last row). In particular, there can be
no ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
between them.
containing respectively the base pointer of the current stack frame and the
return address of the current function. For instance, on a x86\_64
architecture, RA would contain the unwound value of \reg{rip}, the instruction
pointer. Each row has a certain validity interval, on which it describes
accurate unwinding data. This range starts at the instruction pointer it is
associated with, and ends at the start IP of the next table row (or the end IP
of the current FDE if it was the last row). In particular, there can be no ``IP
hole'' within a FDE --~unlike FDEs themselves, which can leave holes between
them.
\begin{figure}[h]
\begin{minipage}{0.45\textwidth}
@ -329,17 +322,17 @@ how the stack frame is constructed. When interpreting the generated \ehframe{}
with \lstbash{readelf -wF}, we obtain the (slightly edited)
Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
\leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
thus the CFA is 8 bytes above \reg{rsp} (which was the value of \reg{rsp}
before the call, and is the topmost value of used space for this stack frame),
and the return address is precisely at \reg{rsp} --~that is, stored between
\reg{rsp} and $\reg{rsp} + 8$. Then, 8 integers of 4 bytes each (for
\lstc{fibo}, \lstc{pos} being optimized out) are allocated on the stack, which
puts the CFA 32 bytes above \reg{rsp}, and the return address still 8 bytes
below the CFA\@. Yet, \prog{gcc} decided to allocate a total space of 48 bytes
for the stack frame for memory alignment reasons, which means subtracting 40
bytes to \reg{rsp} (address $\mhex{615}$ in the assembly). Then, by the end of
the function, the local variables are discarded and \reg{rsp} is reset to its
value from the first row.
thus the CFA is 8 bytes above \reg{rsp}, and the return address is precisely at
\reg{rsp} --~that is, stored between \reg{rsp} and $\reg{rsp} + 8$. Then, the
contents of \lstc{fibo}, 8 integers of 4 bytes each, are allocated on the
stack, which puts the CFA 32 bytes above \reg{rsp}; the return address still
being 8 bytes below the CFA\@. The variable \lstc{pos} is optimized out in the
generated assembly code, thus no stack space is allocated for it. Yet,
\prog{gcc} decided to allocate a total space of 48 bytes for the stack frame
for memory alignment reasons, which means subtracting 40 bytes to \reg{rsp}
(address $\mhex{615}$ in the assembly). Then, by the end of the function, the
local variables are discarded and \reg{rsp} is reset to its value from the
first row.
However, DWARF data isn't actually stored as a table in the binary files, but
is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
@ -425,9 +418,9 @@ These are the DWARF instructions used for CFI description, that is, the
instructions that contain the stack unwinding table informations. The following
list is an exhaustive list of instructions from the DWARF5
specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for
brevity and clarity. All these instructions are up to variants (most
brevity and clarity. All these instructions are up to variants --~most
instructions exist in multiple formats to handle various operands formatting,
to optimize space). Since we won't be talking about the underlying file format
to optimize space. Since we won't be talking about the underlying file format
here, those variations between eg. \dwcfa{advance\_loc1} and
\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
operand~-- are irrelevant and will be eluded.
@ -517,10 +510,10 @@ only handled as register identifiers, so we can safely state that $\reg{reg}
A value can then be undefined, stored at memory address $x$ or be directly a
value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
\textit{offset}$. The CFA is considered a simple register here. For instance, to
define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we
would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$ (for the stack grows
downwards).
\textit{offset}$. The CFA is considered a simple register here. For instance,
to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA,
we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack
grows downwards.
\subsection{Target language~: a C function body}
@ -533,10 +526,10 @@ execution stack, or even the heap.
This function takes as arguments an instruction pointer --~supposedly
extracted from $\reg{rip}$~-- and an array of register values; and returns a
fresh array of register values after unwinding this call frame. The function is
compositional\footnote{up to technicities: the IP obtained after unwinding the
first frame might be handled in a different dynamically loaded object, and this
would require inspecting the DWARF located in another file}: it can be called
twice in a row to unwind two stack frames.
compositional: it can be called twice in a row to unwind two stack frames,
unless the IP obtained after the first unwinding comes from another shared
object file, for instance a call to \prog{libc}. In this case, unwinding the
second frame will require loading the corresponding DWARF information.
The function is the following~:
@ -636,8 +629,8 @@ $F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.
\semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
\end{align*}
(The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
we omit those two operations, we can plainly remove the stack).
The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
we omit those two operations, we can plainly remove the stack.
\subsection{From $\intermedlang$ to C}
@ -694,8 +687,9 @@ machine code on the x86\_64 platform.
The rough idea of the compilation is to produce, out of the \ehframe{} section
of a binary, C code that resembles the code shown in the DWARF semantics from
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
\lstbash{-O2} mode\footnote{Compiling in \lstbash{-O3} takes way too much
time.}, providing for free all the optimization passes of a modern compiler.
\lstbash{-O2} mode, since it already provides a good level of optimization and
compiling in \lstbash{-O3} takes way too much time. This saves us the trouble
of optimizing the generated C code whenever GCC does that by itself.
The generated code consists in a single monolithic function, \lstc{_eh_elf},
taking as arguments an instruction pointer and a memory context (\ie{} the
@ -715,18 +709,18 @@ return it.
A setting of the compiler also optionally enables another parameter to the
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
\lstc{deref} function, when enabled, replaces everywhere the dereferencing
\lstc{deref} function, when present, replaces everywhere the dereferencing
\lstc{*} operator, and can be used to generate \ehelfs{} that will work on
remote address spaces (\ie{} whenever the unwinding is not done on the process
reading the \ehelf{} itself, but some other process, or even on a stack dump of
a long-terminated process).
remote address spaces, that is, whenever the unwinding is not done on the
process reading the \ehelf{} itself, but some other process, or even on a stack
dump of a long-terminated process.
Unlike in the \ehframe, and unlike what should be done in a release,
real-world-proof version of the \ehelfs, the choice was made to keep this
prototype simple, and only handle the few registers that were needed to simply
unwind the stack. Thus, the only registers handled in \ehelfs{} are \reg{rip},
\reg{rbp}, \reg{rsp} and \reg{rbx} (the latter being used quite often in
\prog{libc} to hold the CFA address). This is enough to unwind the stack, but
\reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used quite often in
\prog{libc} to hold the CFA address. This is enough to unwind the stack, but
is not sufficient to analyze every stack frame as \prog{gdb} would do after a
\lstbash{frame n} command.
@ -736,10 +730,9 @@ is not sufficient to analyze every stack frame as \prog{gdb} would do after a
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
\lstc{uintptr_t} are the values of the corresponding registers, and
\lstc{flags} is a 8-bits value, indicating for each register whether it is
present or not in this context (\ie{} if the \lstc{rbx} bit is not set, the
value of \lstc{rbx} in the structure isn't meaningful), plus an error bit,
indicating whether an error occurred during unwinding (which can be due \eg{}
to an unsupported operation in the original DWARF, thus compiled to an error).
present or not in this context, plus an error bit, indicating whether an error
occurred during unwinding. Such errors can be due \eg{} to an unsupported
operation in the original DWARF\@.
This generated data is stored in separate shared object files, which we call
\ehelfs. It would have been possible to alter the original ELF file to embed
@ -827,12 +820,12 @@ made in order to shrink the \ehelfs.
The major optimization that most reduced the output size was to use an if/else
tree implementing a binary search on the program counter relevant intervals,
instead of a huge switch. In the process, we also \emph{outline} a lot of code,
that is, find out identical ``switch cases'' bodies (which are not switch cases
anymore, but if bodies), move them outside of the if/else tree, identify them
by a label, and jump to them using a \lstc{goto}, which de-duplicates a lot of
code and contributes greatly to the shrinking. In the process, we noticed that
the vast majority of FDE rows are actually taken among very few ``common'' FDE
rows.
that is, find out identical ``switch cases'' bodies --~which are not switch
cases anymore, but if bodies~--, move them outside of the if/else tree,
identify them by a label, and jump to them using a \lstc{goto}, which
de-duplicates a lot of code and contributes greatly to the shrinking. In the
process, we noticed that the vast majority of FDE rows are actually taken among
very few ``common'' FDE rows.
This makes this optimization really efficient, as seen later in
Section~\ref{ssec:results_size}, but also makes it an interesting question
@ -886,13 +879,12 @@ Listing~\ref{lst:ex1_dw}, etc.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation of \prog{perf}}\label{ssec:perf}
\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem (actually,
\prog{perf} is developed within the Linux kernel source tree). A profiler is an
important tool from the developer's toolbox that analyzes the performance of
programs by recording the time spent in each function, including within nested
calls. This analysis often enables programmers to optimize critical paths and
functions in their programs, while leaving unoptimized functions that are
seldom traversed.
\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem, and is
even developed within the Linux kernel source tree. A profiler is an important
tool from the developer's toolbox that analyzes the performance of programs by
recording the time spent in each function, including within nested calls. This
analysis often enables programmers to optimize critical paths and functions in
their programs, while leaving unoptimized functions that are seldom traversed.
For this purpose, the basic idea is to stop the traced program at regular
intervals, unwind its stack, write down the current nested function calls, and
@ -924,16 +916,16 @@ activity, be linked against \prog{libc} and \prog{pthread}, and be very light.
Interfacing \ehelfs{} with \prog{perf} required, in a first place, to fork
\prog{libunwind} and implement \ehelfs{} support for it. In the process, it
turned out necessary to slightly modify \prog{libunwind}'s interface to add a
parameter to a function, since \prog{libunwind} is made to be agnostic of the
system and process as much as possible, to be able to unwind in any context.
This very restricted information lacked a memory map (a table indicating which
shared object is mapped at which address in memory) in order to use \ehelfs.
Apart from this, the modified version of \prog{libunwind} produced is entirely
compatible with the vanilla version, meaning that the only modifications
required to use \ehelfs{} within any project using \prog{libunwind} should be
modifying one line of code (this function call, which is a setup function) and
linking against the modified version of \prog{libunwind} instead of the system
version.
parameter to an initialisation function, since \prog{libunwind} is made to be
agnostic of the system and process as much as possible, to be able to unwind in
any context. This very restricted information lacked a \emph{memory map}, a
table indicating which shared object is mapped at which address in memory, in
order to use \ehelfs. Apart from this, the modified version of \prog{libunwind}
produced is entirely compatible with the vanilla version. This means that the
only modifications required to use \ehelfs{} within any project using
\prog{libunwind} should be changing one line of code to add one parameter to a
function call and linking against the modified version of \prog{libunwind}
instead of the system version.
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
code only, left apart the benchmarking code. The major problem encountered was
@ -984,9 +976,9 @@ swapping.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured time performance}
The benchmarking, as described in Section~\ref{ssec:bench_perf}, of \ehelfs{}
against the vanilla \prog{libunwind} (using the same methodology, only linking
\prog{perf} against the vanilla \prog{libunwind}), gives the results in
A benchmarking of \ehelfs{} against the vanilla \prog{libunwind} was made using
the exact same methodology as in Section~\ref{ssec:bench_perf}, only linking
\prog{perf} against the vanilla \prog{libunwind}. It yields the results in
Table~\ref{table:bench_time}.
\begin{table}[h]
@ -1036,11 +1028,11 @@ instruction, however, would not slow down at all the implementation, since
every instruction would simply be compiled to x86\_64 without affecting the
already supported code.
It is also worth noting that on the machine described in
Section~\ref{ssec:bench_hw}, the compilation of the \ehelfs{} at a level of
\lstc{-O2} needed to run \prog{hackbench}, that is, \prog{hackbench},
\prog{libc}, \prog{ld}, and \prog{libpthread} takes an overall time of $25.28$
seconds (using only a single core).
It is also worth noting that the compilation time of \ehelfs{} is also
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
without using multiple cores to compile, the various shared objects needed to
run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured compactness}\label{ssec:results_size}
@ -1189,8 +1181,8 @@ only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and
second row analyzes all the columns that were encountered, no matter whether
supported or not.
The Table~\ref{table:instr_types} analyzes the proportion of each command (\ie\
the formal way a register is set) for non-CFA columns in the sampled data. For
The Table~\ref{table:instr_types} analyzes the proportion of each command
--~the formal way a register is set~-- for non-CFA columns in the sampled data. For
a brief explanation, \texttt{Offset} means stored at offset from CFA,
\texttt{Register} means the value from a machine register, \texttt{Expression}
means stored at the address of an expression's result, and the \texttt{Val\_}