Review and reword end of §1, §3 and §4

This commit is contained in:
Théophile Bastian 2018-08-08 14:01:55 +02:00
parent b761f360cc
commit b128ddd571
3 changed files with 176 additions and 110 deletions

View file

@ -88,16 +88,19 @@ before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
conventions}\label{fig:call_stack}
\end{wrapfigure}
The register \reg{rsp} is supposed to always point just past the last used
memory cell in the stack, thus, when the process just enters a new function,
\reg{rsp} points 8 bytes after the location of the return address. Then, the
compiler might use \reg{rbp} (``base pointer'') to save this value of
\reg{rip}, by writing the old value of \reg{rbp} just below the return address
on the stack, then copying \reg{rsp} to \reg{rbp}. This makes it easy to find
the return address from anywhere within the function, and also allows for easy
addressing of local variables. Yet, using \reg{rbp} to save \reg{rip} is not
always done, since it somehow ``wastes'' a register. This decision is, on
x86\_64 System V, up to the compiler.
The register \reg{rsp} is supposed to always point to the last used memory cell
in the stack, thus, when the process just enters a new function, \reg{rsp}
points right to the location of the return address\footnote{Remember that since
the stack grows \emph{downwards} in memory, the arrow of \reg{rsp} points
\emph{below} the RA cell in the figure, and yet the memory cell indexed is the
one \emph{above} in the drawing, that is, the RA.}. Then, the compiler might
use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
the old value of \reg{rbp} just below the return address on the stack, then
copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
from anywhere within the function, and also allows for easy addressing of local
variables. Yet, using \reg{rbp} to save \reg{rip} is not always done, since it
somehow ``wastes'' a register. This decision is, on x86\_64 System V, up to the
compiler.
Often, a function will start by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it will push on
@ -242,52 +245,92 @@ when talking about DWARF, a register is merely a numerical identifier that is
often, but not necessarily, mapped to a real machine register by the ABI\@.
In practice, this data takes the form of a collection of tables, one table per
Frame Description Entry (FDE), which most often corresponds to a function. Each
column of the table is a register (\eg{} \reg{rsp}), with two additional
Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such
a table, that has a range of IPs on which it has authority. Most often, but not
necessarily, it corresponds to a single function in the original source code.
Each column of the table is a register (\eg{} \reg{rsp}), with two additional
special registers, CFA (Canonical Frame Address) and RA (Return Address),
containing respectively the base pointer of the current stack frame and the
return address of the current function (\ie{} for x86\_64, the unwound value of
\reg{rip}, the instruction pointer). Each row of the table is a particular
instruction pointer, within the instruction pointer range of the tabulated FDE
(assuming a FDE maps directly to a function, this range is simply the IP range
of the given function in the \lstc{.text} section of the binary), a row being
valid from its start IP to the start IP of the next row, or the end IP of the
FDE if it is the last row.
containing respectively the base pointer of the current stack
frame\footnote{The CFA is most commonly thought of as the base pointer of the
frame, yet this is not enforced by DWARF\@. The CFA is used as an address from
which other registers will be deduced as offsets, and although it is supposed
to be the actual base pointer, it can be anything as long as it is close enough
to the addresses that will be deduced from it.} and the return address of the
current function (\ie{} for x86\_64, the unwound value of \reg{rip}, the
instruction pointer). Each row has a certain validity interval, on which it
describes accurate unwinding data. This range starts at the instruction pointer
it is associated with, and ends at the start IP of the next table row (or the
end IP of the current FDE if it was the last row). In particular, there can be
no ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
between them.
\begin{minipage}{0.45\textwidth}
\lstinputlisting[language=C, firstline=3, lastline=12,
caption={Original C},label={lst:ex1_c}]
{src/fib7/fib7.c}
\end{minipage} \hfill \begin{minipage}{0.45\textwidth}
\lstinputlisting[language=C,caption={Processed DWARF},label={lst:ex1_dw}]
{src/fib7/fib7.fde}
\lstinputlisting[language=C,caption={Raw DWARF},label={lst:ex1_dwraw}]
{src/fib7/fib7.raw_fde}
\end{minipage}
\begin{figure}[h]
\begin{minipage}{0.45\textwidth}
\lstinputlisting[language=C, firstline=3, lastline=12,
caption={Original C},label={lst:ex1_c}]
{src/fib7/fib7.c}
\end{minipage} \hfill \begin{minipage}{0.45\textwidth}
\lstinputlisting[language=C,caption={Processed DWARF},
label={lst:ex1_dw}]
{src/fib7/fib7.fde}
\lstinputlisting[language=C,caption={Raw DWARF},label={lst:ex1_dwraw}]
{src/fib7/fib7.raw_fde}
\end{minipage}
\end{figure}
\begin{minipage}{0.45\textwidth}
\lstinputlisting[language={[x86masm]Assembler},lastline=11,
caption={Generated assembly},label={lst:ex1_asm}]
{src/fib7/fib7.s}
\end{minipage} \hfill \begin{minipage}{0.45\textwidth}
\lstinputlisting[language={[x86masm]Assembler},firstline=12,
firstnumber=last]
{src/fib7/fib7.s}
\end{minipage}
\begin{figure}[h]
\begin{minipage}{0.45\textwidth}
\lstinputlisting[language={[x86masm]Assembler},lastline=11,
caption={Generated assembly},label={lst:ex1_asm}]
{src/fib7/fib7.s}
\end{minipage} \hfill \begin{minipage}{0.45\textwidth}
\lstinputlisting[language={[x86masm]Assembler},firstline=12,
firstnumber=last]
{src/fib7/fib7.s}
\end{minipage}
\end{figure}
\begin{table}[h]
\centering
\begin{tabular}{|c|c|c|c|c|c}
\stackfhead{+ \mhex{30}}
& \stackfhead{+ \mhex{28}}
& \stackfhead{+ \mhex{20}}
& \stackfhead{+ \mhex{1c}}
& \stackfhead{+ \mhex{4}}
& \stackfhead{}
\\
\hline{}
Return Address & \textit{Alignment space}
& \spaced{2ex}{\lstc{fibo[7]}}
& \spaced{4ex}{\ldots}
& \spaced{2ex}{\lstc{fibo[0]}}
& \textit{Next frame}
\\
\hline
\end{tabular}
\caption{Stack frame schema}\label{table:ex1_stack_schema}
\end{table}
For instance, the C source code in Listing~\ref{lst:ex1_c} above, when compiled
with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the
assembly code in Listing~\ref{lst:ex1_asm}. When interpreting the generated
\ehframe{} with \lstbash{readelf -wF}, we obtain the (slightly edited)
assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack
frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding
how the stack frame is constructed. When interpreting the generated \ehframe{}
with \lstbash{readelf -wF}, we obtain the (slightly edited)
Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
\leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
thus the CFA is 8 bytes above \reg{rsp} (which was the value of \reg{rsp}
before the call), and the return address is precisely at \reg{rsp}. Then, 9
integers of 8 bytes each (8 for \lstc{fibo}, one for \lstc{pos}) are allocated
on the stack, which puts the CFA 80 bytes above \reg{rsp}, and the return
address still 8 bytes below the CFA\@. Then, by the end of the function, the
local variables are discarded and \reg{rsp} is reset to its value from the
first row.
before the call, and is the topmost value of used space for this stack frame),
and the return address is precisely at \reg{rsp} --~that is, stored between
\reg{rsp} and $\reg{rsp} + 8$. Then, 8 integers of 4 bytes each (for
\lstc{fibo}, \lstc{pos} being optimized out) are allocated on the stack, which
puts the CFA 32 bytes above \reg{rsp}, and the return address still 8 bytes
below the CFA\@. Yet, \prog{gcc} decided to allocate a total space of 48 bytes
for the stack frame for memory alignment reasons, which means subtracting 40
bytes to \reg{rsp} (address $\mhex{615}$ in the assembly). Then, by the end of
the function, the local variables are discarded and \reg{rsp} is reset to its
value from the first row.
However, DWARF data isn't actually stored as a table in the binary files, but
is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
@ -295,12 +338,12 @@ location of the first IP in the FDE, and must define at least its CFA\@. Then,
when all relevant registers are defined, it is possible to define a new row by
providing a location offset (\eg{} here $4$), and the new row is defined as a
clone of the previous one, which can then be altered (\eg{} here by setting
\lstc{CFA} to $\reg{rsp} + 80$). This means that every line is defined \wrt{}
\lstc{CFA} to $\reg{rsp} + 48$). This means that every line is defined \wrt{}
the previous one, and that the IPs of the successive rows cannot be determined
before evaluating every row before. Thus, unwinding a frame from an IP close to
the end of the frame will require evaluating pretty much every DWARF row in the
table before reaching the relevant information, slowing down drastically the
unwinding process.
without evaluating every row that comes before in the first place. Thus,
unwinding a frame from an IP close to the end of the frame will require
evaluating pretty much every DWARF row in the table before reaching the
relevant information, slowing down drastically the unwinding process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{How big are FDEs?}
@ -377,8 +420,8 @@ brevity and clarity. All these instructions are up to variants (most
instructions exist in multiple formats to handle various operands formatting,
to optimize space). Since we won't be talking about the underlying file format
here, those variations between eg. \dwcfa{advance\_loc1} and
\dwcfa{advance\_loc2} ---~which differ only on the number of bytes of their
operand~--- are irrelevant and will be eluded.
\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
operand~-- are irrelevant and will be eluded.
\begin{itemize}
\item{} \dwcfa{set\_loc(loc)}~:
@ -478,8 +521,8 @@ in the context of the program being unwound. In particular, it must be able to
dereference some pointer derived from DWARF instructions that will point to the
execution stack, or even the heap.
This function takes as arguments an instruction pointer ---~supposedly
extracted from $\reg{rip}$~--- and an array of register values; and returns a
This function takes as arguments an instruction pointer --~supposedly
extracted from $\reg{rip}$~-- and an array of register values; and returns a
fresh array of register values after unwinding this call frame. The function is
compositional\footnote{up to technicities: the IP obtained after unwinding the
first frame might be handled in a different dynamically loaded object, and this
@ -641,25 +684,33 @@ machine code on the x86\_64 platform.
The rough idea of the compilation is to produce, out of the \ehframe{} section
of a binary, C code that resembles the code shown in the DWARF semantics from
Section~\ref{sec:semantics} above. This C code is then compiled by GCC,
providing for free all the optimization passes of a modern compiler.
Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
\lstbash{-O2} mode\footnote{Compiling in \lstbash{-O3} takes way too much
time.}, providing for free all the optimization passes of a modern compiler.
The generated code consists in a single monolithic function, taking as
arguments an instruction pointer and a memory context (\ie{} the value of the
various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
function will then return a fresh memory context, containing the values the
registers hold after unwinding this frame.
The generated code consists in a single monolithic function, \lstc{_eh_elf},
taking as arguments an instruction pointer and a memory context (\ie{} the
value of the various machine registers) as defined in
Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
context, containing the values the registers hold after unwinding this frame.
The body of the function itself is mostly a huge switch, taking advantage of
the non-standard ---~yet widely implemented in C compilers~--- syntax for range
switches, in which each \lstc{case} can refer to a range. All the FDEs are
merged together into this switch, each row of a FDE being a switch case. The
cases then fill a context with unwound values, then return it.
the non-standard --~yet widely implemented in C compilers~-- syntax for range
switches, in which each \lstinline{case} can refer to a range. All the FDEs are
merged together into this switch, each row of a FDE being a switch case.
Separating the various FDEs in the C code --~other than with comments~-- is,
unlike what is done in DWARF, pointless, since accessing a ``row'' has a linear
cost, and the C code is not meant to be read, except maybe for debugging
purposes. The switch cases bodies then fill a context with unwound values, then
return it.
An optionally enabled parameter can be used to pass a function pointer to a
dereferencing function, that conceptually does what the dereferencing \lstc{*}
operator does on a pointer, and is used to unwind a process that is not the
currently running process, and thus not sharing the same address space. A call
A setting of the compiler also optionally enables another parameter to the
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
\lstc{deref} function, when enabled, replaces everywhere the dereferencing
\lstc{*} operator, and can be used to generate \ehelfs{} that will work on
remote address spaces (\ie{} whenever the unwinding is not done on the process
reading the \ehelf{} itself, but some other process, or even on a stack dump of
a long-terminated process).
Unlike in the \ehframe, and unlike what should be done in a release,
real-world-proof version of the \ehelfs, the choice was made to keep this
@ -675,20 +726,24 @@ is not sufficient to analyze every stack frame as \prog{gdb} would do after a
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
\lstc{uintptr_t} are the values of the corresponding registers, and
\lstc{flags} is a 8-bytes value, indicating for each register whether it is
\lstc{flags} is a 8-bits value, indicating for each register whether it is
present or not in this context (\ie{} if the \lstc{rbx} bit is not set, the
value of \lstc{rbx} in the structure isn't meaningful), plus an error bit,
indicating whether an error occurred during unwinding.
indicating whether an error occurred during unwinding (which can be due \eg{}
to an unsupported operation in the original DWARF, thus compiled to an error).
This generated data is stored in separate shared object files, which we call
\ehelfs. It would have been possible to alter the original ELF file to embed
this data as a new section, but it getting it to be executed just as any
this data as a new section, but getting it to be executed just as any
portion of the \lstc{.text} section would probably have been painful, and
keeping it separated during the experimental phase is quite convenient. It is
possible to have multiple versions of \ehelfs{} files in parallel, with various
options turned on or off, and it doesn't require to alter the base system by
editing \eg{} \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is
required, those files can simply be \lstc{dlopen}'d.
required, those files can simply be \lstc{dlopen}'d. It is also possible to
imagine, in a future environment production, packaging \ehelfs{} files
separately, so that people interested in heavy computation can have the choice
to install them.
\medskip
@ -705,15 +760,19 @@ generated for the C code in Listing~\ref{lst:ex1_c}.
Without any particular care to efficiency or compactness, it is already
possible to produce a compiled version very close to the one described in
Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be
actually benchmarked, it is already possible to write in a few hundreds of line
of C a simple stack walker printing the functions traversed. It already works
actually benchmarked, it is already possible to write in a few hundred lines of
C code a simple stack walker printing the functions traversed. It already works
without any problem on the easily tested cases, since corner cases are mostly
found in standard and highly optimal libraries, and it is not that easy to get
found in standard and highly optimized libraries, and it is not that easy to get
the program to stop and print a stack trace from within a system library
without using a debugger.
The major drawback of this approach, without any particular care taken, is the
space waste.
space waste. The space taken by those tentative \ehelfs{} is analyzed in
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
it depends.
\begin{table}[h]
\centering
@ -736,11 +795,6 @@ space waste.
\caption{Basic \ehelfs{} space usage}\label{table:basic_eh_elf_space}
\end{table}
The space taken by those tentative \ehelfs{} is analyzed in
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
it depends.
The first column only includes the sizes of the ELF sections \lstc{.text} (the
program itself) and \lstc{.rodata}, the read-only data (such as static strings,
etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{}
@ -764,16 +818,17 @@ made in order to shrink the \ehelfs.
The major optimization that most reduced the output size was to use an if/else
tree implementing a binary search on the program counter relevant intervals,
instead of a huge switch. In the process, we also \emph{outline} a lot of code,
that is, find out identical code blocks, move them outside of the if/else tree,
identify them by a label, and jump to them using a \lstc{goto}, which
de-duplicates a lot of code and contributes greatly to the shrinking. In the
process, we noticed that the vast majority of FDE rows are actually taken among
very few ``common'' FDE rows.
that is, find out identical ``switch cases'' bodies (which are not switch cases
anymore, but if bodies), move them outside of the if/else tree, identify them
by a label, and jump to them using a \lstc{goto}, which de-duplicates a lot of
code and contributes greatly to the shrinking. In the process, we noticed that
the vast majority of FDE rows are actually taken among very few ``common'' FDE
rows.
This makes this optimization really efficient, as seen later in
Section~\ref{ssec:results_size}, but also makes it an interesting question ---
not investigated during this internship --- to find out whether standard DWARF
data could be efficiently compressed in this way.
Section~\ref{ssec:results_size}, but also makes it an interesting question
--~not investigated during this internship~-- to find out whether standard
DWARF data could be efficiently compressed in this way.
\begin{minipage}{0.45\textwidth}
\lstinputlisting[language=C, caption={\ehelf{} for the previous example},
@ -806,15 +861,16 @@ However, unwinding over and over again from the same program point would have
had no interest at all, since \prog{libunwind} would have simply cached the
relevant DWARF row. In the mean time, making sure that the various unwinding
are made from different locations is somehow cheating, since it makes useless
\prog{libunwind}'s caching. All in all, the benchmarking method must have a
``natural'' distribution of unwindings.
\prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding
distribution. All in all, the benchmarking method must have a ``natural''
distribution of unwindings.
Another requirement is to also distribute quite evenly the unwinding points
across the program: we would like to benchmark stack unwindings crossing some
standard library functions, starting from inside them, etc.
Finally, the unwound program must be interesting enough to enter and exit a lot
of function, nest function calls, have FDEs that are not as simple as in
of functions, nest function calls, have FDEs that are not as simple as in
Listing~\ref{lst:ex1_dw}, etc.
@ -864,19 +920,23 @@ system and process as much as possible, to be able to unwind in any context.
This very restricted information lacked a memory map (a table indicating which
shared object is mapped at which address in memory) in order to use \ehelfs.
Apart from this, the modified version of \prog{libunwind} produced is entirely
compatible with the vanilla version.
compatible with the vanilla version, meaning that the only modifications
required to use \ehelfs{} within any project using \prog{libunwind} should be
modifying one line of code (this function call, which is a setup function) and
linking against the modified version of \prog{libunwind} instead of the system
version.
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
code only. The major problem encountered was to understand how \prog{perf}
works. In order to avoid perturbing the traced program, \prog{perf} does not
unwind at runtime, but rather records at regular interval the program's stack,
and all the auxiliary information that is needed to unwind later. This is done
when running \lstbash{perf record}. Then, \lstbash{perf report} unwinds the
stack to analyze it; but at this point of time, the traced process is long
dead, thus any PID-based approach, or any approach using \texttt{/proc}
information will fail. However, as this was the easiest method, this approach
was chosen when implementing the first version of \ehelfs; thus requiring some
code rewriting.
code only, left apart the benchmarking code. The major problem encountered was
to understand how \prog{perf} works. In order to avoid perturbing the traced
program, \prog{perf} does not unwind at runtime, but rather records at regular
intervals the program's stack, and all the auxiliary information that is needed
to unwind later. This is done when running \lstbash{perf record}. Then,
\lstbash{perf report} unwinds the stack to analyze it; but at this point of
time, the traced process is long dead, thus any PID-based approach, or any
approach using \texttt{/proc} information will fail. However, as this was the
easiest method, the first version of \ehelfs{} used those mechanisms; thus
requiring some code rewriting.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Other explored methods}
@ -884,15 +944,15 @@ code rewriting.
The first approach tried to benchmark was trying to create some specific C code
that would meet the requirements from Section~\ref{ssec:bench_req}, while
calling itself a benchmarking procedure from time to time. This was abandoned
quite fast, because generating C code interesting enough to be unwound turned
out hard, and the generated FDEs invariably ended out uninteresting. It would
also never have met the requirement of unwinding from fairly distributed
quite quickly, because generating C code interesting enough to be unwound
turned out hard, and the generated FDEs invariably ended out uninteresting. It
would also never have met the requirement of unwinding from fairly distributed
locations anyway.
Another attempt was made using CSmith~\cite{csmith}, a random C code generator
initially made for C compilers random testing. The idea was still to craft an
interesting C program that would unwind on its own frequently, but to integrate
randomly generated C code with CSmith to integrate interesting C snippets that
CSmith-randomly generated C code within hand-written C snippets that
would generate large enough FDEs and nested calls. This was abandoned as well
as the call graph of a CSmith-generated code is often far too small, and the
CSmith code is notoriously hard to understand and edit.

View file

@ -7,3 +7,6 @@
\newcommand{\set}[1]{\left\{ #1 \right\}}
\newcommand{\card}[1]{\left\vert{} #1 \right\vert}
\newcommand{\abs}[1]{\left\vert{} #1 \right\vert}
\newcommand{\tnhead}[2]{\multicolumn{1}{#1}{#2}} % Table neutral head
\newcommand{\spaced}[2]{\hspace{#1} #2 \hspace{#1}}

View file

@ -1,5 +1,8 @@
%% Specific commands for this project
\newcommand{\stackfhead}[1]
{\tnhead{l}{\hspace{-5ex}$\reg{rsp} #1$ \hspace{2em}}}
\newcommand{\prog}[1]{\texttt{#1}}
\newcommand{\ehelf}{\texttt{eh\_elf}}
\newcommand{\ehelfs}{\texttt{eh\_elfs}}