|
|
|
@ -108,7 +108,7 @@ the location of the return address. Then, the compiler might use \reg{rbp}
|
|
|
|
|
the function, and allows for easy addressing of local variables. To some
|
|
|
|
|
extents, it also allows for hot debugging, such as saving a useful core dump
|
|
|
|
|
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
|
|
|
|
|
the decision of using it is, on x86\_64 System V, up to the compiler.
|
|
|
|
|
the decision of using it, on x86\_64 System V, is up to the compiler.
|
|
|
|
|
|
|
|
|
|
Usually, a function starts by subtracting some value to \reg{rsp}, allocating
|
|
|
|
|
some space in the stack frame for its local variables. Then, it saves on the
|
|
|
|
@ -150,7 +150,7 @@ compiler is free to do as it wishes. Even worse, it is not trivial to know
|
|
|
|
|
callee-saved registers were at all, since if the function does not alter a
|
|
|
|
|
register, it does not have to save it.
|
|
|
|
|
|
|
|
|
|
With this example, it seems pretty clear tha some additional data is necessary
|
|
|
|
|
With this example, it seems pretty clear that some additional data is necessary
|
|
|
|
|
to perform stack unwinding reliably, without only performing a guesswork. This
|
|
|
|
|
data is stored along with the debugging information of a program, and one
|
|
|
|
|
common format of debugging data is DWARF\@.
|
|
|
|
@ -218,22 +218,23 @@ that is, $300\,\text{ms}$ per second of program run with default settings.
|
|
|
|
|
|
|
|
|
|
One of the causes that inspired this internship were also Stephen Kell's
|
|
|
|
|
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
|
|
|
|
|
unwinding through \prog{libunwind} and was forced to force \prog{gcc} to use a
|
|
|
|
|
frame pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer}
|
|
|
|
|
in order to mitigate the slowness.
|
|
|
|
|
unwinding through \prog{libunwind} and had to force \prog{gcc} to use a frame
|
|
|
|
|
pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer} in
|
|
|
|
|
order to mitigate the slowness.
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
\subsection{DWARF format}
|
|
|
|
|
|
|
|
|
|
The DWARF format was first standardized as the format for debugging information
|
|
|
|
|
of the ELF executable binaries, which are standard on UNIX-like systems,
|
|
|
|
|
including Linux and MacOS --~but not Windows. It is now commonly used across a
|
|
|
|
|
wide variety of binary formats to store debugging information. As of now, the
|
|
|
|
|
latest DWARF standard is DWARF 5~\cite{dwarf5std}, which is openly accessible.
|
|
|
|
|
of the ELF executable binaries (Extensible Linking Format), which are standard
|
|
|
|
|
on UNIX-like systems, including Linux and MacOS --~but not Windows. It is now
|
|
|
|
|
commonly used across a wide variety of binary formats to store debugging
|
|
|
|
|
information. As of now, the latest DWARF standard is DWARF 5~\cite{dwarf5std},
|
|
|
|
|
which is openly accessible.
|
|
|
|
|
|
|
|
|
|
The DWARF data commonly includes type information about the variables in the
|
|
|
|
|
original programming language, correspondence of assembly instructions with a
|
|
|
|
|
line in the original source file, \ldots
|
|
|
|
|
line in the original source file, \ldots{}
|
|
|
|
|
The format also specifies a way to represent unwinding data, as described in
|
|
|
|
|
Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
|
|
|
|
|
\lstc{.debug_frame}, but most often found as \ehframe.
|
|
|
|
@ -776,8 +777,12 @@ would do after a \lstbash{frame n} command. Yet, if one was to enhance the
|
|
|
|
|
code to handle every register, it would not be much harder and would probably
|
|
|
|
|
be only a few hours worth of code refactoring and rewriting.
|
|
|
|
|
|
|
|
|
|
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
|
|
|
|
|
{src/dwarf_assembly_context/unwind_context.c}
|
|
|
|
|
\begin{figure}[h]
|
|
|
|
|
\centering{}
|
|
|
|
|
\lstinputlisting[language=C, caption={Unwinding context},
|
|
|
|
|
label={lst:unw_ctx}]
|
|
|
|
|
{src/dwarf_assembly_context/unwind_context.c}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
|
|
|
|
|
\lstc{uintptr_t} are the values of the corresponding registers, and
|
|
|
|
@ -808,10 +813,11 @@ scattered among various \ehelf{} files, one for each shared object loaded
|
|
|
|
|
unwinder must first acquire a \emph{memory map}, a table listing the various
|
|
|
|
|
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
|
|
|
|
|
memory map is provided by the operating system --~for instance, on Linux, it is
|
|
|
|
|
available as a file in \texttt{/proc}. Once this map is acquired, when
|
|
|
|
|
unwinding from a given IP, the unwinder must identify the memory segment from
|
|
|
|
|
which it comes, deduce the source ELF file, and deduce the corresponding
|
|
|
|
|
\ehelf.
|
|
|
|
|
available as a file in \texttt{/proc}, a special part of the file system that
|
|
|
|
|
the kernel uses to communicate with the userland processes. Once this map is
|
|
|
|
|
acquired, when unwinding from a given IP, the unwinder must identify the memory
|
|
|
|
|
segment from which it comes, deduce the source ELF file, and deduce the
|
|
|
|
|
corresponding \ehelf.
|
|
|
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
|
|
|
|
@ -834,7 +840,7 @@ well on the standard cases that are easily tested, and can be used to unwind
|
|
|
|
|
the stack of simple programs.
|
|
|
|
|
|
|
|
|
|
The major drawback of this approach, without any particular care taken, is the
|
|
|
|
|
space waste. The space taken by those tentative \ehelfs{} is analyzed in
|
|
|
|
|
waste of space. The space taken by those tentative \ehelfs{} is analyzed in
|
|
|
|
|
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
|
|
|
|
|
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
|
|
|
|
|
it depends.
|
|
|
|
@ -877,21 +883,21 @@ the original program size ($65\,\%$).
|
|
|
|
|
|
|
|
|
|
A lot of small space optimizations, such as filtering out empty FDEs, merging
|
|
|
|
|
together the rows that are equivalent on all the registers kept, etc.\ were
|
|
|
|
|
made in order to shrink the \ehelfs.
|
|
|
|
|
made in order to shrink the size of the \ehelfs.
|
|
|
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
|
|
|
|
|
The major optimization that most reduced the output size was to use an if/else
|
|
|
|
|
tree implementing a binary search on the instruction pointer relevant
|
|
|
|
|
intervals, instead of a single monolithic switch. In the process, we also
|
|
|
|
|
\emph{outline} code whenever possible, that is, find out identical ``switch
|
|
|
|
|
cases'' bodies --~which are not switch cases anymore, but \texttt{if}
|
|
|
|
|
bodies~--, move them outside of the if/else tree, identify them by a label, and
|
|
|
|
|
jump to them using a \lstc{goto}, which de-duplicates a lot of code and
|
|
|
|
|
contributes greatly to the shrinking. In the process, we noticed that the vast
|
|
|
|
|
majority of FDE rows are actually taken among very few ``common'' FDE rows. For
|
|
|
|
|
instance, in the \prog{libc}, out of a total of $20827$ rows, only $302$
|
|
|
|
|
($1.5\,\%$) unique rows remain after the outlining.
|
|
|
|
|
The optimization that most reduced the output size was to use an if/else tree
|
|
|
|
|
implementing a binary search on the instruction pointer relevant intervals,
|
|
|
|
|
instead of a single monolithic switch. In the process, we also \emph{outline}
|
|
|
|
|
code whenever possible, that is, find out identical ``switch cases'' bodies
|
|
|
|
|
--~which are not switch cases anymore, but \texttt{if} bodies~--, move them
|
|
|
|
|
outside of the if/else tree, identify them by a label, and jump to them using a
|
|
|
|
|
\lstc{goto}, which de-duplicates a lot of code and contributes greatly to the
|
|
|
|
|
shrinking. In the process, we noticed that the vast majority of FDE rows are
|
|
|
|
|
actually taken among very few ``common'' FDE rows. For instance, in the
|
|
|
|
|
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) unique rows
|
|
|
|
|
remain after the outlining.
|
|
|
|
|
|
|
|
|
|
This makes this optimization really efficient, as seen later in
|
|
|
|
|
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
|
|
|
@ -999,7 +1005,8 @@ The program that was chosen for \prog{perf}-benchmarking is
|
|
|
|
|
\prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to
|
|
|
|
|
stress-test and benchmark the Linux scheduler by spawning processes or threads
|
|
|
|
|
that communicate with each other. It has the interest of generating stack
|
|
|
|
|
activity, be linked against \prog{libc} and \prog{pthread}, and be very light.
|
|
|
|
|
activity, being linked against \prog{libc} and \prog{pthread}, and being very
|
|
|
|
|
light.
|
|
|
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
|
|
|
|
@ -1059,7 +1066,8 @@ CSmith code is notoriously hard to understand and edit.
|
|
|
|
|
All the measures in this report were made on a computer with an Intel Xeon
|
|
|
|
|
E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The
|
|
|
|
|
computer has 32\,GB of RAM, and care was taken never to fill it and start
|
|
|
|
|
swapping.
|
|
|
|
|
swapping --~using the hard drive to store data instead of the RAM when it is
|
|
|
|
|
full, degrading harshly the performance.
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
\subsection{Measured time performance}\label{ssec:timeperf}
|
|
|
|
@ -1124,7 +1132,8 @@ The compilation time of \ehelfs{} is also reasonable. On the machine
|
|
|
|
|
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
|
|
|
|
|
compile, the various shared objects needed to run \prog{hackbench} --~that is,
|
|
|
|
|
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
|
|
|
|
|
in an overall time of $25.28$ seconds.
|
|
|
|
|
in an overall time of $25.28$ seconds, which a developer is probably prepared
|
|
|
|
|
to wait for.
|
|
|
|
|
|
|
|
|
|
The unwinding errors observed are hard to investigate, but are most probably
|
|
|
|
|
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
|
|
|
|
@ -1182,7 +1191,7 @@ registers represent most columns --~see Section~\ref{ssec:instr_cov}.
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
\subsection{Instructions coverage}\label{ssec:instr_cov}
|
|
|
|
|
|
|
|
|
|
In order to determine which DWARF instructions are necessary to implement to
|
|
|
|
|
In order to determine which DWARF instructions should be implemented to
|
|
|
|
|
have meaningful results, as well as to assess the instruction coverage of our
|
|
|
|
|
compiler and \ehelfs, we must look at real-world ELF files and inspect the
|
|
|
|
|
instructions used.
|
|
|
|
@ -1329,8 +1338,6 @@ The overall size of the project is
|
|
|
|
|
statistics, benchmarking, testing and analyzing code modules add up to around
|
|
|
|
|
1500 more lines.
|
|
|
|
|
|
|
|
|
|
\pagebreak{}
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|