Fix Guinness' reviews

This commit is contained in:
Théophile Bastian 2018-08-20 15:38:10 +02:00
parent df7252238e
commit d4087865e6
2 changed files with 53 additions and 45 deletions

View file

@ -8,12 +8,13 @@
\subsection*{The general context} \subsection*{The general context}
The standard debugging data format, DWARF, contains tables that, for a given The standard debugging data format, DWARF (Debugging With Attributed Record
instruction pointer (IP), permit to understand how the assembly instruction Formats), contains tables permitting, for a given instruction pointer (IP), to
relates to the source code, where variables are currently allocated in memory understand how instructions from the assembly code relates to the original
or if they are stored in a register, what are their type and how to unwind the source code, where are variables currently allocated in memory or if they are
current stack frame. This information is generated when passing \eg{} the stored in a register, what are their type and how to unwind the current stack
switch \lstbash{-g} to \prog{gcc} or equivalents. frame. This information is generated when passing \eg{} the switch \lstbash{-g}
to \prog{gcc} or equivalents.
Even in stripped (non-debug) binaries, a small portion of DWARF data remains: Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
the stack unwinding data. This information is necessary to unwind stack the stack unwinding data. This information is necessary to unwind stack
@ -28,7 +29,7 @@ Section~\ref{ssec:instr_cov}~\textendash, consisting in offsets from memory
addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the
standard defines rules that take the form of a stack-machine expression that standard defines rules that take the form of a stack-machine expression that
can access virtually all the process's memory and perform Turing-complete can access virtually all the process's memory and perform Turing-complete
computation~\cite{oakley2011exploiting}. computations~\cite{oakley2011exploiting}.
\subsection*{The research problem} \subsection*{The research problem}
@ -83,8 +84,8 @@ few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having
enough samples for this purpose --~at least a few thousands~-- is not easy, enough samples for this purpose --~at least a few thousands~-- is not easy,
since one must avoid unwinding the same frame over and over again, which would since one must avoid unwinding the same frame over and over again, which would
only benchmark the caching mechanism. The other problem is to distribute only benchmark the caching mechanism. The other problem is to distribute
evenly the unwinding measures across the various IPs, including directly into evenly the unwinding measures across the various IPs, among which those
the loaded libraries (\eg{} the \prog{libc}). directly located into the loaded libraries (\eg{} the \prog{libc}).
The solution eventually chosen was to modify \prog{perf}, the standard The solution eventually chosen was to modify \prog{perf}, the standard
profiling program for Linux, in order to gather statistics and benchmarks of profiling program for Linux, in order to gather statistics and benchmarks of
its unwindings. Modifying \prog{perf} was an additional challenge that turned its unwindings. Modifying \prog{perf} was an additional challenge that turned
@ -131,7 +132,7 @@ the compiled DWARF version (see Section~\ref{ssec:timeperf}).
The implementation, however, is not yet production-ready: it only supports the The implementation, however, is not yet production-ready: it only supports the
x86\_64 architecture, and relies to some extent on the Linux operating system. x86\_64 architecture, and relies to some extent on the Linux operating system.
None of these pose a fundamental problem. Supporting other processor None of these pose a fundamental problem. Supporting other processor
architectures and ABIs are only a matter of engineering,. The operating system architectures and ABIs are only a matter of engineering. The operating system
dependency is only present in the libraries developed in order to interact with dependency is only present in the libraries developed in order to interact with
the compiled unwinding data, which can be developed for virtually any operating the compiled unwinding data, which can be developed for virtually any operating
system. system.

View file

@ -108,7 +108,7 @@ the location of the return address. Then, the compiler might use \reg{rbp}
the function, and allows for easy addressing of local variables. To some the function, and allows for easy addressing of local variables. To some
extents, it also allows for hot debugging, such as saving a useful core dump extents, it also allows for hot debugging, such as saving a useful core dump
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
the decision of using it is, on x86\_64 System V, up to the compiler. the decision of using it, on x86\_64 System V, is up to the compiler.
Usually, a function starts by subtracting some value to \reg{rsp}, allocating Usually, a function starts by subtracting some value to \reg{rsp}, allocating
some space in the stack frame for its local variables. Then, it saves on the some space in the stack frame for its local variables. Then, it saves on the
@ -150,7 +150,7 @@ compiler is free to do as it wishes. Even worse, it is not trivial to know
callee-saved registers were at all, since if the function does not alter a callee-saved registers were at all, since if the function does not alter a
register, it does not have to save it. register, it does not have to save it.
With this example, it seems pretty clear tha some additional data is necessary With this example, it seems pretty clear that some additional data is necessary
to perform stack unwinding reliably, without only performing a guesswork. This to perform stack unwinding reliably, without only performing a guesswork. This
data is stored along with the debugging information of a program, and one data is stored along with the debugging information of a program, and one
common format of debugging data is DWARF\@. common format of debugging data is DWARF\@.
@ -218,22 +218,23 @@ that is, $300\,\text{ms}$ per second of program run with default settings.
One of the causes that inspired this internship were also Stephen Kell's One of the causes that inspired this internship were also Stephen Kell's
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack \prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
unwinding through \prog{libunwind} and was forced to force \prog{gcc} to use a unwinding through \prog{libunwind} and had to force \prog{gcc} to use a frame
frame pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer} pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer} in
in order to mitigate the slowness. order to mitigate the slowness.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DWARF format} \subsection{DWARF format}
The DWARF format was first standardized as the format for debugging information The DWARF format was first standardized as the format for debugging information
of the ELF executable binaries, which are standard on UNIX-like systems, of the ELF executable binaries (Extensible Linking Format), which are standard
including Linux and MacOS --~but not Windows. It is now commonly used across a on UNIX-like systems, including Linux and MacOS --~but not Windows. It is now
wide variety of binary formats to store debugging information. As of now, the commonly used across a wide variety of binary formats to store debugging
latest DWARF standard is DWARF 5~\cite{dwarf5std}, which is openly accessible. information. As of now, the latest DWARF standard is DWARF 5~\cite{dwarf5std},
which is openly accessible.
The DWARF data commonly includes type information about the variables in the The DWARF data commonly includes type information about the variables in the
original programming language, correspondence of assembly instructions with a original programming language, correspondence of assembly instructions with a
line in the original source file, \ldots line in the original source file, \ldots{}
The format also specifies a way to represent unwinding data, as described in The format also specifies a way to represent unwinding data, as described in
Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
\lstc{.debug_frame}, but most often found as \ehframe. \lstc{.debug_frame}, but most often found as \ehframe.
@ -776,8 +777,12 @@ would do after a \lstbash{frame n} command. Yet, if one was to enhance the
code to handle every register, it would not be much harder and would probably code to handle every register, it would not be much harder and would probably
be only a few hours worth of code refactoring and rewriting. be only a few hours worth of code refactoring and rewriting.
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}] \begin{figure}[h]
\centering{}
\lstinputlisting[language=C, caption={Unwinding context},
label={lst:unw_ctx}]
{src/dwarf_assembly_context/unwind_context.c} {src/dwarf_assembly_context/unwind_context.c}
\end{figure}
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
\lstc{uintptr_t} are the values of the corresponding registers, and \lstc{uintptr_t} are the values of the corresponding registers, and
@ -808,10 +813,11 @@ scattered among various \ehelf{} files, one for each shared object loaded
unwinder must first acquire a \emph{memory map}, a table listing the various unwinder must first acquire a \emph{memory map}, a table listing the various
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
memory map is provided by the operating system --~for instance, on Linux, it is memory map is provided by the operating system --~for instance, on Linux, it is
available as a file in \texttt{/proc}. Once this map is acquired, when available as a file in \texttt{/proc}, a special part of the file system that
unwinding from a given IP, the unwinder must identify the memory segment from the kernel uses to communicate with the userland processes. Once this map is
which it comes, deduce the source ELF file, and deduce the corresponding acquired, when unwinding from a given IP, the unwinder must identify the memory
\ehelf. segment from which it comes, deduce the source ELF file, and deduce the
corresponding \ehelf.
\medskip \medskip
@ -834,7 +840,7 @@ well on the standard cases that are easily tested, and can be used to unwind
the stack of simple programs. the stack of simple programs.
The major drawback of this approach, without any particular care taken, is the The major drawback of this approach, without any particular care taken, is the
space waste. The space taken by those tentative \ehelfs{} is analyzed in waste of space. The space taken by those tentative \ehelfs{} is analyzed in
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
it depends. it depends.
@ -877,21 +883,21 @@ the original program size ($65\,\%$).
A lot of small space optimizations, such as filtering out empty FDEs, merging A lot of small space optimizations, such as filtering out empty FDEs, merging
together the rows that are equivalent on all the registers kept, etc.\ were together the rows that are equivalent on all the registers kept, etc.\ were
made in order to shrink the \ehelfs. made in order to shrink the size of the \ehelfs.
\medskip \medskip
The major optimization that most reduced the output size was to use an if/else The optimization that most reduced the output size was to use an if/else tree
tree implementing a binary search on the instruction pointer relevant implementing a binary search on the instruction pointer relevant intervals,
intervals, instead of a single monolithic switch. In the process, we also instead of a single monolithic switch. In the process, we also \emph{outline}
\emph{outline} code whenever possible, that is, find out identical ``switch code whenever possible, that is, find out identical ``switch cases'' bodies
cases'' bodies --~which are not switch cases anymore, but \texttt{if} --~which are not switch cases anymore, but \texttt{if} bodies~--, move them
bodies~--, move them outside of the if/else tree, identify them by a label, and outside of the if/else tree, identify them by a label, and jump to them using a
jump to them using a \lstc{goto}, which de-duplicates a lot of code and \lstc{goto}, which de-duplicates a lot of code and contributes greatly to the
contributes greatly to the shrinking. In the process, we noticed that the vast shrinking. In the process, we noticed that the vast majority of FDE rows are
majority of FDE rows are actually taken among very few ``common'' FDE rows. For actually taken among very few ``common'' FDE rows. For instance, in the
instance, in the \prog{libc}, out of a total of $20827$ rows, only $302$ \prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) unique rows
($1.5\,\%$) unique rows remain after the outlining. remain after the outlining.
This makes this optimization really efficient, as seen later in This makes this optimization really efficient, as seen later in
Section~\ref{ssec:results_size}, but also makes it an interesting question Section~\ref{ssec:results_size}, but also makes it an interesting question
@ -999,7 +1005,8 @@ The program that was chosen for \prog{perf}-benchmarking is
\prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to \prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to
stress-test and benchmark the Linux scheduler by spawning processes or threads stress-test and benchmark the Linux scheduler by spawning processes or threads
that communicate with each other. It has the interest of generating stack that communicate with each other. It has the interest of generating stack
activity, be linked against \prog{libc} and \prog{pthread}, and be very light. activity, being linked against \prog{libc} and \prog{pthread}, and being very
light.
\medskip \medskip
@ -1059,7 +1066,8 @@ CSmith code is notoriously hard to understand and edit.
All the measures in this report were made on a computer with an Intel Xeon All the measures in this report were made on a computer with an Intel Xeon
E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The
computer has 32\,GB of RAM, and care was taken never to fill it and start computer has 32\,GB of RAM, and care was taken never to fill it and start
swapping. swapping --~using the hard drive to store data instead of the RAM when it is
full, degrading harshly the performance.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Measured time performance}\label{ssec:timeperf} \subsection{Measured time performance}\label{ssec:timeperf}
@ -1124,7 +1132,8 @@ The compilation time of \ehelfs{} is also reasonable. On the machine
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
compile, the various shared objects needed to run \prog{hackbench} --~that is, compile, the various shared objects needed to run \prog{hackbench} --~that is,
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled \prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
in an overall time of $25.28$ seconds. in an overall time of $25.28$ seconds, which a developer is probably prepared
to wait for.
The unwinding errors observed are hard to investigate, but are most probably The unwinding errors observed are hard to investigate, but are most probably
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$ due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
@ -1182,7 +1191,7 @@ registers represent most columns --~see Section~\ref{ssec:instr_cov}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Instructions coverage}\label{ssec:instr_cov} \subsection{Instructions coverage}\label{ssec:instr_cov}
In order to determine which DWARF instructions are necessary to implement to In order to determine which DWARF instructions should be implemented to
have meaningful results, as well as to assess the instruction coverage of our have meaningful results, as well as to assess the instruction coverage of our
compiler and \ehelfs, we must look at real-world ELF files and inspect the compiler and \ehelfs, we must look at real-world ELF files and inspect the
instructions used. instructions used.
@ -1329,8 +1338,6 @@ The overall size of the project is
statistics, benchmarking, testing and analyzing code modules add up to around statistics, benchmarking, testing and analyzing code modules add up to around
1500 more lines. 1500 more lines.
\pagebreak{}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%