Fix Guinness' reviews
This commit is contained in:
parent
df7252238e
commit
d4087865e6
2 changed files with 53 additions and 45 deletions
|
@ -8,12 +8,13 @@
|
|||
|
||||
\subsection*{The general context}
|
||||
|
||||
The standard debugging data format, DWARF, contains tables that, for a given
|
||||
instruction pointer (IP), permit to understand how the assembly instruction
|
||||
relates to the source code, where variables are currently allocated in memory
|
||||
or if they are stored in a register, what are their type and how to unwind the
|
||||
current stack frame. This information is generated when passing \eg{} the
|
||||
switch \lstbash{-g} to \prog{gcc} or equivalents.
|
||||
The standard debugging data format, DWARF (Debugging With Attributed Record
|
||||
Formats), contains tables permitting, for a given instruction pointer (IP), to
|
||||
understand how instructions from the assembly code relates to the original
|
||||
source code, where are variables currently allocated in memory or if they are
|
||||
stored in a register, what are their type and how to unwind the current stack
|
||||
frame. This information is generated when passing \eg{} the switch \lstbash{-g}
|
||||
to \prog{gcc} or equivalents.
|
||||
|
||||
Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
|
||||
the stack unwinding data. This information is necessary to unwind stack
|
||||
|
@ -28,7 +29,7 @@ Section~\ref{ssec:instr_cov}~\textendash, consisting in offsets from memory
|
|||
addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the
|
||||
standard defines rules that take the form of a stack-machine expression that
|
||||
can access virtually all the process's memory and perform Turing-complete
|
||||
computation~\cite{oakley2011exploiting}.
|
||||
computations~\cite{oakley2011exploiting}.
|
||||
|
||||
\subsection*{The research problem}
|
||||
|
||||
|
@ -83,8 +84,8 @@ few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having
|
|||
enough samples for this purpose --~at least a few thousands~-- is not easy,
|
||||
since one must avoid unwinding the same frame over and over again, which would
|
||||
only benchmark the caching mechanism. The other problem is to distribute
|
||||
evenly the unwinding measures across the various IPs, including directly into
|
||||
the loaded libraries (\eg{} the \prog{libc}).
|
||||
evenly the unwinding measures across the various IPs, among which those
|
||||
directly located into the loaded libraries (\eg{} the \prog{libc}).
|
||||
The solution eventually chosen was to modify \prog{perf}, the standard
|
||||
profiling program for Linux, in order to gather statistics and benchmarks of
|
||||
its unwindings. Modifying \prog{perf} was an additional challenge that turned
|
||||
|
@ -131,7 +132,7 @@ the compiled DWARF version (see Section~\ref{ssec:timeperf}).
|
|||
The implementation, however, is not yet production-ready: it only supports the
|
||||
x86\_64 architecture, and relies to some extent on the Linux operating system.
|
||||
None of these pose a fundamental problem. Supporting other processor
|
||||
architectures and ABIs are only a matter of engineering,. The operating system
|
||||
architectures and ABIs are only a matter of engineering. The operating system
|
||||
dependency is only present in the libraries developed in order to interact with
|
||||
the compiled unwinding data, which can be developed for virtually any operating
|
||||
system.
|
||||
|
|
|
@ -108,7 +108,7 @@ the location of the return address. Then, the compiler might use \reg{rbp}
|
|||
the function, and allows for easy addressing of local variables. To some
|
||||
extents, it also allows for hot debugging, such as saving a useful core dump
|
||||
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
|
||||
the decision of using it is, on x86\_64 System V, up to the compiler.
|
||||
the decision of using it, on x86\_64 System V, is up to the compiler.
|
||||
|
||||
Usually, a function starts by subtracting some value to \reg{rsp}, allocating
|
||||
some space in the stack frame for its local variables. Then, it saves on the
|
||||
|
@ -150,7 +150,7 @@ compiler is free to do as it wishes. Even worse, it is not trivial to know
|
|||
callee-saved registers were at all, since if the function does not alter a
|
||||
register, it does not have to save it.
|
||||
|
||||
With this example, it seems pretty clear tha some additional data is necessary
|
||||
With this example, it seems pretty clear that some additional data is necessary
|
||||
to perform stack unwinding reliably, without only performing a guesswork. This
|
||||
data is stored along with the debugging information of a program, and one
|
||||
common format of debugging data is DWARF\@.
|
||||
|
@ -218,22 +218,23 @@ that is, $300\,\text{ms}$ per second of program run with default settings.
|
|||
|
||||
One of the causes that inspired this internship were also Stephen Kell's
|
||||
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
|
||||
unwinding through \prog{libunwind} and was forced to force \prog{gcc} to use a
|
||||
frame pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer}
|
||||
in order to mitigate the slowness.
|
||||
unwinding through \prog{libunwind} and had to force \prog{gcc} to use a frame
|
||||
pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer} in
|
||||
order to mitigate the slowness.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{DWARF format}
|
||||
|
||||
The DWARF format was first standardized as the format for debugging information
|
||||
of the ELF executable binaries, which are standard on UNIX-like systems,
|
||||
including Linux and MacOS --~but not Windows. It is now commonly used across a
|
||||
wide variety of binary formats to store debugging information. As of now, the
|
||||
latest DWARF standard is DWARF 5~\cite{dwarf5std}, which is openly accessible.
|
||||
of the ELF executable binaries (Extensible Linking Format), which are standard
|
||||
on UNIX-like systems, including Linux and MacOS --~but not Windows. It is now
|
||||
commonly used across a wide variety of binary formats to store debugging
|
||||
information. As of now, the latest DWARF standard is DWARF 5~\cite{dwarf5std},
|
||||
which is openly accessible.
|
||||
|
||||
The DWARF data commonly includes type information about the variables in the
|
||||
original programming language, correspondence of assembly instructions with a
|
||||
line in the original source file, \ldots
|
||||
line in the original source file, \ldots{}
|
||||
The format also specifies a way to represent unwinding data, as described in
|
||||
Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
|
||||
\lstc{.debug_frame}, but most often found as \ehframe.
|
||||
|
@ -776,8 +777,12 @@ would do after a \lstbash{frame n} command. Yet, if one was to enhance the
|
|||
code to handle every register, it would not be much harder and would probably
|
||||
be only a few hours worth of code refactoring and rewriting.
|
||||
|
||||
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
|
||||
\begin{figure}[h]
|
||||
\centering{}
|
||||
\lstinputlisting[language=C, caption={Unwinding context},
|
||||
label={lst:unw_ctx}]
|
||||
{src/dwarf_assembly_context/unwind_context.c}
|
||||
\end{figure}
|
||||
|
||||
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
|
||||
\lstc{uintptr_t} are the values of the corresponding registers, and
|
||||
|
@ -808,10 +813,11 @@ scattered among various \ehelf{} files, one for each shared object loaded
|
|||
unwinder must first acquire a \emph{memory map}, a table listing the various
|
||||
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
|
||||
memory map is provided by the operating system --~for instance, on Linux, it is
|
||||
available as a file in \texttt{/proc}. Once this map is acquired, when
|
||||
unwinding from a given IP, the unwinder must identify the memory segment from
|
||||
which it comes, deduce the source ELF file, and deduce the corresponding
|
||||
\ehelf.
|
||||
available as a file in \texttt{/proc}, a special part of the file system that
|
||||
the kernel uses to communicate with the userland processes. Once this map is
|
||||
acquired, when unwinding from a given IP, the unwinder must identify the memory
|
||||
segment from which it comes, deduce the source ELF file, and deduce the
|
||||
corresponding \ehelf.
|
||||
|
||||
\medskip
|
||||
|
||||
|
@ -834,7 +840,7 @@ well on the standard cases that are easily tested, and can be used to unwind
|
|||
the stack of simple programs.
|
||||
|
||||
The major drawback of this approach, without any particular care taken, is the
|
||||
space waste. The space taken by those tentative \ehelfs{} is analyzed in
|
||||
waste of space. The space taken by those tentative \ehelfs{} is analyzed in
|
||||
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
|
||||
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
|
||||
it depends.
|
||||
|
@ -877,21 +883,21 @@ the original program size ($65\,\%$).
|
|||
|
||||
A lot of small space optimizations, such as filtering out empty FDEs, merging
|
||||
together the rows that are equivalent on all the registers kept, etc.\ were
|
||||
made in order to shrink the \ehelfs.
|
||||
made in order to shrink the size of the \ehelfs.
|
||||
|
||||
\medskip
|
||||
|
||||
The major optimization that most reduced the output size was to use an if/else
|
||||
tree implementing a binary search on the instruction pointer relevant
|
||||
intervals, instead of a single monolithic switch. In the process, we also
|
||||
\emph{outline} code whenever possible, that is, find out identical ``switch
|
||||
cases'' bodies --~which are not switch cases anymore, but \texttt{if}
|
||||
bodies~--, move them outside of the if/else tree, identify them by a label, and
|
||||
jump to them using a \lstc{goto}, which de-duplicates a lot of code and
|
||||
contributes greatly to the shrinking. In the process, we noticed that the vast
|
||||
majority of FDE rows are actually taken among very few ``common'' FDE rows. For
|
||||
instance, in the \prog{libc}, out of a total of $20827$ rows, only $302$
|
||||
($1.5\,\%$) unique rows remain after the outlining.
|
||||
The optimization that most reduced the output size was to use an if/else tree
|
||||
implementing a binary search on the instruction pointer relevant intervals,
|
||||
instead of a single monolithic switch. In the process, we also \emph{outline}
|
||||
code whenever possible, that is, find out identical ``switch cases'' bodies
|
||||
--~which are not switch cases anymore, but \texttt{if} bodies~--, move them
|
||||
outside of the if/else tree, identify them by a label, and jump to them using a
|
||||
\lstc{goto}, which de-duplicates a lot of code and contributes greatly to the
|
||||
shrinking. In the process, we noticed that the vast majority of FDE rows are
|
||||
actually taken among very few ``common'' FDE rows. For instance, in the
|
||||
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) unique rows
|
||||
remain after the outlining.
|
||||
|
||||
This makes this optimization really efficient, as seen later in
|
||||
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
||||
|
@ -999,7 +1005,8 @@ The program that was chosen for \prog{perf}-benchmarking is
|
|||
\prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to
|
||||
stress-test and benchmark the Linux scheduler by spawning processes or threads
|
||||
that communicate with each other. It has the interest of generating stack
|
||||
activity, be linked against \prog{libc} and \prog{pthread}, and be very light.
|
||||
activity, being linked against \prog{libc} and \prog{pthread}, and being very
|
||||
light.
|
||||
|
||||
\medskip
|
||||
|
||||
|
@ -1059,7 +1066,8 @@ CSmith code is notoriously hard to understand and edit.
|
|||
All the measures in this report were made on a computer with an Intel Xeon
|
||||
E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The
|
||||
computer has 32\,GB of RAM, and care was taken never to fill it and start
|
||||
swapping.
|
||||
swapping --~using the hard drive to store data instead of the RAM when it is
|
||||
full, degrading harshly the performance.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Measured time performance}\label{ssec:timeperf}
|
||||
|
@ -1124,7 +1132,8 @@ The compilation time of \ehelfs{} is also reasonable. On the machine
|
|||
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
|
||||
compile, the various shared objects needed to run \prog{hackbench} --~that is,
|
||||
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
|
||||
in an overall time of $25.28$ seconds.
|
||||
in an overall time of $25.28$ seconds, which a developer is probably prepared
|
||||
to wait for.
|
||||
|
||||
The unwinding errors observed are hard to investigate, but are most probably
|
||||
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
|
||||
|
@ -1182,7 +1191,7 @@ registers represent most columns --~see Section~\ref{ssec:instr_cov}.
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Instructions coverage}\label{ssec:instr_cov}
|
||||
|
||||
In order to determine which DWARF instructions are necessary to implement to
|
||||
In order to determine which DWARF instructions should be implemented to
|
||||
have meaningful results, as well as to assess the instruction coverage of our
|
||||
compiler and \ehelfs, we must look at real-world ELF files and inspect the
|
||||
instructions used.
|
||||
|
@ -1329,8 +1338,6 @@ The overall size of the project is
|
|||
statistics, benchmarking, testing and analyzing code modules add up to around
|
||||
1500 more lines.
|
||||
|
||||
\pagebreak{}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
|
Loading…
Reference in a new issue