Fix Guinness' reviews
This commit is contained in:
parent
df7252238e
commit
d4087865e6
2 changed files with 53 additions and 45 deletions
|
@ -8,12 +8,13 @@
|
||||||
|
|
||||||
\subsection*{The general context}
|
\subsection*{The general context}
|
||||||
|
|
||||||
The standard debugging data format, DWARF, contains tables that, for a given
|
The standard debugging data format, DWARF (Debugging With Attributed Record
|
||||||
instruction pointer (IP), permit to understand how the assembly instruction
|
Formats), contains tables permitting, for a given instruction pointer (IP), to
|
||||||
relates to the source code, where variables are currently allocated in memory
|
understand how instructions from the assembly code relates to the original
|
||||||
or if they are stored in a register, what are their type and how to unwind the
|
source code, where are variables currently allocated in memory or if they are
|
||||||
current stack frame. This information is generated when passing \eg{} the
|
stored in a register, what are their type and how to unwind the current stack
|
||||||
switch \lstbash{-g} to \prog{gcc} or equivalents.
|
frame. This information is generated when passing \eg{} the switch \lstbash{-g}
|
||||||
|
to \prog{gcc} or equivalents.
|
||||||
|
|
||||||
Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
|
Even in stripped (non-debug) binaries, a small portion of DWARF data remains:
|
||||||
the stack unwinding data. This information is necessary to unwind stack
|
the stack unwinding data. This information is necessary to unwind stack
|
||||||
|
@ -28,7 +29,7 @@ Section~\ref{ssec:instr_cov}~\textendash, consisting in offsets from memory
|
||||||
addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the
|
addresses stored in registers (such as \reg{rbp} or \reg{rsp}). Yet, the
|
||||||
standard defines rules that take the form of a stack-machine expression that
|
standard defines rules that take the form of a stack-machine expression that
|
||||||
can access virtually all the process's memory and perform Turing-complete
|
can access virtually all the process's memory and perform Turing-complete
|
||||||
computation~\cite{oakley2011exploiting}.
|
computations~\cite{oakley2011exploiting}.
|
||||||
|
|
||||||
\subsection*{The research problem}
|
\subsection*{The research problem}
|
||||||
|
|
||||||
|
@ -83,8 +84,8 @@ few samples (around $10\,\mu s$ per frame) to avoid statistical errors. Having
|
||||||
enough samples for this purpose --~at least a few thousands~-- is not easy,
|
enough samples for this purpose --~at least a few thousands~-- is not easy,
|
||||||
since one must avoid unwinding the same frame over and over again, which would
|
since one must avoid unwinding the same frame over and over again, which would
|
||||||
only benchmark the caching mechanism. The other problem is to distribute
|
only benchmark the caching mechanism. The other problem is to distribute
|
||||||
evenly the unwinding measures across the various IPs, including directly into
|
evenly the unwinding measures across the various IPs, among which those
|
||||||
the loaded libraries (\eg{} the \prog{libc}).
|
directly located into the loaded libraries (\eg{} the \prog{libc}).
|
||||||
The solution eventually chosen was to modify \prog{perf}, the standard
|
The solution eventually chosen was to modify \prog{perf}, the standard
|
||||||
profiling program for Linux, in order to gather statistics and benchmarks of
|
profiling program for Linux, in order to gather statistics and benchmarks of
|
||||||
its unwindings. Modifying \prog{perf} was an additional challenge that turned
|
its unwindings. Modifying \prog{perf} was an additional challenge that turned
|
||||||
|
@ -131,7 +132,7 @@ the compiled DWARF version (see Section~\ref{ssec:timeperf}).
|
||||||
The implementation, however, is not yet production-ready: it only supports the
|
The implementation, however, is not yet production-ready: it only supports the
|
||||||
x86\_64 architecture, and relies to some extent on the Linux operating system.
|
x86\_64 architecture, and relies to some extent on the Linux operating system.
|
||||||
None of these pose a fundamental problem. Supporting other processor
|
None of these pose a fundamental problem. Supporting other processor
|
||||||
architectures and ABIs are only a matter of engineering,. The operating system
|
architectures and ABIs are only a matter of engineering. The operating system
|
||||||
dependency is only present in the libraries developed in order to interact with
|
dependency is only present in the libraries developed in order to interact with
|
||||||
the compiled unwinding data, which can be developed for virtually any operating
|
the compiled unwinding data, which can be developed for virtually any operating
|
||||||
system.
|
system.
|
||||||
|
|
|
@ -108,7 +108,7 @@ the location of the return address. Then, the compiler might use \reg{rbp}
|
||||||
the function, and allows for easy addressing of local variables. To some
|
the function, and allows for easy addressing of local variables. To some
|
||||||
extents, it also allows for hot debugging, such as saving a useful core dump
|
extents, it also allows for hot debugging, such as saving a useful core dump
|
||||||
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
|
upon segfault. Yet, using \reg{rbp} to save \reg{rip} wastes a register, and
|
||||||
the decision of using it is, on x86\_64 System V, up to the compiler.
|
the decision of using it, on x86\_64 System V, is up to the compiler.
|
||||||
|
|
||||||
Usually, a function starts by subtracting some value to \reg{rsp}, allocating
|
Usually, a function starts by subtracting some value to \reg{rsp}, allocating
|
||||||
some space in the stack frame for its local variables. Then, it saves on the
|
some space in the stack frame for its local variables. Then, it saves on the
|
||||||
|
@ -150,7 +150,7 @@ compiler is free to do as it wishes. Even worse, it is not trivial to know
|
||||||
callee-saved registers were at all, since if the function does not alter a
|
callee-saved registers were at all, since if the function does not alter a
|
||||||
register, it does not have to save it.
|
register, it does not have to save it.
|
||||||
|
|
||||||
With this example, it seems pretty clear tha some additional data is necessary
|
With this example, it seems pretty clear that some additional data is necessary
|
||||||
to perform stack unwinding reliably, without only performing a guesswork. This
|
to perform stack unwinding reliably, without only performing a guesswork. This
|
||||||
data is stored along with the debugging information of a program, and one
|
data is stored along with the debugging information of a program, and one
|
||||||
common format of debugging data is DWARF\@.
|
common format of debugging data is DWARF\@.
|
||||||
|
@ -218,22 +218,23 @@ that is, $300\,\text{ms}$ per second of program run with default settings.
|
||||||
|
|
||||||
One of the causes that inspired this internship were also Stephen Kell's
|
One of the causes that inspired this internship were also Stephen Kell's
|
||||||
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
|
\prog{libcrunch}~\cite{kell2016libcrunch}, which makes a heavy use of stack
|
||||||
unwinding through \prog{libunwind} and was forced to force \prog{gcc} to use a
|
unwinding through \prog{libunwind} and had to force \prog{gcc} to use a frame
|
||||||
frame pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer}
|
pointer (\reg{rbp}) everywhere through \lstbash{-fno-omit-frame-pointer} in
|
||||||
in order to mitigate the slowness.
|
order to mitigate the slowness.
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\subsection{DWARF format}
|
\subsection{DWARF format}
|
||||||
|
|
||||||
The DWARF format was first standardized as the format for debugging information
|
The DWARF format was first standardized as the format for debugging information
|
||||||
of the ELF executable binaries, which are standard on UNIX-like systems,
|
of the ELF executable binaries (Extensible Linking Format), which are standard
|
||||||
including Linux and MacOS --~but not Windows. It is now commonly used across a
|
on UNIX-like systems, including Linux and MacOS --~but not Windows. It is now
|
||||||
wide variety of binary formats to store debugging information. As of now, the
|
commonly used across a wide variety of binary formats to store debugging
|
||||||
latest DWARF standard is DWARF 5~\cite{dwarf5std}, which is openly accessible.
|
information. As of now, the latest DWARF standard is DWARF 5~\cite{dwarf5std},
|
||||||
|
which is openly accessible.
|
||||||
|
|
||||||
The DWARF data commonly includes type information about the variables in the
|
The DWARF data commonly includes type information about the variables in the
|
||||||
original programming language, correspondence of assembly instructions with a
|
original programming language, correspondence of assembly instructions with a
|
||||||
line in the original source file, \ldots
|
line in the original source file, \ldots{}
|
||||||
The format also specifies a way to represent unwinding data, as described in
|
The format also specifies a way to represent unwinding data, as described in
|
||||||
Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
|
Section~\ref{ssec:stack_unwinding} above, in an ELF section originally called
|
||||||
\lstc{.debug_frame}, but most often found as \ehframe.
|
\lstc{.debug_frame}, but most often found as \ehframe.
|
||||||
|
@ -776,8 +777,12 @@ would do after a \lstbash{frame n} command. Yet, if one was to enhance the
|
||||||
code to handle every register, it would not be much harder and would probably
|
code to handle every register, it would not be much harder and would probably
|
||||||
be only a few hours worth of code refactoring and rewriting.
|
be only a few hours worth of code refactoring and rewriting.
|
||||||
|
|
||||||
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
|
\begin{figure}[h]
|
||||||
{src/dwarf_assembly_context/unwind_context.c}
|
\centering{}
|
||||||
|
\lstinputlisting[language=C, caption={Unwinding context},
|
||||||
|
label={lst:unw_ctx}]
|
||||||
|
{src/dwarf_assembly_context/unwind_context.c}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
|
In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
|
||||||
\lstc{uintptr_t} are the values of the corresponding registers, and
|
\lstc{uintptr_t} are the values of the corresponding registers, and
|
||||||
|
@ -808,10 +813,11 @@ scattered among various \ehelf{} files, one for each shared object loaded
|
||||||
unwinder must first acquire a \emph{memory map}, a table listing the various
|
unwinder must first acquire a \emph{memory map}, a table listing the various
|
||||||
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
|
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
|
||||||
memory map is provided by the operating system --~for instance, on Linux, it is
|
memory map is provided by the operating system --~for instance, on Linux, it is
|
||||||
available as a file in \texttt{/proc}. Once this map is acquired, when
|
available as a file in \texttt{/proc}, a special part of the file system that
|
||||||
unwinding from a given IP, the unwinder must identify the memory segment from
|
the kernel uses to communicate with the userland processes. Once this map is
|
||||||
which it comes, deduce the source ELF file, and deduce the corresponding
|
acquired, when unwinding from a given IP, the unwinder must identify the memory
|
||||||
\ehelf.
|
segment from which it comes, deduce the source ELF file, and deduce the
|
||||||
|
corresponding \ehelf.
|
||||||
|
|
||||||
\medskip
|
\medskip
|
||||||
|
|
||||||
|
@ -834,7 +840,7 @@ well on the standard cases that are easily tested, and can be used to unwind
|
||||||
the stack of simple programs.
|
the stack of simple programs.
|
||||||
|
|
||||||
The major drawback of this approach, without any particular care taken, is the
|
The major drawback of this approach, without any particular care taken, is the
|
||||||
space waste. The space taken by those tentative \ehelfs{} is analyzed in
|
waste of space. The space taken by those tentative \ehelfs{} is analyzed in
|
||||||
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
|
Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
|
||||||
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
|
introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
|
||||||
it depends.
|
it depends.
|
||||||
|
@ -877,21 +883,21 @@ the original program size ($65\,\%$).
|
||||||
|
|
||||||
A lot of small space optimizations, such as filtering out empty FDEs, merging
|
A lot of small space optimizations, such as filtering out empty FDEs, merging
|
||||||
together the rows that are equivalent on all the registers kept, etc.\ were
|
together the rows that are equivalent on all the registers kept, etc.\ were
|
||||||
made in order to shrink the \ehelfs.
|
made in order to shrink the size of the \ehelfs.
|
||||||
|
|
||||||
\medskip
|
\medskip
|
||||||
|
|
||||||
The major optimization that most reduced the output size was to use an if/else
|
The optimization that most reduced the output size was to use an if/else tree
|
||||||
tree implementing a binary search on the instruction pointer relevant
|
implementing a binary search on the instruction pointer relevant intervals,
|
||||||
intervals, instead of a single monolithic switch. In the process, we also
|
instead of a single monolithic switch. In the process, we also \emph{outline}
|
||||||
\emph{outline} code whenever possible, that is, find out identical ``switch
|
code whenever possible, that is, find out identical ``switch cases'' bodies
|
||||||
cases'' bodies --~which are not switch cases anymore, but \texttt{if}
|
--~which are not switch cases anymore, but \texttt{if} bodies~--, move them
|
||||||
bodies~--, move them outside of the if/else tree, identify them by a label, and
|
outside of the if/else tree, identify them by a label, and jump to them using a
|
||||||
jump to them using a \lstc{goto}, which de-duplicates a lot of code and
|
\lstc{goto}, which de-duplicates a lot of code and contributes greatly to the
|
||||||
contributes greatly to the shrinking. In the process, we noticed that the vast
|
shrinking. In the process, we noticed that the vast majority of FDE rows are
|
||||||
majority of FDE rows are actually taken among very few ``common'' FDE rows. For
|
actually taken among very few ``common'' FDE rows. For instance, in the
|
||||||
instance, in the \prog{libc}, out of a total of $20827$ rows, only $302$
|
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) unique rows
|
||||||
($1.5\,\%$) unique rows remain after the outlining.
|
remain after the outlining.
|
||||||
|
|
||||||
This makes this optimization really efficient, as seen later in
|
This makes this optimization really efficient, as seen later in
|
||||||
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
||||||
|
@ -999,7 +1005,8 @@ The program that was chosen for \prog{perf}-benchmarking is
|
||||||
\prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to
|
\prog{hackbench}~\cite{hackbenchsrc}. This small program is designed to
|
||||||
stress-test and benchmark the Linux scheduler by spawning processes or threads
|
stress-test and benchmark the Linux scheduler by spawning processes or threads
|
||||||
that communicate with each other. It has the interest of generating stack
|
that communicate with each other. It has the interest of generating stack
|
||||||
activity, be linked against \prog{libc} and \prog{pthread}, and be very light.
|
activity, being linked against \prog{libc} and \prog{pthread}, and being very
|
||||||
|
light.
|
||||||
|
|
||||||
\medskip
|
\medskip
|
||||||
|
|
||||||
|
@ -1059,7 +1066,8 @@ CSmith code is notoriously hard to understand and edit.
|
||||||
All the measures in this report were made on a computer with an Intel Xeon
|
All the measures in this report were made on a computer with an Intel Xeon
|
||||||
E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The
|
E3-1505M v6 CPU, with a clock frequency of $3.00$\,GHz and 8 cores. The
|
||||||
computer has 32\,GB of RAM, and care was taken never to fill it and start
|
computer has 32\,GB of RAM, and care was taken never to fill it and start
|
||||||
swapping.
|
swapping --~using the hard drive to store data instead of the RAM when it is
|
||||||
|
full, degrading harshly the performance.
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\subsection{Measured time performance}\label{ssec:timeperf}
|
\subsection{Measured time performance}\label{ssec:timeperf}
|
||||||
|
@ -1124,7 +1132,8 @@ The compilation time of \ehelfs{} is also reasonable. On the machine
|
||||||
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
|
described in Section~\ref{ssec:bench_hw}, and without using multiple cores to
|
||||||
compile, the various shared objects needed to run \prog{hackbench} --~that is,
|
compile, the various shared objects needed to run \prog{hackbench} --~that is,
|
||||||
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
|
\prog{hackbench}, \prog{libc}, \prog{ld} and \prog{libpthread}~-- are compiled
|
||||||
in an overall time of $25.28$ seconds.
|
in an overall time of $25.28$ seconds, which a developer is probably prepared
|
||||||
|
to wait for.
|
||||||
|
|
||||||
The unwinding errors observed are hard to investigate, but are most probably
|
The unwinding errors observed are hard to investigate, but are most probably
|
||||||
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
|
due to truncated stack records. Indeed, since \prog{perf} dumps the last $n$
|
||||||
|
@ -1182,7 +1191,7 @@ registers represent most columns --~see Section~\ref{ssec:instr_cov}.
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\subsection{Instructions coverage}\label{ssec:instr_cov}
|
\subsection{Instructions coverage}\label{ssec:instr_cov}
|
||||||
|
|
||||||
In order to determine which DWARF instructions are necessary to implement to
|
In order to determine which DWARF instructions should be implemented to
|
||||||
have meaningful results, as well as to assess the instruction coverage of our
|
have meaningful results, as well as to assess the instruction coverage of our
|
||||||
compiler and \ehelfs, we must look at real-world ELF files and inspect the
|
compiler and \ehelfs, we must look at real-world ELF files and inspect the
|
||||||
instructions used.
|
instructions used.
|
||||||
|
@ -1329,8 +1338,6 @@ The overall size of the project is
|
||||||
statistics, benchmarking, testing and analyzing code modules add up to around
|
statistics, benchmarking, testing and analyzing code modules add up to around
|
||||||
1500 more lines.
|
1500 more lines.
|
||||||
|
|
||||||
\pagebreak{}
|
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
|
|
Loading…
Reference in a new issue