Rephrase everything but section 2

This commit is contained in:
Théophile Bastian 2018-08-18 22:06:55 +02:00
parent f0809dbf1c
commit 2f44049506

View file

@ -702,14 +702,14 @@ Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
context, containing the values the registers hold after unwinding this frame.
The body of the function itself consists in a single monolithic switch, taking
advantage of the non-standard --~yet widely implemented in C compilers~--
syntax for range switches, in which each \lstinline{case} can refer to a range.
All the FDEs are merged together into this switch, each row of a FDE being a
switch case. Separating the various FDEs in the C code --~other than with
comments~-- is, unlike what is done in DWARF, pointless, since accessing a
``row'' has a linear cost, and the C code is not meant to be read, except maybe
for debugging purposes. The switch cases bodies then fill a context with
unwound values, then return it.
advantage of the non-standard --~yet overwhelmingly implemented in common C
compilers~-- syntax for range switches, in which each \lstinline{case} can
refer to a range, \eg{} \lstc{case 17 ... 42:}. All the FDEs are merged
together into this switch, each row of a FDE being a switch case. Separating
the various FDEs in the C code --~other than with comments~-- is, unlike what
is done in DWARF, pointless, since accessing a ``row'' has a linear cost, and
the C code is not meant to be read, except maybe for debugging purposes. The
switch cases bodies then fill a context with unwound values before return it.
A setting of the compiler also optionally enables another parameter to the
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
@ -724,12 +724,12 @@ real-world-proof version of the \ehelfs, the choice was made to keep this
implementation simple, and only handle the few registers that were needed to
simply unwind the stack. Thus, the only registers handled in \ehelfs{} are
\reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used a few
times in \prog{libc} to hold the CFA address in common functions. This is
enough to unwind the stack reliably, and thus enough for profiling, but is not
sufficient to analyze every stack frame as \prog{gdb} would do after a
\lstbash{frame n} command. Yet, if one was to enhance the code to handle every
register, it would not be much harder and would probably be only a few hours of
code refactoring and rewriting.
times in \prog{libc} and other less common libraries to hold the CFA address in
common functions. This is enough to unwind the stack reliably, and thus enough
for profiling, but is not sufficient to analyze every stack frame as \prog{gdb}
would do after a \lstbash{frame n} command. Yet, if one was to enhance the
code to handle every register, it would not be much harder and would probably
be only a few hours worth of code refactoring and rewriting.
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
{src/dwarf_assembly_context/unwind_context.c}
@ -754,17 +754,19 @@ on or off, and it doesn't require to alter the base system by editing \eg{}
\texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is required, those
files can simply be \lstc{dlopen}'d. It is also possible to imagine, in a
future environment production, packaging \ehelfs{} files separately, so that
people interested in heavy computation can have the choice to install them.
people interested in better performance can have the choice to install them.
This, in particular, means that each ELF file has its unwinding data in a
separate \ehelf{} file --~just like with DWARF, where each ELF retains its own
DWARF data. Thus, an unwinder must first acquire a \emph{memory map}, a table
listing the various ELF files loaded and \emph{mapped} in memory, and on which
memory segment. This memory map is provided by the operating system --~for
instance, on Linux, it is available as a file in \texttt{/proc}. Once this map
is acquired, when unwinding from a given IP, the unwinder must identify the
memory segment from which it comes, deduce the source ELF file, and deduce the
corresponding \ehelf.
separate \ehelf{} file, implying that the unwinding data for a given program is
scattered among various \ehelf{} files, one for each shared object loaded
--~just like with DWARF, where each ELF retains its own DWARF data. Thus, an
unwinder must first acquire a \emph{memory map}, a table listing the various
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
memory map is provided by the operating system --~for instance, on Linux, it is
available as a file in \texttt{/proc}. Once this map is acquired, when
unwinding from a given IP, the unwinder must identify the memory segment from
which it comes, deduce the source ELF file, and deduce the corresponding
\ehelf.
\medskip
@ -772,8 +774,8 @@ corresponding \ehelf.
label={lst:fib7_eh_elf_basic}]
{src/fib7/fib7.eh_elf_basic.c}
The C code in Listing~\ref{lst:fib7_eh_elf_basic} is a part of what was
generated for the C code in Listing~\ref{lst:ex1_c}.
The C code in Listing~\ref{lst:fib7_eh_elf_basic} is the relevant part of what
was generated for the C code in Listing~\ref{lst:ex1_c}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{First results}
@ -817,13 +819,13 @@ it depends.
The first column only includes the sizes of the ELF sections \lstc{.text} (the
program itself) and \lstc{.rodata}, the read-only data (such as static strings,
etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{}
is considered, because it is self-consistent (few data or none is stored in
is considered, because it is self-contained (few data or none is stored in
\lstc{.rodata}), and the other sections could be removed if the \ehelfs{}
\lstc{.text} was somehow embedded in the original shared object.
This first tentative version of \ehelfs{} is roughly 7 times heavier than the
original \lstc{.eh_frame}, and represents a far too significant proportion of
the original program size.
the original program size ($65\,\%$).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Space optimization}\label{ssec:space_optim}
@ -838,13 +840,13 @@ The major optimization that most reduced the output size was to use an if/else
tree implementing a binary search on the instruction pointer relevant
intervals, instead of a single monolithic switch. In the process, we also
\emph{outline} code whenever possible, that is, find out identical ``switch
cases'' bodies --~which are not switch cases anymore, but if bodies~--, move
them outside of the if/else tree, identify them by a label, and jump to them
using a \lstc{goto}, which de-duplicates a lot of code and contributes greatly
to the shrinking. In the process, we noticed that the vast majority of FDE rows
are actually taken among very few ``common'' FDE rows. For instance, in the
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) remain
after the outlining.
cases'' bodies --~which are not switch cases anymore, but \texttt{if}
bodies~--, move them outside of the if/else tree, identify them by a label, and
jump to them using a \lstc{goto}, which de-duplicates a lot of code and
contributes greatly to the shrinking. In the process, we noticed that the vast
majority of FDE rows are actually taken among very few ``common'' FDE rows. For
instance, in the \prog{libc}, out of a total of $20827$ rows, only $302$
($1.5\,\%$) unique rows remain after the outlining.
This makes this optimization really efficient, as seen later in
Section~\ref{ssec:results_size}, but also makes it an interesting question
@ -874,13 +876,13 @@ solution working.
\subsection{Requirements}\label{ssec:bench_req}
To provide relevant benchmarks of the \ehelfs{} performance, one must sample at
least a few hundreds or thousands of stack unwinding, since a single frame
least a few hundreds or thousands of stack unwindings, since a single frame
unwinding with regular DWARF takes the order of magnitude of $10\,\mu s$, and
\ehelfs{} were expected to have significantly better performance.
However, unwinding over and over again from the same program point would have
had no interest at all, since \prog{libunwind} would have simply cached the
relevant DWARF row. In the mean time, making sure that the various unwinding
relevant DWARF rows. In the mean time, making sure that the various unwindings
are made from different locations is somehow cheating, since it makes useless
\prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding
distribution. All in all, the benchmarking method must have a ``natural''
@ -892,8 +894,8 @@ stack unwindings crossing some standard library functions, starting from inside
them, etc.
Finally, the unwound program must be interesting enough to enter and exit
functions often, building a good stack of nested function calls (at least 5
frequently), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
functions often, building a good stack of nested function calls (at least
frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
etc.
@ -925,7 +927,8 @@ Section~\ref{ssec:bench_req} above: since it stops at regular intervals and
unwinds, the unwindings are evenly distributed \wrt{} the frequency of
execution of the code, which is a natural enough setup for the benchmarks to be
meaningful, while still unwinding from diversified locations, preventing
caching from being be overwhelming. It also has the ability to unwind from
caching from being be overwhelming --~as can be observed later in
Section~\ref{ssec:timeperf}. It also has the ability to unwind from
within any function, included functions of linked shared libraries. It can also
be applied to virtually any program, which allows unwinding ``interesting''
code.
@ -944,27 +947,26 @@ turned out necessary to slightly modify \prog{libunwind}'s interface to add a
parameter to an initialisation function, since \prog{libunwind} is made to be
agnostic of the system and process as much as possible, to be able to unwind in
any context. This very restricted information lacked a memory map (see
Section~\ref{ssec:ehelfs}) in order to use \ehelfs. Apart from this, the
modified version of \prog{libunwind} produced is entirely compatible with the
vanilla version. This means that the only modifications required to use
\ehelfs{} within any project using \prog{libunwind} should be changing one line
of code to add one parameter to a function call and linking against the
modified version of \prog{libunwind} instead of the system version.
Section~\ref{ssec:ehelfs}) in order to use \ehelfs{} --~while, on the other
hand, providing information about the original DWARF that are now useless.
Apart from this, the modified version of \prog{libunwind} produced is entirely
compatible with the vanilla version. This means that the only modifications
required to use \ehelfs{} within any project using \prog{libunwind} should be
changing one line of code to add one parameter to a function call and linking
against the modified version of \prog{libunwind} instead of the system version.
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
code only, left apart the benchmarking code. The major problem encountered was
to understand how \prog{perf} works. In order to avoid perturbing the traced
program, \prog{perf} does not unwind at runtime, but rather records at regular
intervals the program's stack, and all the auxiliary information that is needed
to unwind later. This is done when running \lstbash{perf record}. Then,
\lstbash{perf report} unwinds the stack to analyze it; but at this point of
time, the traced process is long dead, thus any PID-based approach, or any
approach using \texttt{/proc} information will fail. However, as this was the
easiest method, the first version of \ehelfs{} used those mechanisms; thus
requiring some code rewriting.
The modified versions of both \prog{perf} and \prog{libunwind} are present in
the repositories \prog{perf-eh\_elf} and \prog{libunwind-eh\_elf}.
to unwind later. This is done when running \lstbash{perf record}. Then, a
subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but
at this point of time, the traced process is long dead. Thus, any PID-based
approach, or any approach using \texttt{/proc} information will fail. However,
as this was the easiest method, the first version of \ehelfs{} used those
mechanisms; it took some code rewriting to move to a PID- and
\texttt{/proc}-agnostic implementation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Other explored methods}
@ -1052,6 +1054,11 @@ instruction, however, would not slow down at all the implementation, since
every instruction would simply be compiled to x86\_64 without affecting the
already supported code.
The fact that there is a sharp difference between cached and uncached
\prog{libunwind} confirm that our experimental setup did not unwind at totally
different locations every single time, and thus was not biased in this
direction, since caching is still very efficient.
It is also worth noting that the compilation time of \ehelfs{} is also
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
without using multiple cores to compile, the various shared objects needed to
@ -1117,8 +1124,10 @@ Section~\ref{ssec:instr_cov}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Instructions coverage}\label{ssec:instr_cov}
In order to determine which proportion of real-world ELF instructions are
covered by our compiler and \ehelfs.
In order to determine which DWARF instructions are necessary to implement to
have meaningful results, as well as to assess the instruction coverage of our
compiler and \ehelfs, we must look at real-world ELF files and inspect the
instructions used.
The method chosen was to take a random uniform sample of 4000 ELFs among those
present on a basic ArchLinux system setup, in the directories \texttt{/bin},
@ -1211,7 +1220,7 @@ instructions encountered that were not supported by \ehelfs. The first row is
only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and
\reg{rbx} (the supported registers --~see Section~\ref{ssec:ehelfs}). The
second row analyzes all the columns that were encountered, no matter whether
supported or not.
supported or not in \ehelfs.
The Table~\ref{table:instr_types} analyzes the proportion of each command
--~the formal way a register is set~-- for non-CFA columns in the sampled data. For
@ -1221,11 +1230,13 @@ means stored at the address of an expression's result, and the \texttt{Val\_}
prefix means that the value must not be dereferenced. Overall, it can be seen
that supporting \texttt{Offset} already means supporting the vast majority of
registers. The data gathered (not reproduced here) also suggests that
supporting a few common expressions is enough to support most of them.
supporting a few common expressions is enough to support most of them. This is
further supported by the fact that we already support more than $80\,\%$ of
expressions only by supporting two basic constructs.
It is also worth noting that of all the 4000 analyzed files, there are only 12
that contained all the unsupported expressions seen, and only 24 that contained
some unsupported instruction at all.
It is also worth noting that among all of the 4000 analyzed files, all the
unsupported expressions are clustered in only 12 of them, and only 24 contained
unsupported instructions at all.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%