Rephrase everything but section 2
This commit is contained in:
parent
f0809dbf1c
commit
2f44049506
1 changed files with 72 additions and 61 deletions
|
@ -702,14 +702,14 @@ Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
|
|||
context, containing the values the registers hold after unwinding this frame.
|
||||
|
||||
The body of the function itself consists in a single monolithic switch, taking
|
||||
advantage of the non-standard --~yet widely implemented in C compilers~--
|
||||
syntax for range switches, in which each \lstinline{case} can refer to a range.
|
||||
All the FDEs are merged together into this switch, each row of a FDE being a
|
||||
switch case. Separating the various FDEs in the C code --~other than with
|
||||
comments~-- is, unlike what is done in DWARF, pointless, since accessing a
|
||||
``row'' has a linear cost, and the C code is not meant to be read, except maybe
|
||||
for debugging purposes. The switch cases bodies then fill a context with
|
||||
unwound values, then return it.
|
||||
advantage of the non-standard --~yet overwhelmingly implemented in common C
|
||||
compilers~-- syntax for range switches, in which each \lstinline{case} can
|
||||
refer to a range, \eg{} \lstc{case 17 ... 42:}. All the FDEs are merged
|
||||
together into this switch, each row of a FDE being a switch case. Separating
|
||||
the various FDEs in the C code --~other than with comments~-- is, unlike what
|
||||
is done in DWARF, pointless, since accessing a ``row'' has a linear cost, and
|
||||
the C code is not meant to be read, except maybe for debugging purposes. The
|
||||
switch cases bodies then fill a context with unwound values before return it.
|
||||
|
||||
A setting of the compiler also optionally enables another parameter to the
|
||||
\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
|
||||
|
@ -724,12 +724,12 @@ real-world-proof version of the \ehelfs, the choice was made to keep this
|
|||
implementation simple, and only handle the few registers that were needed to
|
||||
simply unwind the stack. Thus, the only registers handled in \ehelfs{} are
|
||||
\reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used a few
|
||||
times in \prog{libc} to hold the CFA address in common functions. This is
|
||||
enough to unwind the stack reliably, and thus enough for profiling, but is not
|
||||
sufficient to analyze every stack frame as \prog{gdb} would do after a
|
||||
\lstbash{frame n} command. Yet, if one was to enhance the code to handle every
|
||||
register, it would not be much harder and would probably be only a few hours of
|
||||
code refactoring and rewriting.
|
||||
times in \prog{libc} and other less common libraries to hold the CFA address in
|
||||
common functions. This is enough to unwind the stack reliably, and thus enough
|
||||
for profiling, but is not sufficient to analyze every stack frame as \prog{gdb}
|
||||
would do after a \lstbash{frame n} command. Yet, if one was to enhance the
|
||||
code to handle every register, it would not be much harder and would probably
|
||||
be only a few hours worth of code refactoring and rewriting.
|
||||
|
||||
\lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}]
|
||||
{src/dwarf_assembly_context/unwind_context.c}
|
||||
|
@ -754,17 +754,19 @@ on or off, and it doesn't require to alter the base system by editing \eg{}
|
|||
\texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is required, those
|
||||
files can simply be \lstc{dlopen}'d. It is also possible to imagine, in a
|
||||
future environment production, packaging \ehelfs{} files separately, so that
|
||||
people interested in heavy computation can have the choice to install them.
|
||||
people interested in better performance can have the choice to install them.
|
||||
|
||||
This, in particular, means that each ELF file has its unwinding data in a
|
||||
separate \ehelf{} file --~just like with DWARF, where each ELF retains its own
|
||||
DWARF data. Thus, an unwinder must first acquire a \emph{memory map}, a table
|
||||
listing the various ELF files loaded and \emph{mapped} in memory, and on which
|
||||
memory segment. This memory map is provided by the operating system --~for
|
||||
instance, on Linux, it is available as a file in \texttt{/proc}. Once this map
|
||||
is acquired, when unwinding from a given IP, the unwinder must identify the
|
||||
memory segment from which it comes, deduce the source ELF file, and deduce the
|
||||
corresponding \ehelf.
|
||||
separate \ehelf{} file, implying that the unwinding data for a given program is
|
||||
scattered among various \ehelf{} files, one for each shared object loaded
|
||||
--~just like with DWARF, where each ELF retains its own DWARF data. Thus, an
|
||||
unwinder must first acquire a \emph{memory map}, a table listing the various
|
||||
ELF files loaded and \emph{mapped} in memory, and on which memory segment. This
|
||||
memory map is provided by the operating system --~for instance, on Linux, it is
|
||||
available as a file in \texttt{/proc}. Once this map is acquired, when
|
||||
unwinding from a given IP, the unwinder must identify the memory segment from
|
||||
which it comes, deduce the source ELF file, and deduce the corresponding
|
||||
\ehelf.
|
||||
|
||||
\medskip
|
||||
|
||||
|
@ -772,8 +774,8 @@ corresponding \ehelf.
|
|||
label={lst:fib7_eh_elf_basic}]
|
||||
{src/fib7/fib7.eh_elf_basic.c}
|
||||
|
||||
The C code in Listing~\ref{lst:fib7_eh_elf_basic} is a part of what was
|
||||
generated for the C code in Listing~\ref{lst:ex1_c}.
|
||||
The C code in Listing~\ref{lst:fib7_eh_elf_basic} is the relevant part of what
|
||||
was generated for the C code in Listing~\ref{lst:ex1_c}.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{First results}
|
||||
|
@ -817,13 +819,13 @@ it depends.
|
|||
The first column only includes the sizes of the ELF sections \lstc{.text} (the
|
||||
program itself) and \lstc{.rodata}, the read-only data (such as static strings,
|
||||
etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{}
|
||||
is considered, because it is self-consistent (few data or none is stored in
|
||||
is considered, because it is self-contained (few data or none is stored in
|
||||
\lstc{.rodata}), and the other sections could be removed if the \ehelfs{}
|
||||
\lstc{.text} was somehow embedded in the original shared object.
|
||||
|
||||
This first tentative version of \ehelfs{} is roughly 7 times heavier than the
|
||||
original \lstc{.eh_frame}, and represents a far too significant proportion of
|
||||
the original program size.
|
||||
the original program size ($65\,\%$).
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Space optimization}\label{ssec:space_optim}
|
||||
|
@ -838,13 +840,13 @@ The major optimization that most reduced the output size was to use an if/else
|
|||
tree implementing a binary search on the instruction pointer relevant
|
||||
intervals, instead of a single monolithic switch. In the process, we also
|
||||
\emph{outline} code whenever possible, that is, find out identical ``switch
|
||||
cases'' bodies --~which are not switch cases anymore, but if bodies~--, move
|
||||
them outside of the if/else tree, identify them by a label, and jump to them
|
||||
using a \lstc{goto}, which de-duplicates a lot of code and contributes greatly
|
||||
to the shrinking. In the process, we noticed that the vast majority of FDE rows
|
||||
are actually taken among very few ``common'' FDE rows. For instance, in the
|
||||
\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) remain
|
||||
after the outlining.
|
||||
cases'' bodies --~which are not switch cases anymore, but \texttt{if}
|
||||
bodies~--, move them outside of the if/else tree, identify them by a label, and
|
||||
jump to them using a \lstc{goto}, which de-duplicates a lot of code and
|
||||
contributes greatly to the shrinking. In the process, we noticed that the vast
|
||||
majority of FDE rows are actually taken among very few ``common'' FDE rows. For
|
||||
instance, in the \prog{libc}, out of a total of $20827$ rows, only $302$
|
||||
($1.5\,\%$) unique rows remain after the outlining.
|
||||
|
||||
This makes this optimization really efficient, as seen later in
|
||||
Section~\ref{ssec:results_size}, but also makes it an interesting question
|
||||
|
@ -874,13 +876,13 @@ solution working.
|
|||
\subsection{Requirements}\label{ssec:bench_req}
|
||||
|
||||
To provide relevant benchmarks of the \ehelfs{} performance, one must sample at
|
||||
least a few hundreds or thousands of stack unwinding, since a single frame
|
||||
least a few hundreds or thousands of stack unwindings, since a single frame
|
||||
unwinding with regular DWARF takes the order of magnitude of $10\,\mu s$, and
|
||||
\ehelfs{} were expected to have significantly better performance.
|
||||
|
||||
However, unwinding over and over again from the same program point would have
|
||||
had no interest at all, since \prog{libunwind} would have simply cached the
|
||||
relevant DWARF row. In the mean time, making sure that the various unwinding
|
||||
relevant DWARF rows. In the mean time, making sure that the various unwindings
|
||||
are made from different locations is somehow cheating, since it makes useless
|
||||
\prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding
|
||||
distribution. All in all, the benchmarking method must have a ``natural''
|
||||
|
@ -892,8 +894,8 @@ stack unwindings crossing some standard library functions, starting from inside
|
|||
them, etc.
|
||||
|
||||
Finally, the unwound program must be interesting enough to enter and exit
|
||||
functions often, building a good stack of nested function calls (at least 5
|
||||
frequently), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
|
||||
functions often, building a good stack of nested function calls (at least
|
||||
frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw},
|
||||
etc.
|
||||
|
||||
|
||||
|
@ -925,7 +927,8 @@ Section~\ref{ssec:bench_req} above: since it stops at regular intervals and
|
|||
unwinds, the unwindings are evenly distributed \wrt{} the frequency of
|
||||
execution of the code, which is a natural enough setup for the benchmarks to be
|
||||
meaningful, while still unwinding from diversified locations, preventing
|
||||
caching from being be overwhelming. It also has the ability to unwind from
|
||||
caching from being be overwhelming --~as can be observed later in
|
||||
Section~\ref{ssec:timeperf}. It also has the ability to unwind from
|
||||
within any function, included functions of linked shared libraries. It can also
|
||||
be applied to virtually any program, which allows unwinding ``interesting''
|
||||
code.
|
||||
|
@ -944,27 +947,26 @@ turned out necessary to slightly modify \prog{libunwind}'s interface to add a
|
|||
parameter to an initialisation function, since \prog{libunwind} is made to be
|
||||
agnostic of the system and process as much as possible, to be able to unwind in
|
||||
any context. This very restricted information lacked a memory map (see
|
||||
Section~\ref{ssec:ehelfs}) in order to use \ehelfs. Apart from this, the
|
||||
modified version of \prog{libunwind} produced is entirely compatible with the
|
||||
vanilla version. This means that the only modifications required to use
|
||||
\ehelfs{} within any project using \prog{libunwind} should be changing one line
|
||||
of code to add one parameter to a function call and linking against the
|
||||
modified version of \prog{libunwind} instead of the system version.
|
||||
Section~\ref{ssec:ehelfs}) in order to use \ehelfs{} --~while, on the other
|
||||
hand, providing information about the original DWARF that are now useless.
|
||||
Apart from this, the modified version of \prog{libunwind} produced is entirely
|
||||
compatible with the vanilla version. This means that the only modifications
|
||||
required to use \ehelfs{} within any project using \prog{libunwind} should be
|
||||
changing one line of code to add one parameter to a function call and linking
|
||||
against the modified version of \prog{libunwind} instead of the system version.
|
||||
|
||||
Once this was done, plugging it in \prog{perf} was the matter of a few lines of
|
||||
code only, left apart the benchmarking code. The major problem encountered was
|
||||
to understand how \prog{perf} works. In order to avoid perturbing the traced
|
||||
program, \prog{perf} does not unwind at runtime, but rather records at regular
|
||||
intervals the program's stack, and all the auxiliary information that is needed
|
||||
to unwind later. This is done when running \lstbash{perf record}. Then,
|
||||
\lstbash{perf report} unwinds the stack to analyze it; but at this point of
|
||||
time, the traced process is long dead, thus any PID-based approach, or any
|
||||
approach using \texttt{/proc} information will fail. However, as this was the
|
||||
easiest method, the first version of \ehelfs{} used those mechanisms; thus
|
||||
requiring some code rewriting.
|
||||
|
||||
The modified versions of both \prog{perf} and \prog{libunwind} are present in
|
||||
the repositories \prog{perf-eh\_elf} and \prog{libunwind-eh\_elf}.
|
||||
to unwind later. This is done when running \lstbash{perf record}. Then, a
|
||||
subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but
|
||||
at this point of time, the traced process is long dead. Thus, any PID-based
|
||||
approach, or any approach using \texttt{/proc} information will fail. However,
|
||||
as this was the easiest method, the first version of \ehelfs{} used those
|
||||
mechanisms; it took some code rewriting to move to a PID- and
|
||||
\texttt{/proc}-agnostic implementation.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Other explored methods}
|
||||
|
@ -1052,6 +1054,11 @@ instruction, however, would not slow down at all the implementation, since
|
|||
every instruction would simply be compiled to x86\_64 without affecting the
|
||||
already supported code.
|
||||
|
||||
The fact that there is a sharp difference between cached and uncached
|
||||
\prog{libunwind} confirm that our experimental setup did not unwind at totally
|
||||
different locations every single time, and thus was not biased in this
|
||||
direction, since caching is still very efficient.
|
||||
|
||||
It is also worth noting that the compilation time of \ehelfs{} is also
|
||||
reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
|
||||
without using multiple cores to compile, the various shared objects needed to
|
||||
|
@ -1117,8 +1124,10 @@ Section~\ref{ssec:instr_cov}).
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Instructions coverage}\label{ssec:instr_cov}
|
||||
|
||||
In order to determine which proportion of real-world ELF instructions are
|
||||
covered by our compiler and \ehelfs.
|
||||
In order to determine which DWARF instructions are necessary to implement to
|
||||
have meaningful results, as well as to assess the instruction coverage of our
|
||||
compiler and \ehelfs, we must look at real-world ELF files and inspect the
|
||||
instructions used.
|
||||
|
||||
The method chosen was to take a random uniform sample of 4000 ELFs among those
|
||||
present on a basic ArchLinux system setup, in the directories \texttt{/bin},
|
||||
|
@ -1211,7 +1220,7 @@ instructions encountered that were not supported by \ehelfs. The first row is
|
|||
only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and
|
||||
\reg{rbx} (the supported registers --~see Section~\ref{ssec:ehelfs}). The
|
||||
second row analyzes all the columns that were encountered, no matter whether
|
||||
supported or not.
|
||||
supported or not in \ehelfs.
|
||||
|
||||
The Table~\ref{table:instr_types} analyzes the proportion of each command
|
||||
--~the formal way a register is set~-- for non-CFA columns in the sampled data. For
|
||||
|
@ -1221,11 +1230,13 @@ means stored at the address of an expression's result, and the \texttt{Val\_}
|
|||
prefix means that the value must not be dereferenced. Overall, it can be seen
|
||||
that supporting \texttt{Offset} already means supporting the vast majority of
|
||||
registers. The data gathered (not reproduced here) also suggests that
|
||||
supporting a few common expressions is enough to support most of them.
|
||||
supporting a few common expressions is enough to support most of them. This is
|
||||
further supported by the fact that we already support more than $80\,\%$ of
|
||||
expressions only by supporting two basic constructs.
|
||||
|
||||
It is also worth noting that of all the 4000 analyzed files, there are only 12
|
||||
that contained all the unsupported expressions seen, and only 24 that contained
|
||||
some unsupported instruction at all.
|
||||
It is also worth noting that among all of the 4000 analyzed files, all the
|
||||
unsupported expressions are clustered in only 12 of them, and only 24 contained
|
||||
unsupported instructions at all.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
|
Loading…
Reference in a new issue