From 2f440495069c389589865b36af7d1dea0965bab5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= Date: Sat, 18 Aug 2018 22:06:55 +0200 Subject: [PATCH] Rephrase everything but section 2 --- report/report.tex | 133 +++++++++++++++++++++++++--------------------- 1 file changed, 72 insertions(+), 61 deletions(-) diff --git a/report/report.tex b/report/report.tex index 87c0aaf..c011745 100644 --- a/report/report.tex +++ b/report/report.tex @@ -702,14 +702,14 @@ Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory context, containing the values the registers hold after unwinding this frame. The body of the function itself consists in a single monolithic switch, taking -advantage of the non-standard --~yet widely implemented in C compilers~-- -syntax for range switches, in which each \lstinline{case} can refer to a range. -All the FDEs are merged together into this switch, each row of a FDE being a -switch case. Separating the various FDEs in the C code --~other than with -comments~-- is, unlike what is done in DWARF, pointless, since accessing a -``row'' has a linear cost, and the C code is not meant to be read, except maybe -for debugging purposes. The switch cases bodies then fill a context with -unwound values, then return it. +advantage of the non-standard --~yet overwhelmingly implemented in common C +compilers~-- syntax for range switches, in which each \lstinline{case} can +refer to a range, \eg{} \lstc{case 17 ... 42:}. All the FDEs are merged +together into this switch, each row of a FDE being a switch case. Separating +the various FDEs in the C code --~other than with comments~-- is, unlike what +is done in DWARF, pointless, since accessing a ``row'' has a linear cost, and +the C code is not meant to be read, except maybe for debugging purposes. The +switch cases bodies then fill a context with unwound values before return it. A setting of the compiler also optionally enables another parameter to the \lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This @@ -724,12 +724,12 @@ real-world-proof version of the \ehelfs, the choice was made to keep this implementation simple, and only handle the few registers that were needed to simply unwind the stack. Thus, the only registers handled in \ehelfs{} are \reg{rip}, \reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used a few -times in \prog{libc} to hold the CFA address in common functions. This is -enough to unwind the stack reliably, and thus enough for profiling, but is not -sufficient to analyze every stack frame as \prog{gdb} would do after a -\lstbash{frame n} command. Yet, if one was to enhance the code to handle every -register, it would not be much harder and would probably be only a few hours of -code refactoring and rewriting. +times in \prog{libc} and other less common libraries to hold the CFA address in +common functions. This is enough to unwind the stack reliably, and thus enough +for profiling, but is not sufficient to analyze every stack frame as \prog{gdb} +would do after a \lstbash{frame n} command. Yet, if one was to enhance the +code to handle every register, it would not be much harder and would probably +be only a few hours worth of code refactoring and rewriting. \lstinputlisting[language=C, caption={Unwinding context}, label={lst:unw_ctx}] {src/dwarf_assembly_context/unwind_context.c} @@ -754,17 +754,19 @@ on or off, and it doesn't require to alter the base system by editing \eg{} \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is required, those files can simply be \lstc{dlopen}'d. It is also possible to imagine, in a future environment production, packaging \ehelfs{} files separately, so that -people interested in heavy computation can have the choice to install them. +people interested in better performance can have the choice to install them. This, in particular, means that each ELF file has its unwinding data in a -separate \ehelf{} file --~just like with DWARF, where each ELF retains its own -DWARF data. Thus, an unwinder must first acquire a \emph{memory map}, a table -listing the various ELF files loaded and \emph{mapped} in memory, and on which -memory segment. This memory map is provided by the operating system --~for -instance, on Linux, it is available as a file in \texttt{/proc}. Once this map -is acquired, when unwinding from a given IP, the unwinder must identify the -memory segment from which it comes, deduce the source ELF file, and deduce the -corresponding \ehelf. +separate \ehelf{} file, implying that the unwinding data for a given program is +scattered among various \ehelf{} files, one for each shared object loaded +--~just like with DWARF, where each ELF retains its own DWARF data. Thus, an +unwinder must first acquire a \emph{memory map}, a table listing the various +ELF files loaded and \emph{mapped} in memory, and on which memory segment. This +memory map is provided by the operating system --~for instance, on Linux, it is +available as a file in \texttt{/proc}. Once this map is acquired, when +unwinding from a given IP, the unwinder must identify the memory segment from +which it comes, deduce the source ELF file, and deduce the corresponding +\ehelf. \medskip @@ -772,8 +774,8 @@ corresponding \ehelf. label={lst:fib7_eh_elf_basic}] {src/fib7/fib7.eh_elf_basic.c} -The C code in Listing~\ref{lst:fib7_eh_elf_basic} is a part of what was -generated for the C code in Listing~\ref{lst:ex1_c}. +The C code in Listing~\ref{lst:fib7_eh_elf_basic} is the relevant part of what +was generated for the C code in Listing~\ref{lst:ex1_c}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{First results} @@ -817,13 +819,13 @@ it depends. The first column only includes the sizes of the ELF sections \lstc{.text} (the program itself) and \lstc{.rodata}, the read-only data (such as static strings, etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{} -is considered, because it is self-consistent (few data or none is stored in +is considered, because it is self-contained (few data or none is stored in \lstc{.rodata}), and the other sections could be removed if the \ehelfs{} \lstc{.text} was somehow embedded in the original shared object. This first tentative version of \ehelfs{} is roughly 7 times heavier than the original \lstc{.eh_frame}, and represents a far too significant proportion of -the original program size. +the original program size ($65\,\%$). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Space optimization}\label{ssec:space_optim} @@ -838,13 +840,13 @@ The major optimization that most reduced the output size was to use an if/else tree implementing a binary search on the instruction pointer relevant intervals, instead of a single monolithic switch. In the process, we also \emph{outline} code whenever possible, that is, find out identical ``switch -cases'' bodies --~which are not switch cases anymore, but if bodies~--, move -them outside of the if/else tree, identify them by a label, and jump to them -using a \lstc{goto}, which de-duplicates a lot of code and contributes greatly -to the shrinking. In the process, we noticed that the vast majority of FDE rows -are actually taken among very few ``common'' FDE rows. For instance, in the -\prog{libc}, out of a total of $20827$ rows, only $302$ ($1.5\,\%$) remain -after the outlining. +cases'' bodies --~which are not switch cases anymore, but \texttt{if} +bodies~--, move them outside of the if/else tree, identify them by a label, and +jump to them using a \lstc{goto}, which de-duplicates a lot of code and +contributes greatly to the shrinking. In the process, we noticed that the vast +majority of FDE rows are actually taken among very few ``common'' FDE rows. For +instance, in the \prog{libc}, out of a total of $20827$ rows, only $302$ +($1.5\,\%$) unique rows remain after the outlining. This makes this optimization really efficient, as seen later in Section~\ref{ssec:results_size}, but also makes it an interesting question @@ -874,13 +876,13 @@ solution working. \subsection{Requirements}\label{ssec:bench_req} To provide relevant benchmarks of the \ehelfs{} performance, one must sample at -least a few hundreds or thousands of stack unwinding, since a single frame +least a few hundreds or thousands of stack unwindings, since a single frame unwinding with regular DWARF takes the order of magnitude of $10\,\mu s$, and \ehelfs{} were expected to have significantly better performance. However, unwinding over and over again from the same program point would have had no interest at all, since \prog{libunwind} would have simply cached the -relevant DWARF row. In the mean time, making sure that the various unwinding +relevant DWARF rows. In the mean time, making sure that the various unwindings are made from different locations is somehow cheating, since it makes useless \prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding distribution. All in all, the benchmarking method must have a ``natural'' @@ -892,8 +894,8 @@ stack unwindings crossing some standard library functions, starting from inside them, etc. Finally, the unwound program must be interesting enough to enter and exit -functions often, building a good stack of nested function calls (at least 5 -frequently), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, +functions often, building a good stack of nested function calls (at least +frequently 5), have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc. @@ -925,7 +927,8 @@ Section~\ref{ssec:bench_req} above: since it stops at regular intervals and unwinds, the unwindings are evenly distributed \wrt{} the frequency of execution of the code, which is a natural enough setup for the benchmarks to be meaningful, while still unwinding from diversified locations, preventing -caching from being be overwhelming. It also has the ability to unwind from +caching from being be overwhelming --~as can be observed later in +Section~\ref{ssec:timeperf}. It also has the ability to unwind from within any function, included functions of linked shared libraries. It can also be applied to virtually any program, which allows unwinding ``interesting'' code. @@ -944,27 +947,26 @@ turned out necessary to slightly modify \prog{libunwind}'s interface to add a parameter to an initialisation function, since \prog{libunwind} is made to be agnostic of the system and process as much as possible, to be able to unwind in any context. This very restricted information lacked a memory map (see -Section~\ref{ssec:ehelfs}) in order to use \ehelfs. Apart from this, the -modified version of \prog{libunwind} produced is entirely compatible with the -vanilla version. This means that the only modifications required to use -\ehelfs{} within any project using \prog{libunwind} should be changing one line -of code to add one parameter to a function call and linking against the -modified version of \prog{libunwind} instead of the system version. +Section~\ref{ssec:ehelfs}) in order to use \ehelfs{} --~while, on the other +hand, providing information about the original DWARF that are now useless. +Apart from this, the modified version of \prog{libunwind} produced is entirely +compatible with the vanilla version. This means that the only modifications +required to use \ehelfs{} within any project using \prog{libunwind} should be +changing one line of code to add one parameter to a function call and linking +against the modified version of \prog{libunwind} instead of the system version. Once this was done, plugging it in \prog{perf} was the matter of a few lines of code only, left apart the benchmarking code. The major problem encountered was to understand how \prog{perf} works. In order to avoid perturbing the traced program, \prog{perf} does not unwind at runtime, but rather records at regular intervals the program's stack, and all the auxiliary information that is needed -to unwind later. This is done when running \lstbash{perf record}. Then, -\lstbash{perf report} unwinds the stack to analyze it; but at this point of -time, the traced process is long dead, thus any PID-based approach, or any -approach using \texttt{/proc} information will fail. However, as this was the -easiest method, the first version of \ehelfs{} used those mechanisms; thus -requiring some code rewriting. - -The modified versions of both \prog{perf} and \prog{libunwind} are present in -the repositories \prog{perf-eh\_elf} and \prog{libunwind-eh\_elf}. +to unwind later. This is done when running \lstbash{perf record}. Then, a +subsequent call to \lstbash{perf report} unwinds the stack to analyze it; but +at this point of time, the traced process is long dead. Thus, any PID-based +approach, or any approach using \texttt{/proc} information will fail. However, +as this was the easiest method, the first version of \ehelfs{} used those +mechanisms; it took some code rewriting to move to a PID- and +\texttt{/proc}-agnostic implementation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Other explored methods} @@ -1052,6 +1054,11 @@ instruction, however, would not slow down at all the implementation, since every instruction would simply be compiled to x86\_64 without affecting the already supported code. +The fact that there is a sharp difference between cached and uncached +\prog{libunwind} confirm that our experimental setup did not unwind at totally +different locations every single time, and thus was not biased in this +direction, since caching is still very efficient. + It is also worth noting that the compilation time of \ehelfs{} is also reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and without using multiple cores to compile, the various shared objects needed to @@ -1117,8 +1124,10 @@ Section~\ref{ssec:instr_cov}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Instructions coverage}\label{ssec:instr_cov} -In order to determine which proportion of real-world ELF instructions are -covered by our compiler and \ehelfs. +In order to determine which DWARF instructions are necessary to implement to +have meaningful results, as well as to assess the instruction coverage of our +compiler and \ehelfs, we must look at real-world ELF files and inspect the +instructions used. The method chosen was to take a random uniform sample of 4000 ELFs among those present on a basic ArchLinux system setup, in the directories \texttt{/bin}, @@ -1211,7 +1220,7 @@ instructions encountered that were not supported by \ehelfs. The first row is only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and \reg{rbx} (the supported registers --~see Section~\ref{ssec:ehelfs}). The second row analyzes all the columns that were encountered, no matter whether -supported or not. +supported or not in \ehelfs. The Table~\ref{table:instr_types} analyzes the proportion of each command --~the formal way a register is set~-- for non-CFA columns in the sampled data. For @@ -1221,11 +1230,13 @@ means stored at the address of an expression's result, and the \texttt{Val\_} prefix means that the value must not be dereferenced. Overall, it can be seen that supporting \texttt{Offset} already means supporting the vast majority of registers. The data gathered (not reproduced here) also suggests that -supporting a few common expressions is enough to support most of them. +supporting a few common expressions is enough to support most of them. This is +further supported by the fact that we already support more than $80\,\%$ of +expressions only by supporting two basic constructs. -It is also worth noting that of all the 4000 analyzed files, there are only 12 -that contained all the unsupported expressions seen, and only 24 that contained -some unsupported instruction at all. +It is also worth noting that among all of the 4000 analyzed files, all the +unsupported expressions are clustered in only 12 of them, and only 24 contained +unsupported instructions at all. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% End main text content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%