diff --git a/report/report.tex b/report/report.tex index 12d1fd1..b7e6d15 100644 --- a/report/report.tex +++ b/report/report.tex @@ -80,15 +80,15 @@ restored before returning, the function's return address and local variables. On the x86\_64 platform, with which this report is mostly concerned, the calling convention that is followed is defined in the System V -ABI~\cite{systemVabi} for the Unix-like operating systems (among which Linux). +ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux. Under this calling convention, the first six arguments of a function are passed in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8}, \reg{r9}, while additional arguments are pushed onto the stack. It also defines which registers may be overwritten by the callee, and which parameters must be -restored before returning (which most of the time is done by pushing the -register value onto the stack in the function prelude, and restoring it just -before returning). Those preserved registers are \reg{rbx}, \reg{rsp}, -\reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}. +restored before returning. This restoration, most of the time, is done by +pushing the register value onto the stack in the function prelude, and +restoring it just before returning. Those preserved registers are \reg{rbx}, +\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}. \begin{wrapfigure}{r}{0.4\textwidth} \centering @@ -98,11 +98,8 @@ before returning). Those preserved registers are \reg{rbx}, \reg{rsp}, \end{wrapfigure} The register \reg{rsp} is supposed to always point to the last used memory cell -in the stack, thus, when the process just enters a new function, \reg{rsp} -points right to the location of the return address\footnote{Remember that since -the stack grows \emph{downwards} in memory, the arrow of \reg{rsp} points -\emph{below} the RA cell in the figure, and yet the memory cell indexed is the -one \emph{above} in the drawing, that is, the RA.}. Then, the compiler might +in the stack. Thus, when the process just enters a new function, \reg{rsp} +points right to the location of the return address. Then, the compiler might use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing the old value of \reg{rbp} just below the return address on the stack, then copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address @@ -148,8 +145,8 @@ Left apart analyzing the assembly code produced, there is no way to find where the return address is stored, relatively to \reg{rsp}, at some arbitrary point of the function. Even when \reg{rbp} is used, there is no easy way to guess where each callee-saved register is stored in the stack frame, and worse, which -callee-saved registers were saved (since it is not necessary to save a register -that the function never touches). +callee-saved registers were saved, since it is optional to save a register +that the function never touches. With this example, it seems pretty clear that it is often necessary to have additional data to perform stack unwinding. This data is often stored among the @@ -171,11 +168,11 @@ context, by unwinding \lstc{fct_b}'s frame. \medskip -Yet, stack unwinding (and thus debugging data) \emph{is not limited to +Yet, stack unwinding, and thus, debugging data, \emph{is not limited to debugging}. Another common usage is profiling. A profiling tool, such as \prog{perf} under -Linux -- see Section~\ref{ssec:perf} --, is used to measure and analyze in +Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in which functions a program spends its time, identify bottlenecks and find out which parts are critical to optimize. To do so, modern profilers pause the traced program at regular, short intervals, inspect their stack, and determine @@ -202,8 +199,8 @@ trigger the destructors of stack-allocated objects. Furthermore, this is often undesirable: \lstc{setjmp} has a quite big overhead, which is introduced whenever a \lstc{try} block is encountered. Instead, it is often preferred to have strictly no overhead when no exception happens, at the cost of a greater -overhead when an exception is actually fired (after all, they are supposed to -be \emph{exceptional}). For more details on C++ exception handling, +overhead when an exception is actually fired --~after all, they are supposed to +be \emph{exceptional}. For more details on C++ exception handling, see~\cite{koening1990exception} (especially Section~16.5). Possible implementation mechanisms are also presented in~\cite{dinechin2000exn}. @@ -237,8 +234,8 @@ the previous paragraph, in an ELF section originally called For any binary, debugging information can easily get quite large if no attention is payed to keeping it as compact as possible. In this matter, DWARF does an excellent job, and everything is stored in a very compact way. This, -however, as we will see, makes it both difficult to parse correctly (with \eg{} -variable-length integers) and quite slow to interpret. +however, as we will see, makes it both difficult to parse correctly and quite +slow to interpret. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{DWARF unwinding data} @@ -259,19 +256,15 @@ a table, that has a range of IPs on which it has authority. Most often, but not necessarily, it corresponds to a single function in the original source code. Each column of the table is a register (\eg{} \reg{rsp}), with two additional special registers, CFA (Canonical Frame Address) and RA (Return Address), -containing respectively the base pointer of the current stack -frame\footnote{The CFA is most commonly thought of as the base pointer of the -frame, yet this is not enforced by DWARF\@. The CFA is used as an address from -which other registers will be deduced as offsets, and although it is supposed -to be the actual base pointer, it can be anything as long as it is close enough -to the addresses that will be deduced from it.} and the return address of the -current function (\ie{} for x86\_64, the unwound value of \reg{rip}, the -instruction pointer). Each row has a certain validity interval, on which it -describes accurate unwinding data. This range starts at the instruction pointer -it is associated with, and ends at the start IP of the next table row (or the -end IP of the current FDE if it was the last row). In particular, there can be -no ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes -between them. +containing respectively the base pointer of the current stack frame and the +return address of the current function. For instance, on a x86\_64 +architecture, RA would contain the unwound value of \reg{rip}, the instruction +pointer. Each row has a certain validity interval, on which it describes +accurate unwinding data. This range starts at the instruction pointer it is +associated with, and ends at the start IP of the next table row (or the end IP +of the current FDE if it was the last row). In particular, there can be no ``IP +hole'' within a FDE --~unlike FDEs themselves, which can leave holes between +them. \begin{figure}[h] \begin{minipage}{0.45\textwidth} @@ -329,17 +322,17 @@ how the stack frame is constructed. When interpreting the generated \ehframe{} with \lstbash{readelf -wF}, we obtain the (slightly edited) Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615} \leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address, -thus the CFA is 8 bytes above \reg{rsp} (which was the value of \reg{rsp} -before the call, and is the topmost value of used space for this stack frame), -and the return address is precisely at \reg{rsp} --~that is, stored between -\reg{rsp} and $\reg{rsp} + 8$. Then, 8 integers of 4 bytes each (for -\lstc{fibo}, \lstc{pos} being optimized out) are allocated on the stack, which -puts the CFA 32 bytes above \reg{rsp}, and the return address still 8 bytes -below the CFA\@. Yet, \prog{gcc} decided to allocate a total space of 48 bytes -for the stack frame for memory alignment reasons, which means subtracting 40 -bytes to \reg{rsp} (address $\mhex{615}$ in the assembly). Then, by the end of -the function, the local variables are discarded and \reg{rsp} is reset to its -value from the first row. +thus the CFA is 8 bytes above \reg{rsp}, and the return address is precisely at +\reg{rsp} --~that is, stored between \reg{rsp} and $\reg{rsp} + 8$. Then, the +contents of \lstc{fibo}, 8 integers of 4 bytes each, are allocated on the +stack, which puts the CFA 32 bytes above \reg{rsp}; the return address still +being 8 bytes below the CFA\@. The variable \lstc{pos} is optimized out in the +generated assembly code, thus no stack space is allocated for it. Yet, +\prog{gcc} decided to allocate a total space of 48 bytes for the stack frame +for memory alignment reasons, which means subtracting 40 bytes to \reg{rsp} +(address $\mhex{615}$ in the assembly). Then, by the end of the function, the +local variables are discarded and \reg{rsp} is reset to its value from the +first row. However, DWARF data isn't actually stored as a table in the binary files, but is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the @@ -425,9 +418,9 @@ These are the DWARF instructions used for CFI description, that is, the instructions that contain the stack unwinding table informations. The following list is an exhaustive list of instructions from the DWARF5 specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for -brevity and clarity. All these instructions are up to variants (most +brevity and clarity. All these instructions are up to variants --~most instructions exist in multiple formats to handle various operands formatting, -to optimize space). Since we won't be talking about the underlying file format +to optimize space. Since we won't be talking about the underlying file format here, those variations between eg. \dwcfa{advance\_loc1} and \dwcfa{advance\_loc2} --~which differ only on the number of bytes of their operand~-- are irrelevant and will be eluded. @@ -517,10 +510,10 @@ only handled as register identifiers, so we can safely state that $\reg{reg} A value can then be undefined, stored at memory address $x$ or be directly a value $x$, $x$ being here a simple expression consisting of $\reg{reg} + -\textit{offset}$. The CFA is considered a simple register here. For instance, to -define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we -would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$ (for the stack grows -downwards). +\textit{offset}$. The CFA is considered a simple register here. For instance, +to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA, +we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack +grows downwards. \subsection{Target language~: a C function body} @@ -533,10 +526,10 @@ execution stack, or even the heap. This function takes as arguments an instruction pointer --~supposedly extracted from $\reg{rip}$~-- and an array of register values; and returns a fresh array of register values after unwinding this call frame. The function is -compositional\footnote{up to technicities: the IP obtained after unwinding the -first frame might be handled in a different dynamically loaded object, and this -would require inspecting the DWARF located in another file}: it can be called -twice in a row to unwind two stack frames. +compositional: it can be called twice in a row to unwind two stack frames, +unless the IP obtained after the first unwinding comes from another shared +object file, for instance a call to \prog{libc}. In this case, unwinding the +second frame will require loading the corresponding DWARF information. The function is the following~: @@ -636,8 +629,8 @@ $F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$. \semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\ \end{align*} -(The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If -we omit those two operations, we can plainly remove the stack). +The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If +we omit those two operations, we can plainly remove the stack. \subsection{From $\intermedlang$ to C} @@ -694,8 +687,9 @@ machine code on the x86\_64 platform. The rough idea of the compilation is to produce, out of the \ehframe{} section of a binary, C code that resembles the code shown in the DWARF semantics from Section~\ref{sec:semantics} above. This C code is then compiled by GCC in -\lstbash{-O2} mode\footnote{Compiling in \lstbash{-O3} takes way too much -time.}, providing for free all the optimization passes of a modern compiler. +\lstbash{-O2} mode, since it already provides a good level of optimization and +compiling in \lstbash{-O3} takes way too much time. This saves us the trouble +of optimizing the generated C code whenever GCC does that by itself. The generated code consists in a single monolithic function, \lstc{_eh_elf}, taking as arguments an instruction pointer and a memory context (\ie{} the @@ -715,18 +709,18 @@ return it. A setting of the compiler also optionally enables another parameter to the \lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This -\lstc{deref} function, when enabled, replaces everywhere the dereferencing +\lstc{deref} function, when present, replaces everywhere the dereferencing \lstc{*} operator, and can be used to generate \ehelfs{} that will work on -remote address spaces (\ie{} whenever the unwinding is not done on the process -reading the \ehelf{} itself, but some other process, or even on a stack dump of -a long-terminated process). +remote address spaces, that is, whenever the unwinding is not done on the +process reading the \ehelf{} itself, but some other process, or even on a stack +dump of a long-terminated process. Unlike in the \ehframe, and unlike what should be done in a release, real-world-proof version of the \ehelfs, the choice was made to keep this prototype simple, and only handle the few registers that were needed to simply unwind the stack. Thus, the only registers handled in \ehelfs{} are \reg{rip}, -\reg{rbp}, \reg{rsp} and \reg{rbx} (the latter being used quite often in -\prog{libc} to hold the CFA address). This is enough to unwind the stack, but +\reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used quite often in +\prog{libc} to hold the CFA address. This is enough to unwind the stack, but is not sufficient to analyze every stack frame as \prog{gdb} would do after a \lstbash{frame n} command. @@ -736,10 +730,9 @@ is not sufficient to analyze every stack frame as \prog{gdb} would do after a In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type \lstc{uintptr_t} are the values of the corresponding registers, and \lstc{flags} is a 8-bits value, indicating for each register whether it is -present or not in this context (\ie{} if the \lstc{rbx} bit is not set, the -value of \lstc{rbx} in the structure isn't meaningful), plus an error bit, -indicating whether an error occurred during unwinding (which can be due \eg{} -to an unsupported operation in the original DWARF, thus compiled to an error). +present or not in this context, plus an error bit, indicating whether an error +occurred during unwinding. Such errors can be due \eg{} to an unsupported +operation in the original DWARF\@. This generated data is stored in separate shared object files, which we call \ehelfs. It would have been possible to alter the original ELF file to embed @@ -827,12 +820,12 @@ made in order to shrink the \ehelfs. The major optimization that most reduced the output size was to use an if/else tree implementing a binary search on the program counter relevant intervals, instead of a huge switch. In the process, we also \emph{outline} a lot of code, -that is, find out identical ``switch cases'' bodies (which are not switch cases -anymore, but if bodies), move them outside of the if/else tree, identify them -by a label, and jump to them using a \lstc{goto}, which de-duplicates a lot of -code and contributes greatly to the shrinking. In the process, we noticed that -the vast majority of FDE rows are actually taken among very few ``common'' FDE -rows. +that is, find out identical ``switch cases'' bodies --~which are not switch +cases anymore, but if bodies~--, move them outside of the if/else tree, +identify them by a label, and jump to them using a \lstc{goto}, which +de-duplicates a lot of code and contributes greatly to the shrinking. In the +process, we noticed that the vast majority of FDE rows are actually taken among +very few ``common'' FDE rows. This makes this optimization really efficient, as seen later in Section~\ref{ssec:results_size}, but also makes it an interesting question @@ -886,13 +879,12 @@ Listing~\ref{lst:ex1_dw}, etc. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Presentation of \prog{perf}}\label{ssec:perf} -\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem (actually, -\prog{perf} is developed within the Linux kernel source tree). A profiler is an -important tool from the developer's toolbox that analyzes the performance of -programs by recording the time spent in each function, including within nested -calls. This analysis often enables programmers to optimize critical paths and -functions in their programs, while leaving unoptimized functions that are -seldom traversed. +\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem, and is +even developed within the Linux kernel source tree. A profiler is an important +tool from the developer's toolbox that analyzes the performance of programs by +recording the time spent in each function, including within nested calls. This +analysis often enables programmers to optimize critical paths and functions in +their programs, while leaving unoptimized functions that are seldom traversed. For this purpose, the basic idea is to stop the traced program at regular intervals, unwind its stack, write down the current nested function calls, and @@ -924,16 +916,16 @@ activity, be linked against \prog{libc} and \prog{pthread}, and be very light. Interfacing \ehelfs{} with \prog{perf} required, in a first place, to fork \prog{libunwind} and implement \ehelfs{} support for it. In the process, it turned out necessary to slightly modify \prog{libunwind}'s interface to add a -parameter to a function, since \prog{libunwind} is made to be agnostic of the -system and process as much as possible, to be able to unwind in any context. -This very restricted information lacked a memory map (a table indicating which -shared object is mapped at which address in memory) in order to use \ehelfs. -Apart from this, the modified version of \prog{libunwind} produced is entirely -compatible with the vanilla version, meaning that the only modifications -required to use \ehelfs{} within any project using \prog{libunwind} should be -modifying one line of code (this function call, which is a setup function) and -linking against the modified version of \prog{libunwind} instead of the system -version. +parameter to an initialisation function, since \prog{libunwind} is made to be +agnostic of the system and process as much as possible, to be able to unwind in +any context. This very restricted information lacked a \emph{memory map}, a +table indicating which shared object is mapped at which address in memory, in +order to use \ehelfs. Apart from this, the modified version of \prog{libunwind} +produced is entirely compatible with the vanilla version. This means that the +only modifications required to use \ehelfs{} within any project using +\prog{libunwind} should be changing one line of code to add one parameter to a +function call and linking against the modified version of \prog{libunwind} +instead of the system version. Once this was done, plugging it in \prog{perf} was the matter of a few lines of code only, left apart the benchmarking code. The major problem encountered was @@ -984,9 +976,9 @@ swapping. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Measured time performance} -The benchmarking, as described in Section~\ref{ssec:bench_perf}, of \ehelfs{} -against the vanilla \prog{libunwind} (using the same methodology, only linking -\prog{perf} against the vanilla \prog{libunwind}), gives the results in +A benchmarking of \ehelfs{} against the vanilla \prog{libunwind} was made using +the exact same methodology as in Section~\ref{ssec:bench_perf}, only linking +\prog{perf} against the vanilla \prog{libunwind}. It yields the results in Table~\ref{table:bench_time}. \begin{table}[h] @@ -1036,11 +1028,11 @@ instruction, however, would not slow down at all the implementation, since every instruction would simply be compiled to x86\_64 without affecting the already supported code. -It is also worth noting that on the machine described in -Section~\ref{ssec:bench_hw}, the compilation of the \ehelfs{} at a level of -\lstc{-O2} needed to run \prog{hackbench}, that is, \prog{hackbench}, -\prog{libc}, \prog{ld}, and \prog{libpthread} takes an overall time of $25.28$ -seconds (using only a single core). +It is also worth noting that the compilation time of \ehelfs{} is also +reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and +without using multiple cores to compile, the various shared objects needed to +run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and +\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Measured compactness}\label{ssec:results_size} @@ -1189,8 +1181,8 @@ only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and second row analyzes all the columns that were encountered, no matter whether supported or not. -The Table~\ref{table:instr_types} analyzes the proportion of each command (\ie\ -the formal way a register is set) for non-CFA columns in the sampled data. For +The Table~\ref{table:instr_types} analyzes the proportion of each command +--~the formal way a register is set~-- for non-CFA columns in the sampled data. For a brief explanation, \texttt{Offset} means stored at offset from CFA, \texttt{Register} means the value from a machine register, \texttt{Expression} means stored at the address of an expression's result, and the \texttt{Val\_}