diff --git a/report/report.tex b/report/report.tex index a064d67..035c264 100644 --- a/report/report.tex +++ b/report/report.tex @@ -88,16 +88,19 @@ before returning). Those preserved registers are \reg{rbx}, \reg{rsp}, conventions}\label{fig:call_stack} \end{wrapfigure} -The register \reg{rsp} is supposed to always point just past the last used -memory cell in the stack, thus, when the process just enters a new function, -\reg{rsp} points 8 bytes after the location of the return address. Then, the -compiler might use \reg{rbp} (``base pointer'') to save this value of -\reg{rip}, by writing the old value of \reg{rbp} just below the return address -on the stack, then copying \reg{rsp} to \reg{rbp}. This makes it easy to find -the return address from anywhere within the function, and also allows for easy -addressing of local variables. Yet, using \reg{rbp} to save \reg{rip} is not -always done, since it somehow ``wastes'' a register. This decision is, on -x86\_64 System V, up to the compiler. +The register \reg{rsp} is supposed to always point to the last used memory cell +in the stack, thus, when the process just enters a new function, \reg{rsp} +points right to the location of the return address\footnote{Remember that since +the stack grows \emph{downwards} in memory, the arrow of \reg{rsp} points +\emph{below} the RA cell in the figure, and yet the memory cell indexed is the +one \emph{above} in the drawing, that is, the RA.}. Then, the compiler might +use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing +the old value of \reg{rbp} just below the return address on the stack, then +copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address +from anywhere within the function, and also allows for easy addressing of local +variables. Yet, using \reg{rbp} to save \reg{rip} is not always done, since it +somehow ``wastes'' a register. This decision is, on x86\_64 System V, up to the +compiler. Often, a function will start by subtracting some value to \reg{rsp}, allocating some space in the stack frame for its local variables. Then, it will push on @@ -242,52 +245,92 @@ when talking about DWARF, a register is merely a numerical identifier that is often, but not necessarily, mapped to a real machine register by the ABI\@. In practice, this data takes the form of a collection of tables, one table per -Frame Description Entry (FDE), which most often corresponds to a function. Each -column of the table is a register (\eg{} \reg{rsp}), with two additional +Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such +a table, that has a range of IPs on which it has authority. Most often, but not +necessarily, it corresponds to a single function in the original source code. +Each column of the table is a register (\eg{} \reg{rsp}), with two additional special registers, CFA (Canonical Frame Address) and RA (Return Address), -containing respectively the base pointer of the current stack frame and the -return address of the current function (\ie{} for x86\_64, the unwound value of -\reg{rip}, the instruction pointer). Each row of the table is a particular -instruction pointer, within the instruction pointer range of the tabulated FDE -(assuming a FDE maps directly to a function, this range is simply the IP range -of the given function in the \lstc{.text} section of the binary), a row being -valid from its start IP to the start IP of the next row, or the end IP of the -FDE if it is the last row. +containing respectively the base pointer of the current stack +frame\footnote{The CFA is most commonly thought of as the base pointer of the +frame, yet this is not enforced by DWARF\@. The CFA is used as an address from +which other registers will be deduced as offsets, and although it is supposed +to be the actual base pointer, it can be anything as long as it is close enough +to the addresses that will be deduced from it.} and the return address of the +current function (\ie{} for x86\_64, the unwound value of \reg{rip}, the +instruction pointer). Each row has a certain validity interval, on which it +describes accurate unwinding data. This range starts at the instruction pointer +it is associated with, and ends at the start IP of the next table row (or the +end IP of the current FDE if it was the last row). In particular, there can be +no ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes +between them. -\begin{minipage}{0.45\textwidth} - \lstinputlisting[language=C, firstline=3, lastline=12, - caption={Original C},label={lst:ex1_c}] - {src/fib7/fib7.c} -\end{minipage} \hfill \begin{minipage}{0.45\textwidth} - \lstinputlisting[language=C,caption={Processed DWARF},label={lst:ex1_dw}] - {src/fib7/fib7.fde} - \lstinputlisting[language=C,caption={Raw DWARF},label={lst:ex1_dwraw}] - {src/fib7/fib7.raw_fde} -\end{minipage} +\begin{figure}[h] + \begin{minipage}{0.45\textwidth} + \lstinputlisting[language=C, firstline=3, lastline=12, + caption={Original C},label={lst:ex1_c}] + {src/fib7/fib7.c} + \end{minipage} \hfill \begin{minipage}{0.45\textwidth} + \lstinputlisting[language=C,caption={Processed DWARF}, + label={lst:ex1_dw}] + {src/fib7/fib7.fde} + \lstinputlisting[language=C,caption={Raw DWARF},label={lst:ex1_dwraw}] + {src/fib7/fib7.raw_fde} + \end{minipage} +\end{figure} -\begin{minipage}{0.45\textwidth} - \lstinputlisting[language={[x86masm]Assembler},lastline=11, - caption={Generated assembly},label={lst:ex1_asm}] - {src/fib7/fib7.s} -\end{minipage} \hfill \begin{minipage}{0.45\textwidth} - \lstinputlisting[language={[x86masm]Assembler},firstline=12, - firstnumber=last] - {src/fib7/fib7.s} -\end{minipage} +\begin{figure}[h] + \begin{minipage}{0.45\textwidth} + \lstinputlisting[language={[x86masm]Assembler},lastline=11, + caption={Generated assembly},label={lst:ex1_asm}] + {src/fib7/fib7.s} + \end{minipage} \hfill \begin{minipage}{0.45\textwidth} + \lstinputlisting[language={[x86masm]Assembler},firstline=12, + firstnumber=last] + {src/fib7/fib7.s} + \end{minipage} +\end{figure} + +\begin{table}[h] + \centering + \begin{tabular}{|c|c|c|c|c|c} + \stackfhead{+ \mhex{30}} + & \stackfhead{+ \mhex{28}} + & \stackfhead{+ \mhex{20}} + & \stackfhead{+ \mhex{1c}} + & \stackfhead{+ \mhex{4}} + & \stackfhead{} + \\ + \hline{} + Return Address & \textit{Alignment space} + & \spaced{2ex}{\lstc{fibo[7]}} + & \spaced{4ex}{\ldots} + & \spaced{2ex}{\lstc{fibo[0]}} + & \textit{Next frame} + \\ + \hline + \end{tabular} + \caption{Stack frame schema}\label{table:ex1_stack_schema} +\end{table} For instance, the C source code in Listing~\ref{lst:ex1_c} above, when compiled with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the -assembly code in Listing~\ref{lst:ex1_asm}. When interpreting the generated -\ehframe{} with \lstbash{readelf -wF}, we obtain the (slightly edited) +assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack +frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding +how the stack frame is constructed. When interpreting the generated \ehframe{} +with \lstbash{readelf -wF}, we obtain the (slightly edited) Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615} \leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address, thus the CFA is 8 bytes above \reg{rsp} (which was the value of \reg{rsp} -before the call), and the return address is precisely at \reg{rsp}. Then, 9 -integers of 8 bytes each (8 for \lstc{fibo}, one for \lstc{pos}) are allocated -on the stack, which puts the CFA 80 bytes above \reg{rsp}, and the return -address still 8 bytes below the CFA\@. Then, by the end of the function, the -local variables are discarded and \reg{rsp} is reset to its value from the -first row. +before the call, and is the topmost value of used space for this stack frame), +and the return address is precisely at \reg{rsp} --~that is, stored between +\reg{rsp} and $\reg{rsp} + 8$. Then, 8 integers of 4 bytes each (for +\lstc{fibo}, \lstc{pos} being optimized out) are allocated on the stack, which +puts the CFA 32 bytes above \reg{rsp}, and the return address still 8 bytes +below the CFA\@. Yet, \prog{gcc} decided to allocate a total space of 48 bytes +for the stack frame for memory alignment reasons, which means subtracting 40 +bytes to \reg{rsp} (address $\mhex{615}$ in the assembly). Then, by the end of +the function, the local variables are discarded and \reg{rsp} is reset to its +value from the first row. However, DWARF data isn't actually stored as a table in the binary files, but is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the @@ -295,12 +338,12 @@ location of the first IP in the FDE, and must define at least its CFA\@. Then, when all relevant registers are defined, it is possible to define a new row by providing a location offset (\eg{} here $4$), and the new row is defined as a clone of the previous one, which can then be altered (\eg{} here by setting -\lstc{CFA} to $\reg{rsp} + 80$). This means that every line is defined \wrt{} +\lstc{CFA} to $\reg{rsp} + 48$). This means that every line is defined \wrt{} the previous one, and that the IPs of the successive rows cannot be determined -before evaluating every row before. Thus, unwinding a frame from an IP close to -the end of the frame will require evaluating pretty much every DWARF row in the -table before reaching the relevant information, slowing down drastically the -unwinding process. +without evaluating every row that comes before in the first place. Thus, +unwinding a frame from an IP close to the end of the frame will require +evaluating pretty much every DWARF row in the table before reaching the +relevant information, slowing down drastically the unwinding process. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{How big are FDEs?} @@ -377,8 +420,8 @@ brevity and clarity. All these instructions are up to variants (most instructions exist in multiple formats to handle various operands formatting, to optimize space). Since we won't be talking about the underlying file format here, those variations between eg. \dwcfa{advance\_loc1} and -\dwcfa{advance\_loc2} ---~which differ only on the number of bytes of their -operand~--- are irrelevant and will be eluded. +\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their +operand~-- are irrelevant and will be eluded. \begin{itemize} \item{} \dwcfa{set\_loc(loc)}~: @@ -478,8 +521,8 @@ in the context of the program being unwound. In particular, it must be able to dereference some pointer derived from DWARF instructions that will point to the execution stack, or even the heap. -This function takes as arguments an instruction pointer ---~supposedly -extracted from $\reg{rip}$~--- and an array of register values; and returns a +This function takes as arguments an instruction pointer --~supposedly +extracted from $\reg{rip}$~-- and an array of register values; and returns a fresh array of register values after unwinding this call frame. The function is compositional\footnote{up to technicities: the IP obtained after unwinding the first frame might be handled in a different dynamically loaded object, and this @@ -641,25 +684,33 @@ machine code on the x86\_64 platform. The rough idea of the compilation is to produce, out of the \ehframe{} section of a binary, C code that resembles the code shown in the DWARF semantics from -Section~\ref{sec:semantics} above. This C code is then compiled by GCC, -providing for free all the optimization passes of a modern compiler. +Section~\ref{sec:semantics} above. This C code is then compiled by GCC in +\lstbash{-O2} mode\footnote{Compiling in \lstbash{-O3} takes way too much +time.}, providing for free all the optimization passes of a modern compiler. -The generated code consists in a single monolithic function, taking as -arguments an instruction pointer and a memory context (\ie{} the value of the -various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The -function will then return a fresh memory context, containing the values the -registers hold after unwinding this frame. +The generated code consists in a single monolithic function, \lstc{_eh_elf}, +taking as arguments an instruction pointer and a memory context (\ie{} the +value of the various machine registers) as defined in +Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory +context, containing the values the registers hold after unwinding this frame. The body of the function itself is mostly a huge switch, taking advantage of -the non-standard ---~yet widely implemented in C compilers~--- syntax for range -switches, in which each \lstc{case} can refer to a range. All the FDEs are -merged together into this switch, each row of a FDE being a switch case. The -cases then fill a context with unwound values, then return it. +the non-standard --~yet widely implemented in C compilers~-- syntax for range +switches, in which each \lstinline{case} can refer to a range. All the FDEs are +merged together into this switch, each row of a FDE being a switch case. +Separating the various FDEs in the C code --~other than with comments~-- is, +unlike what is done in DWARF, pointless, since accessing a ``row'' has a linear +cost, and the C code is not meant to be read, except maybe for debugging +purposes. The switch cases bodies then fill a context with unwound values, then +return it. -An optionally enabled parameter can be used to pass a function pointer to a -dereferencing function, that conceptually does what the dereferencing \lstc{*} -operator does on a pointer, and is used to unwind a process that is not the -currently running process, and thus not sharing the same address space. A call +A setting of the compiler also optionally enables another parameter to the +\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This +\lstc{deref} function, when enabled, replaces everywhere the dereferencing +\lstc{*} operator, and can be used to generate \ehelfs{} that will work on +remote address spaces (\ie{} whenever the unwinding is not done on the process +reading the \ehelf{} itself, but some other process, or even on a stack dump of +a long-terminated process). Unlike in the \ehframe, and unlike what should be done in a release, real-world-proof version of the \ehelfs, the choice was made to keep this @@ -675,20 +726,24 @@ is not sufficient to analyze every stack frame as \prog{gdb} would do after a In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type \lstc{uintptr_t} are the values of the corresponding registers, and -\lstc{flags} is a 8-bytes value, indicating for each register whether it is +\lstc{flags} is a 8-bits value, indicating for each register whether it is present or not in this context (\ie{} if the \lstc{rbx} bit is not set, the value of \lstc{rbx} in the structure isn't meaningful), plus an error bit, -indicating whether an error occurred during unwinding. +indicating whether an error occurred during unwinding (which can be due \eg{} +to an unsupported operation in the original DWARF, thus compiled to an error). This generated data is stored in separate shared object files, which we call \ehelfs. It would have been possible to alter the original ELF file to embed -this data as a new section, but it getting it to be executed just as any +this data as a new section, but getting it to be executed just as any portion of the \lstc{.text} section would probably have been painful, and keeping it separated during the experimental phase is quite convenient. It is possible to have multiple versions of \ehelfs{} files in parallel, with various options turned on or off, and it doesn't require to alter the base system by editing \eg{} \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is -required, those files can simply be \lstc{dlopen}'d. +required, those files can simply be \lstc{dlopen}'d. It is also possible to +imagine, in a future environment production, packaging \ehelfs{} files +separately, so that people interested in heavy computation can have the choice +to install them. \medskip @@ -705,15 +760,19 @@ generated for the C code in Listing~\ref{lst:ex1_c}. Without any particular care to efficiency or compactness, it is already possible to produce a compiled version very close to the one described in Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be -actually benchmarked, it is already possible to write in a few hundreds of line -of C a simple stack walker printing the functions traversed. It already works +actually benchmarked, it is already possible to write in a few hundred lines of +C code a simple stack walker printing the functions traversed. It already works without any problem on the easily tested cases, since corner cases are mostly -found in standard and highly optimal libraries, and it is not that easy to get +found in standard and highly optimized libraries, and it is not that easy to get the program to stop and print a stack trace from within a system library without using a debugger. The major drawback of this approach, without any particular care taken, is the -space waste. +space waste. The space taken by those tentative \ehelfs{} is analyzed in +Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program +introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which +it depends. + \begin{table}[h] \centering @@ -736,11 +795,6 @@ space waste. \caption{Basic \ehelfs{} space usage}\label{table:basic_eh_elf_space} \end{table} -The space taken by those tentative \ehelfs{} is analyzed in -Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program -introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which -it depends. - The first column only includes the sizes of the ELF sections \lstc{.text} (the program itself) and \lstc{.rodata}, the read-only data (such as static strings, etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{} @@ -764,16 +818,17 @@ made in order to shrink the \ehelfs. The major optimization that most reduced the output size was to use an if/else tree implementing a binary search on the program counter relevant intervals, instead of a huge switch. In the process, we also \emph{outline} a lot of code, -that is, find out identical code blocks, move them outside of the if/else tree, -identify them by a label, and jump to them using a \lstc{goto}, which -de-duplicates a lot of code and contributes greatly to the shrinking. In the -process, we noticed that the vast majority of FDE rows are actually taken among -very few ``common'' FDE rows. +that is, find out identical ``switch cases'' bodies (which are not switch cases +anymore, but if bodies), move them outside of the if/else tree, identify them +by a label, and jump to them using a \lstc{goto}, which de-duplicates a lot of +code and contributes greatly to the shrinking. In the process, we noticed that +the vast majority of FDE rows are actually taken among very few ``common'' FDE +rows. This makes this optimization really efficient, as seen later in -Section~\ref{ssec:results_size}, but also makes it an interesting question --- -not investigated during this internship --- to find out whether standard DWARF -data could be efficiently compressed in this way. +Section~\ref{ssec:results_size}, but also makes it an interesting question +--~not investigated during this internship~-- to find out whether standard +DWARF data could be efficiently compressed in this way. \begin{minipage}{0.45\textwidth} \lstinputlisting[language=C, caption={\ehelf{} for the previous example}, @@ -806,15 +861,16 @@ However, unwinding over and over again from the same program point would have had no interest at all, since \prog{libunwind} would have simply cached the relevant DWARF row. In the mean time, making sure that the various unwinding are made from different locations is somehow cheating, since it makes useless -\prog{libunwind}'s caching. All in all, the benchmarking method must have a -``natural'' distribution of unwindings. +\prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding +distribution. All in all, the benchmarking method must have a ``natural'' +distribution of unwindings. Another requirement is to also distribute quite evenly the unwinding points across the program: we would like to benchmark stack unwindings crossing some standard library functions, starting from inside them, etc. Finally, the unwound program must be interesting enough to enter and exit a lot -of function, nest function calls, have FDEs that are not as simple as in +of functions, nest function calls, have FDEs that are not as simple as in Listing~\ref{lst:ex1_dw}, etc. @@ -864,19 +920,23 @@ system and process as much as possible, to be able to unwind in any context. This very restricted information lacked a memory map (a table indicating which shared object is mapped at which address in memory) in order to use \ehelfs. Apart from this, the modified version of \prog{libunwind} produced is entirely -compatible with the vanilla version. +compatible with the vanilla version, meaning that the only modifications +required to use \ehelfs{} within any project using \prog{libunwind} should be +modifying one line of code (this function call, which is a setup function) and +linking against the modified version of \prog{libunwind} instead of the system +version. Once this was done, plugging it in \prog{perf} was the matter of a few lines of -code only. The major problem encountered was to understand how \prog{perf} -works. In order to avoid perturbing the traced program, \prog{perf} does not -unwind at runtime, but rather records at regular interval the program's stack, -and all the auxiliary information that is needed to unwind later. This is done -when running \lstbash{perf record}. Then, \lstbash{perf report} unwinds the -stack to analyze it; but at this point of time, the traced process is long -dead, thus any PID-based approach, or any approach using \texttt{/proc} -information will fail. However, as this was the easiest method, this approach -was chosen when implementing the first version of \ehelfs; thus requiring some -code rewriting. +code only, left apart the benchmarking code. The major problem encountered was +to understand how \prog{perf} works. In order to avoid perturbing the traced +program, \prog{perf} does not unwind at runtime, but rather records at regular +intervals the program's stack, and all the auxiliary information that is needed +to unwind later. This is done when running \lstbash{perf record}. Then, +\lstbash{perf report} unwinds the stack to analyze it; but at this point of +time, the traced process is long dead, thus any PID-based approach, or any +approach using \texttt{/proc} information will fail. However, as this was the +easiest method, the first version of \ehelfs{} used those mechanisms; thus +requiring some code rewriting. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Other explored methods} @@ -884,15 +944,15 @@ code rewriting. The first approach tried to benchmark was trying to create some specific C code that would meet the requirements from Section~\ref{ssec:bench_req}, while calling itself a benchmarking procedure from time to time. This was abandoned -quite fast, because generating C code interesting enough to be unwound turned -out hard, and the generated FDEs invariably ended out uninteresting. It would -also never have met the requirement of unwinding from fairly distributed +quite quickly, because generating C code interesting enough to be unwound +turned out hard, and the generated FDEs invariably ended out uninteresting. It +would also never have met the requirement of unwinding from fairly distributed locations anyway. Another attempt was made using CSmith~\cite{csmith}, a random C code generator initially made for C compilers random testing. The idea was still to craft an interesting C program that would unwind on its own frequently, but to integrate -randomly generated C code with CSmith to integrate interesting C snippets that +CSmith-randomly generated C code within hand-written C snippets that would generate large enough FDEs and nested calls. This was abandoned as well as the call graph of a CSmith-generated code is often far too small, and the CSmith code is notoriously hard to understand and edit. diff --git a/shared/common.sty b/shared/common.sty index 0e5bd6d..90aaae4 100644 --- a/shared/common.sty +++ b/shared/common.sty @@ -7,3 +7,6 @@ \newcommand{\set}[1]{\left\{ #1 \right\}} \newcommand{\card}[1]{\left\vert{} #1 \right\vert} \newcommand{\abs}[1]{\left\vert{} #1 \right\vert} + +\newcommand{\tnhead}[2]{\multicolumn{1}{#1}{#2}} % Table neutral head +\newcommand{\spaced}[2]{\hspace{#1} #2 \hspace{#1}} diff --git a/shared/specific.sty b/shared/specific.sty index 380c85c..3150265 100644 --- a/shared/specific.sty +++ b/shared/specific.sty @@ -1,5 +1,8 @@ %% Specific commands for this project +\newcommand{\stackfhead}[1] + {\tnhead{l}{\hspace{-5ex}$\reg{rsp} #1$ \hspace{2em}}} + \newcommand{\prog}[1]{\texttt{#1}} \newcommand{\ehelf}{\texttt{eh\_elf}} \newcommand{\ehelfs}{\texttt{eh\_elfs}}