Factor out irrelevant footnotes and parentheses

2018-08-16 00:26:59 +02:00 · 2018-08-16 00:26:59 +02:00 · 67b25ca038
commit 67b25ca038
parent c5f1f8615b
1 changed files with 91 additions and 99 deletions
--- a/report/report.tex
+++ b/report/report.tex
@ -80,15 +80,15 @@ restored before returning, the function's return address and local variables.
 On the x86\_64 platform, with which this report is mostly concerned, the
 calling convention that is followed is defined in the System V
-ABI~\cite{systemVabi} for the Unix-like operating systems (among which Linux).
+ABI~\cite{systemVabi} for the Unix-like operating systems --~among which Linux.
 Under this calling convention, the first six arguments of a function are passed
 in the registers \reg{rdi}, \reg{rsi}, \reg{rdx}, \reg{rcx}, \reg{r8},
 \reg{r9}, while additional arguments are pushed onto the stack. It also defines
 which registers may be overwritten by the callee, and which parameters must be
-restored before returning (which most of the time is done by pushing the
+restored before returning. This restoration, most of the time, is done by
-register value onto the stack in the function prelude, and restoring it just
+pushing the register value onto the stack in the function prelude, and
-before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
+restoring it just before returning. Those preserved registers are \reg{rbx},
-\reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
+\reg{rsp}, \reg{rbp}, \reg{r12}, \reg{r13}, \reg{r14}, \reg{r15}.
 \begin{wrapfigure}{r}{0.4\textwidth}
    \centering
@ -98,11 +98,8 @@ before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
 \end{wrapfigure}
 The register \reg{rsp} is supposed to always point to the last used memory cell
-in the stack, thus, when the process just enters a new function, \reg{rsp}
+in the stack. Thus, when the process just enters a new function, \reg{rsp}
-points right to the location of the return address\footnote{Remember that since
+points right to the location of the return address. Then, the compiler might
 the stack grows \emph{downwards} in memory, the arrow of \reg{rsp} points
 \emph{below} the RA cell in the figure, and yet the memory cell indexed is the
 one \emph{above} in the drawing, that is, the RA.}. Then, the compiler might
 use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
 the old value of \reg{rbp} just below the return address on the stack, then
 copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
@ -148,8 +145,8 @@ Left apart analyzing the assembly code produced, there is no way to find where
 the return address is stored, relatively to \reg{rsp}, at some arbitrary point
 of the function. Even when \reg{rbp} is used, there is no easy way to guess
 where each callee-saved register is stored in the stack frame, and worse, which
-callee-saved registers were saved (since it is not necessary to save a register
+callee-saved registers were saved, since it is optional to save a register
-that the function never touches).
+that the function never touches.
 With this example, it seems pretty clear that it is often necessary to have
 additional data to perform stack unwinding. This data is often stored among the
@ -171,11 +168,11 @@ context, by unwinding \lstc{fct_b}'s frame.
 \medskip
-Yet, stack unwinding (and thus debugging data) \emph{is not limited to
+Yet, stack unwinding, and thus, debugging data, \emph{is not limited to
 debugging}.
 Another common usage is profiling. A profiling tool, such as \prog{perf} under
-Linux -- see Section~\ref{ssec:perf} --, is used to measure and analyze in
+Linux --~see Section~\ref{ssec:perf} --, is used to measure and analyze in
 which functions a program spends its time, identify bottlenecks and find out
 which parts are critical to optimize.  To do so, modern profilers pause the
 traced program at regular, short intervals, inspect their stack, and determine
@ -202,8 +199,8 @@ trigger the destructors of stack-allocated objects. Furthermore, this is often
 undesirable: \lstc{setjmp} has a quite big overhead, which is introduced
 whenever a \lstc{try} block is encountered. Instead, it is often preferred to
 have strictly no overhead when no exception happens, at the cost of a greater
-overhead when an exception is actually fired (after all, they are supposed to
+overhead when an exception is actually fired --~after all, they are supposed to
-be \emph{exceptional}). For more details on C++ exception handling,
+be \emph{exceptional}. For more details on C++ exception handling,
 see~\cite{koening1990exception} (especially Section~16.5). Possible
 implementation mechanisms are also presented in~\cite{dinechin2000exn}.
@ -237,8 +234,8 @@ the previous paragraph, in an ELF section originally called
 For any binary, debugging information can easily get quite large if no
 attention is payed to keeping it as compact as possible. In this matter, DWARF
 does an excellent job, and everything is stored in a very compact way. This,
-however, as we will see, makes it both difficult to parse correctly (with \eg{}
+however, as we will see, makes it both difficult to parse correctly and quite
-variable-length integers) and quite slow to interpret.
+slow to interpret.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{DWARF unwinding data}
@ -259,19 +256,15 @@ a table, that has a range of IPs on which it has authority. Most often, but not
 necessarily, it corresponds to a single function in the original source code.
 Each column of the table is a register (\eg{} \reg{rsp}), with two additional
 special registers, CFA (Canonical Frame Address) and RA (Return Address),
-containing respectively the base pointer of the current stack
+containing respectively the base pointer of the current stack frame and the
-frame\footnote{The CFA is most commonly thought of as the base pointer of the
+return address of the current function. For instance, on a x86\_64
-frame, yet this is not enforced by DWARF\@. The CFA is used as an address from
+architecture, RA would contain the unwound value of \reg{rip}, the instruction
-which other registers will be deduced as offsets, and although it is supposed
+pointer. Each row has a certain validity interval, on which it describes
-to be the actual base pointer, it can be anything as long as it is close enough
+accurate unwinding data. This range starts at the instruction pointer it is
-to the addresses that will be deduced from it.} and the return address of the
+associated with, and ends at the start IP of the next table row (or the end IP
-current function (\ie{} for x86\_64, the unwound value of \reg{rip}, the
+of the current FDE if it was the last row). In particular, there can be no ``IP
-instruction pointer). Each row has a certain validity interval, on which it
+hole'' within a FDE --~unlike FDEs themselves, which can leave holes between
-describes accurate unwinding data. This range starts at the instruction pointer
+them.
 it is associated with, and ends at the start IP of the next table row (or the
 end IP of the current FDE if it was the last row). In particular, there can be
 no ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
 between them.
 \begin{figure}[h]
    \begin{minipage}{0.45\textwidth}
@ -329,17 +322,17 @@ how the stack frame is constructed. When interpreting the generated \ehframe{}
 with \lstbash{readelf -wF}, we obtain the (slightly edited)
 Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
 \leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
-thus the CFA is 8 bytes above \reg{rsp} (which was the value of \reg{rsp}
+thus the CFA is 8 bytes above \reg{rsp}, and the return address is precisely at
-before the call, and is the topmost value of used space for this stack frame),
+\reg{rsp} --~that is, stored between \reg{rsp} and $\reg{rsp} + 8$. Then, the
-and the return address is precisely at \reg{rsp} --~that is, stored between
+contents of \lstc{fibo}, 8 integers of 4 bytes each, are allocated on the
-\reg{rsp} and $\reg{rsp} + 8$. Then, 8 integers of 4 bytes each (for
+stack, which puts the CFA 32 bytes above \reg{rsp}; the return address still
-\lstc{fibo}, \lstc{pos} being optimized out) are allocated on the stack, which
+being 8 bytes below the CFA\@. The variable \lstc{pos} is optimized out in the
-puts the CFA 32 bytes above \reg{rsp}, and the return address still 8 bytes
+generated assembly code, thus no stack space is allocated for it. Yet,
-below the CFA\@.  Yet, \prog{gcc} decided to allocate a total space of 48 bytes
+\prog{gcc} decided to allocate a total space of 48 bytes for the stack frame
-for the stack frame for memory alignment reasons, which means subtracting 40
+for memory alignment reasons, which means subtracting 40 bytes to \reg{rsp}
-bytes to \reg{rsp} (address $\mhex{615}$ in the assembly). Then, by the end of
+(address $\mhex{615}$ in the assembly). Then, by the end of the function, the
-the function, the local variables are discarded and \reg{rsp} is reset to its
+local variables are discarded and \reg{rsp} is reset to its value from the
-value from the first row.
+first row.
 However, DWARF data isn't actually stored as a table in the binary files, but
 is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
@ -425,9 +418,9 @@ These are the DWARF instructions used for CFI description, that is, the
 instructions that contain the stack unwinding table informations. The following
 list is an exhaustive list of instructions from the DWARF5
 specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for
-brevity and clarity. All these instructions are up to variants (most
+brevity and clarity. All these instructions are up to variants --~most
 instructions exist in multiple formats to handle various operands formatting,
-to optimize space). Since we won't be talking about the underlying file format
+to optimize space. Since we won't be talking about the underlying file format
 here, those variations between eg. \dwcfa{advance\_loc1} and
 \dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
 operand~-- are irrelevant and will be eluded.
@ -517,10 +510,10 @@ only handled as register identifiers, so we can safely state that $\reg{reg}
 A value can then be undefined, stored at memory address $x$ or be directly a
 value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
-\textit{offset}$. The CFA is considered a simple register here. For instance, to
+\textit{offset}$. The CFA is considered a simple register here. For instance,
-define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we
+to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA,
-would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$ (for the stack grows
+we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack
-downwards).
+grows downwards.
 \subsection{Target language~: a C function body}
@ -533,10 +526,10 @@ execution stack, or even the heap.
 This function takes as arguments an instruction pointer --~supposedly
 extracted from $\reg{rip}$~-- and an array of register values; and returns a
 fresh array of register values after unwinding this call frame. The function is
-compositional\footnote{up to technicities: the IP obtained after unwinding the
+compositional: it can be called twice in a row to unwind two stack frames,
-first frame might be handled in a different dynamically loaded object, and this
+unless the IP obtained after the first unwinding comes from another shared
-would require inspecting the DWARF located in another file}: it can be called
+object file, for instance a call to \prog{libc}. In this case, unwinding the
-twice in a row to unwind two stack frames.
+second frame will require loading the corresponding DWARF information.
 The function is the following~:
@ -636,8 +629,8 @@ $F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.
    \semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
 \end{align*}
-(The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
+The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
-we omit those two operations, we can plainly remove the stack).
+we omit those two operations, we can plainly remove the stack.
 \subsection{From $\intermedlang$ to C}
@ -694,8 +687,9 @@ machine code on the x86\_64 platform.
 The rough idea of the compilation is to produce, out of the \ehframe{} section
 of a binary, C code that resembles the code shown in the DWARF semantics from
 Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
-\lstbash{-O2} mode\footnote{Compiling in \lstbash{-O3} takes way too much
+\lstbash{-O2} mode, since it already provides a good level of optimization and
-time.}, providing for free all the optimization passes of a modern compiler.
+compiling in \lstbash{-O3} takes way too much time. This saves us the trouble
 of optimizing the generated C code whenever GCC does that by itself.
 The generated code consists in a single monolithic function, \lstc{_eh_elf},
 taking as arguments an instruction pointer and a memory context (\ie{} the
@ -715,18 +709,18 @@ return it.
 A setting of the compiler also optionally enables another parameter to the
 \lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
-\lstc{deref} function, when enabled, replaces everywhere the dereferencing
+\lstc{deref} function, when present, replaces everywhere the dereferencing
 \lstc{*} operator, and can be used to generate \ehelfs{} that will work on
-remote address spaces (\ie{} whenever the unwinding is not done on the process
+remote address spaces, that is, whenever the unwinding is not done on the
-reading the \ehelf{} itself, but some other process, or even on a stack dump of
+process reading the \ehelf{} itself, but some other process, or even on a stack
-a long-terminated process).
+dump of a long-terminated process.
 Unlike in the \ehframe, and unlike what should be done in a release,
 real-world-proof version of the \ehelfs, the choice was made to keep this
 prototype simple, and only handle the few registers that were needed to simply
 unwind the stack. Thus, the only registers handled in \ehelfs{} are \reg{rip},
-\reg{rbp}, \reg{rsp} and \reg{rbx} (the latter being used quite often in
+\reg{rbp}, \reg{rsp} and \reg{rbx}, the latter being used quite often in
-\prog{libc} to hold the CFA address). This is enough to unwind the stack, but
+\prog{libc} to hold the CFA address. This is enough to unwind the stack, but
 is not sufficient to analyze every stack frame as \prog{gdb} would do after a
 \lstbash{frame n} command.
@ -736,10 +730,9 @@ is not sufficient to analyze every stack frame as \prog{gdb} would do after a
 In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
 \lstc{uintptr_t} are the values of the corresponding registers, and
 \lstc{flags} is a 8-bits value, indicating for each register whether it is
-present or not in this context (\ie{} if the \lstc{rbx} bit is not set, the
+present or not in this context, plus an error bit, indicating whether an error
-value of \lstc{rbx} in the structure isn't meaningful), plus an error bit,
+occurred during unwinding. Such errors can be due \eg{} to an unsupported
-indicating whether an error occurred during unwinding (which can be due \eg{}
+operation in the original DWARF\@.
 to an unsupported operation in the original DWARF, thus compiled to an error).
 This generated data is stored in separate shared object files, which we call
 \ehelfs. It would have been possible to alter the original ELF file to embed
@ -827,12 +820,12 @@ made in order to shrink the \ehelfs.
 The major optimization that most reduced the output size was to use an if/else
 tree implementing a binary search on the program counter relevant intervals,
 instead of a huge switch. In the process, we also \emph{outline} a lot of code,
-that is, find out identical ``switch cases'' bodies (which are not switch cases
+that is, find out identical ``switch cases'' bodies --~which are not switch
-anymore, but if bodies), move them outside of the if/else tree, identify them
+cases anymore, but if bodies~--, move them outside of the if/else tree,
-by a label, and jump to them using a \lstc{goto}, which de-duplicates a lot of
+identify them by a label, and jump to them using a \lstc{goto}, which
-code and contributes greatly to the shrinking. In the process, we noticed that
+de-duplicates a lot of code and contributes greatly to the shrinking. In the
-the vast majority of FDE rows are actually taken among very few ``common'' FDE
+process, we noticed that the vast majority of FDE rows are actually taken among
-rows.
+very few ``common'' FDE rows.
 This makes this optimization really efficient, as seen later in
 Section~\ref{ssec:results_size}, but also makes it an interesting question
@ -886,13 +879,12 @@ Listing~\ref{lst:ex1_dw}, etc.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Presentation of \prog{perf}}\label{ssec:perf}
-\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem (actually,
+\prog{Perf} is a \emph{profiler} that comes with the Linux ecosystem, and is
-\prog{perf} is developed within the Linux kernel source tree). A profiler is an
+even developed within the Linux kernel source tree. A profiler is an important
-important tool from the developer's toolbox that analyzes the performance of
+tool from the developer's toolbox that analyzes the performance of programs by
-programs by recording the time spent in each function, including within nested
+recording the time spent in each function, including within nested calls. This
-calls. This analysis often enables programmers to optimize critical paths and
+analysis often enables programmers to optimize critical paths and functions in
-functions in their programs, while leaving unoptimized functions that are
+their programs, while leaving unoptimized functions that are seldom traversed.
 seldom traversed.
 For this purpose, the basic idea is to stop the traced program at regular
 intervals, unwind its stack, write down the current nested function calls, and
@ -924,16 +916,16 @@ activity, be linked against \prog{libc} and \prog{pthread}, and be very light.
 Interfacing \ehelfs{} with \prog{perf} required, in a first place, to fork
 \prog{libunwind} and implement \ehelfs{} support for it. In the process, it
 turned out necessary to slightly modify \prog{libunwind}'s interface to add a
-parameter to a function, since \prog{libunwind} is made to be agnostic of the
+parameter to an initialisation function, since \prog{libunwind} is made to be
-system and process as much as possible, to be able to unwind in any context.
+agnostic of the system and process as much as possible, to be able to unwind in
-This very restricted information lacked a memory map (a table indicating which
+any context.  This very restricted information lacked a \emph{memory map}, a
-shared object is mapped at which address in memory) in order to use \ehelfs.
+table indicating which shared object is mapped at which address in memory, in
-Apart from this, the modified version of \prog{libunwind} produced is entirely
+order to use \ehelfs. Apart from this, the modified version of \prog{libunwind}
-compatible with the vanilla version, meaning that the only modifications
+produced is entirely compatible with the vanilla version. This means that the
-required to use \ehelfs{} within any project using \prog{libunwind} should be
+only modifications required to use \ehelfs{} within any project using
-modifying one line of code (this function call, which is a setup function) and
+\prog{libunwind} should be changing one line of code to add one parameter to a
-linking against the modified version of \prog{libunwind} instead of the system
+function call and linking against the modified version of \prog{libunwind}
-version.
+instead of the system version.
 Once this was done, plugging it in \prog{perf} was the matter of a few lines of
 code only, left apart the benchmarking code. The major problem encountered was
@ -984,9 +976,9 @@ swapping.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Measured time performance}
-The benchmarking, as described in Section~\ref{ssec:bench_perf}, of \ehelfs{}
+A benchmarking of \ehelfs{} against the vanilla \prog{libunwind} was made using
-against the vanilla \prog{libunwind} (using the same methodology, only linking
+the exact same methodology as in Section~\ref{ssec:bench_perf}, only linking
-\prog{perf} against the vanilla \prog{libunwind}), gives the results in
+\prog{perf} against the vanilla \prog{libunwind}. It yields the results in
 Table~\ref{table:bench_time}.
 \begin{table}[h]
@ -1036,11 +1028,11 @@ instruction, however, would not slow down at all the implementation, since
 every instruction would simply be compiled to x86\_64 without affecting the
 already supported code.
-It is also worth noting that on the machine described in
+It is also worth noting that the compilation time of \ehelfs{} is also
-Section~\ref{ssec:bench_hw}, the compilation of the \ehelfs{} at a level of
+reasonably short. On the machine described in Section~\ref{ssec:bench_hw}, and
-\lstc{-O2} needed to run \prog{hackbench}, that is, \prog{hackbench},
+without using multiple cores to compile, the various shared objects needed to
-\prog{libc}, \prog{ld}, and \prog{libpthread} takes an overall time of $25.28$
+run \prog{hackbench} --~that is, \prog{hackbench}, \prog{libc}, \prog{ld} and
-seconds (using only a single core).
+\prog{libpthread}~-- are compiled in an overall time of $25.28$ seconds.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Measured compactness}\label{ssec:results_size}
@ -1189,8 +1181,8 @@ only concerned about the columns CFA, \reg{rip}, \reg{rsp}, \reg{rbp} and
 second row analyzes all the columns that were encountered, no matter whether
 supported or not.
-The Table~\ref{table:instr_types} analyzes the proportion of each command (\ie\
+The Table~\ref{table:instr_types} analyzes the proportion of each command
-the formal way a register is set) for non-CFA columns in the sampled data. For
+--~the formal way a register is set~-- for non-CFA columns in the sampled data. For
 a brief explanation, \texttt{Offset} means stored at offset from CFA,
 \texttt{Register} means the value from a machine register, \texttt{Expression}
 means stored at the address of an expression's result, and the \texttt{Val\_}