Review and reword end of §1, §3 and §4

2018-08-08 14:01:55 +02:00 · 2018-08-08 14:01:55 +02:00 · b128ddd571
commit b128ddd571
parent b761f360cc
3 changed files with 176 additions and 110 deletions
--- a/report/report.tex
+++ b/report/report.tex
@ -88,16 +88,19 @@ before returning). Those preserved registers are \reg{rbx}, \reg{rsp},
    conventions}\label{fig:call_stack}
 \end{wrapfigure}
-The register \reg{rsp} is supposed to always point just past the last used
+The register \reg{rsp} is supposed to always point to the last used memory cell
-memory cell in the stack, thus, when the process just enters a new function,
+in the stack, thus, when the process just enters a new function, \reg{rsp}
-\reg{rsp} points 8 bytes after the location of the return address. Then, the
+points right to the location of the return address\footnote{Remember that since
-compiler might use \reg{rbp} (``base pointer'') to save this value of
+the stack grows \emph{downwards} in memory, the arrow of \reg{rsp} points
-\reg{rip}, by writing the old value of \reg{rbp} just below the return address
+\emph{below} the RA cell in the figure, and yet the memory cell indexed is the
-on the stack, then copying \reg{rsp} to \reg{rbp}. This makes it easy to find
+one \emph{above} in the drawing, that is, the RA.}. Then, the compiler might
-the return address from anywhere within the function, and also allows for easy
+use \reg{rbp} (``base pointer'') to save this value of \reg{rip}, by writing
-addressing of local variables. Yet, using \reg{rbp} to save \reg{rip} is not
+the old value of \reg{rbp} just below the return address on the stack, then
-always done, since it somehow ``wastes'' a register. This decision is, on
+copying \reg{rsp} to \reg{rbp}. This makes it easy to find the return address
-x86\_64 System V, up to the compiler.
+from anywhere within the function, and also allows for easy addressing of local
 variables. Yet, using \reg{rbp} to save \reg{rip} is not always done, since it
 somehow ``wastes'' a register. This decision is, on x86\_64 System V, up to the
 compiler.
 Often, a function will start by subtracting some value to \reg{rsp}, allocating
 some space in the stack frame for its local variables. Then, it will push on
@ -242,29 +245,40 @@ when talking about DWARF, a register is merely a numerical identifier that is
 often, but not necessarily, mapped to a real machine register by the ABI\@.
 In practice, this data takes the form of a collection of tables, one table per
-Frame Description Entry (FDE), which most often corresponds to a function. Each
+Frame Description Entry (FDE). A FDE, in turn, is a DWARF entry describing such
-column of the table is a register (\eg{} \reg{rsp}), with two additional
+a table, that has a range of IPs on which it has authority. Most often, but not
 necessarily, it corresponds to a single function in the original source code.
 Each column of the table is a register (\eg{} \reg{rsp}), with two additional
 special registers, CFA (Canonical Frame Address) and RA (Return Address),
-containing respectively the base pointer of the current stack frame and the
+containing respectively the base pointer of the current stack
-return address of the current function (\ie{} for x86\_64, the unwound value of
+frame\footnote{The CFA is most commonly thought of as the base pointer of the
-\reg{rip}, the instruction pointer). Each row of the table is a particular
+frame, yet this is not enforced by DWARF\@. The CFA is used as an address from
-instruction pointer, within the instruction pointer range of the tabulated FDE
+which other registers will be deduced as offsets, and although it is supposed
-(assuming a FDE maps directly to a function, this range is simply the IP range
+to be the actual base pointer, it can be anything as long as it is close enough
-of the given function in the \lstc{.text} section of the binary), a row being
+to the addresses that will be deduced from it.} and the return address of the
-valid from its start IP to the start IP of the next row, or the end IP of the
+current function (\ie{} for x86\_64, the unwound value of \reg{rip}, the
-FDE if it is the last row.
+instruction pointer). Each row has a certain validity interval, on which it
 describes accurate unwinding data. This range starts at the instruction pointer
 it is associated with, and ends at the start IP of the next table row (or the
 end IP of the current FDE if it was the last row). In particular, there can be
 no ``IP hole'' within a FDE --~unlike FDEs themselves, which can leave holes
 between them.
 \begin{figure}[h]
    \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language=C, firstline=3, lastline=12,
                         caption={Original C},label={lst:ex1_c}]
            {src/fib7/fib7.c}
    \end{minipage} \hfill \begin{minipage}{0.45\textwidth}
-    \lstinputlisting[language=C,caption={Processed DWARF},label={lst:ex1_dw}]
+        \lstinputlisting[language=C,caption={Processed DWARF},
                         label={lst:ex1_dw}]
            {src/fib7/fib7.fde}
        \lstinputlisting[language=C,caption={Raw DWARF},label={lst:ex1_dwraw}]
            {src/fib7/fib7.raw_fde}
    \end{minipage}
 \end{figure}
 \begin{figure}[h]
    \begin{minipage}{0.45\textwidth}
        \lstinputlisting[language={[x86masm]Assembler},lastline=11,
                         caption={Generated assembly},label={lst:ex1_asm}]
@ -274,20 +288,49 @@ FDE if it is the last row.
                         firstnumber=last]
            {src/fib7/fib7.s}
    \end{minipage}
 \end{figure}
 \begin{table}[h]
    \centering
    \begin{tabular}{|c|c|c|c|c|c}
        \stackfhead{+ \mhex{30}}
            & \stackfhead{+ \mhex{28}}
            & \stackfhead{+ \mhex{20}}
            & \stackfhead{+ \mhex{1c}}
            & \stackfhead{+ \mhex{4}}
            & \stackfhead{}
            \\
        \hline{}
            Return Address & \textit{Alignment space}
                & \spaced{2ex}{\lstc{fibo[7]}}
                & \spaced{4ex}{\ldots}
                & \spaced{2ex}{\lstc{fibo[0]}}
                & \textit{Next frame}
                \\
        \hline
    \end{tabular}
    \caption{Stack frame schema}\label{table:ex1_stack_schema}
 \end{table}
 For instance, the C source code in Listing~\ref{lst:ex1_c} above, when compiled
 with \lstbash{gcc -O1 -fomit-frame-pointer -fno-stack-protector}, yields the
-assembly code in Listing~\ref{lst:ex1_asm}. When interpreting the generated
+assembly code in Listing~\ref{lst:ex1_asm}. The memory layout of the stack
-\ehframe{} with \lstbash{readelf -wF}, we obtain the (slightly edited)
+frame is presented in Table~\ref{table:ex1_stack_schema}, to help understanding
 how the stack frame is constructed. When interpreting the generated \ehframe{}
 with \lstbash{readelf -wF}, we obtain the (slightly edited)
 Listing~\ref{lst:ex1_dw}. During the function prelude, \ie{} for $\mhex{615}
 \leq \reg{rip} < \mhex{619}$, the stack frame only contains the return address,
 thus the CFA is 8 bytes above \reg{rsp} (which was the value of \reg{rsp}
-before the call), and the return address is precisely at \reg{rsp}. Then, 9
+before the call, and is the topmost value of used space for this stack frame),
-integers of 8 bytes each (8 for \lstc{fibo}, one for \lstc{pos}) are allocated
+and the return address is precisely at \reg{rsp} --~that is, stored between
-on the stack, which puts the CFA 80 bytes above \reg{rsp}, and the return
+\reg{rsp} and $\reg{rsp} + 8$. Then, 8 integers of 4 bytes each (for
-address still 8 bytes below the CFA\@. Then, by the end of the function, the
+\lstc{fibo}, \lstc{pos} being optimized out) are allocated on the stack, which
-local variables are discarded and \reg{rsp} is reset to its value from the
+puts the CFA 32 bytes above \reg{rsp}, and the return address still 8 bytes
-first row.
+below the CFA\@.  Yet, \prog{gcc} decided to allocate a total space of 48 bytes
 for the stack frame for memory alignment reasons, which means subtracting 40
 bytes to \reg{rsp} (address $\mhex{615}$ in the assembly). Then, by the end of
 the function, the local variables are discarded and \reg{rsp} is reset to its
 value from the first row.
 However, DWARF data isn't actually stored as a table in the binary files, but
 is instead stored as in Listing~\ref{lst:ex1_dwraw}. The first row has the
@ -295,12 +338,12 @@ location of the first IP in the FDE, and must define at least its CFA\@. Then,
 when all relevant registers are defined, it is possible to define a new row by
 providing a location offset (\eg{} here $4$), and the new row is defined as a
 clone of the previous one, which can then be altered (\eg{} here by setting
-\lstc{CFA} to $\reg{rsp} + 80$). This means that every line is defined \wrt{}
+\lstc{CFA} to $\reg{rsp} + 48$). This means that every line is defined \wrt{}
 the previous one, and that the IPs of the successive rows cannot be determined
-before evaluating every row before. Thus, unwinding a frame from an IP close to
+without evaluating every row that comes before in the first place. Thus,
-the end of the frame will require evaluating pretty much every DWARF row in the
+unwinding a frame from an IP close to the end of the frame will require
-table before reaching the relevant information, slowing down drastically the
+evaluating pretty much every DWARF row in the table before reaching the
-unwinding process.
+relevant information, slowing down drastically the unwinding process.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{How big are FDEs?}
@ -377,8 +420,8 @@ brevity and clarity. All these instructions are up to variants (most
 instructions exist in multiple formats to handle various operands formatting,
 to optimize space). Since we won't be talking about the underlying file format
 here, those variations between eg. \dwcfa{advance\_loc1} and
-\dwcfa{advance\_loc2} ---~which differ only on the number of bytes of their
+\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
-operand~--- are irrelevant and will be eluded.
+operand~-- are irrelevant and will be eluded.
 \begin{itemize}
    \item{} \dwcfa{set\_loc(loc)}~:
@ -478,8 +521,8 @@ in the context of the program being unwound. In particular, it must be able to
 dereference some pointer derived from DWARF instructions that will point to the
 execution stack, or even the heap.
-This function takes as arguments an instruction pointer ---~supposedly
+This function takes as arguments an instruction pointer --~supposedly
-extracted from $\reg{rip}$~--- and an array of register values; and returns a
+extracted from $\reg{rip}$~-- and an array of register values; and returns a
 fresh array of register values after unwinding this call frame. The function is
 compositional\footnote{up to technicities: the IP obtained after unwinding the
 first frame might be handled in a different dynamically loaded object, and this
@ -641,25 +684,33 @@ machine code on the x86\_64 platform.
 The rough idea of the compilation is to produce, out of the \ehframe{} section
 of a binary, C code that resembles the code shown in the DWARF semantics from
-Section~\ref{sec:semantics} above. This C code is then compiled by GCC,
+Section~\ref{sec:semantics} above. This C code is then compiled by GCC in
-providing for free all the optimization passes of a modern compiler.
+\lstbash{-O2} mode\footnote{Compiling in \lstbash{-O3} takes way too much
 time.}, providing for free all the optimization passes of a modern compiler.
-The generated code consists in a single monolithic function, taking as
+The generated code consists in a single monolithic function, \lstc{_eh_elf},
-arguments an instruction pointer and a memory context (\ie{} the value of the
+taking as arguments an instruction pointer and a memory context (\ie{} the
-various machine registers) as defined in Listing~\ref{lst:unw_ctx}. The
+value of the various machine registers) as defined in
-function will then return a fresh memory context, containing the values the
+Listing~\ref{lst:unw_ctx}. The function will then return a fresh memory
-registers hold after unwinding this frame.
+context, containing the values the registers hold after unwinding this frame.
 The body of the function itself is mostly a huge switch, taking advantage of
-the non-standard ---~yet widely implemented in C compilers~--- syntax for range
+the non-standard --~yet widely implemented in C compilers~-- syntax for range
-switches, in which each \lstc{case} can refer to a range. All the FDEs are
+switches, in which each \lstinline{case} can refer to a range. All the FDEs are
-merged together into this switch, each row of a FDE being a switch case. The
+merged together into this switch, each row of a FDE being a switch case.
-cases then fill a context with unwound values, then return it.
+Separating the various FDEs in the C code --~other than with comments~-- is,
 unlike what is done in DWARF, pointless, since accessing a ``row'' has a linear
 cost, and the C code is not meant to be read, except maybe for debugging
 purposes. The switch cases bodies then fill a context with unwound values, then
 return it.
-An optionally enabled parameter can be used to pass a function pointer to a
+A setting of the compiler also optionally enables another parameter to the
-dereferencing function, that conceptually does what the dereferencing \lstc{*}
+\lstc{_eh_elf} function, \lstc{deref}, which is a function pointer. This
-operator does on a pointer, and is used to unwind a process that is not the
+\lstc{deref} function, when enabled, replaces everywhere the dereferencing
-currently running process, and thus not sharing the same address space. A call
+\lstc{*} operator, and can be used to generate \ehelfs{} that will work on
 remote address spaces (\ie{} whenever the unwinding is not done on the process
 reading the \ehelf{} itself, but some other process, or even on a stack dump of
 a long-terminated process).
 Unlike in the \ehframe, and unlike what should be done in a release,
 real-world-proof version of the \ehelfs, the choice was made to keep this
@ -675,20 +726,24 @@ is not sufficient to analyze every stack frame as \prog{gdb} would do after a
 In the unwind context from Listing~\ref{lst:unw_ctx}, the values of type
 \lstc{uintptr_t} are the values of the corresponding registers, and
-\lstc{flags} is a 8-bytes value, indicating for each register whether it is
+\lstc{flags} is a 8-bits value, indicating for each register whether it is
 present or not in this context (\ie{} if the \lstc{rbx} bit is not set, the
 value of \lstc{rbx} in the structure isn't meaningful), plus an error bit,
-indicating whether an error occurred during unwinding.
+indicating whether an error occurred during unwinding (which can be due \eg{}
 to an unsupported operation in the original DWARF, thus compiled to an error).
 This generated data is stored in separate shared object files, which we call
 \ehelfs. It would have been possible to alter the original ELF file to embed
-this data as a new section, but it getting it to be executed just as any
+this data as a new section, but getting it to be executed just as any
 portion of the \lstc{.text} section would probably have been painful, and
 keeping it separated during the experimental phase is quite convenient. It is
 possible to have multiple versions of \ehelfs{} files in parallel, with various
 options turned on or off, and it doesn't require to alter the base system by
 editing \eg{} \texttt{/usr/lib/libc-*.so}. Instead, when the \ehelf{} data is
-required, those files can simply be \lstc{dlopen}'d.
+required, those files can simply be \lstc{dlopen}'d. It is also possible to
 imagine, in a future environment production, packaging \ehelfs{} files
 separately, so that people interested in heavy computation can have the choice
 to install them.
 \medskip
@ -705,15 +760,19 @@ generated for the C code in Listing~\ref{lst:ex1_c}.
 Without any particular care to efficiency or compactness, it is already
 possible to produce a compiled version very close to the one described in
 Section~\ref{sec:semantics}. Although the unwinding speed cannot yet be
-actually benchmarked, it is already possible to write in a few hundreds of line
+actually benchmarked, it is already possible to write in a few hundred lines of
-of C a simple stack walker printing the functions traversed. It already works
+C code a simple stack walker printing the functions traversed. It already works
 without any problem on the easily tested cases, since corner cases are mostly
-found in standard and highly optimal libraries, and it is not that easy to get
+found in standard and highly optimized libraries, and it is not that easy to get
 the program to stop and print a stack trace from within a system library
 without using a debugger.
 The major drawback of this approach, without any particular care taken, is the
-space waste.
+space waste. The space taken by those tentative \ehelfs{} is analyzed in
 Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
 introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
 it depends.
 \begin{table}[h]
    \centering
@ -736,11 +795,6 @@ space waste.
    \caption{Basic \ehelfs{} space usage}\label{table:basic_eh_elf_space}
 \end{table}
 The space taken by those tentative \ehelfs{} is analyzed in
 Table~\ref{table:basic_eh_elf_space} for \prog{hackbench}, a small program
 introduced later in Section~\ref{ssec:bench_perf}, and the libraries on which
 it depends.
 The first column only includes the sizes of the ELF sections \lstc{.text} (the
 program itself) and \lstc{.rodata}, the read-only data (such as static strings,
 etc.). Only the weight of the \lstc{.text} section of the generated \ehelfs{}
@ -764,16 +818,17 @@ made in order to shrink the \ehelfs.
 The major optimization that most reduced the output size was to use an if/else
 tree implementing a binary search on the program counter relevant intervals,
 instead of a huge switch. In the process, we also \emph{outline} a lot of code,
-that is, find out identical code blocks, move them outside of the if/else tree,
+that is, find out identical ``switch cases'' bodies (which are not switch cases
-identify them by a label, and jump to them using a \lstc{goto}, which
+anymore, but if bodies), move them outside of the if/else tree, identify them
-de-duplicates a lot of code and contributes greatly to the shrinking. In the
+by a label, and jump to them using a \lstc{goto}, which de-duplicates a lot of
-process, we noticed that the vast majority of FDE rows are actually taken among
+code and contributes greatly to the shrinking. In the process, we noticed that
-very few ``common'' FDE rows.
+the vast majority of FDE rows are actually taken among very few ``common'' FDE
 rows.
 This makes this optimization really efficient, as seen later in
-Section~\ref{ssec:results_size}, but also makes it an interesting question ---
+Section~\ref{ssec:results_size}, but also makes it an interesting question
-not investigated during this internship --- to find out whether standard DWARF
+--~not investigated during this internship~-- to find out whether standard
-data could be efficiently compressed in this way.
+DWARF data could be efficiently compressed in this way.
 \begin{minipage}{0.45\textwidth}
    \lstinputlisting[language=C, caption={\ehelf{} for the previous example},
@ -806,15 +861,16 @@ However, unwinding over and over again from the same program point would have
 had no interest at all, since \prog{libunwind} would have simply cached the
 relevant DWARF row. In the mean time, making sure that the various unwinding
 are made from different locations is somehow cheating, since it makes useless
-\prog{libunwind}'s caching. All in all, the benchmarking method must have a
+\prog{libunwind}'s caching and does not reproduce ``real-world'' unwinding
-``natural'' distribution of unwindings.
+distribution. All in all, the benchmarking method must have a ``natural''
 distribution of unwindings.
 Another requirement is to also distribute quite evenly the unwinding points
 across the program: we would like to benchmark stack unwindings crossing some
 standard library functions, starting from inside them, etc.
 Finally, the unwound program must be interesting enough to enter and exit a lot
-of function, nest function calls, have FDEs that are not as simple as in
+of functions, nest function calls, have FDEs that are not as simple as in
 Listing~\ref{lst:ex1_dw}, etc.
@ -864,19 +920,23 @@ system and process as much as possible, to be able to unwind in any context.
 This very restricted information lacked a memory map (a table indicating which
 shared object is mapped at which address in memory) in order to use \ehelfs.
 Apart from this, the modified version of \prog{libunwind} produced is entirely
-compatible with the vanilla version.
+compatible with the vanilla version, meaning that the only modifications
 required to use \ehelfs{} within any project using \prog{libunwind} should be
 modifying one line of code (this function call, which is a setup function) and
 linking against the modified version of \prog{libunwind} instead of the system
 version.
 Once this was done, plugging it in \prog{perf} was the matter of a few lines of
-code only. The major problem encountered was to understand how \prog{perf}
+code only, left apart the benchmarking code. The major problem encountered was
-works. In order to avoid perturbing the traced program, \prog{perf} does not
+to understand how \prog{perf} works. In order to avoid perturbing the traced
-unwind at runtime, but rather records at regular interval the program's stack,
+program, \prog{perf} does not unwind at runtime, but rather records at regular
-and all the auxiliary information that is needed to unwind later. This is done
+intervals the program's stack, and all the auxiliary information that is needed
-when running \lstbash{perf record}. Then, \lstbash{perf report} unwinds the
+to unwind later. This is done when running \lstbash{perf record}. Then,
-stack to analyze it; but at this point of time, the traced process is long
+\lstbash{perf report} unwinds the stack to analyze it; but at this point of
-dead, thus any PID-based approach, or any approach using \texttt{/proc}
+time, the traced process is long dead, thus any PID-based approach, or any
-information will fail. However, as this was the easiest method, this approach
+approach using \texttt{/proc} information will fail. However, as this was the
-was chosen when implementing the first version of \ehelfs; thus requiring some
+easiest method, the first version of \ehelfs{} used those mechanisms; thus
-code rewriting.
+requiring some code rewriting.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Other explored methods}
@ -884,15 +944,15 @@ code rewriting.
 The first approach tried to benchmark was trying to create some specific C code
 that would meet the requirements from Section~\ref{ssec:bench_req}, while
 calling itself a benchmarking procedure from time to time. This was abandoned
-quite fast, because generating C code interesting enough to be unwound turned
+quite quickly, because generating C code interesting enough to be unwound
-out hard, and the generated FDEs invariably ended out uninteresting. It would
+turned out hard, and the generated FDEs invariably ended out uninteresting. It
-also never have met the requirement of unwinding from fairly distributed
+would also never have met the requirement of unwinding from fairly distributed
 locations anyway.
 Another attempt was made using CSmith~\cite{csmith}, a random C code generator
 initially made for C compilers random testing. The idea was still to craft an
 interesting C program that would unwind on its own frequently, but to integrate
-randomly generated C code with CSmith to integrate interesting C snippets that
+CSmith-randomly generated C code within hand-written C snippets that
 would generate large enough FDEs and nested calls. This was abandoned as well
 as the call graph of a CSmith-generated code is often far too small, and the
 CSmith code is notoriously hard to understand and edit.
--- a/shared/common.sty
+++ b/shared/common.sty
@ -7,3 +7,6 @@
 \newcommand{\set}[1]{\left\{ #1 \right\}}
 \newcommand{\card}[1]{\left\vert{} #1 \right\vert}
 \newcommand{\abs}[1]{\left\vert{} #1 \right\vert}
 \newcommand{\tnhead}[2]{\multicolumn{1}{#1}{#2}}  % Table neutral head
 \newcommand{\spaced}[2]{\hspace{#1} #2 \hspace{#1}}
--- a/shared/specific.sty
+++ b/shared/specific.sty
@ -1,5 +1,8 @@
 %% Specific commands for this project
 \newcommand{\stackfhead}[1]
    {\tnhead{l}{\hspace{-5ex}$\reg{rsp} #1$ \hspace{2em}}}
 \newcommand{\prog}[1]{\texttt{#1}}
 \newcommand{\ehelf}{\texttt{eh\_elf}}
 \newcommand{\ehelfs}{\texttt{eh\_elfs}}