Reword semantics section

2018-08-19 16:40:58 +02:00 · 2018-08-19 16:40:58 +02:00 · 21b39e2115
commit 21b39e2115
parent ef6d68b36e
4 changed files with 145 additions and 107 deletions
--- a/report/report.tex
+++ b/report/report.tex
@ -22,6 +22,7 @@ Under supervision of Francesco Zappa Nardelli, March -- August 2018\\
 \usepackage{wrapfig}
 \usepackage{pgfplots}
 \usepackage{placeins}
+\usepackage{enumitem}
 %\usepackage[backend=biber,style=alphabetic]{biblatex}
 \usepackage[backend=biber]{biblatex}

@ -118,7 +119,7 @@ value.
 \subsection{Stack unwinding}\label{ssec:stack_unwinding}

 For various reasons, it is interesting, at some point of the execution of a
-program, to glance at its program stack and be able to extract informations
+program, to glance at its program stack and be able to extract information
 from it. For instance, when running a debugger, a frequent usage is to obtain a
 \emph{backtrace}, that is, the list of all nested function calls at the current
 IP\@. This actually observes the stack to find the different stack frames, and
@ -152,7 +153,7 @@ register, it does not have to save it.

 With this example, it seems pretty clear tha some additional data is necessary
 to perform stack unwinding reliably, without only performing a guesswork. This
-data is stored along with the debugging informations of a program, and one
+data is stored along with the debugging information of a program, and one
 common format of debugging data is DWARF\@.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -395,14 +396,14 @@ parse the relevant FDE from its start, until it finds the row it was seeking.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{DWARF semantics}\label{sec:semantics}

-We will now define semantics covering most of the operations used for FDEs
-described in the DWARF standard~\cite{dwarf5std}, such as seen in
+We will now define semantics covering the operations used for FDEs described in
+the DWARF standard~\cite{dwarf5std}, such as seen in
 Listing~\ref{lst:ex1_dwraw}, with the exception of DWARF expressions. These are
-not treated here, because they form a rich language and would take a lot
-of time and space to formalize, while in the meantime being seldom used --~see
+not treated here, because they form a rich language and would take a lot of
+time and space to formalize, while in the mean time being seldom used --~see
 Section~\ref{ssec:instr_cov}.

-These semantics are defined with respect to the well-formalized C language, and
+These semantics are defined \wrt{} the well-formalized C language, and
 are passing through an intermediary language. The DWARF language can read the
 whole memory, as well as registers, and is always executed for some instruction
 pointer. The C function representing it will thus take as parameters an array
@ -412,17 +413,23 @@ registers values, which will represent the evaluated DWARF row.
 \subsection{Original language: DWARF instructions}

 These are the DWARF instructions used for CFI description, that is, the
-instructions that contain the stack unwinding table informations. The following
-list is an exhaustive list of instructions from the DWARF5
-specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for
-brevity and clarity. All these instructions are up to variants --~most
-instructions exist in multiple formats to handle various operands formatting,
-to optimize space. Since we won't be talking about the underlying file format
-here, those variations between eg. \dwcfa{advance\_loc1} and
-\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
-operand~-- are irrelevant and will be eluded.
+instructions that contain the stack unwinding table information. Below is an
+exhaustive list of instructions from the DWARF5 specification~\cite{dwarf5std}
+concerning CFI, with reworded descriptions for brevity and clarity. All these
+instructions are up to variants --~most instructions exist in multiple formats
+to handle various operands formatting for space optimisation. Since we won't be
+talking about the underlying file format here, those variations between \eg{}
+\dwcfa{advance\_loc1} and \dwcfa{advance\_loc2} --~which differ only on the
+number of bytes of their operand~-- are irrelevant and will be eluded.

-\begin{itemize}
+As said before, we also elude here references to DWARF expressions, as they are
+complex and are mostly not implemented in the actual compiler anyway --~left
+apart some special cases. Those expressions bring in complexity, as they are
+turing-complete stack machine expressions that can access virtually the whole
+computer's memory. Formalizing them would require designing semantics for such
+a language.
+
+\begin{itemize}[itemsep=3pt, parsep=0pt]
    \item{} \dwcfa{set\_loc(loc)}:
        start a new table row from address $loc$
    \item{} \dwcfa{advance\_loc(delta)}:
@ -447,8 +454,7 @@ operand~-- are irrelevant and will be eluded.
    \item{} \dwcfa{val\_offset(reg, offset)}:
        the value of the register \reg{reg} is the value $CFA + \textit{offset}$
    \item{} \dwcfa{register(reg, model)}:
-        the register \reg{reg} has, in this row, the value that $\reg{model}$
-        had in the previous row
+        the register \reg{reg} has, in this row, the value of $\reg{model}$.
    \item{} \dwcfa{expression(reg, expr)}:
        the value of \reg{reg} is stored in memory at the address defined by
        $expr$
@ -457,15 +463,15 @@ operand~-- are irrelevant and will be eluded.
    \item{} \dwcfa{restore(reg)}:
        \reg{reg} has the same value as in this FDE's preamble (CIE) in this
        row. This is \emph{not implemented in this semantics} for simplicity
-        and brevity (we would have to introduce CIE (preamble) and FDE (body)
-        independently). This is also not much used in actual ELF
-        files: the analysis in Section~\ref{ssec:instr_cov} found no such
-        instruction, on a random uniform sample of 4000 ELF files.
+        and brevity, as we would have to introduce CIE (preamble) and FDE
+        (body) independently. This is also not often used in actual ELF files:
+        the analysis in Section~\ref{ssec:instr_cov} found no such instruction,
+        on a random uniform sample of 4000 ELF files.
    \item{} \dwcfa{remember\_state()}:
-        push the state of all the registers of this row on an implicit stack
+        push the state of all the registers of this row on a state-saving stack
    \item{} \dwcfa{restore\_state()}:
-        pop an entry of the implicit stack, and restore all registers in this
-        row to the value held in the stack record.
+        pop an entry of the state-saving stack, and restore all registers in
+        this row to the value held in the stack record.
    \item{} \dwcfa{nop()}:
        do nothing (padding)
 \end{itemize}
@ -495,30 +501,30 @@ Its grammar is as follows:
 \end{align*}

 The entry point of the grammar is a $\FDE$, which is a set of rows, each
-annotated with a machine address, the address from which it is valid.
-The addresses are necessarily increasing within a FDE\@.
+annotated with the machine address from which it is valid.  The addresses are
+necessarily increasing within a FDE\@.

-Each row then represents, as a function mapping registers to values, a row of
+Each row is as a function mapping registers to values, and represents a row of
 the unwinding table.

-We implicitly consider that $\reg{reg}$ maps to a number, and we use here
-\texttt{x86\_64} names for convenience, but actually in DWARF registers are
-only handled as register identifiers, so we can safely state that $\reg{reg}
-\in \regs$.
+We implicitly consider that $\reg{reg}$ maps to a number. We use here
+\texttt{x86\_64} names for convenience, although in DWARF, registers are merely
+identifiers.  Thus, we can safely state that $\reg{reg} \in \regs$.

 A value can then be undefined, stored at memory address $x$ or be directly a
 value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
-\textit{offset}$. The CFA is considered a simple register here. For instance,
-to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA,
-we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack
-grows downwards. We also leave open the possibility to extend the language with
-DWARF expressions support as $\valexpr{}$, although we \emph{do not} specify
-them here.
+\textit{offset}$. The CFA is seen just as any other register here, although
+DWARF makes a distinction between it and other columns. For instance, to define
+$\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we would
+have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack grows
+downwards. We also leave open the possibility to extend the language with DWARF
+expressions support as $\valexpr{}$, although we \emph{do not} specify them
+here.

 \subsection{Target language: a C function body}

-The target language of these semantics is a C function, to be interpreted with
-respect to the C11 standard~\cite{c11std}. The function is supposed to be run
+The target language of these semantics is a C function, to be interpreted
+\wrt{} the C11 standard~\cite{c11std}. The function is supposed to be run
 in the context of the program being unwound. In particular, it must be able to
 dereference some pointer derived from DWARF instructions that will point to the
 execution stack, or even the heap.
@ -538,22 +544,35 @@ The function is the following:
 The translation of $\intermedlang$ as produced by the later-defined function
 are then to be inserted in this context, where the comment states so.

+In pseudo-C code (for brevity) and assuming the functions and types used are
+duly defined elsewhere, unwinding multiple frames would then look like this:
+
+\lstinputlisting[language=C]{src/dw_semantics/stack_walker.c}
+
+Thus, if we hold for true that the IP will remain in the same memory segment
+--~\ie{} binary file~-- for two frames, we can safely unwind two frames this
+way:
+
+\lstinputlisting[language=C]{src/dw_semantics/unwind2.c}
+
 \subsection{From DWARF to $\intermedlang$}

-To define the interpretation of $\DWARF$ to $\intermedlang$, we will need to
-proceed forward, but, as the language inherently depends on the previous
-instructions to give a meaning to the following ones, we will depend on what
-was computed before. At a point of the interpretation $h \vert t$, where $t$ is
-what remains to be interpreted, $h$ what has been, and $H$ the result of the
-interpretation, it would thus look like $\llbracket t \rrbracket (H)$.
-
-But we also need to keep track of this implicit stack DWARF uses, which will be
-kept in subscript.
-
-\medskip
+In DWARF, the instructions have a meaning that refer to previously interpreted
+instructions, sequentially. For instance, many registers are defined at offsets
+from the current CFA, which in turn was previously defined \wrt{} the former
+CFA value, etc. Thus, to give a meaning to a DWARF instruction, knowledge of
+the current row's values is needed. Let us consider a given point of the
+interpretation of $d = h \cdot t$, where we already have interpreted $h$, the
+first instructions, and interpreted it as $H \in \FDE$, while $t$ remains to be
+interpreted. We then define the interpretation function $\llbracket t
+\rrbracket (H)$, interpreting the remainder $t$ of the DWARF instructions,
+having the knowledge of $H$, the current interpreted row.

+But we also need to keep track of this state-saving stack DWARF uses, which
+will be kept in subscript.
 Thus, we define $\semI{\bullet}{s}(\bullet): \DWARF \times \FDE \to \FDE$, for
 $s$ a stack of $\dwrow$, that is,
+
 \[
    s \in \rowstack := \dwrow^\ast
 \]
@ -562,30 +581,49 @@ Implicitly, $\semI{\bullet}{} := \semI{\bullet}{\varepsilon}$

 \medskip

-For convenience, we define $\insarrow{reg}$, the operator changing the value of
-a register for a given value in the last row, as
+For convenience, we define $\insarrow{$r \in{} \regs$}$, an operator changing the
+value assigned to a register, its right-hand side operand, in the last row
+of a given $\FDE$, its left-hand side operand.

 \[
-    \left(f \in \FDE\right) \insarrow{$r \in \regs$} (v \in values)
+    \left(f \in \FDE\right) \insarrow{$r \in{} \regs$} (v \in values)
    \quad := \quad
-    \left( f\left[0 \ldots |f| - 2\right] \right) \cdot \left\{
+    \left( f\left[0\ \cdots\  \left(\vert f \vert - 2\right)\right] \right) \cdot \left\{
        \begin{array}{r l}
            r' \neq r &\mapsto \left(f[-1]\right)(r') \\
            r &\mapsto v \\
        \end{array} \right.
 \]

-The same way, we define $\extrarrow{reg}$ that \emph{extracts} the rule
-currently applied for $\reg{reg}$, eg. $F \extrarrow{CFA} \valval{\reg{reg} +
-\text{off}}$. If the rule currently applied in such a case is \emph{not} of the
-form $\reg{reg} + \text{off}$, then the program is considered erroneous. One
-can see this $\extrarrow{reg}$ somehow as a \lstc{match} statement in OCaml,
-but with only one case, allowing to retrieve packed data.
+Note that for convenience, we allow ourselves to index negatively an array to
+retrieve its values from the end; thus, $f[-1]$ refers to the last entry of
+$f$. If we consider the fictive following fictive row $R_0$,

-More generally, we define ${\extrarrow{reg}}^{-k}$ as the same operation, but
-extracting in the $k$-older row, ie. ${\extrarrow{reg}}^{0}$ is the same as
-$\extrarrow{reg}$, and $F {\extrarrow{reg}}^{-1} \bullet$ is the same as
-$F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.
+\[
+    R \in \dwrow := \left\{ \begin{array}{r l}
+        CFA &\mapsto \valval{\reg{rsp} - 48} \\
+        \reg{rbx} &\mapsto \valaddr{\reg{rsp} - 16}
+    \end{array}\right.
+\]
+
+then, we would have
+
+\[
+    R \insarrow{\reg{rbx}} \left(\valaddr{\reg{rip - 24}}\right)
+    \quad = \quad
+    \left\{ \begin{array}{r l}
+        CFA &\mapsto \valval{\reg{rsp} - 48} \\
+        \reg{rbx} &\mapsto \valaddr{\reg{rsp} - 24}
+    \end{array}\right.
+\]
+
+The same way, we define $\extrarrow{reg}$ that \emph{extracts} the rule
+currently applied for $\reg{reg}$ in the last row of a FDE, \eg{} $F
+\extrarrow{CFA} \valval{\reg{reg} + \text{off}}$. If the rule currently applied
+in such a case is \emph{not} of the form $\reg{reg} + \text{off}$, then the
+program is considered erroneous. One can see this $\extrarrow{reg}$ as a
+\lstc{match} statement in OCaml, but with only one case, allowing to retrieve
+packed data, all the other unmatched cases corresponding to an error.

 \begin{align*}
    \semI{\varepsilon}{s}(F) &:= F \\
@ -594,83 +632,74 @@ $F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.
    \semI{\dwcfa{adv\_loc(delta)} \cdot d}{s}(F) &:=
        \contsem{F \cdot \left(F[-1].addr + delta, F[-1].row \right)} \\
    \semI{\dwcfa{def\_cfa(reg, offset)} \cdot d}{s}(F) &:=
-        \contsem{F \insarrow{CFA} \valval{\reg{reg} + offset}} \\
+        \contsem{F \insarrow{CFA} \valval{\reg{reg} + \textit{offset}}} \\
    \semI{\dwcfa{def\_cfa\_register(reg)} \cdot d}{s}(F) &:=
-        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}}
+        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \textit{oldoffset}}
        \text{ in} \\
-        &\quad \contsem{F \insarrow{CFA} \valval{\reg{reg} + oldoffset}} \\
+        &\quad \contsem{F \insarrow{CFA} \valval{\reg{reg} + \textit{oldoffset}}} \\
    \semI{\dwcfa{def\_cfa\_offset(offset)} \cdot d}{s}(F) &:=
-        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}}
+        \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \textit{oldoffset}}
        \text{ in} \\
-        &\quad \contsem{F \insarrow{CFA} \valval{\reg{oldreg} + offset}} \\
-    \semI{\dwcfa{def\_cfa\_expression(expr)} \cdot d}{s}(F) &:=
-        \text{TO BE DEFINED} &\qtodo{CHECK ME?} \\
+        &\quad \contsem{F \insarrow{CFA} \valval{\reg{oldreg} + \textit{offset}}} \\
    \semI{\dwcfa{undefined(reg)} \cdot d}{s}(F) &:=
-        \contsem{F \insarrow{reg} \bot} \\
+        \contsem{F \insarrow{\reg{reg}} \bot} \\
    \semI{\dwcfa{same\_value(reg)} \cdot d}{s}(F) &:=
        \valval{\reg{reg}} \\
    \semI{\dwcfa{offset(reg, offset)} \cdot d}{s}(F) &:=
-        \contsem{F \insarrow{reg} \valaddr{\reg{CFA} + \textit{offset}}} \\
+        \contsem{F \insarrow{reg} \valaddr{\textit{CFA} + \textit{offset}}} \\
    \semI{\dwcfa{val\_offset(reg, offset)} \cdot d}{s}(F) &:=
-        \contsem{F \insarrow{reg} \valval{\reg{CFA} + \textit{offset}}} \\
+        \contsem{F \insarrow{reg} \valval{\textit{CFA} + \textit{offset}}} \\
    \semI{\dwcfa{register(reg, model)} \cdot d}{s}(F) &:=
-        \text{let } F {\extrarrow{model}}^{-1} r \text{ in }
+        \text{let } F {\extrarrow{model}} r \text{ in }
        \contsem{F \insarrow{reg} r} \\
-    \semI{\dwcfa{expression(reg, expr)} \cdot d}{s}(F) &:=
-        \text{TO BE DEFINED} &\qtodo{CHECK ME?}\\
-    \semI{\dwcfa{val\_expression(reg, expr)} \cdot d}{s}(F) &:=
-        \text{TO BE DEFINED} &\qtodo{CHECK ME?}\\
-%    \semI{\dwcfa{restore(reg)} \cdot d}{s}(F) &:= \\  %% NOT IMPLEMENTED
    \semI{\dwcfa{remember\_state()} \cdot d}{s}(F) &:=
-        \semI{d}{s \cdot F[-1].row}\left(F\right) \\
-    \semI{\dwcfa{restore\_state()} \cdot d}{s \cdot t}(F) &:=
+        \semI{d}{s\ \cdot\ \left(F[-1].row\right)}\left(F\right) \\
+    \semI{\dwcfa{restore\_state()} \cdot d}{s\ \cdot\ t}(F) &:=
        \semI{d}{s}\left(F\left[0 \ldots |F|-2\right] \cdot
        \left(F[-1].addr, t\right) \right) \\
    \semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
 \end{align*}

-The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
-we omit those two operations, we can plainly remove the stack.
+The state-saving stack is used only for \texttt{remember\_state} and
+\texttt{restore\_state}. If we were to omit those two operations, we could
+plainly remove the stack from our notations.


 \subsection{From $\intermedlang$ to C}

 \emph{The C code provided thereafter is a correct but inefficient reference
-implementation, which is only provided for formalization, to specify DWARF
-\wrt{} C. In particular, the actual compiler is \emph{not} implemented this
-way.}
+implementation, which is only provided to specify DWARF \wrt{} C. In
+particular, the actual compiler is \emph{not} implemented this way.}

 \medskip

-We now define $\semC{\bullet}: \DWARF \to C$, in the context presented
-earlier. The translation from $\intermedlang$ to C is defined as follows:
+We now define $\semC{\bullet}: \DWARF \to C$, in the context presented earlier
+in Section~\ref{lst:sem_c_ctx}. The translation from $\intermedlang$ to C is
+defined as follows:

 \begin{itemize}
-    \item $\semC{\varepsilon} =$ \\
+    \item $\semC{\varepsilon} =$
        \begin{lstlisting}[language=C, mathescape=true]
-            else {
-                for(int reg=0; reg < NB_REGS; ++reg)
-                    new_ctx[reg] = $\semR{\bot}$;
-            }
-        \end{lstlisting}
+for(int reg=0; reg < NB_REGS; ++reg)
+    new_ctx[reg] = $\semR{\bot}$; \end{lstlisting}

    \item $\semC{(\text{loc}, \text{row}) \cdot t} = C\_code \cdot \semC{t}$,
        where $C\_code$ is
        \begin{lstlisting}[language=C, mathescape=true]
-            if(ip >= $loc$) {
-                for(int reg=0; reg < NB_REGS; ++reg)
-                    new_ctx[reg] = $\semR{row[reg]}$;
-                goto end_ifs; // Avoid using `else if` (easier for generation)
-            }
-        \end{lstlisting}
+if(ip >= $loc$) {
+    for(int reg=0; reg < NB_REGS; ++reg)
+        new_ctx[reg] = $\semR{row[reg]}$;
+    goto end_ifs; // Avoid using `else if` (easier for generation)
+} \end{lstlisting}
 \end{itemize}

-and $\semR{\bullet}$ is defined as
+while $\semR{\bullet}$ is defined as
 \begin{align*}
-    \semR{\bot} &= \text{\lstc{ERROR_VALUE}} \\
-    \semR{\valaddr{\text{reg}, \textit{offset}}} &=
+    \semR{\bot} &\eqspace{}
+        \text{\lstc{ERROR_VALUE}} \\
+    \semR{\valaddr{\text{reg}, \textit{offset}}} &\eqspace{}
        \text{\lstc{*(old_ctx[reg] + offset)}} \\
-    \semR{\valval{\text{reg}, \textit{offset}}} &=
+    \semR{\valval{\text{reg}, \textit{offset}}} &\eqspace{}
        \text{\lstc{(old_ctx[reg] + offset)}} \\
 \end{align*}

--- a/report/src/dw_semantics/stack_walker.c
+++ b/report/src/dw_semantics/stack_walker.c
@ -0,0 +1,5 @@
+while(!unwinding_done()) {
+    unwind_fct_t unwind_fct = get_unwinder_for_IP(current_context[RA]);
+    current_context = unwind_fct(current_context[RA], current_context);
+    do_something_with_context(current_context);
+}
--- a/report/src/dw_semantics/unwind2.c
+++ b/report/src/dw_semantics/unwind2.c
@ -0,0 +1,2 @@
+for(int i = 0; i < 2; ++i)
+    current_context = unwind_frame(current_context[RA], current_context);
--- a/shared/specific.sty
+++ b/shared/specific.sty
@ -38,3 +38,5 @@
 \newcommand{\extrarrow}[1]{\xrightarrow{\text{#1}}}

 \newcommand{\contsem}[1]{\semI{d}{s}\left(#1\right)}
+
+\newcommand{\eqspace}{\quad = \quad}