Reword semantics section

This commit is contained in:
Théophile Bastian 2018-08-19 16:40:58 +02:00
parent ef6d68b36e
commit 21b39e2115
4 changed files with 145 additions and 107 deletions

View file

@ -22,6 +22,7 @@ Under supervision of Francesco Zappa Nardelli, March -- August 2018\\
\usepackage{wrapfig}
\usepackage{pgfplots}
\usepackage{placeins}
\usepackage{enumitem}
%\usepackage[backend=biber,style=alphabetic]{biblatex}
\usepackage[backend=biber]{biblatex}
@ -118,7 +119,7 @@ value.
\subsection{Stack unwinding}\label{ssec:stack_unwinding}
For various reasons, it is interesting, at some point of the execution of a
program, to glance at its program stack and be able to extract informations
program, to glance at its program stack and be able to extract information
from it. For instance, when running a debugger, a frequent usage is to obtain a
\emph{backtrace}, that is, the list of all nested function calls at the current
IP\@. This actually observes the stack to find the different stack frames, and
@ -152,7 +153,7 @@ register, it does not have to save it.
With this example, it seems pretty clear tha some additional data is necessary
to perform stack unwinding reliably, without only performing a guesswork. This
data is stored along with the debugging informations of a program, and one
data is stored along with the debugging information of a program, and one
common format of debugging data is DWARF\@.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -395,14 +396,14 @@ parse the relevant FDE from its start, until it finds the row it was seeking.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{DWARF semantics}\label{sec:semantics}
We will now define semantics covering most of the operations used for FDEs
described in the DWARF standard~\cite{dwarf5std}, such as seen in
We will now define semantics covering the operations used for FDEs described in
the DWARF standard~\cite{dwarf5std}, such as seen in
Listing~\ref{lst:ex1_dwraw}, with the exception of DWARF expressions. These are
not treated here, because they form a rich language and would take a lot
of time and space to formalize, while in the meantime being seldom used --~see
not treated here, because they form a rich language and would take a lot of
time and space to formalize, while in the mean time being seldom used --~see
Section~\ref{ssec:instr_cov}.
These semantics are defined with respect to the well-formalized C language, and
These semantics are defined \wrt{} the well-formalized C language, and
are passing through an intermediary language. The DWARF language can read the
whole memory, as well as registers, and is always executed for some instruction
pointer. The C function representing it will thus take as parameters an array
@ -412,17 +413,23 @@ registers values, which will represent the evaluated DWARF row.
\subsection{Original language: DWARF instructions}
These are the DWARF instructions used for CFI description, that is, the
instructions that contain the stack unwinding table informations. The following
list is an exhaustive list of instructions from the DWARF5
specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for
brevity and clarity. All these instructions are up to variants --~most
instructions exist in multiple formats to handle various operands formatting,
to optimize space. Since we won't be talking about the underlying file format
here, those variations between eg. \dwcfa{advance\_loc1} and
\dwcfa{advance\_loc2} --~which differ only on the number of bytes of their
operand~-- are irrelevant and will be eluded.
instructions that contain the stack unwinding table information. Below is an
exhaustive list of instructions from the DWARF5 specification~\cite{dwarf5std}
concerning CFI, with reworded descriptions for brevity and clarity. All these
instructions are up to variants --~most instructions exist in multiple formats
to handle various operands formatting for space optimisation. Since we won't be
talking about the underlying file format here, those variations between \eg{}
\dwcfa{advance\_loc1} and \dwcfa{advance\_loc2} --~which differ only on the
number of bytes of their operand~-- are irrelevant and will be eluded.
\begin{itemize}
As said before, we also elude here references to DWARF expressions, as they are
complex and are mostly not implemented in the actual compiler anyway --~left
apart some special cases. Those expressions bring in complexity, as they are
turing-complete stack machine expressions that can access virtually the whole
computer's memory. Formalizing them would require designing semantics for such
a language.
\begin{itemize}[itemsep=3pt, parsep=0pt]
\item{} \dwcfa{set\_loc(loc)}:
start a new table row from address $loc$
\item{} \dwcfa{advance\_loc(delta)}:
@ -447,8 +454,7 @@ operand~-- are irrelevant and will be eluded.
\item{} \dwcfa{val\_offset(reg, offset)}:
the value of the register \reg{reg} is the value $CFA + \textit{offset}$
\item{} \dwcfa{register(reg, model)}:
the register \reg{reg} has, in this row, the value that $\reg{model}$
had in the previous row
the register \reg{reg} has, in this row, the value of $\reg{model}$.
\item{} \dwcfa{expression(reg, expr)}:
the value of \reg{reg} is stored in memory at the address defined by
$expr$
@ -457,15 +463,15 @@ operand~-- are irrelevant and will be eluded.
\item{} \dwcfa{restore(reg)}:
\reg{reg} has the same value as in this FDE's preamble (CIE) in this
row. This is \emph{not implemented in this semantics} for simplicity
and brevity (we would have to introduce CIE (preamble) and FDE (body)
independently). This is also not much used in actual ELF
files: the analysis in Section~\ref{ssec:instr_cov} found no such
instruction, on a random uniform sample of 4000 ELF files.
and brevity, as we would have to introduce CIE (preamble) and FDE
(body) independently. This is also not often used in actual ELF files:
the analysis in Section~\ref{ssec:instr_cov} found no such instruction,
on a random uniform sample of 4000 ELF files.
\item{} \dwcfa{remember\_state()}:
push the state of all the registers of this row on an implicit stack
push the state of all the registers of this row on a state-saving stack
\item{} \dwcfa{restore\_state()}:
pop an entry of the implicit stack, and restore all registers in this
row to the value held in the stack record.
pop an entry of the state-saving stack, and restore all registers in
this row to the value held in the stack record.
\item{} \dwcfa{nop()}:
do nothing (padding)
\end{itemize}
@ -495,30 +501,30 @@ Its grammar is as follows:
\end{align*}
The entry point of the grammar is a $\FDE$, which is a set of rows, each
annotated with a machine address, the address from which it is valid.
The addresses are necessarily increasing within a FDE\@.
annotated with the machine address from which it is valid. The addresses are
necessarily increasing within a FDE\@.
Each row then represents, as a function mapping registers to values, a row of
Each row is as a function mapping registers to values, and represents a row of
the unwinding table.
We implicitly consider that $\reg{reg}$ maps to a number, and we use here
\texttt{x86\_64} names for convenience, but actually in DWARF registers are
only handled as register identifiers, so we can safely state that $\reg{reg}
\in \regs$.
We implicitly consider that $\reg{reg}$ maps to a number. We use here
\texttt{x86\_64} names for convenience, although in DWARF, registers are merely
identifiers. Thus, we can safely state that $\reg{reg} \in \regs$.
A value can then be undefined, stored at memory address $x$ or be directly a
value $x$, $x$ being here a simple expression consisting of $\reg{reg} +
\textit{offset}$. The CFA is considered a simple register here. For instance,
to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA,
we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack
grows downwards. We also leave open the possibility to extend the language with
DWARF expressions support as $\valexpr{}$, although we \emph{do not} specify
them here.
\textit{offset}$. The CFA is seen just as any other register here, although
DWARF makes a distinction between it and other columns. For instance, to define
$\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we would
have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$, since the stack grows
downwards. We also leave open the possibility to extend the language with DWARF
expressions support as $\valexpr{}$, although we \emph{do not} specify them
here.
\subsection{Target language: a C function body}
The target language of these semantics is a C function, to be interpreted with
respect to the C11 standard~\cite{c11std}. The function is supposed to be run
The target language of these semantics is a C function, to be interpreted
\wrt{} the C11 standard~\cite{c11std}. The function is supposed to be run
in the context of the program being unwound. In particular, it must be able to
dereference some pointer derived from DWARF instructions that will point to the
execution stack, or even the heap.
@ -538,22 +544,35 @@ The function is the following:
The translation of $\intermedlang$ as produced by the later-defined function
are then to be inserted in this context, where the comment states so.
In pseudo-C code (for brevity) and assuming the functions and types used are
duly defined elsewhere, unwinding multiple frames would then look like this:
\lstinputlisting[language=C]{src/dw_semantics/stack_walker.c}
Thus, if we hold for true that the IP will remain in the same memory segment
--~\ie{} binary file~-- for two frames, we can safely unwind two frames this
way:
\lstinputlisting[language=C]{src/dw_semantics/unwind2.c}
\subsection{From DWARF to $\intermedlang$}
To define the interpretation of $\DWARF$ to $\intermedlang$, we will need to
proceed forward, but, as the language inherently depends on the previous
instructions to give a meaning to the following ones, we will depend on what
was computed before. At a point of the interpretation $h \vert t$, where $t$ is
what remains to be interpreted, $h$ what has been, and $H$ the result of the
interpretation, it would thus look like $\llbracket t \rrbracket (H)$.
But we also need to keep track of this implicit stack DWARF uses, which will be
kept in subscript.
\medskip
In DWARF, the instructions have a meaning that refer to previously interpreted
instructions, sequentially. For instance, many registers are defined at offsets
from the current CFA, which in turn was previously defined \wrt{} the former
CFA value, etc. Thus, to give a meaning to a DWARF instruction, knowledge of
the current row's values is needed. Let us consider a given point of the
interpretation of $d = h \cdot t$, where we already have interpreted $h$, the
first instructions, and interpreted it as $H \in \FDE$, while $t$ remains to be
interpreted. We then define the interpretation function $\llbracket t
\rrbracket (H)$, interpreting the remainder $t$ of the DWARF instructions,
having the knowledge of $H$, the current interpreted row.
But we also need to keep track of this state-saving stack DWARF uses, which
will be kept in subscript.
Thus, we define $\semI{\bullet}{s}(\bullet): \DWARF \times \FDE \to \FDE$, for
$s$ a stack of $\dwrow$, that is,
\[
s \in \rowstack := \dwrow^\ast
\]
@ -562,30 +581,49 @@ Implicitly, $\semI{\bullet}{} := \semI{\bullet}{\varepsilon}$
\medskip
For convenience, we define $\insarrow{reg}$, the operator changing the value of
a register for a given value in the last row, as
For convenience, we define $\insarrow{$r \in{} \regs$}$, an operator changing the
value assigned to a register, its right-hand side operand, in the last row
of a given $\FDE$, its left-hand side operand.
\[
\left(f \in \FDE\right) \insarrow{$r \in \regs$} (v \in values)
\left(f \in \FDE\right) \insarrow{$r \in{} \regs$} (v \in values)
\quad := \quad
\left( f\left[0 \ldots |f| - 2\right] \right) \cdot \left\{
\left( f\left[0\ \cdots\ \left(\vert f \vert - 2\right)\right] \right) \cdot \left\{
\begin{array}{r l}
r' \neq r &\mapsto \left(f[-1]\right)(r') \\
r &\mapsto v \\
\end{array} \right.
\]
The same way, we define $\extrarrow{reg}$ that \emph{extracts} the rule
currently applied for $\reg{reg}$, eg. $F \extrarrow{CFA} \valval{\reg{reg} +
\text{off}}$. If the rule currently applied in such a case is \emph{not} of the
form $\reg{reg} + \text{off}$, then the program is considered erroneous. One
can see this $\extrarrow{reg}$ somehow as a \lstc{match} statement in OCaml,
but with only one case, allowing to retrieve packed data.
Note that for convenience, we allow ourselves to index negatively an array to
retrieve its values from the end; thus, $f[-1]$ refers to the last entry of
$f$. If we consider the fictive following fictive row $R_0$,
More generally, we define ${\extrarrow{reg}}^{-k}$ as the same operation, but
extracting in the $k$-older row, ie. ${\extrarrow{reg}}^{0}$ is the same as
$\extrarrow{reg}$, and $F {\extrarrow{reg}}^{-1} \bullet$ is the same as
$F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.
\[
R \in \dwrow := \left\{ \begin{array}{r l}
CFA &\mapsto \valval{\reg{rsp} - 48} \\
\reg{rbx} &\mapsto \valaddr{\reg{rsp} - 16}
\end{array}\right.
\]
then, we would have
\[
R \insarrow{\reg{rbx}} \left(\valaddr{\reg{rip - 24}}\right)
\quad = \quad
\left\{ \begin{array}{r l}
CFA &\mapsto \valval{\reg{rsp} - 48} \\
\reg{rbx} &\mapsto \valaddr{\reg{rsp} - 24}
\end{array}\right.
\]
The same way, we define $\extrarrow{reg}$ that \emph{extracts} the rule
currently applied for $\reg{reg}$ in the last row of a FDE, \eg{} $F
\extrarrow{CFA} \valval{\reg{reg} + \text{off}}$. If the rule currently applied
in such a case is \emph{not} of the form $\reg{reg} + \text{off}$, then the
program is considered erroneous. One can see this $\extrarrow{reg}$ as a
\lstc{match} statement in OCaml, but with only one case, allowing to retrieve
packed data, all the other unmatched cases corresponding to an error.
\begin{align*}
\semI{\varepsilon}{s}(F) &:= F \\
@ -594,83 +632,74 @@ $F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$.
\semI{\dwcfa{adv\_loc(delta)} \cdot d}{s}(F) &:=
\contsem{F \cdot \left(F[-1].addr + delta, F[-1].row \right)} \\
\semI{\dwcfa{def\_cfa(reg, offset)} \cdot d}{s}(F) &:=
\contsem{F \insarrow{CFA} \valval{\reg{reg} + offset}} \\
\contsem{F \insarrow{CFA} \valval{\reg{reg} + \textit{offset}}} \\
\semI{\dwcfa{def\_cfa\_register(reg)} \cdot d}{s}(F) &:=
\text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}}
\text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \textit{oldoffset}}
\text{ in} \\
&\quad \contsem{F \insarrow{CFA} \valval{\reg{reg} + oldoffset}} \\
&\quad \contsem{F \insarrow{CFA} \valval{\reg{reg} + \textit{oldoffset}}} \\
\semI{\dwcfa{def\_cfa\_offset(offset)} \cdot d}{s}(F) &:=
\text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}}
\text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \textit{oldoffset}}
\text{ in} \\
&\quad \contsem{F \insarrow{CFA} \valval{\reg{oldreg} + offset}} \\
\semI{\dwcfa{def\_cfa\_expression(expr)} \cdot d}{s}(F) &:=
\text{TO BE DEFINED} &\qtodo{CHECK ME?} \\
&\quad \contsem{F \insarrow{CFA} \valval{\reg{oldreg} + \textit{offset}}} \\
\semI{\dwcfa{undefined(reg)} \cdot d}{s}(F) &:=
\contsem{F \insarrow{reg} \bot} \\
\contsem{F \insarrow{\reg{reg}} \bot} \\
\semI{\dwcfa{same\_value(reg)} \cdot d}{s}(F) &:=
\valval{\reg{reg}} \\
\semI{\dwcfa{offset(reg, offset)} \cdot d}{s}(F) &:=
\contsem{F \insarrow{reg} \valaddr{\reg{CFA} + \textit{offset}}} \\
\contsem{F \insarrow{reg} \valaddr{\textit{CFA} + \textit{offset}}} \\
\semI{\dwcfa{val\_offset(reg, offset)} \cdot d}{s}(F) &:=
\contsem{F \insarrow{reg} \valval{\reg{CFA} + \textit{offset}}} \\
\contsem{F \insarrow{reg} \valval{\textit{CFA} + \textit{offset}}} \\
\semI{\dwcfa{register(reg, model)} \cdot d}{s}(F) &:=
\text{let } F {\extrarrow{model}}^{-1} r \text{ in }
\text{let } F {\extrarrow{model}} r \text{ in }
\contsem{F \insarrow{reg} r} \\
\semI{\dwcfa{expression(reg, expr)} \cdot d}{s}(F) &:=
\text{TO BE DEFINED} &\qtodo{CHECK ME?}\\
\semI{\dwcfa{val\_expression(reg, expr)} \cdot d}{s}(F) &:=
\text{TO BE DEFINED} &\qtodo{CHECK ME?}\\
% \semI{\dwcfa{restore(reg)} \cdot d}{s}(F) &:= \\ %% NOT IMPLEMENTED
\semI{\dwcfa{remember\_state()} \cdot d}{s}(F) &:=
\semI{d}{s \cdot F[-1].row}\left(F\right) \\
\semI{\dwcfa{restore\_state()} \cdot d}{s \cdot t}(F) &:=
\semI{d}{s\ \cdot\ \left(F[-1].row\right)}\left(F\right) \\
\semI{\dwcfa{restore\_state()} \cdot d}{s\ \cdot\ t}(F) &:=
\semI{d}{s}\left(F\left[0 \ldots |F|-2\right] \cdot
\left(F[-1].addr, t\right) \right) \\
\semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\
\end{align*}
The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If
we omit those two operations, we can plainly remove the stack.
The state-saving stack is used only for \texttt{remember\_state} and
\texttt{restore\_state}. If we were to omit those two operations, we could
plainly remove the stack from our notations.
\subsection{From $\intermedlang$ to C}
\emph{The C code provided thereafter is a correct but inefficient reference
implementation, which is only provided for formalization, to specify DWARF
\wrt{} C. In particular, the actual compiler is \emph{not} implemented this
way.}
implementation, which is only provided to specify DWARF \wrt{} C. In
particular, the actual compiler is \emph{not} implemented this way.}
\medskip
We now define $\semC{\bullet}: \DWARF \to C$, in the context presented
earlier. The translation from $\intermedlang$ to C is defined as follows:
We now define $\semC{\bullet}: \DWARF \to C$, in the context presented earlier
in Section~\ref{lst:sem_c_ctx}. The translation from $\intermedlang$ to C is
defined as follows:
\begin{itemize}
\item $\semC{\varepsilon} =$ \\
\item $\semC{\varepsilon} =$
\begin{lstlisting}[language=C, mathescape=true]
else {
for(int reg=0; reg < NB_REGS; ++reg)
new_ctx[reg] = $\semR{\bot}$;
}
\end{lstlisting}
for(int reg=0; reg < NB_REGS; ++reg)
new_ctx[reg] = $\semR{\bot}$; \end{lstlisting}
\item $\semC{(\text{loc}, \text{row}) \cdot t} = C\_code \cdot \semC{t}$,
where $C\_code$ is
\begin{lstlisting}[language=C, mathescape=true]
if(ip >= $loc$) {
for(int reg=0; reg < NB_REGS; ++reg)
new_ctx[reg] = $\semR{row[reg]}$;
goto end_ifs; // Avoid using `else if` (easier for generation)
}
\end{lstlisting}
if(ip >= $loc$) {
for(int reg=0; reg < NB_REGS; ++reg)
new_ctx[reg] = $\semR{row[reg]}$;
goto end_ifs; // Avoid using `else if` (easier for generation)
} \end{lstlisting}
\end{itemize}
and $\semR{\bullet}$ is defined as
while $\semR{\bullet}$ is defined as
\begin{align*}
\semR{\bot} &= \text{\lstc{ERROR_VALUE}} \\
\semR{\valaddr{\text{reg}, \textit{offset}}} &=
\semR{\bot} &\eqspace{}
\text{\lstc{ERROR_VALUE}} \\
\semR{\valaddr{\text{reg}, \textit{offset}}} &\eqspace{}
\text{\lstc{*(old_ctx[reg] + offset)}} \\
\semR{\valval{\text{reg}, \textit{offset}}} &=
\semR{\valval{\text{reg}, \textit{offset}}} &\eqspace{}
\text{\lstc{(old_ctx[reg] + offset)}} \\
\end{align*}

View file

@ -0,0 +1,5 @@
while(!unwinding_done()) {
unwind_fct_t unwind_fct = get_unwinder_for_IP(current_context[RA]);
current_context = unwind_fct(current_context[RA], current_context);
do_something_with_context(current_context);
}

View file

@ -0,0 +1,2 @@
for(int i = 0; i < 2; ++i)
current_context = unwind_frame(current_context[RA], current_context);

View file

@ -38,3 +38,5 @@
\newcommand{\extrarrow}[1]{\xrightarrow{\text{#1}}}
\newcommand{\contsem}[1]{\semI{d}{s}\left(#1\right)}
\newcommand{\eqspace}{\quad = \quad}