\title{DWARF Semantics} \author{Théophile Bastian} \date{\today} \documentclass[11pt]{article} \usepackage[left=2cm,right=2cm,top=2cm,bottom=2cm]{geometry} \usepackage{amsmath} \usepackage{amssymb} \usepackage{stmaryrd} \usepackage{mathtools} \usepackage[utf8]{inputenc} \usepackage[backend=biber,style=alphabetic]{biblatex} \usepackage{my_listings} \usepackage{my_hyperref} \addbibresource{semantics.bib} \newcommand{\dwcfa}[1]{\texttt{DW\_CFA\_#1}} \newcommand{\reg}[1]{\%#1} \newcommand{\DWARF}{\operatorname{DWARF}} \newcommand{\FDE}{\operatorname{FDE}} \newcommand{\dwrow}{\operatorname{Row}} \newcommand{\spexpr}{\mathbb{E}} \newcommand{\regs}{\mathbb{R}} \newcommand{\values}{\mathbb{V}} \newcommand{\valaddr}[1]{\operatorname{Addr}\left(#1\right)} \newcommand{\valval}[1]{\operatorname{Val}\left(#1\right)} \newcommand{\intermedlang}{\mathcal{I}} \newcommand{\semI}[2]{{\left\llbracket{} #1 \right\rrbracket}^\intermedlang_{#2}} \newcommand{\semC}[1]{{\left\llbracket{} #1 \right\rrbracket}^{C}} \newcommand{\semR}[1]{{\left\llbracket{} #1 \right\rrbracket}^{R}} \newcommand{\rowstack}{\mathbb{S}} \newcommand{\insarrow}[1]{\xleftarrow{\text{#1}}} \newcommand{\extrarrow}[1]{\xrightarrow{\text{#1}}} \newcommand{\contsem}[1]{\semI{d}{s}\left(#1\right)} \newcommand{\ehelf}{\texttt{EH\_ELF}} \begin{document} \maketitle We will now define semantics covering most of the operations used for CFI\footnote{To be defined elsewhere in the report} described in the DWARF standard~\cite{dwarf5std}, with the exception of DWARF expressions. These are not exhaustively treated because they are quite rich and would take a lot of time and space to formalize, and in the meantime are only seldom used (see the DWARF statistics regarding this). These semantics are defined with respect to the well-formalized C language, and are passing through an intermediary language. The DWARF language can read the whole memory, as well as registers, and is always executed for some instruction pointer. The C function representing it will thus take as parameters an array of the registers' values as well as an IP, and will return another array of registers values, which will represent the evaluated DWARF row. \section{Original language~: DWARF instructions} These are the DWARF instructions used for CFI description, that is, the instructions that contain the stack unwinding table informations. The following list is an exhaustive list of instructions from the DWARF5 specification~\cite{dwarf5std} concerning CFI, with reworded descriptions for brevity and clarity. All these instructions are up to variants (most instructions exist in multiple formats to handle various operands formatting, to optimize space). Since we won't be talking about the underlying file format here, those variations between eg. \dwcfa{advance\_loc1} and \dwcfa{advance\_loc2} ---~which differ only on the number of bytes of their operand~--- are irrelevant and will be eluded. \begin{itemize} \item{} \dwcfa{set\_loc(loc)}~: start a new table row from address $loc$ \item{} \dwcfa{advance\_loc(delta)}~: start a new table row at address $prev\_loc + delta$ \item{} \dwcfa{def\_cfa(reg, offset)}~: sets this row's CFA at $(\reg{reg} + \textit{offset})$ \item{} \dwcfa{def\_cfa\_register(reg)}~: sets CFA at $(\reg{reg} + \textit{prev\_offset})$ \item{} \dwcfa{def\_cfa\_offset(offset)}~: sets CFA at $(\reg{prev\_reg} + \textit{offset})$ \item{} \dwcfa{def\_cfa\_expression(expr)}~: sets CFA as the result of $expr$ \item{} \dwcfa{undefined(reg)}~: sets the register \reg{reg} as undefined in this row \item{} \dwcfa{same\_value(reg)}~: declares that the register \reg{reg} hasn't been touched, or was restored to its previous value, in this row. An unwinding procedure can leave it as-is. \item{} \dwcfa{offset(reg, offset)}~: the value of the register \reg{reg} is stored in memory at the address $CFA + \textit{offset}$. \item{} \dwcfa{val\_offset(reg, offset)}~: the value of the register \reg{reg} is the value $CFA + \textit{offset}$ \item{} \dwcfa{register(reg, model)}~: the register \reg{reg} has, in this row, the value that $\reg{model}$ had in the previous row \item{} \dwcfa{expression(reg, expr)}~: the value of \reg{reg} is stored in memory at the address defined by $expr$ \item{} \dwcfa{val\_expression(reg, expr)}~: \reg{reg} has the value of $expr$ \item{} \dwcfa{restore(reg)}~: \reg{reg} has the same value as in this FDE's preamble (CIE) in this row. This is \emph{not implemented in this semantics} for simplicity and brevity (we would have to introduce CIE (preamble) and FDE (body) independently). This is also not much used in actual ELF files\footnote{TODO: refer to stats}. \item{} \dwcfa{remember\_state()}~: push the state of all the registers of this row on an implicit stack \item{} \dwcfa{restore\_state()}~: pop an entry of the implicit stack, and restore all registers in this row to the value held in the stack record. \item{} \dwcfa{nop()}~: do nothing (padding) \end{itemize} \section{Intermediary language $\intermedlang$} A first pass will translate DWARF instructions into this intermediary language $\intermedlang$. It is designed to be more mathematical, representing the same thing, but abstracting all the data compression of the DWARF format away, so that we can better reason on it and transform it into C code. Its grammar is as follows: \begin{align*} \FDE &::= {\left(\mathbb{Z} \times \dwrow \right)}^{\ast} & \text{FDE (set of rows)} \\ \dwrow &::= \values ^ \regs & \text{A single table row} \\ \regs &::= \left\{0, 1, \ldots, \operatorname{NB\_REGS - 1} \right\} & \text{Machine registers} \\ \values &::= \bot & \text{Values: undefined,}\\ &\quad\vert~\valaddr{\spexpr} & \text{at address $x$},\\ &\quad\vert~\valval{\spexpr} & \text{of value $x$} \\ \spexpr &::= \regs \times \mathbb{Z} & \text{A ``simple'' expression $\reg{reg} + \textit{offset}$} \\ \end{align*} The entry point of the grammar is a $\FDE$, which is a set of rows, each annotated with a machine address, the address from which it is valid. Note that the addresses are necessarily increasing within a FDE\@. Each row then represents, as a function mapping registers to values, a row of the unwinding table. We implicitly consider that $\reg{reg}$ maps to a number, and we use here \texttt{x86\_64} names for convenience, but actually in DWARF registers are only handled as register identifiers, so we can safely state that $\reg{reg} \in \regs$. A value can then be undefined, stored at memory address $x$ or be directly a value $x$, $x$ being here a simple expression consisting of $\reg{reg} + \textit{offset}$. The CFA is considered a simple register here. For instance, to define $\reg{rax}$ to the value contained in memory 16 bytes below the CFA, we would have $\reg{rax} \mapsto \valaddr{\reg{CFA}, -16}$ (for the stack grows downwards). \section{Target language~: a C function body} The target language of these semantics is a C function, to be interpreted with respect to the C11 standard~\cite{c11std}. The function is supposed to be run in the context of the program being unwound. In particular, it must be able to dereference some pointer derived from DWARF instructions that will point to the execution stack, or even the heap. This function takes as arguments an instruction pointer ---~supposedly extracted from $\reg{rip}$~--- and an array of register values; and returns a fresh array of register values after unwinding this call frame. The function is compositional\footnote{up to technicities: the IP obtained after unwinding the first frame might be handled in a different dynamically loaded object, and this would require inspecting the DWARF located in another file}: it can be called twice in a row to unwind two stack frames. The function is the following~: \lstinputlisting[language=C]{src/c_context.c} The translation of $\intermedlang$ as produced by the later-defined function are then to be inserted in this context, where the comment states so. \section{From DWARF to $\intermedlang$} To define the interpretation of $\DWARF$ to $\intermedlang$, we will need to proceed forward, but, as the language inherently depends on the previous instructions to give a meaning to the following ones, we will depend on what was computed before. At a point of the interpretation $h \vert t$, where $t$ is what remains to be interpreted, $h$ what has been, and $H$ the result of the interpretation, it would thus look like $\llbracket t \rrbracket (H)$. But we also need to keep track of this implicit stack DWARF uses, which will be kept in subscript. \vspace{1em} Thus, we define $\semI{\bullet}{s}(\bullet) : \DWARF \times \FDE \to \FDE$, for $s$ a stack of $\dwrow$, that is, \[ s \in \rowstack := \dwrow^\ast \] Implicitly, $\semI{\bullet}{} := \semI{\bullet}{\varepsilon}$ \vspace{1em} For convenience, we define $\insarrow{reg}$, the operator changing the value of a register for a given value in the last row, as \[ \left(f \in \FDE\right) \insarrow{$r \in \regs$} (v \in values) \quad := \quad \left( f\left[0 \ldots |f| - 2\right] \right) \cdot \left\{ \begin{array}{r l} r' \neq r &\mapsto \left(f[-1]\right)(r') \\ r &\mapsto v \\ \end{array} \right. \] The same way, we define $\extrarrow{reg}$ that \emph{extracts} the rule currently applied for $\reg{reg}$, eg. $F \extrarrow{CFA} \valval{\reg{reg} + \text{off}}$. If the rule currently applied in such a case is \emph{not} of the form $\reg{reg} + \text{off}$, then the program is considered erroneous. One can see this $\extrarrow{reg}$ somehow as a \lstc{match} statement in OCaml, but with only one case, allowing to retrieve packed data. More generally, we define ${\extrarrow{reg}}^{-k}$ as the same operation, but extracting in the $k$-older row, ie. ${\extrarrow{reg}}^{0}$ is the same as $\extrarrow{reg}$, and $F {\extrarrow{reg}}^{-1} \bullet$ is the same as $F\left[0 \ldots |F|-2\right] \extrarrow{reg} \bullet$. \begin{align*} \semI{\varepsilon}{s}(F) &:= F \\ \semI{\dwcfa{set\_loc(loc)} \cdot d}{s}(F) &:= \contsem{F \cdot \left(loc, F[-1].row \right)} \\ \semI{\dwcfa{adv\_loc(delta)} \cdot d}{s}(F) &:= \contsem{F \cdot \left(F[-1].addr + delta, F[-1].row \right)} \\ \semI{\dwcfa{def\_cfa(reg, offset)} \cdot d}{s}(F) &:= \contsem{F \insarrow{CFA} \valval{\reg{reg} + offset}} \\ \semI{\dwcfa{def\_cfa\_register(reg)} \cdot d}{s}(F) &:= \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}} \text{ in} \\ &\quad \contsem{F \insarrow{CFA} \valval{\reg{reg} + oldoffset}} \\ \semI{\dwcfa{def\_cfa\_offset(offset)} \cdot d}{s}(F) &:= \text{let F }\extrarrow{CFA} \valval{\reg{oldreg} + \text{oldoffset}} \text{ in} \\ &\quad \contsem{F \insarrow{CFA} \valval{\reg{oldreg} + offset}} \\ \semI{\dwcfa{def\_cfa\_expression(expr)} \cdot d}{s}(F) &:= \text{TO BE DEFINED} &\textbf{CHECK ME?}\\ \semI{\dwcfa{undefined(reg)} \cdot d}{s}(F) &:= \contsem{F \insarrow{reg} \bot} \\ \semI{\dwcfa{same\_value(reg)} \cdot d}{s}(F) &:= \valval{\reg{reg}} \\ \semI{\dwcfa{offset(reg, offset)} \cdot d}{s}(F) &:= \contsem{F \insarrow{reg} \valaddr{\reg{CFA} + \textit{offset}}} \\ \semI{\dwcfa{val\_offset(reg, offset)} \cdot d}{s}(F) &:= \contsem{F \insarrow{reg} \valval{\reg{CFA} + \textit{offset}}} \\ \semI{\dwcfa{register(reg, model)} \cdot d}{s}(F) &:= \text{let } F {\extrarrow{model}}^{-1} r \text{ in } \contsem{F \insarrow{reg} r} \\ \semI{\dwcfa{expression(reg, expr)} \cdot d}{s}(F) &:= \text{TO BE DEFINED} &\textbf{CHECK ME?}\\ \semI{\dwcfa{val\_expression(reg, expr)} \cdot d}{s}(F) &:= \text{TO BE DEFINED} &\textbf{CHECK ME?}\\ % \semI{\dwcfa{restore(reg)} \cdot d}{s}(F) &:= \\ %% NOT IMPLEMENTED \semI{\dwcfa{remember\_state()} \cdot d}{s}(F) &:= \semI{d}{s \cdot F[-1].row}\left(F\right) \\ \semI{\dwcfa{restore\_state()} \cdot d}{s \cdot t}(F) &:= \semI{d}{s}\left(F\left[0 \ldots |F|-2\right] \cdot \left(F[-1].addr, t\right) \right) \\ \semI{\dwcfa{nop()} \cdot d}{s}(F) &:= \contsem{F}\\ \end{align*} (The stack is used for \texttt{remember\_state} and \texttt{restore\_state}. If we omit those two operations, we can plainly remove the stack). \section{From $\intermedlang$ to C} \textit{This only defines the semantics, with respect to standard C, of DWARF as interpreted by \ehelf\@. The actual DWARF to C compiler is not implemented this way.} \vspace{1em} We now define $\semC{\bullet} : \DWARF \to C$, in the context presented earlier. The translation from $\intermedlang$ to C is defined as follows: \begin{itemize} \item $\semC{\varepsilon} =$ \\ \begin{lstlisting}[language=C, mathescape=true] else { for(int reg=0; reg < NB_REGS; ++reg) new_ctx[reg] = $\semR{\bot}$; } \end{lstlisting} \item $\semC{(\text{loc}, \text{row}) \cdot t} = C\_code \cdot \semC{t}$, where $C\_code$ is \begin{lstlisting}[language=C, mathescape=true] if(ip >= $loc$) { for(int reg=0; reg < NB_REGS; ++reg) new_ctx[reg] = $\semR{row[reg]}$; goto end_ifs; // Avoid if/else if problems } \end{lstlisting} \end{itemize} and $\semR{\bullet}$ is defined as \begin{align*} \semR{\bot} &= \text{\lstc{ERROR_VALUE}} \\ \semR{\valaddr{\text{reg}, \textit{offset}}} &= \text{\lstc{*(old_ctx[reg] + offset)}} \\ \semR{\valval{\text{reg}, \textit{offset}}} &= \text{\lstc{(old_ctx[reg] + offset)}} \\ \end{align*} \printbibliography{} \end{document}