Add heuristics analysis

2019-07-15 13:56:02 +02:00 · 2019-07-15 13:56:02 +02:00 · b2cf0a77df
commit b2cf0a77df
parent c74ec873eb
1 changed files with 99 additions and 0 deletions
--- a/HEURISTICS.md
+++ b/HEURISTICS.md
@ -0,0 +1,99 @@
+# Heuristics used for synthesis
+
+This file lists the major heuristics used for synthesis.
+
+## Initial row
+
+Initial row is always assumed as
+    CFA     rbp   ra
+    rsp+8   u     c-8
+
+## With or without %rbp?
+
+When synthesizing a FDE, there is sometimes a choice between using %rbp or
+not. For instance, it is possible that the original program uses %rbp for
+something entirely different than keeping a base pointer, without it being
+obvious: the synthesis must then avoid using %rbp.
+
+When synthesizing a FDE, two passes are applied on the function: a first pass
+that tracks %rbp to generate a correct table, but is denied using %rbp as an
+indexing mean for CFA. If this first pass fails by losing track of its CFA at
+some point, we fall back to a second phase that does the same, but switches its
+CFA indexing to %rbp if possible.
+
+This method works in practice because
+ * if the first pass succeeded, then a correct CFA indexing was found,
+ * if not, the original compiler could not generate a correct CFA indexing
+   either and was forced to use %rbp as a base pointer (except corner cases,
+   eg. clang sometimes generate code without possible correct unwinding data in
+   pre-abort error handling paths)
+
+## Lossy merge
+
+When two or more code branches merge at some point, we require that the
+unwinding data propagated by all of the branches can be merged into
+consistent data.
+
+Most of the time, *consistent* means strictly equivalent, but it can be
+weakened by allowing rows with %rbp undefined on one side and defined on the
+other to be merged — thus assuming the merged data is %rbp undefined, allowing
+a information loss.
+
+We actually process the control flow graph of a subroutine by walking it
+depth-first. When first encountering a new block, the propagated row is saved
+as the initial data for this block. When we encounter it again from another
+predecessor, the propagated row is merged if possible, or aborts with
+inconsistency. This merge operation is thus algorithmically free if the data
+first stored in the block is %rbp undefined — it is possible to just erase the
+data on the newly merged unwinding data. The other way around, changing the
+data already present, with which subsequent computations have already been
+made, would require recomputing a lot of data. We thus *only allow it* if the
+block is a leaf block in the control flow graph of the subroutine.
+
+This restriction in the application conditions works well in practice because
+gcc does not generate such lossy merges, and clang generates those only for the
+exit block of a function — just before `retq`.
+
+## CFA state tracking
+
+### When CFA is an offset of %rsp
+
+If the CFA is an offset of %rsp, it must be kept up to date when %rsp changes.
+In the BAP IR, every such change will generate some instruction `%rsp <- EXPR`.
+
+ * If the expression is just `%rsp <- %rsp + offset`, the CFA is updated with
+   this offset (most cases).
+ * If not, the analysis loses track and aborts. This case did not occur during
+   our testing while the CFA was indexed by %rsp.
+
+### When CFA is offset of %rbp
+
+If the CFA is an offset of %rbp, nothing special is required to track the CFA.
+
+### Switching between the modes: %rsp to %rbp indexing
+
+If the CFA is currently an offset of %rsp, an indexing mode change is detected
+when %rip is saved to %rbp. If the synthesis is currently allowed to use %rbp
+indexing (see *With or without %rbp?*), the indexing mode is then switched. If
+not, the current CFA indexing is kept.
+
+### Switching between the modes: %rbp to %rsp indexing
+
+The only event that triggers a revert to %rsp-based indexing is when %rbp gets
+overwritten with something while %rbp indexing.
+
+It is non-trivial to decide which %rsp offset should be used when switching
+back. So far, we have only encountered switches back to %rsp at the very end of
+functions — when %rbp was popped from the stack. Thus, we thus assume that upon
+restore, CFA=%rsp+8. This only works in practice since in the observed cases,
+compilers tend to stick to %rbp indexing when they decide to use it in a
+function.
+
+## %rbp state tracking
+
+Tracking the state of %rbp (or any other callee-saved register) can be done by
+tracking the program points at which
+
+ * %rbp is undefined and an instruction saves %rbp to the stack,
+ * %rbp is defined and an instruction overwrites %rbp with the data initially
+   saved on the stack