Add heuristics analysis

2019-07-15 13:56:02 +02:00 · 2019-07-15 13:56:02 +02:00 · b2cf0a77df
commit b2cf0a77df
parent c74ec873eb
1 changed files with 99 additions and 0 deletions
--- a/HEURISTICS.md
+++ b/HEURISTICS.md
@ -0,0 +1,99 @@
 # Heuristics used for synthesis
 This file lists the major heuristics used for synthesis.
 ## Initial row
 Initial row is always assumed as
    CFA     rbp   ra
    rsp+8   u     c-8
 ## With or without %rbp?
 When synthesizing a FDE, there is sometimes a choice between using %rbp or
 not. For instance, it is possible that the original program uses %rbp for
 something entirely different than keeping a base pointer, without it being
 obvious: the synthesis must then avoid using %rbp.
 When synthesizing a FDE, two passes are applied on the function: a first pass
 that tracks %rbp to generate a correct table, but is denied using %rbp as an
 indexing mean for CFA. If this first pass fails by losing track of its CFA at
 some point, we fall back to a second phase that does the same, but switches its
 CFA indexing to %rbp if possible.
 This method works in practice because
 * if the first pass succeeded, then a correct CFA indexing was found,
 * if not, the original compiler could not generate a correct CFA indexing
   either and was forced to use %rbp as a base pointer (except corner cases,
   eg. clang sometimes generate code without possible correct unwinding data in
   pre-abort error handling paths)
 ## Lossy merge
 When two or more code branches merge at some point, we require that the
 unwinding data propagated by all of the branches can be merged into
 consistent data.
 Most of the time, *consistent* means strictly equivalent, but it can be
 weakened by allowing rows with %rbp undefined on one side and defined on the
 other to be merged — thus assuming the merged data is %rbp undefined, allowing
 a information loss.
 We actually process the control flow graph of a subroutine by walking it
 depth-first. When first encountering a new block, the propagated row is saved
 as the initial data for this block. When we encounter it again from another
 predecessor, the propagated row is merged if possible, or aborts with
 inconsistency. This merge operation is thus algorithmically free if the data
 first stored in the block is %rbp undefined — it is possible to just erase the
 data on the newly merged unwinding data. The other way around, changing the
 data already present, with which subsequent computations have already been
 made, would require recomputing a lot of data. We thus *only allow it* if the
 block is a leaf block in the control flow graph of the subroutine.
 This restriction in the application conditions works well in practice because
 gcc does not generate such lossy merges, and clang generates those only for the
 exit block of a function — just before `retq`.
 ## CFA state tracking
 ### When CFA is an offset of %rsp
 If the CFA is an offset of %rsp, it must be kept up to date when %rsp changes.
 In the BAP IR, every such change will generate some instruction `%rsp <- EXPR`.
 * If the expression is just `%rsp <- %rsp + offset`, the CFA is updated with
   this offset (most cases).
 * If not, the analysis loses track and aborts. This case did not occur during
   our testing while the CFA was indexed by %rsp.
 ### When CFA is offset of %rbp
 If the CFA is an offset of %rbp, nothing special is required to track the CFA.
 ### Switching between the modes: %rsp to %rbp indexing
 If the CFA is currently an offset of %rsp, an indexing mode change is detected
 when %rip is saved to %rbp. If the synthesis is currently allowed to use %rbp
 indexing (see *With or without %rbp?*), the indexing mode is then switched. If
 not, the current CFA indexing is kept.
 ### Switching between the modes: %rbp to %rsp indexing
 The only event that triggers a revert to %rsp-based indexing is when %rbp gets
 overwritten with something while %rbp indexing.
 It is non-trivial to decide which %rsp offset should be used when switching
 back. So far, we have only encountered switches back to %rsp at the very end of
 functions — when %rbp was popped from the stack. Thus, we thus assume that upon
 restore, CFA=%rsp+8. This only works in practice since in the observed cases,
 compilers tend to stick to %rbp indexing when they decide to use it in a
 function.
 ## %rbp state tracking
 Tracking the state of %rbp (or any other callee-saved register) can be done by
 tracking the program points at which
 * %rbp is undefined and an instruction saves %rbp to the stack,
 * %rbp is defined and an instruction overwrites %rbp with the data initially
   saved on the stack