m1-internship-report/report/report.tex

\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{indentfirst}
\usepackage{enumerate}
\usepackage{caption}
\usepackage{algorithmicx}
%\usepgfplotslibrary{external}
%\tikzexternalize

\usepackage[backend=biber,style=trad-alpha]{biblatex}
\usepackage[left=2cm,right=2cm,top=2cm,bottom=2cm]{geometry}


% Custom packages
\usepackage{todo}
\usepackage{leftrule_theorems}
\usepackage{my_listings}
\usepackage{my_hyperref}
\usepackage{../common/internship}
\usepackage{../common/math}

\usepackage{pgfplots}

\bibliography{../common/refs}

\title{Pattern-matching and substitution in electronic circuits}
\author{Théophile Bastian, under supervision of Carl-Johan Seger
        and Mary Sheeran\\
    \small{Chalmers University, Göteborg, Sweden}}
\date{February~--~June 2017}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}
\maketitle

\begin{abstract}
    \todo{enhance abstract?}

    The present report describes and summarizes my 1st year of Master's
    degree's internship at the university of Chalmers, Göteborg, Sweden. I
    worked under supervision of Carl-Johan Seger and Mary Sheeran.

    The project's goal was to contribute to VossII, a hardware proving tool
    cloning fl, developed at Intel by Carl Seger for internal usage only. It
    led me to work on \emph{pattern-matching} on electronic circuits for a
    \emph{search-and-replace} method, allowing one to apply previously proved
    transformations to a circuit.

    This problem turns out to be more or less the \emph{subgraph isomorphism
    problem}, which is NP-complete, and must nevertheless be solved fast on
    processor-sized circuits on this particular case.

    During my internship, I developed a C++ library to perform this task that
    will be integrated in VossII, based on a few well-known algorithms as well
    as some ad-hoc heuristics and algorithm tweaks to better match the context
    of circuits. One of my objectives all along was also to keep a clean and
    well-documented codebase, as the tool will have to be maintainable by
    others later.
\end{abstract}

\tableofcontents

\todo{Talk of the repo, somewhere}

\pagebreak

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

In the previous years, verification and proved software has gathered an
increasing interest in the computer science community, as people realised how
hard bugs are to track down. But hardware bugs are even more tedious to find
and fix, and can easily lead to disastrous consequences, as those cannot be
patched on existing hardware. For instance, the well-known Pentium
``\textsc{fdiv}'' bug~\cite{pratt1995fdiv} that affected a large number of
Pentium processors lead to wrong results for some floating point divisions.
Intel had to offer to replace all the defective CPUs, leading to an announced
loss of 475 million dollars~\cite{nicely_fdiv}. Even recently, the Skylake and
Kaby Lake hyperthreading bug had to be patched using microcode, loosing
performance and reliability.

To avoid such disasters, the industry nowadays uses a wide range of techniques
to catch bugs as early as possible --- which, hopefully, is before the product's
release date. Among those are \todo{list + cite}, but also proved hardware. On
circuits as complex as processors, usually, only sub-components are proved
correct in a specified context --- that should, but is not proved to, be
respected by the other parts of the circuit. Yet, this trade-off between proved
correctness and engineer's work time already gives a pretty good confidence in
the circuit.

In this context, Carl Seger was one of the main developers of fl at Intel, a
functional ml-inspired programming language integrating many features useful to
get insights of a circuit, testing it and proving it. It mostly features
symbolic trajectory evaluation based model checking and theorem proving, and is
intended to be an all-purpose toolbox for the hardware prover.

Among other features, it includes a ``search and replace'' feature, which can
search every occurrence of a given gates pattern in a circuit, and replace it
by some other gates pattern, proved observationally equivalent beforehand.
Time has proved this method very efficient to design circuits: this way, one
can start from an inefficient, yet simple circuit, prove it, and then refine it
into an equivalent, yet efficient one, through proved transformations. It is
also possible to go the other way, and start with an optimized circuit, hard to
understand, and make it easier to understand to work more efficiently.

\paragraph{My internship} lies amid a project shared between Carl-Johan Seger
and Mary Sheeran, aiming to develop tools for proved design of FPGA circuits.
One of the keystones of this project is an open-sourced and publicly available
version of fl, used for the proving part, and is still at the moment under
development.

My part of the work resided on this ``search and replace'' tool. More
specifically, I focused on writing a C++ library, \emph{isomatch}, which is
interfaced with the C core of fl and provides it with low-level and very fast
functions for this task.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Problem setting}

\subsection{Circuit description}

\begin{figure}[!h]
    \begin{align*}
        \textbf{Integer constant } n, m, p, q, \ldots \qquad& \\
        \\
        \textbf{Wire } in0, out0, ctl0, \ldots \qquad& \\
        \\
        \textbf{Vector } \evec{v}{n} & & \textit{($n$ elements of type $v$)} \\
        \\
        \textbf{Circuit } c, d, \ldots ::=~
            &\text{delay} (in0, out0) &\textit{(delay 1 clock tick)} \\
            \vert~&\text{tristate} (in0, out0, ctl0)
                &\textit{(three-state gate)} \\
            \vert~&\text{comb} (\evec{in0}{n}, \evec{out0}{m}, \evec{e}{m})
                &\textit{(combinatorial gate)} \\
            \vert~&\text{assert} (\evec{in0}{n}, \evec{e}{m})
                &\textit{(assertion gate)} \\
            \vert~&\text{group} (\evec{in0}{n}, \evec{out0}{m}, \evec{c}{p})
                &\textit{(circuit hierarchical group)} \\
        \\
        \textbf{Binary operator } \otimes ::=~
            &\wedge & \textit{(and)} \\
            \vert~&\vee & \textit{(or)} \\
            \vert~&\oplus & \textit{(xor)} \\
            \vert~&+ & \textit{(add)} \\
            \vert~&- & \textit{(sub)} \\
            \vert~&\times & \textit{(times)} \\
            \vert~&\div & \textit{(div)} \\
            \vert~&\% & \textit{(mod)} \\
            \vert~&\lsl & \textit{(logical shift left)} \\
            \vert~&\lsr & \textit{(logical shift right)} \\
            \vert~&\asr & \textit{(arithmetic shift right)} \\
        \\
        \textbf{Unary and constant operator } \otimes_0 ::=~
            &\clsl & \textit{(logical shift left of constant)} \\
            \vert~&\clsr & \textit{(logical shift right of constant)} \\
            \vert~&\casr & \textit{(arithmetic shift right of constant)} \\
        \\
        \textbf{Unary operator } \ominus ::=~
            &\lnot & \textit{(logical negation)} \\
        \\
        \textbf{Expression } e, f, \ldots ::=~
            & x & \textit{(variable)} \\
            \vert~& n & \textit{(integer constant)} \\
            \vert~& e \otimes f & \textit{(binary operator)} \\
            \vert~& e \otimes_0 n & \textit{(unary operator with constant)} \\
            \vert~& \ominus e & \textit{(unary operator)} \\
            \vert~& e_{\vert~n \ldots m} & \textit{(slicing: take a subword)}\\
            \vert~& e~\vert~f & \textit{(merging: concatenate two words)} \\
    \end{align*}
    \caption{AST of circuits used}\label{fig:ast}
\end{figure}

The circuits on which \emph{isomatch} is working are described, and internally
represented, by the AST in Figure~\ref{fig:ast}.

The most important thing in the description of circuits here, is that those
circuits are organized as a hierarchy of \emph{circuit groups}. This hierarchy
can be seen as the construction of a circuit by assembling smaller integrated
circuits (ICs), themselves built the same way, etc. A group is composed of
sub-circuits, input pins and output pins. Each level can of course contain
``leaf'' gates, like \textit{and} or \textit{delay} gates. This is important,
because it allows the program to work on smaller areas the circuit (\eg{}
loading in memory only a part of the circuit, etc.).

\subsection{Objective}

More precisely, the problems that \emph{isomatch} must solve are the following.

\begin{enumerate}
    \item\label{prob:equal} Given two circuit groups, are they structurally
        equivalent? That is, are they the same circuit, arranged in a different
        way, with possibly different names, etc.?

    \item\label{prob:match} Given two circuits, \emph{needle} and
        \emph{haystack}, find every (non-overlapping) occurrence of \emph{needle} in
        \emph{haystack}. An occurrence is a set $S$ of sub-circuits of
        \emph{haystack} such that there is a one-to-one mapping of structurally
        equivalent circuits of $S$ with circuits of \emph{needle}, and those
        circuits are connected the same way in both circuits.
\end{enumerate}

Both problems are hard. The first one is an instance of graph isomorphism, as
the actual question is whether there exists a one-to-one mapping between
sub-circuits of the two groups, such that every mapped circuit is equal to the
other (either directly if it is a leaf gate, or recursively with the same
procedure); and whether this mapping respects connections (edges) between those
circuits. Graph isomorphism is known to be in NP (given a permutation of the
first graph, it is polynomial to check whether the first is equal to the second
\wrt{} the permutation), but not known to be in either P or NP-complete. Thus,
since Babai's work on graph isomorphism~\cite{babai2016graph} is only of
theoretical interest, the known algorithms remain in worst-case exponential
time, and require ad-hoc heuristics for specific kind of graphs to get maximum
efficiency.

The second one is an instance of subgraph isomorphism problem, which is known
to be NP-complete~\cite{cook1971complexity}. Even though a few algorithms
(discussed later) are known to be efficient in most cases for this problem, it
is nevertheless necessary to implement them the right way, and with the right
heuristics, to get the desired efficiency for the given problem.

\subsection{Sought efficiency}

The goal of \textit{isomatch} is to be applied to large circuits on-the-fly,
during their conception. Those circuits can (and will probably) be as large as
a full processor, and the software will be operated by a human, working on
their circuit. Thus, \textit{isomatch} must be as fast as possible, since
matching operation will be executed often, and often multiple times in a row.
It must then remain fast enough for the human not to lose too much time, and
eventually lose patience.

\todo{Mention clean codebase somewhere}

\todo{Mention VossII somewhere}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{General approach}

The global strategy used to solve efficiently the problem can be broken down to
three main parts.

\paragraph{Signatures.} The initial idea to make the computation fast is to
aggregate the inner data of a gate --- be it a leaf gate or a group --- in a
kind of hash, a 64 bits unsigned integer. This approach is directly inspired
from what was done in fl, back at Intel. This hash must be easy to compute,
and must be based only on the structure of the graph --- that is, must be
entirely oblivious of the labels given, the order in which the circuit is
described, the order in which different circuits are plugged on a wire, \ldots.
The signature equality, moreover, must be sound; that is, two signatures must
necessarily be equal if the circuits are indeed equal.

This makes it possible to rule out quickly whether two circuits are candidates
for a match or not, and run the costy actual equality algorithm on fewer gates.


\paragraph{Group equality.} The group equality algorithm is a standard
backtracking algorithm. It tries to build a match between the graphs by
trying the diverse permutations of elements with the same signature. It can
also communicate with the signing part, to request a more precise (but slightly
slower to compute) signature when it has too many permutations to try.

This part could be enhanced, but does not slow down the algorithm on the tested
examples, so I focused on other parts.


\paragraph{Pattern matching.} This part is the one responsible to answer
queries for occurrences of a sub-circuit in a circuit. It uses extensively the
signatures to determine whether two circuits could be a match or not before
spending too much time actually finding matches, but cannot rely on it as
heavily as group equality, since only the first level of precision is
applicable here (detailed later).

This part mostly consists in an implementation of Ullmann's algorithm for
subgraph isomorphism~\cite{ullmann1976algorithm}, a well-known algorithm for
this problem, that uses the specificities of the graph to be a little faster.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Signatures}

The signature is computed as a simple hash of the element, and is defined for
every type of expression and circuit. It could probably be enhanced with a bit
more work to cover more uniformly the hash space, but no collision was observed
on the examples tested.

\paragraph{Signature constants.} Signature constants are used all around the
signing process, and is a 5-tuple $\sigconst{} = (a, x_l, x_h, d_l, d_h)$ of 32
bits unsigned numbers. All of $x_l$, $x_h$, $d_l$ and $d_h$ are picked as prime
numbers between $10^8$ and $10^9$ (which just fits in a 32 bits unsigned
integer); while $a$ is a random integer uniformly picked between $2^{16}$ and
$2^{32}$.  These constants are generated by a small python script,
\path{util/primegen/pickPrimes.py}.

Those constants are used to produce a 64 bits unsigned value out of another 64
bits unsigned value, called $v$ thereafter, through an operator $\sigop$,
computed as follows (with all computations done on 64 bits unsigned integers).

\vspace{1em}
\begin{algorithmic}
    \Function{$\sigop$}{$\sigconst{}, v$}
        \State{} $out1 \gets (v + a) \cdot x_l$
        \State{} $v_h \gets (v \lsr 32) \xor (out1 \lsr 32)$
        \State{} $low \gets out1 \,\%\, d_l$
        \State{} $high \gets \left((v_h + a) \cdot x_h \right) \%\, d_h$
        \State{} \Return{} $low + 2^{32} \cdot high$
    \EndFunction{}
\end{algorithmic}

\paragraph{Expressions.} Each type of expression (or, in the case of
expression with operator, each type of operator) has its signature constant,
$\sigconst{\text{exprType}}$. The signature of a commutative expression in its
operands is always commutative, and the signature of a non-commutative
expression should not be (and is not, except for collisions). The value $v$
used to sign the expression (in $\sigop(\sigconst{\text{exprType}}, v)$) is
then the sum (respectively difference) of the signature of its parameters for
commutative (respectively non-commutative) expressions.

\paragraph{Circuits' inner signature.} Every circuit is associated with a value
describing its \emph{type} (rather than its contents): 8 bits of circuit type
ID (delay, tristate, \ldots), the number of inputs on the next 8 bits, and the
number of outputs on 8 more bits. This value is then xored with the inner value
of the circuit: for a combinatorial gate, the xor of its expressions'
signatures; for a group, the sum of its children's signatures\footnote{As a
group is likely to have multiple occurrences of a single identical circuit, it
would be unwise to xor its children's signatures, even though the usual advice
is to combine hashes by xoring them.}, \ldots This value constitutes the
circuit's \emph{inner signature}.

\paragraph{Circuits' signature of order $n$.} The inner signature does not
capture at all the \emph{structure} of the graph. An information we can capture
without breaking the signature's independence towards the order of description
of the graph, is the set of its neighbours. Yet, we cannot ``label'' the gates
without breaking this rule; thus, we represent the set of neighbours by the set
of our \emph{neighbours' signatures}.

At this point, we can define the \emph{signature of order $n$} ($n \in
\natset$) of a circuit $C$ as follows:

\begin{align*}
    \sig_0(C) :&= \text{inner signature of } C\\
    \sig_{n+1}(C) :&= \text{inner signature of }C
        + \text{IO adjacency}
        + \hspace{-2em}\sum\limits_{C_i \in \,\text{neighbours of inputs}}
            \hspace{-2em}\sig_n(C_i) \hspace{1em}
        - \hspace{-2em}\sum\limits_{C_o \in \,\text{neighbours of inputs}}
            \hspace{-2em}\sig_n(C_o)
\end{align*}

The ``IO adjacency'' term is an additional term in the signatures of order
above $0$, indicating what input and output pins of the circuit group
containing the current gate are adjacent to it.

The default order of signature used in all computations, unless more is useful,
is 2, after a few benchmarks.

\todo{explain range of $n$}

\paragraph{Efficiency.} Every circuit memoizes all it can concerning its
signature: the inner signature, the IO adjacency, the signatures of order $n$
already computed, etc.

This memoization, alongside with the exclusive use of elementary operations,
makes the computation of a signature very fast. The computation is linear in
the number of gates in a circuit, times the order computed; the computation is
lazy.

To keep those memoized values up to date whenever the structure of the circuit
is changed (since this is meant to be integrated in a programming language, fl,
meaning the structure of the circuit will possibly be created, checked for
signature, altered, then checked again), each circuit keeps track of a
``timestamp'' of last modification, which is incremented whenever the circuit
or its children are modified. A memoized data is always stored alongside with a
timestamp of computation, which invalidates a previous result when needed.

One possible path of investigation for future work, if the computation turns
out to be still too slow in real-world cases --- which looks unlikely ---,
would be to try to multithread this computation.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Group equality}\label{sec:group_equality}

Given two circuit group gates, the task of group equality is to determine
whether the two groups are structurally equivalent, as discussed above.

Group equality itself is handled as a simple backtracking algorithm, trying to
establish a match (an isomorphism, that is, a permutation of the gates of one
of the groups) between the two groups given.

The gates of the two groups are matched by equal signatures, equal number of
inputs and outputs, based on the signature of default order (that is, 2). A few
checks are made, \eg{} every matching group must have the same size on both
sides (if not, then, necessary, the two groups won't match). Then, the worst
case of number of permutations to check is evaluated.

If this number is too high, the signature order will be incremented, and the
matching groups re-created accordingly, until a satisfyingly low number of
permutations is reached (or the diameter of the circuit is reached, meaning
that increasing the order of signature won't have any additional impact). This
order increase ``on-demand'' proved itself very efficient, effectively lowering
the number of permutations examined to no more than $4$ in studied cases.

Once a permutation is judged worth to be examined, the group equality is run
recursively on all its matched gates. If this step succeeds, the graph
structure is then checked. If both steps succeeds, the permutation is correct
and an isomorphism has been found; if not, we move on to the next permutation.

\todo{Anything more to tell here?}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Pattern-match}

We finally need to be able to find every occurrence of a given \emph{needle}
circuit in a bigger \emph{haystack} circuit — at any level in its groups
hierarchy. This problem is basically graph isomorphism with some specificities,
and for this purpose I used a classical algorithm for the subgraph isomorphism
problem, \emph{Ullmann}.

\subsection{Ullmann's algorithm}

One of the classical algorithms to deal with the subgraph isomorphism problem
was first described by Julian R Ullmann in 1976~\cite{ullmann1976algorithm}.
Another, more recent algorithm to deal with this problem is Luigi P Cordella's
VF2 algorithm~\cite{cordella2004sub}, published in 2004. This algorithm is
mostly Ullmann's algorithm, transcribed in a recursive writing, with the
addition of five heuristics. \qtodo{Why not use it then?}

Ullmann is a widely used and fast algorithm for this problem. It makes an
extensive use of adjacency matrix description of the graph, and the initial
article takes advantage of the representation of those matrices as bitsets to
make extensive use of bitwise operations.

The to-be-built permutation matrix is a $\card{needle} \times \card{haystack}$
matrix. Each $1$ in a cell $(i, j)$ indicates that the $i$-th needle part is a
possible match with the $j$-th haystack part. This matrix is called $perm$
thereafter.

The algorithm, left apart the \textsc{refine} function (detailed just after),
is described in Figure~\ref{alg:ullmann}.

\begin{figure}[h]
\begin{algorithmic}
    \Function{find\_at\_depth}{depth, perm, freeVert}
        \If{no 1s on \lstc{perm[depth]}}
            \State{} \Return{}
        \EndIf{}

        \State{} Save perm

        \For{$0 \leq$ chosen $< \card{\text{haystack}}$ such that
                \lstc{perm[depth][chosen]} $ = 1$ and \lstc{freeVert[chosen]}}
            \State{} Put $0$s everywhere on \lstc{perm[depth]}, but on
                \lstc{chosen}

            \State{} Refine perm
            \If{a row of perm has only $0$s}
                \State{} \Return{}
            \EndIf{}

            \If{depth $=$ $\card{\text{needle}} - 1$}
                \State{} Store perm as a result
            \Else{}
                \State{} \Call{find\_at\_depth}{depth$+1$, perm, freeVert with
                    freeVert[chosen] $= 0$}
            \EndIf{}

            \State{} Restore perm
        \EndFor{}
    \EndFunction{}

    \vspace{1em}

    \Function{find}{perm}
        \State{} \Return{} \Call{find\_at\_depth}{0, perm, [$1, \ldots, 1$]}
    \EndFunction{}
\end{algorithmic}
    \caption{Ullmann's algorithm (without refining)}\label{alg:ullmann}
\end{figure}

The refining process is the actual keystone of the algorithm. It is the
mechanism allowing the algorithm to cut down many exploration branches, by
removing ones from the matrix.

The idea is that a match between a needle's vertex $i$ and a haystack's vertex
$j$ is only possible if, for each neighbour $k$ of $i$, $j$ has a neighbour
$k'$ such that the permutation matrix has a one in position $(k, k')$. In other
words, a match between $i$ and $j$ is only possible if every neighbour $k$ of
$i$ (in needle) has a possibly matching (\wrt{} $perm$) vertex $k'$ (in
haystack) which is a neighbour of $j$.

This condition is checked on every $1$ in the permutation matrix. If it is not
met, the cell is nulled. This, though, potentially creates new ones not
matching the condition: the process must be run again, until no new zeroes
appear.

In the initial article~\cite{ullmann1976algorithm}, Ullmann advocates for
bitwise tricks to complete this expensive step: indeed, checking the existence
of such a $k'$ can be done by checking the nullity of the bitwise \textsc{and}
of the adjacency of $j$ and the permutation matrix row of $k$.

The refining function is detailed in Figure~\ref{alg:ullmann_refine}.

\todo{Insert explaining figure}

\begin{figure}[h]
    \begin{algorithmic}
        \Function{refine}{perm}
            \While{changes during last run}
                \For{each needle vertex $i$}
                    \For{each haystack vertex $j$}
                        \For{each neighbour $k$ of $i$ in needle}
                            \If{\lstc{perm[k] & haystack\_adjacency[j]}
                                    $= 0$}
                                \State{} \lstc{perm[i][j]} $\gets 0$
                            \EndIf{}
                        \EndFor{}
                    \EndFor{}
                \EndFor{}
            \EndWhile{}
        \EndFunction{}
    \end{algorithmic}
    \caption{Ullmann's refining function}\label{alg:ullmann_refine}
\end{figure}

\subsection{Ullmann for \emph{isomatch}}

\paragraph{Graph used.} Our circuit is not actually a graph just as-is: indeed,
a wire can be connected to multiple circuits (multiple gates' inputs, or even
multiple gates' outputs when using tristate circuits). This could be transposed
into a graph with $\frac{n(n-1)}{2}$ edges (the complete subgraph) for this
particular wire.  Though, internally, a wire is better represented as a vertex
itself, with $n$ edges linking it to the connected gates. This representation
is also used in Ullmann's implementation, leading to a permutation matrix of
$\left(\card{\text{needle gates}} + \card{\text{needle wires}}\right) \times
\left(\card{\text{haystack gates}} + \card{\text{haystack wires}}\right)$.

\paragraph{Final result.} Once a result (\ie{} a correct permutation) is
obtained, we further need to check it is actually a solution of our problem.
Indeed, while the structure is guaranteed by the algorithm to be the same, we
still need to check that every circuit is equal to its matched one, through the
procedure described in Section~\ref{sec:group_equality}. So far, only the
equality of signatures was checked. We only need to check the circuits, as the
wires are necessarily actually matching.

\paragraph{Non-overlapping results.} We want our results to be non-overlapping
(because we won't be able to perform a search-and-replace if it is not the
case). Whenever two potential results are conflicting, an arbitrary one of the
two can be returned (a human user is operating the software and can make a
narrower search if needed).

To match this specification, we must keep track of the circuits that are
already included in a match. We also cannot include an ancestor of a circuit
that was included in a match in another match (though this is not possible,
because the needle can't be included in itself, but signature collisions could
occur).

\subsection{Implementation optimisations}

\paragraph{Initial permutation matrix.} The matrix is first filled according to
the signatures matches. It is then refined a bit more, by making sure that for
every match, every potentially matching gate has the same ``wire kinds''.
Indeed, a gate needle's wire must have at least the same inbound adjacent
signatures as its matching haystack wire, and same goes for outbound adjacent
signatures. Thus, two circuits cannot be matched if this condition is not
respected for each pair of corresponding wires of those circuits, and their
corresponding cell in the permutation matrix can be nulled.

\paragraph{Pre-check.} The needle will, in most cases, not be found at all in
a given hierarchy group of the haystack. To avoid wasting computation time, we
first check that every signature present in the needle is present at least as
many times in the haystack. This simple check saved a lot of time.

\todo{More stuff?}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Performance}

In this section, all the measures were made on a computer with an Intel
i7-3770 CPU (3.40GHz) and 8\,GB of RAM\@. % chktex 8

\subsection{Small processor}

The example I used widely during my project to test my program and check its
efficiency periodically was a small processor designed one year earlier as a
school project~\cite{sysdig_cpu}. The processor implements a large subset of
ARM, was conceived as a few hierarchized modules (ALU, registers, opcode
decoder, \ldots) but was flattened as a plain netlist when generated from its
python code, so I first had to patch its generator code to make its hierarchy
apparent.

The processor, in the end, has around 2000 leaf gates (but works at word level)
and 240 hierarchy groups. \qtodo{Tell us more!}

\paragraph{Signature.} First, the time required to sign the whole circuit with
different levels of signature (\ie{} the order of signature computed for every
part of the circuit). In practice, we never compute high order signatures for a
whole circuit, as signature of subgroups are always computed by default at the
order $2$, unless this particular group needs a more accurate signature.

The measures were made for 100 consecutive runs of the program (then averaged
for a single run) and measured by the command \texttt{time}.

\begin{center}
    \begin{tikzpicture}
    \begin{axis}[
        title={Signature time of the processor for different levels
            of signature},
        xlabel={Level of signature},
        ylabel={Time (ms)},
        xmin=0, xmax=16,
        ymin=0, ymax=300,
        legend pos=north west,
        ymajorgrids=true,
        grid style=dashed,
    ]

    \addplot[
        color=blue,
        mark=square,
        ]
        coordinates {
            (2,105.4)
            (3,122.6)
            (4,140.1)
            (5,155.4)
            (6,171.2)
            (7,183.9)
            (8,198.3)
            (9,211.2)
            (10,224.3)
            (11,236.7)
            (12,248.5)
            (13,259.3)
            (14,271.7)
            (15,281.4)
        };
        \legend{-O3}
    \end{axis}
    \end{tikzpicture}
\end{center}


\paragraph{Equality.} To test the circuit group equality, a small piece of
code takes a circuit, scrambles it as much as possible
--- without altering its structure ---, \eg{} by renaming randomly its parts,
by randomly changing the order of the circuits and groups, \ldots The circuit
is then matched with its unaltered counterpart.

For the processor described above, it takes about \textbf{477\,ms} to
prove it equal to its scrambled version, and then the other way around. Yet,
the memoized results (essentially the signatures) are kept for the second one,
considerably speeding it up: the same program proving only one way takes about
\textbf{475\,ms}. \todo{explain why}


\subsection{Corner cases}

There were a few observed cases where the algorithm tends to be slower on
certain configurations.

\todo{More corner cases}

\paragraph{Split/merge trees.} A common pattern that tends to slow down the
algorithm is split/merge trees. Those patterns occur when one wants to merge
$n$ one bit wires into a single $n$ bits wire, or the other way around.

These patterns are pretty common, for instance when an opcode is run through a
MUX tree to perform the requested operation.

Though, this pattern generates a lot of collisions in signatures. Indeed, for a
tree of depth \eg{} 8, a node just below the root will need a signature of
order 7 to have a different signature than another one at the same depth. With
a signature of order up to 6, only other gates from the tree will be included
in the signature when going down in the tree; the exact same gates will be
included above the tree's root. Thus, nothing will differentiate one gate from
another while the boundary of the tree is not reached (assuming the gates below
the tree's leaves are not all the same; if so, more levels will be needed).

As the notion of ``left child'' and ``right child'' cannot be used (since it
would rely on the order or description of the graph), there seems to be no good
way to discriminate those two nodes. Furthermore, the nodes are not totally
interchangeable: indeed, when checking for an equality between two such trees,
it does not matter which node is the left one; but once this is fixed, the
nodes on the layer below cannot be freely exchanged.

\todo{Figure describing the problem}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\printbibliography{}

\end{document}