Compare commits

...

4 commits

2 changed files with 251 additions and 77 deletions

View file

@ -98,3 +98,10 @@
publisher={IEEE}
}
@misc{sysdig_cpu,
author={Théophile Bastian and Noémie Cartier and Nathanaël Courant},
title={``Système digital'' project},
howpublished={\url{https://github.com/tobast/sysdig-full/}},
year=2016
}

View file

@ -9,6 +9,9 @@
\usepackage{enumerate}
\usepackage{caption}
\usepackage{algorithmicx}
%\usepgfplotslibrary{external}
%\tikzexternalize
\usepackage[backend=biber,style=trad-alpha]{biblatex}
\usepackage[left=2cm,right=2cm,top=2cm,bottom=2cm]{geometry}
@ -21,6 +24,8 @@
\usepackage{../common/internship}
\usepackage{../common/math}
\usepackage{pgfplots}
\bibliography{../common/refs}
\title{Pattern-matching and substitution in electronic circuits}
@ -48,8 +53,8 @@
transformations to a circuit.
This problem turns out to be more or less the \emph{subgraph isomorphism
problem}, which is NP-complete, and must nevertheless be solved fast on
processor-sized circuits on this particular case.
problem}, which is NP-complete, and must nevertheless be solved efficiently
on processor-sized circuits on this particular case.
During my internship, I developed a C++ library to perform this task that
will be integrated in VossII, based on a few well-known algorithms as well
@ -61,8 +66,6 @@
\tableofcontents
\todo{Talk of the repo, somewhere}
\pagebreak
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -70,24 +73,26 @@
In the previous years, verification and proved software has gathered an
increasing interest in the computer science community, as people realised how
hard bugs are to track down. But hardware bugs are even more tedious to find
and fix, and can easily lead to disastrous consequences, as those cannot be
patched on existing hardware. For instance, the well-known Pentium
``\textsc{fdiv}'' bug~\cite{pratt1995fdiv} that affected a large number of
Pentium processors lead to wrong results for some floating point divisions.
Intel had to offer to replace all the defective CPUs, leading to an announced
loss of 475 million dollars~\cite{nicely_fdiv}. Even recently, the Skylake and
Kaby Lake hyperthreading bug had to be patched using microcode, loosing
performance and reliability.
hard bugs are to track down, and the little confidence they had in their own
code. But hardware bugs are even more tedious to find and fix, and can easily
lead to disastrous consequences, as those cannot be patched on existing
hardware. For instance, the well-known Pentium ``\textsc{fdiv}''
bug~\cite{pratt1995fdiv} that affected a large number of Pentium processors
lead to wrong results for some floating point divisions. Intel had to offer to
replace all the defective CPUs, leading to an announced loss of 475 million
dollars~\cite{nicely_fdiv}. Even recently, the Skylake and Kaby Lake
hyperthreading bug had to be patched using microcode, losing performance and
reliability.
To avoid such disasters, the industry nowadays uses a wide range of techniques
to catch bugs as early as possible --- which, hopefully, is before the product's
release date. Among those are \todo{list + cite}, but also proved hardware. On
circuits as complex as processors, usually, only sub-components are proved
correct in a specified context --- that should, but is not proved to, be
respected by the other parts of the circuit. Yet, this trade-off between proved
correctness and engineer's work time already gives a pretty good confidence in
the circuit.
to catch bugs as early as possible --- which, hopefully, is before the
product's release date. Among those are \todo{list + cite}, but also proved
hardware. On circuits as complex as processors, usually, only sub-components
are proved correct with respect to a given specification of its behaviour,
while the circuit is kept in a specified context, \ie{} a set of valid inputs
and outputs, etc. --- that should, but is not proved to, be respected by the
other parts of the circuit. Yet, this trade-off between proved correctness and
engineer's work time already gives a pretty good confidence in the circuit.
In this context, Carl Seger was one of the main developers of fl at Intel, a
functional ml-inspired programming language integrating many features useful to
@ -108,7 +113,7 @@ understand, and make it easier to understand to work more efficiently.
and Mary Sheeran, aiming to develop tools for proved design of FPGA circuits.
One of the keystones of this project is an open-sourced and publicly available
version of fl, used for the proving part, and is still at the moment under
development.
heavy development.
My part of the work resided on this ``search and replace'' tool. More
specifically, I focused on writing a C++ library, \emph{isomatch}, which is
@ -147,7 +152,7 @@ functions for this task.
\vert~&- & \textit{(sub)} \\
\vert~&\times & \textit{(times)} \\
\vert~&\div & \textit{(div)} \\
\vert~&\% & \textit{(mod)} \\
\vert~&\% & \textit{(mod)} \\ % chktex 35
\vert~&\lsl & \textit{(logical shift left)} \\
\vert~&\lsr & \textit{(logical shift right)} \\
\vert~&\asr & \textit{(arithmetic shift right)} \\
@ -181,10 +186,32 @@ can be seen as the construction of a circuit by assembling smaller integrated
circuits (ICs), themselves built the same way, etc. A group is composed of
sub-circuits, input pins and output pins. Each level can of course contain
``leaf'' gates, like \textit{and} or \textit{delay} gates. This is important,
because it allows the program to work on smaller areas the circuit (\eg{}
because it allows the program to work on smaller areas of the circuit (\eg{}
loading in memory only a part of the circuit, etc.).
\subsection{Objective}
\emph{Isomatch} comes along with a small parser for a toy ad-hoc language,
designed to allow one to quickly write a test circuit and run the
\emph{isomatch} functions on it. There was no real apparent need for a better
language, or a standard circuit description language (such as VSDL), since the
user will mostly use \emph{isomatch} through fl, and feed it directly with data
read from fl --- which is able to handle \eg{} VSDL\@.
\subsection{Codebases}
Carl Seger's new version of fl is currently being developed as \textbf{VossII},
and is yet to be open-sourced.
My contribution to the project, \textbf{isomatch}, is free software, and is
available on GitHub:
\begin{center}
\raisebox{-0.4\height}{
\includegraphics[height=2em]{../common/github32.png}}
\hspace{1em}
\url{https://github.com/tobast/circuit-isomatch/}
\end{center}
\subsection{Problems}
More precisely, the problems that \emph{isomatch} must solve are the following.
@ -194,25 +221,25 @@ More precisely, the problems that \emph{isomatch} must solve are the following.
way, with possibly different names, etc.?
\item\label{prob:match} Given two circuits, \emph{needle} and
\emph{haystack}, find every (non-overlapping) occurrence of \emph{needle} in
\emph{haystack}. An occurrence is a set $S$ of sub-circuits of
\emph{haystack} such that there is a one-to-one mapping of structurally
equivalent circuits of $S$ with circuits of \emph{needle}, and those
circuits are connected the same way in both circuits.
\emph{haystack}, find every (non-overlapping) occurrence of
\emph{needle} in \emph{haystack}. An occurrence is a set $S$ of
sub-circuits of \emph{haystack} such that there is a one-to-one mapping
of structurally equivalent circuits of $S$ with circuits of
\emph{needle}, and those circuits are connected the same way in both circuits.
\end{enumerate}
Both problems are hard. The first one is an instance of graph isomorphism, as
the actual question is whether there exists a one-to-one mapping between
sub-circuits of the two groups, such that every mapped circuit is equal to the
other (either directly if it is a leaf gate, or recursively with the same
procedure); and whether this mapping respects connections (edges) between those
circuits. Graph isomorphism is known to be in NP (given a permutation of the
first graph, it is polynomial to check whether the first is equal to the second
\wrt{} the permutation), but not known to be in either P or NP-complete. Thus,
since Babai's work on graph isomorphism~\cite{babai2016graph} is only of
theoretical interest, the known algorithms remain in worst-case exponential
time, and require ad-hoc heuristics for specific kind of graphs to get maximum
efficiency.
Both problems are hard. The first one is an instance of the graph isomorphism
problem, as the actual question is whether there exists a one-to-one mapping
between sub-circuits of the two groups, such that every mapped circuit is equal
to the other (either directly if it is a leaf gate, or recursively with the
same procedure); and whether this mapping respects connections (edges) between
those circuits. Graph isomorphism is known to be in NP (given a permutation of
the first graph, it is polynomial to check whether the first is equal to the
second \wrt{} the permutation), but not known to be in either P or NP-complete.
Thus, since Babai's work on graph isomorphism~\cite{babai2016graph} is only of
theoretical interest (at the moment), the known algorithms remain in worst-case
exponential time, and require ad-hoc heuristics for specific kind of graphs to
get maximum efficiency.
The second one is an instance of subgraph isomorphism problem, which is known
to be NP-complete~\cite{cook1971complexity}. Even though a few algorithms
@ -220,19 +247,48 @@ to be NP-complete~\cite{cook1971complexity}. Even though a few algorithms
is nevertheless necessary to implement them the right way, and with the right
heuristics, to get the desired efficiency for the given problem.
\subsection{Code quality}
Another prominent objective was to keep the codebase as clean as possible.
Indeed, this code will probably have to be maintained for quite some time, and
most probably by other people than me. This means that the code and all its
surroundings must be really clean, readable and reusable. I tried to put a lot
of effort in making the code idiomatic and easy to use, through \eg{} the
implementation of iterators over my data structures when needed, idiomatic
C++14, etc.
This also means that the code has to be well-documented: the git history had to
be kept clean and understandable, and a clean documentation can be generated
from the code, using \texttt{doxygen}. The latest documentation is also
compiled as HTML pages here:
\begin{center}
\raisebox{-0.4\height}{
\includegraphics[height=2.3em]{../common/docs.png}}
\hspace{1em}
\url{https://tobast.fr/m1/isomatch}
\end{center}
Since the code is C++, it is also very prone to diverse bugs. While I did not
took the time to integrate unit tests --- which would have been a great
addition ---, I used a sequence of test that can be run using \lstc{make
test}, and tests a lot of features of isomatch.
The code is also tested regularly and on a wide variety of cases with
\lstbash{valgrind} to ensure that there are no memory errors ---
use-after-free, unallocated memory, memory leaks, bad pointer
arithmetics,~\ldots In every tested case, strictly no memory is lost, and no
invalid read was reported.
\subsection{Sought efficiency}
The goal of \textit{isomatch} is to be applied to large circuits on-the-fly,
during their conception. Those circuits can (and will probably) be as large as
a full processor, and the software will be operated by a human, working on
their circuit. Thus, \textit{isomatch} must be as fast as possible, since
matching operation will be executed often, and often multiple times in a row.
It must then remain fast enough for the human not to lose too much time, and
eventually lose patience.
\todo{Mention clean codebase somewhere}
\todo{Mention VossII somewhere}
matching operations will be executed quite often, and often multiple times in a
row. It must then remain fast enough for the human not to lose too much time,
and eventually lose patience.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{General approach}
@ -241,7 +297,7 @@ The global strategy used to solve efficiently the problem can be broken down to
three main parts.
\paragraph{Signatures.} The initial idea to make the computation fast is to
aggregate the inner data of a gate --- be it a leaf gate or a group --- in a
aggregate the inner data of a gate~---~be it a leaf gate or a group~---~in a
kind of hash, a 64 bits unsigned integer. This approach is directly inspired
from what was done in fl, back at Intel. This hash must be easy to compute,
and must be based only on the structure of the graph --- that is, must be
@ -280,8 +336,10 @@ this problem, that uses the specificities of the graph to be a little faster.
The signature is computed as a simple hash of the element, and is defined for
every type of expression and circuit. It could probably be enhanced with a bit
more work to cover more uniformly the hash space, but no collision was observed
on the examples tested.
more work to cover more uniformly the hash space, but no illegitimate collision
(that is, a collision that could be avoided with a better hash function, as
opposed to collisions due to an equal local graph structure) was observed on
the examples tested.
\paragraph{Signature constants.} Signature constants are used all around the
signing process, and is a 5-tuple $\sigconst{} = (a, x_l, x_h, d_l, d_h)$ of 32
@ -289,22 +347,24 @@ bits unsigned numbers. All of $x_l$, $x_h$, $d_l$ and $d_h$ are picked as prime
numbers between $10^8$ and $10^9$ (which just fits in a 32 bits unsigned
integer); while $a$ is a random integer uniformly picked between $2^{16}$ and
$2^{32}$. These constants are generated by a small python script,
\path{util/primegen/pickPrimes.py}.
\path{util/primegen/pickPrimes.py} in the repository.
Those constants are used to produce a 64 bits unsigned value out of another 64
bits unsigned value, called $v$ thereafter, through an operator $\sigop$,
computed as follows (with all computations done on 64 bits unsigned integers).
\vspace{1em}
\begin{algorithmic}
\Function{$\sigop$}{$\sigconst{}, v$}
\State{} $out1 \gets (v + a) \cdot x_l$
\State{} $v_h \gets (v \lsr 32) \xor (out1 \lsr 32)$
\State{} $low \gets out1 \,\%\, d_l$
\State{} $high \gets \left((v_h + a) \cdot x_h \right) \%\, d_h$
\State{} \Return{} $low + 2^{32} \cdot high$
\EndFunction{}
\end{algorithmic}
\begin{center}
\begin{algorithmic}
\Function{$\sigop$}{$\sigconst{}, v$}
\State{} $out1 \gets (v + a) \cdot x_l$
\State{} $v_h \gets (v \lsr 32) \xor (out1 \lsr 32)$
\State{} $low \gets out1 \,\%\, d_l$
\State{} $high \gets \left((v_h + a) \cdot x_h \right) \%\, d_h$
\State{} \Return{} $low + 2^{32} \cdot high$
\EndFunction{}
\end{algorithmic}
\end{center}
\paragraph{Expressions.} Each type of expression (or, in the case of
expression with operator, each type of operator) has its signature constant,
@ -331,7 +391,7 @@ capture at all the \emph{structure} of the graph. An information we can capture
without breaking the signature's independence towards the order of description
of the graph, is the set of its neighbours. Yet, we cannot ``label'' the gates
without breaking this rule; thus, we represent the set of neighbours by the set
of our \emph{neighbours' signatures}.
of the \emph{neighbours' signatures}.
At this point, we can define the \emph{signature of order $n$} ($n \in
\natset$) of a circuit $C$ as follows:
@ -348,13 +408,13 @@ At this point, we can define the \emph{signature of order $n$} ($n \in
The ``IO adjacency'' term is an additional term in the signatures of order
above $0$, indicating what input and output pins of the circuit group
containing the current gate are adjacent to it.
containing the current gate are adjacent to it. Adding this information to the
signature was necessary, since a lot of gates can be signed differently using
this information (see Corner cases in Section~\ref{ssec:corner_cases}).
The default order of signature used in all computations, unless more is useful,
is 2, after a few benchmarks.
\todo{explain range of $n$}
\paragraph{Efficiency.} Every circuit memoizes all it can concerning its
signature: the inner signature, the IO adjacency, the signatures of order $n$
already computed, etc.
@ -373,8 +433,10 @@ or its children are modified. A memoized data is always stored alongside with a
timestamp of computation, which invalidates a previous result when needed.
One possible path of investigation for future work, if the computation turns
out to be still too slow in real-world cases --- which looks unlikely ---,
would be to try to multithread this computation.
out to be still too slow in real-world cases --- which looks unlikely, unless
fl's substitution is run on a regular basis for a huge number of cases using
\eg{} a crontab for automated testing ---, would be to try to multithread this
computation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Group equality}\label{sec:group_equality}
@ -401,7 +463,7 @@ the number of permutations examined to no more than $4$ in studied cases.
Once a permutation is judged worth to be examined, the group equality is run
recursively on all its matched gates. If this step succeeds, the graph
structure is then checked. If both steps succeeds, the permutation is correct
structure is then checked. If both steps succeed, the permutation is correct
and an isomorphism has been found; if not, we move on to the next permutation.
\todo{Anything more to tell here?}
@ -409,7 +471,11 @@ and an isomorphism has been found; if not, we move on to the next permutation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Pattern-match}
\todo{}
We finally need to be able to find every occurrence of a given \emph{needle}
circuit in a bigger \emph{haystack} circuit — at any level in its groups
hierarchy. This problem is basically graph isomorphism with some specificities,
and for this purpose I used a classical algorithm for the subgraph isomorphism
problem, \emph{Ullmann}.
\subsection{Ullmann's algorithm}
@ -418,7 +484,10 @@ was first described by Julian R Ullmann in 1976~\cite{ullmann1976algorithm}.
Another, more recent algorithm to deal with this problem is Luigi P Cordella's
VF2 algorithm~\cite{cordella2004sub}, published in 2004. This algorithm is
mostly Ullmann's algorithm, transcribed in a recursive writing, with the
addition of five heuristics. \qtodo{Why not use it then?}
addition of five heuristics. I originally planned to implement both algorithms
and benchmark both, but had no time to do so in the end; though, Ullmann with
the few additional heuristics applicable in our very specific case turned out
to be fast enough.
Ullmann is a widely used and fast algorithm for this problem. It makes an
extensive use of adjacency matrix description of the graph, and the initial
@ -430,8 +499,9 @@ matrix. Each $1$ in a cell $(i, j)$ indicates that the $i$-th needle part is a
possible match with the $j$-th haystack part. This matrix is called $perm$
thereafter.
The algorithm, left apart the \textsc{refine} function (detailed just after),
is described in Figure~\ref{alg:ullmann}.
The algorithm, left apart the \textsc{refine} function, which is detailed just
after and can be omitted for a (way) slower version of the algorithm, is
described in Figure~\ref{alg:ullmann}.
\begin{figure}[h]
\begin{algorithmic}
@ -474,7 +544,7 @@ is described in Figure~\ref{alg:ullmann}.
The refining process is the actual keystone of the algorithm. It is the
mechanism allowing the algorithm to cut down many exploration branches, by
removing ones from the matrix.
changing ones to zeroes in the matrix being built.
The idea is that a match between a needle's vertex $i$ and a haystack's vertex
$j$ is only possible if, for each neighbour $k$ of $i$, $j$ has a neighbour
@ -552,7 +622,12 @@ occur).
\subsection{Implementation optimisations}
\paragraph{Initial permutation matrix.} The matrix is first filled according to
the signatures matches. It is then refined a bit more, by making sure that for
the signatures matches. Note that only signatures of order 0 --- \ie{} the
inner data of a vertex --- can be used here: indeed, we cannot rely on the
context here, since there can be some context in the haystack that is absent
from the needle, and we cannot check for ``context inclusion'' with our
definition of signatures: \emph{all} the context must be exactly the same for
two signatures to match. It is then refined a bit more, by making sure that for
every match, every potentially matching gate has the same ``wire kinds''.
Indeed, a gate needle's wire must have at least the same inbound adjacent
signatures as its matching haystack wire, and same goes for outbound adjacent
@ -570,15 +645,101 @@ many times in the haystack. This simple check saved a lot of time.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Performance}
\todo{}
In this section, all the measures were made on a computer with an Intel
i7-3770 CPU (3.40GHz) and 8\,GB of RAM\@. % chktex 8
\subsection{Corner cases}
\subsection{Small processor}
The example I used widely during my project to test my program and check its
efficiency periodically was a small processor designed one year earlier as a
school project~\cite{sysdig_cpu}. The processor implements a large subset of
ARM, was conceived as a few hierarchized modules (ALU, registers, opcode
decoder, \ldots) but was flattened as a plain netlist when generated from its
python code, so I first had to patch its generator code to make its hierarchy
apparent.
The processor, in the end, has around 2000 leaf gates (but works at word level)
and 240 hierarchy groups. \qtodo{Tell us more!}
\paragraph{Signature.} First, the time required to sign the whole circuit with
different levels of signature (\ie{} the order of signature computed for every
part of the circuit). In practice, we never compute high order signatures for a
whole circuit, as signature of subgroups are always computed by default at the
order $2$, unless this particular group needs a more accurate signature.
The measures were made for 100 consecutive runs of the program (then averaged
for a single run) and measured by the command \texttt{time}.
\begin{center}
\begin{tikzpicture}
\begin{axis}[
title={Signature time of the processor for different levels
of signature},
xlabel={Level of signature},
ylabel={Time (ms)},
xmin=0, xmax=16,
ymin=0, ymax=300,
legend pos=north west,
ymajorgrids=true,
grid style=dashed,
]
\addplot[
color=blue,
mark=square,
]
coordinates {
(2,105.4)
(3,122.6)
(4,140.1)
(5,155.4)
(6,171.2)
(7,183.9)
(8,198.3)
(9,211.2)
(10,224.3)
(11,236.7)
(12,248.5)
(13,259.3)
(14,271.7)
(15,281.4)
};
\legend{-O3}
\end{axis}
\end{tikzpicture}
\end{center}
The computation time is more or less linear in in the level of signature
required, which is coherent with the implementation. In practice, only small
portions of a circuit will be signed with a high order, which means that we can
afford really high order signatures (\eg{} 40 or 50, which already means that
the diameter of the group is 40 or 50) without having a real impact on the
computation time.
\paragraph{Equality.} To test the circuit group equality, a small piece of
code takes a circuit, scrambles it as much as possible
--- without altering its structure ---, \eg{} by renaming randomly its parts,
by randomly changing the order of the circuits and groups, \ldots The circuit
is then matched with its unaltered counterpart.
For the processor described above, it takes about \textbf{477\,ms} to
prove it equal to its scrambled version, and then the other way around. Yet,
the memoized results (essentially the signatures) are kept for the second one,
considerably speeding it up: the same program proving only one way takes about
\textbf{475\,ms}. \todo{explain why}
\subsection{Corner cases}\label{ssec:corner_cases}
There were a few observed cases where the algorithm tends to be slower on
certain configurations.
certain configurations, and a few other such cases that could be fixed.
\todo{More corner cases}
\todo{Corner case: io pins, io adjacency}
\paragraph{Split/merge trees.} A common pattern that tends to slow down the
algorithm is split/merge trees. Those patterns occur when one wants to merge
$n$ one bit wires into a single $n$ bits wire, or the other way around.
@ -604,6 +765,12 @@ nodes on the layer below cannot be freely exchanged.
\todo{Figure describing the problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Conclusion}
\todo{}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printbibliography{}