phd-thesis/manuscrit/50_CesASMe/30_future_works.tex

\section*{Conclusion and future works}

In this chapter, we have presented a fully-tooled approach that enables:

\begin{itemize}
\item the generation of a wide variety of microbenchmarks, reflecting both the
  expertise contained in an initial benchmark suite, and the diversity of
  code transformations allowing to stress different aspects of a performance model
  ---~or even a measurement environment, \eg{} \bhive; and
\item the comparability of various measurements and
  analyses applied to each of these microbenchmarks.
\end{itemize}

Thanks to this tooling, we were able to show the limits and strengths of
various performance models in relation to the expertise contained in the
Polybench suite. We discuss throughput results in
Section~\ref{ssec:overall_results} and bottleneck prediction in
Section~\ref{ssec:bottleneck_pred_analysis}.

We were also able to demonstrate the difficulties of reasoning at the level of
a basic block isolated from its context. We specifically study those
difficulties in the case of \bhive{} in Section~\ref{ssec:bhive_errors}.
Indeed, the actual values ---~both from registers and memory~--- involved in a
basic block's computation are constitutive not only of its functional
properties (\ie{} the result of the calculation), but also of some of its
non-functional properties (\eg{} latency, throughput).

We were also able to show in Section~\ref{ssec:memlatbound}
that state-of-the-art static analyzers struggle to
account for memory-carried dependencies; a weakness significantly impacting
their overall results on our benchmarks. We believe that detecting
and accounting for these dependencies is an important future works direction.

Moreover, we present this work in the form of a modular software package, each
component of which exposes numerous adjustable parameters. These components can
also be replaced by others fulfilling the same abstract function: another
initial benchmark suite in place of Polybench, other loop nest
optimizers in place of PLUTO and PoCC, other code
analyzers, and so on. This software modularity reflects the fact that our
contribution is about interfacing and communication between distinct issues.

\medskip

Furthermore, we believe that the contributions we made in the course of this work
may eventually be used to face different, yet neighbouring issues.
These perspectives can also be seen as future works:

\smallskip

\paragraph{Program optimization.} The whole program processing we have designed
can be used not only to evaluate the performance model underlying a static
analyzer, but also to guide program optimization itself. In such a perspective,
we would generate different versions of the same program using the
transformations discussed in Section~\ref{sec:bench_gen} and colored blue in
Figure~\ref{fig:contrib}. These different versions would then feed the
execution and measurement environment outlined in
Section~\ref{sec:bench_harness} and colored orange in Figure~\ref{fig:contrib}.
Indeed, thanks to our previous work, we know that the results of these
comparable analyses and measurements would make it possible to identify which
version is the most efficient, and even to reconstruct information indicating
why (which bottlenecks, etc.).

However, this approach would require that these different versions of the same
program are functionally equivalent, \ie{} that they compute the same
result from the same inputs; yet we saw in Section~\ref{sec:bench_harness}
that, as it stands, the transformations we apply are not concerned with
preserving the semantics of the input codes.  To recover this semantic
preservation property, abandoning the kernelification pass we have presented
suffices; this however would require to control L1-residence otherwise.

\smallskip

\paragraph{Dataset building.} Our microbenchmarks generation phase outputs a
large, diverse and representative dataset of microkernels. In addition to our
harness, we believe that such a dataset could be used to improve existing
data-dependant solutions.

%the measurement and execution environment we
%propose is not the only type of tool whose function is to process a large
%dataset (\ie{} the microbenchmarks generated earlier) to automatically
%abstract its characteristics. We can also think of:

Inductive methods, for instance in \anica, strive to preserve the properties of a basic
block through successive abstractions of the instructions it contains, so as to
draw the most general conclusions possible from a particular experiment.
Currently, \anica{} starts off from randomly generated basic blocks. This
approach guarantees a certain variety, and avoids
over-specialization, which would prevent it from finding interesting cases too
far from an initial dataset. However, it may well lead to the sample under
consideration being systematically outside the relevant area of the search
space ---~\ie{} having no relation to real-life programs or those in the user's
field.

On the other hand, machine learning methods based on neural networks, for
instance in \ithemal, seek to correlate the result of a function with the
characteristics of its input ---~in this case to correlate a throughput
prediction with the instructions making up a basic block~--- by backpropagating
the gradient of a cost function. In the case of \ithemal{}, it is trained on
benchmarks originating from a data suite.  As opposed to random generation,
this approach offers representative samples, but comes with a risk of lack of
variety and over-specialization.

Comparatively, our microbenchmark generation method is natively meant to
produce a representative, varied and large dataset. We believe that
enriching the dataset of the above-mentioned methods with our benchmarks might
extend their results and reach.