128 lines
6.8 KiB
TeX
128 lines
6.8 KiB
TeX
\chapter*{Conclusion}
|
|
\addcontentsline{toc}{chapter}{Conclusion}
|
|
|
|
During this manuscript, we explored the main bottlenecks that arise while
|
|
analyzing the low-level performance of a microkernel:
|
|
\begin{itemize}
|
|
\item frontend bottlenecks ---~the processor's frontend is unable to
|
|
saturate the backend with instructions (\autoref{chap:palmed});
|
|
\item backend bottlenecks ---~the backend is saturated with instructions
|
|
and processes them as fast as possible (\autoref{chap:frontend});
|
|
\item dependencies bottlenecks ---~data dependencies between instructions
|
|
prevent the backend from being saturated; the latter is stalled
|
|
awaiting previous results (\autoref{chap:staticdeps}).
|
|
\end{itemize}
|
|
We also conducted in \autoref{chap:CesASMe} a systematic comparative study of a
|
|
variety of state-of-the-art code analyzers.
|
|
|
|
\bigskip{}
|
|
|
|
State-of-the-art code analyzers such as \llvmmca{} or \uica{} already
|
|
boast a good accuracy. Both of these tools ---~and most of the others also~---
|
|
are however based on models obtained by various degrees of manual
|
|
investigation, and cannot be adapted without further manual effort to future
|
|
or uncharted microprocessors.
|
|
|
|
The field of microarchitectural models for code
|
|
analysis emerged with fundamentally manual methods, such as Agner Fog's tables.
|
|
Such tables, however, may now be produced in a more automated way using
|
|
\uopsinfo{} ---~at least for certain microarchitectures; \pmevo{} pushes
|
|
further in this direction by automatically computing a frontend model from
|
|
benchmarks ---~but still has trouble scaling to a full instruction set. In its
|
|
own way, \ithemal{}, a machine-learning based approach, could also be
|
|
considered automated ---~yet, it still requires a large training set for the
|
|
intended processor, which must be at least partially crafted manually.
|
|
This trend towards model automation seems only natural as new
|
|
microarchitectures keep appearing, while new ISAs such as ARM reach the
|
|
supercomputer area.
|
|
|
|
\medskip{}
|
|
|
|
We investigated this direction by exploring the three major bottlenecks
|
|
mentioned earlier in the perspective of providing fully-automated,
|
|
benchmarks-based models for each of them. Optimally, these models should be
|
|
generated by simply executing a program on a machine running on top of the
|
|
targeted microarchitecture.
|
|
|
|
\begin{itemize}
|
|
\item We contributed to \palmed{}, a framework able to extract a
|
|
port-mapping of a processor, serving as a backend model.
|
|
\item We manually extracted a frontend model for the Cortex A72 processor.
|
|
We believe that the foundation of our methodology works on most
|
|
processors. The main characteristics of a frontend, apart from their
|
|
instructions' \uops{} decomposition and issue width, must however still
|
|
be investigated, and their relative importance evaluated.
|
|
\item We provided with \staticdeps{} a method to to extract data
|
|
dependencies between instructions. It is able to detect
|
|
\textit{loop-carried} dependencies (dependencies that span across
|
|
multiple loop iterations), as well as \textit{memory-carried}
|
|
dependencies (dependencies based on reading at a memory address written
|
|
by another instruction). While the former is widely implemented, the
|
|
latter is, to the best of our knowledge, an original contribution. We
|
|
bundled this method in a processor-independent tool, based on semantics
|
|
of the ISA provided by \valgrind{}, which supports a variety of ISAs.
|
|
\end{itemize}
|
|
|
|
\bigskip{}
|
|
|
|
We evaluated independently these three models, each of them providing
|
|
satisfactory results: \palmed{} is competitive with the state of the art, with
|
|
the advantage of being automatic; our frontend model significantly improves a
|
|
backend model's accuracy and our dependencies model significantly improves
|
|
\uica{}'s results, while being consistent with a dynamic dependencies analysis.
|
|
|
|
Evaluating the three models combined as a complete analyzer would have been
|
|
most meaningful. However, as we argue in \autoref{chap:wrapping_up} abvoe, this
|
|
is sadly not pragmatic, as tools do not easily combine without a large amount f
|
|
engineering.
|
|
|
|
\bigskip{}
|
|
|
|
We also identified multiple weaknesses in the current state of the art from our
|
|
comparative experiments with \cesasme{}.
|
|
|
|
\smallskip{}
|
|
|
|
First, none of the state-of-the-art tools have a good support for dependencies
|
|
across memory. Such dependencies were present in about a third of \cesasme{}'s
|
|
benchmark set. While we built this benchmark set aiming for representative
|
|
data, there is no clear evidence that these dependencies are so strongly
|
|
present in the codes analyzed in real usecases. We however believe that such
|
|
cases regularly occur, and we also saw that the performance of code analyzers
|
|
drop sharply in their presence.
|
|
|
|
\smallskip{}
|
|
|
|
We also found the bottleneck prediction offered by some code analyzers still
|
|
uncertain. In our experiments, the tools disagreed more often than not on the
|
|
presence or absence of a bottleneck, with no outstanding tool; we are thus
|
|
unable to conclude on the relative performance of tools on this aspect. On the
|
|
other hand, sensitivity analysis, as implemented \eg{} by \gus{}, seems a
|
|
theoretically sound way to evaluate the presence or absence of a bottleneck in
|
|
a microkernel; it is, however, prohibitively slow for many usecases. In this
|
|
respect, a study of code analyzers' predictions against results from
|
|
sensitivity analysis would certainly bring more conclusive results.
|
|
|
|
\smallskip{}
|
|
|
|
Finally, we observed on \bhive{}'s results the effects of a \emph{lack of
|
|
context} for an analysis. \bhive{} measures a real execution, on real hardware,
|
|
of a kernel; as such, it yields excellent accuracy in many cases, with a median
|
|
error of about 8\%. Yet, it still lacks in accuracy in many other cases, with
|
|
its third quartile (23\%) above \uica{} or \iaca{}'s median result (about
|
|
18\%), and far-reaching outliers bringing its mean error on-par with \uica{}'s.
|
|
Indeed, what precedes a loop nest and the real values present in registers
|
|
impact the performance of the loop nest. The effects can be of fairly high
|
|
level, such as pointer aliasing, leading to false positives or negatives in
|
|
dependency detections. They can also be of a microarchitectural level, such as
|
|
the observable performance loss of memory accesses ---~even with cache hits~---
|
|
when memory reads cross a cache line boundary.
|
|
|
|
This lack of context incurs a significant loss of accuracy for
|
|
static analyzers, as we saw in \autoref{ssec:bhive_errors} that the same
|
|
instruction, depending on its registers' values, can be twice as slow even
|
|
without aliasing, or 19 times slower upon aliasing. With \cesasme{}, we sketch
|
|
the embryo of a solution, with a simple and fast pass of dynamic analysis
|
|
through instrumentation, gathering data for a subsequent pass of static
|
|
analysis. Such a method might help recreating the context needed for an
|
|
accurate analysis.
|