\chapter*{Conclusion} \addcontentsline{toc}{chapter}{Conclusion} During this manuscript, we explored the main bottlenecks that arise while analyzing the low-level performance of a microkernel: \begin{itemize} \item frontend bottlenecks ---~the processor's frontend is unable to saturate the backend with instructions (\autoref{chap:palmed}); \item backend bottlenecks ---~the backend is saturated with instructions from the frontend and is unable to process them fast enough (\autoref{chap:frontend}); \item dependencies bottlenecks ---~data dependencies between instructions prevent the backend from being saturated; the latter is stalled awaiting previous results (\autoref{chap:staticdeps}). \end{itemize} We also conducted in \autoref{chap:CesASMe} a systematic comparative study of a variety of state-of-the-art code analyzers. \bigskip{} State-of-the-art code analyzers such as \llvmmca{} or \uica{} already boast a good accuracy. Both of these tools ---~and most of the others also~--- are however based on models obtained by various degrees of manual investigation, and cannot be adapted without further manual effort to future or uncharted microprocessors. The field of microarchitectural models for code analysis emerged with fundamentally manual methods, such as Agner Fog's tables. Such tables, however, may now be produced in a more automated way using \uopsinfo{} ---~at least for certain microarchitectures; \pmevo{} pushes further in this direction by automatically computing a frontend model from benchmarks ---~but still has trouble scaling to a full instruction set. In its own way, \ithemal{}, a machine-learning based approach, could also be considered automated ---~yet, it still requires a large training set for the intended processor, which must be at least partially crafted manually. This trend towards model automation seems only natural as new microarchitectures keep appearing, while new ISAs such as ARM reach the supercomputer area. \medskip{} We investigated this direction by exploring the three major bottlenecks mentioned earlier in the perspective of providing fully-automated, benchmarks-based models for each of them. Optimally, these models should be generated by simply executing a program on a machine running on top of the targeted microarchitecture. \begin{itemize} \item We contributed to \palmed{}, a framework able to extract a port-mapping of a processor, serving as a backend model. \item We manually extracted a frontend model for the Cortex A72 processor. We believe that the foundation of our methodology works on most processors. To this end, we provide a parametric model that may serve as a scaffold for future works willing to build an automatic frontend model. Some parameters of this model must however still be investigated, and their relative importance evaluated. \item We provided with \staticdeps{} a method to to extract data dependencies between instructions. It is able to detect \textit{loop-carried} dependencies (dependencies that span across multiple loop iterations), as well as \textit{memory-carried} dependencies (dependencies based on reading at a memory address written by another instruction). While the former is widely implemented, the latter is, to the best of our knowledge, an original contribution. We bundled this method in a processor-independent tool, based on semantics of the ISA provided by \valgrind{}, which supports a variety of ISAs. \end{itemize} \bigskip{} We evaluated independently these three models, each of them providing satisfactory results: \palmed{} is competitive with the state of the art, with the advantage of being automatic; our frontend model significantly improves a backend model's accuracy and our dependencies model significantly improves \uica{}'s results, while being consistent with a dynamic dependencies analysis. Evaluating the three models combined as a complete analyzer would have been most meaningful. However, as we argue in the pre-conclusive chapter \nameref{chap:wrapping_up}, this is sadly not pragmatic, as tools do not easily combine without a large amount f engineering. \bigskip{} We also identified multiple weaknesses in the current state of the art from our comparative experiments with \cesasme{}. \smallskip{} First, none of the state-of-the-art tools have a good support for dependencies across memory. Such dependencies were present in about a third of \cesasme{}'s benchmark set. While we built this benchmark set aiming for representative data, there is no clear evidence that these dependencies are so strongly present in the codes analyzed in real usecases. We however believe that such cases regularly occur, and we also saw that the performance of code analyzers drops sharply in their presence. \smallskip{} We also found the bottleneck prediction offered by some code analyzers still very uncertain. In our experiments, the tools disagreed more often than not on the presence or absence of a bottleneck, with no outstanding tool; we are thus unable to conclude on the relative performance of tools on this aspect. On the other hand, sensitivity analysis, as implemented \eg{} by \gus{}, seems a theoretically sound way to evaluate the presence or absence of a bottleneck in a microkernel; it is, however, prohibitively slow for many usecases. In this respect, a study of code analyzers' predictions against results from sensitivity analysis would certainly bring more conclusive results. \smallskip{} Finally, we observed on \bhive{}'s results the effects of a \emph{lack of context} for an analysis. \bhive{} measures a real execution, on real hardware, of a kernel; as such, it yields excellent accuracy in many cases, with a median error of about 8\%. Yet, it still lacks in accuracy in many other cases, with its third quartile (23\%) above \uica{} or \iaca{}'s median result (about 18\%), and far-reaching outliers bringing its mean error on-par with \uica{}'s. Indeed, what precedes a loop nest and the real values present in registers impact the performance of the loop nest. The effects can be of fairly high level, such as pointer aliasing, leading to false positives or negatives in dependency detections. They can also be of a microarchitectural level, such as the observable performance loss of memory accesses ---~even with cache hits~--- when memory reads cross a cache line boundary. This lack of context incurs a significant loss of accuracy for static analyzers, as we saw in \autoref{ssec:bhive_errors} that the same instruction, depending on its registers' values, can be twice as slow even without aliasing, or 19 times slower upon aliasing. With \cesasme{}, we sketch the embryo of a solution, with a simple and fast pass of dynamic analysis through instrumentation, gathering data for a subsequent pass of static analysis. Such a method might help recreating the context needed for an accurate analysis.