\chapter{Wrapping it all up}\label{chap:wrapping_up} In \autoref{chap:palmed}, we introduced \palmed{}, a framework to build a backend model. Following up in \autoref{chap:frontend}, we introduced a frontend model for the ARM-based Cortex A72 processor. Then, in \autoref{chap:staticdeps}, we further introduced a dependency detection model. Put together, these three parts cover the major bottlenecks that a code analyzer must take into account. Both the two first models ---~frontend and backend~--- already natively output a cycles per iteration metric; we reduce our dependencies model to a cycles per iteration metric by computing the \emph{critical path}, described below. \medskip{} To conclude this manuscript, we take a minimalist first approach at combining those three models into a predictor, that we call \acombined{}, by taking the maximal prediction among the three models. This method is clearly less precise than \eg{} \uica{} or \llvmmca{}'s methods, which simulate iterations of the kernel while accounting for each model. It however allows us to quickly and easily evaluate an \emph{upper bound} of the quality of our models: a more refined tool using our models should obtain results at least as good as this method ---~but we could expect it to perform significantly better. \section{Critical path model} To account for dependencies-induced bottlenecks, we compute the \emph{critical path} along the data dependencies graph of the microkernel; that is, the longest path in this graph weighted with source instructions' latencies. The length of this path sets a lower bound to the execution time, as each source instruction must be issued and yield a result before the destination instruction can be issued. This approach is also taken by \osaca{}~\cite{osaca2}. In our case, we use instructions' latencies inferred by \palmed{} and its backend \pipedream{} on the A72. \medskip{} So far, however, this method would fail to account for out-of-orderness: the latency of an instruction is hidden by other computations, independent of the former one's result. This instruction-level parallelism is limited by the reorder buffer's size. We thus unroll the kernel as many times as fits in the reorder buffer ---~accounting for each instruction's \uop{} count, as we have a frontend model readily available~---, and compute the critical path on this unrolled version. Finally, the metric in cycles per iteration is obtained by dividing this critical path's length by the number of times we unrolled the kernel. \section{Evaluation} \begin{table} \centering \footnotesize \begin{tabular}{l r r r r r r r r} \toprule \textbf{Bencher} & \textbf{Datapoints} & \multicolumn{2}{c}{\textbf{Failures}} & \textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} \\ & & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) \\ \midrule A72 combined & 1767 & 9 & (0.51\,\%) & 19.26\,\% & 12.98\,\% & 5.57\,\% & 25.38\,\% & 0.75\\ llvm-mca & 1775 & 1 & (0.06\,\%) & 32.60\,\% & 25.17\,\% & 8.84\,\% & 59.16\,\% & 0.69\\ Osaca (backend) & 1773 & 3 & (0.17\,\%) & 49.33\,\% & 50.19\,\% & 33.53\,\% & 64.94\,\% & 0.67\\ Osaca (crit. path) & 1773 & 3 & (0.17\,\%) & 84.02\,\% & 70.39\,\% & 40.37\,\% & 91.47\,\% & 0.24\\ \bottomrule \end{tabular} \caption{Evaluation through \cesasme{} of the \acombined{} model}\label{table:a72_combined_stats} \end{table} \begin{figure} \centering \includegraphics[width=0.5\linewidth]{cesasme_a72combined_boxplot.svg} \caption{Evaluation through \cesasme{} of the \acombined{} model}\label{fig:a72_combined_stats_boxplot} \end{figure} We evaluate \acombined{} with \cesasme{} on the Raspberry Pi's Cortex A72, using the same set of benchmarks as in \autoref{chap:CesASMe} recompiled for AArch64. As most of the code analyzers we studied are unable to run on the A72, we are only able to compare \acombined{} to the baseline \perf{} measure, \llvmmca{} and \osaca{}. We use \llvmmca{} at version 18.1.8 and \osaca{} at version 0.5.0. We present the results in \autoref{table:a72_combined_stats} and in \autoref{fig:a72_combined_stats_boxplot}. Our \acombined{} model significantly outperforms \llvmmca{}, with a median error approximately half lower than \llvmmca{}'s and a third quartile level with its median. We expect that an iterative model, such as \llvmmca{} or \uica{}, based on our models' data, would yet significantly outperform \acombined{}. \section{Towards a modular approach?} These models, however ---~frontend, backend and dependencies~---, are only very loosely dependent upon each other. The critical path model, for instance, requires the number of \uops{} in one instruction, while the frontend model is purely standalone. Should a standardized format or API for these models emerge, swapping \eg{} our backend model for \uopsinfo{} and running our tool on Intel CPUs would be trivial. Yet better, one could build a ``meta-model'' relying on these model components handling a logic way more performant than our simple \texttt{max}-based model, on which anyone could hot-plug \eg{} a custom frontend model. The usual approach of the domain to try a new idea, instead, is to create a full analyzer implementing this idea, such as what we did with \palmed{} for backend models, or such as \uica{}'s implementation, focusing on frontend analysis. In hindsight, we advocate for the emergence of such a modular code analyzer. It would maybe not be as convenient or well-integrated as ``production-ready'' code analyzers, such as \llvmmca{} ---~which is officially packaged for Debian. It could, however, greatly simplify the academic process of trying a new idea on any of the three main models, by decorrelating them. It would also ease the comparative evaluation of those ideas, while eliminating many of the discrepancies between experimental setups that make an actual comparison difficult ---~the reason that prompted us to make \cesasme{} in \autoref{chap:CesASMe}. Indeed, with such a modular tool, it would be easy to run the same experiment, in the same conditions, while only changing \eg{} the frontend model but keeping a well-tried backend model.