2024-10-01 17:09:22 +02:00
|
|
|
\chapter{Wrapping it all up}\label{chap:wrapping_up}
|
2024-06-15 21:18:46 +02:00
|
|
|
|
|
|
|
In \autoref{chap:palmed}, we introduced \palmed{}, a framework to build a
|
|
|
|
backend model. Following up in \autoref{chap:frontend}, we introduced a
|
|
|
|
frontend model for the ARM-based Cortex A72 processor. Then, in
|
|
|
|
\autoref{chap:staticdeps}, we further introduced a dependency detection model.
|
|
|
|
Put together, these three parts cover the major bottlenecks that a code
|
2024-10-01 17:09:22 +02:00
|
|
|
analyzer must take into account.
|
|
|
|
|
|
|
|
Both the two first models ---~frontend and backend~--- already
|
|
|
|
natively output a cycles per iteration metric; we reduce our dependencies model
|
|
|
|
to a cycles per iteration metric by computing the \emph{critical path},
|
|
|
|
described below.
|
2024-06-15 21:18:46 +02:00
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
2024-10-04 18:46:51 +02:00
|
|
|
To conclude this manuscript, we take a minimalist first approach at combining
|
|
|
|
those three models into a predictor, that we call \acombined{}, by taking the
|
|
|
|
maximal prediction among the three models.
|
2024-10-01 17:09:22 +02:00
|
|
|
|
|
|
|
This method is clearly less precise than \eg{} \uica{} or \llvmmca{}'s
|
|
|
|
methods, which simulate iterations of the kernel while accounting for each
|
|
|
|
model. It however allows us to quickly and easily evaluate an \emph{upper
|
|
|
|
bound} of the quality of our models: a more refined tool using our models
|
2024-10-04 18:46:51 +02:00
|
|
|
should obtain results at least as good as this method ---~but we could expect
|
|
|
|
it to perform significantly better.
|
2024-10-01 17:09:22 +02:00
|
|
|
|
|
|
|
\section{Critical path model}
|
|
|
|
|
|
|
|
To account for dependencies-induced bottlenecks, we compute the \emph{critical
|
|
|
|
path} along the data dependencies graph of the microkernel; that is, the
|
|
|
|
longest path in this graph weighted with source instructions' latencies.
|
|
|
|
The length of this path sets a lower bound to the
|
|
|
|
execution time, as each source instruction must be issued and yield a result
|
|
|
|
before the destination instruction can be issued. This approach is also taken
|
|
|
|
by \osaca{}~\cite{osaca2}.
|
|
|
|
|
|
|
|
In our case, we use instructions' latencies inferred by \palmed{} and its
|
|
|
|
backend \pipedream{} on the A72.
|
|
|
|
|
2024-10-04 18:46:51 +02:00
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
So far, however, this method would fail to account for out-of-orderness: the
|
|
|
|
latency of an instruction is hidden by other computations, independent of the
|
|
|
|
former one's result. This instruction-level parallelism is limited by the
|
|
|
|
reorder buffer's size.
|
2024-10-01 17:09:22 +02:00
|
|
|
|
|
|
|
We thus unroll the kernel as many times as fits in the reorder buffer
|
|
|
|
---~accounting for each instruction's \uop{} count, as we have a frontend model
|
|
|
|
readily available~---, and compute the critical path on this unrolled version.
|
|
|
|
Finally, the metric in cycles per iteration is obtained by dividing this
|
|
|
|
critical path's length by the number of times we unrolled the kernel.
|
|
|
|
|
|
|
|
\section{Evaluation}
|
|
|
|
|
|
|
|
\begin{table}
|
|
|
|
\centering
|
|
|
|
\footnotesize
|
|
|
|
\begin{tabular}{l r r r r r r r r}
|
|
|
|
\toprule
|
|
|
|
\textbf{Bencher} & \textbf{Datapoints} &
|
|
|
|
\multicolumn{2}{c}{\textbf{Failures}} &
|
|
|
|
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} \\
|
|
|
|
& & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) \\
|
|
|
|
\midrule
|
|
|
|
A72 combined & 1767 & 9 & (0.51\,\%) & 19.26\,\% & 12.98\,\% & 5.57\,\% & 25.38\,\% & 0.75\\
|
|
|
|
llvm-mca & 1775 & 1 & (0.06\,\%) & 32.60\,\% & 25.17\,\% & 8.84\,\% & 59.16\,\% & 0.69\\
|
|
|
|
Osaca (backend) & 1773 & 3 & (0.17\,\%) & 49.33\,\% & 50.19\,\% & 33.53\,\% & 64.94\,\% & 0.67\\
|
|
|
|
Osaca (crit. path) & 1773 & 3 & (0.17\,\%) & 84.02\,\% & 70.39\,\% & 40.37\,\% & 91.47\,\% & 0.24\\
|
|
|
|
\bottomrule
|
|
|
|
\end{tabular}
|
|
|
|
\caption{Evaluation through \cesasme{} of the \acombined{} model}\label{table:a72_combined_stats}
|
|
|
|
\end{table}
|
|
|
|
\begin{figure}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.5\linewidth]{cesasme_a72combined_boxplot.svg}
|
|
|
|
\caption{Evaluation through \cesasme{} of the \acombined{}
|
|
|
|
model}\label{fig:a72_combined_stats_boxplot}
|
|
|
|
\end{figure}
|
|
|
|
|
2024-10-04 18:46:51 +02:00
|
|
|
We evaluate \acombined{} with \cesasme{} on the Raspberry Pi's Cortex A72,
|
2024-10-01 17:09:22 +02:00
|
|
|
using the same set of benchmarks as in \autoref{chap:CesASMe} recompiled for
|
|
|
|
AArch64. As most of the code analyzers we studied are unable to run on the A72,
|
|
|
|
we are only able to compare \acombined{} to the baseline \perf{} measure,
|
|
|
|
\llvmmca{} and \osaca{}. We use \llvmmca{} at version 18.1.8 and \osaca{} at
|
|
|
|
version 0.5.0. We present the results in \autoref{table:a72_combined_stats} and
|
|
|
|
in \autoref{fig:a72_combined_stats_boxplot}.
|
|
|
|
|
|
|
|
Our \acombined{} model significantly outperforms \llvmmca{}, with a median
|
|
|
|
error approximately half lower than \llvmmca{}'s and a third quartile level
|
|
|
|
with its median. We expect that an iterative model, such as \llvmmca{} or
|
|
|
|
\uica{}, based on our models' data, would yet significantly outperform
|
|
|
|
\acombined{}.
|
|
|
|
|
|
|
|
\section{Towards a modular approach?}
|
|
|
|
|
|
|
|
These models, however ---~frontend, backend and dependencies~---, are only
|
|
|
|
very loosely dependent upon each other. The critical path model, for instance,
|
|
|
|
requires the number of \uops{} in one instruction, while the frontend model is
|
|
|
|
purely standalone. Should a standardized format or API for these models emerge,
|
|
|
|
swapping \eg{} our backend model for \uopsinfo{} and running our tool on Intel
|
|
|
|
CPUs would be trivial. Yet better, one could build a ``meta-model'' relying on
|
|
|
|
these model components handling a logic way more performant than our simple
|
|
|
|
\texttt{max}-based model, on which anyone could hot-plug \eg{} a custom
|
|
|
|
frontend model.
|
|
|
|
|
|
|
|
The usual approach of the domain to try a new idea, instead, is to
|
2024-06-15 21:18:46 +02:00
|
|
|
create a full analyzer implementing this idea, such as what we did with \palmed{}
|
|
|
|
for backend models, or such as \uica{}'s implementation, focusing on frontend
|
|
|
|
analysis.
|
|
|
|
|
2024-09-01 16:05:21 +02:00
|
|
|
In hindsight, we advocate for the emergence of such a modular code analyzer. It
|
|
|
|
would maybe not be as convenient or well-integrated as ``production-ready''
|
|
|
|
code analyzers, such as \llvmmca{} ---~which is officially packaged for Debian.
|
|
|
|
It could, however, greatly simplify the academic process of trying a new idea
|
|
|
|
on any of the three main models, by decorrelating them. It would also ease the
|
|
|
|
comparative evaluation of those ideas, while eliminating many of the
|
|
|
|
discrepancies between experimental setups that make an actual comparison
|
|
|
|
difficult ---~the reason that prompted us to make \cesasme{} in
|
|
|
|
\autoref{chap:CesASMe}. Indeed, with such a modular tool, it would be easy to
|
|
|
|
run the same experiment, in the same conditions, while only changing \eg{} the
|
|
|
|
frontend model but keeping a well-tried backend model.
|