phd-thesis/manuscrit/90_wrapping_up/main.tex

\chapter{Wrapping it all up}\label{chap:wrapping_up}

In \autoref{chap:palmed}, we introduced \palmed{}, a framework to build a
backend model. Following up in \autoref{chap:frontend}, we introduced a
frontend model for the ARM-based Cortex A72 processor. Then, in
\autoref{chap:staticdeps}, we further introduced a dependency detection model.
Put together, these three parts cover the major bottlenecks that a code
analyzer must take into account.

Both the two first models ---~frontend and backend~--- already
natively output a cycles per iteration metric; we reduce our dependencies model
to a cycles per iteration metric by computing the \emph{critical path},
described below.

\medskip{}

To conclude this manuscript, we take a minimalist first approach at combining
those three models into a predictor, that we call \acombined{}, by taking the
maximal prediction among the three models.

This method is clearly less precise than \eg{} \uica{} or \llvmmca{}'s
methods, which simulate iterations of the kernel while accounting for each
model. It however allows us to quickly and easily evaluate an \emph{upper
bound} of the quality of our models: a more refined tool using our models
should obtain results at least as good as this method ---~but we could expect
it to perform significantly better.

\section{Critical path model}

To account for dependencies-induced bottlenecks, we compute the \emph{critical
path} along the data dependencies graph of the microkernel; that is, the
longest path in this graph weighted with source instructions' latencies.
The length of this path sets a lower bound to the
execution time, as each source instruction must be issued and yield a result
before the destination instruction can be issued. This approach is also taken
by \osaca{}~\cite{osaca2}.

In our case, we use instructions' latencies inferred by \palmed{} and its
backend \pipedream{} on the A72.

\medskip{}

So far, however, this method would fail to account for out-of-orderness: the
latency of an instruction is hidden by other computations, independent of the
former one's result. This instruction-level parallelism is limited by the
reorder buffer's size.

We thus unroll the kernel as many times as fits in the reorder buffer
---~accounting for each instruction's \uop{} count, as we have a frontend model
readily available~---, and compute the critical path on this unrolled version.
Finally, the metric in cycles per iteration is obtained by dividing this
critical path's length by the number of times we unrolled the kernel.

\section{Evaluation}

\begin{table}
    \centering
    \footnotesize
    \begin{tabular}{l r r r r r r r r}
        \toprule
        \textbf{Bencher} & \textbf{Datapoints} &
        \multicolumn{2}{c}{\textbf{Failures}} &
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} \\
              & & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) \\
\midrule
A72 combined & 1767 & 9 & (0.51\,\%) & 19.26\,\% & 12.98\,\% & 5.57\,\% & 25.38\,\% & 0.75\\
llvm-mca & 1775 & 1 & (0.06\,\%) & 32.60\,\% & 25.17\,\% & 8.84\,\% & 59.16\,\% & 0.69\\
Osaca (backend) & 1773 & 3 & (0.17\,\%) & 49.33\,\% & 50.19\,\% & 33.53\,\% & 64.94\,\% & 0.67\\
Osaca (crit. path) & 1773 & 3 & (0.17\,\%) & 84.02\,\% & 70.39\,\% & 40.37\,\% & 91.47\,\% & 0.24\\
\bottomrule
    \end{tabular}
    \caption{Evaluation through \cesasme{} of the \acombined{} model}\label{table:a72_combined_stats}
\end{table}
\begin{figure}
    \centering
    \includegraphics[width=0.5\linewidth]{cesasme_a72combined_boxplot.svg}
    \caption{Evaluation through \cesasme{} of the \acombined{}
    model}\label{fig:a72_combined_stats_boxplot}
\end{figure}

We evaluate \acombined{} with \cesasme{} on the Raspberry Pi's Cortex A72,
using the same set of benchmarks as in \autoref{chap:CesASMe} recompiled for
AArch64. As most of the code analyzers we studied are unable to run on the A72,
we are only able to compare \acombined{} to the baseline \perf{} measure,
\llvmmca{} and \osaca{}. We use \llvmmca{} at version 18.1.8 and \osaca{} at
version 0.5.0. We present the results in \autoref{table:a72_combined_stats} and
in \autoref{fig:a72_combined_stats_boxplot}.

Our \acombined{} model significantly outperforms \llvmmca{}, with a median
error approximately half lower than \llvmmca{}'s and a third quartile level
with its median. We expect that an iterative model, such as \llvmmca{} or
\uica{}, based on our models' data, would yet significantly outperform
\acombined{}.

\section{Towards a modular approach?}

These models, however ---~frontend, backend and dependencies~---, are only
very loosely dependent upon each other. The critical path model, for instance,
requires the number of \uops{} in one instruction, while the frontend model is
purely standalone. Should a standardized format or API for these models emerge,
swapping \eg{} our backend model for \uopsinfo{} and running our tool on Intel
CPUs would be trivial. Yet better, one could build a ``meta-model'' relying on
these model components handling a logic way more performant than our simple
\texttt{max}-based model, on which anyone could hot-plug \eg{} a custom
frontend model.

The usual approach of the domain to try a new idea, instead, is to
create a full analyzer implementing this idea, such as what we did with \palmed{}
for backend models, or such as \uica{}'s implementation, focusing on frontend
analysis.

In hindsight, we advocate for the emergence of such a modular code analyzer. It
would maybe not be as convenient or well-integrated as ``production-ready''
code analyzers, such as \llvmmca{} ---~which is officially packaged for Debian.
It could, however, greatly simplify the academic process of trying a new idea
on any of the three main models, by decorrelating them. It would also ease the
comparative evaluation of those ideas, while eliminating many of the
discrepancies between experimental setups that make an actual comparison
difficult ---~the reason that prompted us to make \cesasme{} in
\autoref{chap:CesASMe}. Indeed, with such a modular tool, it would be easy to
run the same experiment, in the same conditions, while only changing \eg{} the
frontend model but keeping a well-tried backend model.
Wrapping up: writeup 2024-10-01 17:09:22 +02:00			`\chapter{Wrapping it all up}\label{chap:wrapping_up}`
Add Wrapping it all up? pre-conclusion small chapter 2024-06-15 21:18:46 +02:00
			`In \autoref{chap:palmed}, we introduced \palmed{}, a framework to build a`
			`backend model. Following up in \autoref{chap:frontend}, we introduced a`
			`frontend model for the ARM-based Cortex A72 processor. Then, in`
			`\autoref{chap:staticdeps}, we further introduced a dependency detection model.`
			`Put together, these three parts cover the major bottlenecks that a code`
Wrapping up: writeup 2024-10-01 17:09:22 +02:00			`analyzer must take into account.`

			`Both the two first models ---~frontend and backend~--- already`
			`natively output a cycles per iteration metric; we reduce our dependencies model`
			`to a cycles per iteration metric by computing the \emph{critical path},`
			`described below.`
Add Wrapping it all up? pre-conclusion small chapter 2024-06-15 21:18:46 +02:00
			`\medskip{}`

Wrapping up: minor rewordings 2024-10-04 18:46:51 +02:00			`To conclude this manuscript, we take a minimalist first approach at combining`
			`those three models into a predictor, that we call \acombined{}, by taking the`
			`maximal prediction among the three models.`
Wrapping up: writeup 2024-10-01 17:09:22 +02:00
			`This method is clearly less precise than \eg{} \uica{} or \llvmmca{}'s`
			`methods, which simulate iterations of the kernel while accounting for each`
			`model. It however allows us to quickly and easily evaluate an \emph{upper`
			`bound} of the quality of our models: a more refined tool using our models`
Wrapping up: minor rewordings 2024-10-04 18:46:51 +02:00			`should obtain results at least as good as this method ---~but we could expect`
			`it to perform significantly better.`
Wrapping up: writeup 2024-10-01 17:09:22 +02:00
			`\section{Critical path model}`

			`To account for dependencies-induced bottlenecks, we compute the \emph{critical`
			`path} along the data dependencies graph of the microkernel; that is, the`
			`longest path in this graph weighted with source instructions' latencies.`
			`The length of this path sets a lower bound to the`
			`execution time, as each source instruction must be issued and yield a result`
			`before the destination instruction can be issued. This approach is also taken`
			`by \osaca{}~\cite{osaca2}.`

			`In our case, we use instructions' latencies inferred by \palmed{} and its`
			`backend \pipedream{} on the A72.`

Wrapping up: minor rewordings 2024-10-04 18:46:51 +02:00			`\medskip{}`

			`So far, however, this method would fail to account for out-of-orderness: the`
			`latency of an instruction is hidden by other computations, independent of the`
			`former one's result. This instruction-level parallelism is limited by the`
			`reorder buffer's size.`
Wrapping up: writeup 2024-10-01 17:09:22 +02:00
			`We thus unroll the kernel as many times as fits in the reorder buffer`
			`---~accounting for each instruction's \uop{} count, as we have a frontend model`
			`readily available~---, and compute the critical path on this unrolled version.`
			`Finally, the metric in cycles per iteration is obtained by dividing this`
			`critical path's length by the number of times we unrolled the kernel.`

			`\section{Evaluation}`

			`\begin{table}`
			`\centering`
			`\footnotesize`
			`\begin{tabular}{l r r r r r r r r}`
			`\toprule`
			`\textbf{Bencher} & \textbf{Datapoints} &`
			`\multicolumn{2}{c}{\textbf{Failures}} &`
			`\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{$K_\tau$} \\`
			`& & (Count) & (\%) & (\%) & (\%) & (\%) & (\%) \\`
			`\midrule`
			`A72 combined & 1767 & 9 & (0.51\,\%) & 19.26\,\% & 12.98\,\% & 5.57\,\% & 25.38\,\% & 0.75\\`
			`llvm-mca & 1775 & 1 & (0.06\,\%) & 32.60\,\% & 25.17\,\% & 8.84\,\% & 59.16\,\% & 0.69\\`
			`Osaca (backend) & 1773 & 3 & (0.17\,\%) & 49.33\,\% & 50.19\,\% & 33.53\,\% & 64.94\,\% & 0.67\\`
			`Osaca (crit. path) & 1773 & 3 & (0.17\,\%) & 84.02\,\% & 70.39\,\% & 40.37\,\% & 91.47\,\% & 0.24\\`
			`\bottomrule`
			`\end{tabular}`
			`\caption{Evaluation through \cesasme{} of the \acombined{} model}\label{table:a72_combined_stats}`
			`\end{table}`
			`\begin{figure}`
			`\centering`
			`\includegraphics[width=0.5\linewidth]{cesasme_a72combined_boxplot.svg}`
			`\caption{Evaluation through \cesasme{} of the \acombined{}`
			`model}\label{fig:a72_combined_stats_boxplot}`
			`\end{figure}`

Wrapping up: minor rewordings 2024-10-04 18:46:51 +02:00			`We evaluate \acombined{} with \cesasme{} on the Raspberry Pi's Cortex A72,`
Wrapping up: writeup 2024-10-01 17:09:22 +02:00			`using the same set of benchmarks as in \autoref{chap:CesASMe} recompiled for`
			`AArch64. As most of the code analyzers we studied are unable to run on the A72,`
			`we are only able to compare \acombined{} to the baseline \perf{} measure,`
			`\llvmmca{} and \osaca{}. We use \llvmmca{} at version 18.1.8 and \osaca{} at`
			`version 0.5.0. We present the results in \autoref{table:a72_combined_stats} and`
			`in \autoref{fig:a72_combined_stats_boxplot}.`

			`Our \acombined{} model significantly outperforms \llvmmca{}, with a median`
			`error approximately half lower than \llvmmca{}'s and a third quartile level`
			`with its median. We expect that an iterative model, such as \llvmmca{} or`
			`\uica{}, based on our models' data, would yet significantly outperform`
			`\acombined{}.`

			`\section{Towards a modular approach?}`

			`These models, however ---~frontend, backend and dependencies~---, are only`
			`very loosely dependent upon each other. The critical path model, for instance,`
			`requires the number of \uops{} in one instruction, while the frontend model is`
			`purely standalone. Should a standardized format or API for these models emerge,`
			`swapping \eg{} our backend model for \uopsinfo{} and running our tool on Intel`
			CPUs would be trivial. Yet better, one could build a ``meta-model'' relying on
			`these model components handling a logic way more performant than our simple`
			`\texttt{max}-based model, on which anyone could hot-plug \eg{} a custom`
			`frontend model.`

			`The usual approach of the domain to try a new idea, instead, is to`
Add Wrapping it all up? pre-conclusion small chapter 2024-06-15 21:18:46 +02:00			`create a full analyzer implementing this idea, such as what we did with \palmed{}`
			`for backend models, or such as \uica{}'s implementation, focusing on frontend`
			`analysis.`

Proof-read chapter 5 (staticdeps) 2024-09-01 16:05:21 +02:00			`In hindsight, we advocate for the emergence of such a modular code analyzer. It`
			would maybe not be as convenient or well-integrated as ``production-ready''
			`code analyzers, such as \llvmmca{} ---~which is officially packaged for Debian.`
			`It could, however, greatly simplify the academic process of trying a new idea`
			`on any of the three main models, by decorrelating them. It would also ease the`
			`comparative evaluation of those ideas, while eliminating many of the`
			`discrepancies between experimental setups that make an actual comparison`
			`difficult ---~the reason that prompted us to make \cesasme{} in`
			`\autoref{chap:CesASMe}. Indeed, with such a modular tool, it would be easy to`
			`run the same experiment, in the same conditions, while only changing \eg{} the`
			`frontend model but keeping a well-tried backend model.`