phd-thesis/manuscrit/20_foundations/30_sota.tex

\section{State of the art}\label{sec:sota}

Performance models for CPUs have been previously studied, and applied to
static code performance analysis.

\subsection{Manufacturer-sourced data}

Manufacturers of CPUs are expected to offer optimisation data for software
compiled for their processors. This data may be used by compilers authors,
within highly-optimized libraries or in the optimisation process of critical
sections of programs that require very high performance.

\medskip

Intel provides its \textit{Intel® 64 and IA-32 Architectures Optimization
Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
further gives data tables with throughput and latencies for some instructions.
While the manual provides a huge collection of important insights ---~from the
optimisation perspective~--- on their microarchitectures, it lacks exhaustive
and (conveniently) machine-parsable data tables and does not detail port usages
of each instruction.

ARM typically releases optimisation manuals that are way more complete for its
microarchitectures, such as the Cortex A72 optimisation
manual~\cite{ref:a72_optim}.

AMD, since 2020, releases lengthy and complete optimisation manuals for its
microarchitecture. For instance, the Zen4 optimisation
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
processor's workflow and ports, and a spreadsheet of about 3,400 x86
instructions ---~with operands variants broken down~--- and their port usage,
throughput and latencies. Such an effort, which certainly translates to a
non-negligible financial cost to the company, showcases the importance and
recent expectations on such documents.

\medskip{}

As a part of its EXEgesis project~\cite{tool:google_exegesis}, Google made an
effort to parse Intel's microarchitecture manuals, resulting in a
machine-usable data source of instruction details. The extracted data has since
then been contributed to the llvm compiler's data model. The project, however,
is no longer developed.

\subsection{Third-party instruction data}

The lack, for many microarchitectures, of reliable, exhaustive and
machine-usable data for individual instructions has driven academics to
independently obtain this data from an experimental approach.

\medskip

Since 1996, Agner Fog has been maintaining tables of values useful for
optimisation purposes for x86 instructions~\cite{AgnerFog}. These tables, still
maintained and updated today, are often considered very accurate. They are the
result of benchmarking scripts developed by the author, subject to manual
---~and thus tedious, given the size of microarchitectures~--- analysis, and
are mainly conducted through hardware counters measurements. The main
issue, however, is that those tables are generated through the use of
hand-picked instructions and benchmarks, depending on specific hardware
counters and features specific to some CPU manufacturers. As such, while these
tables are very helpful on the supported CPUs for x86, the method does not
scale to the abundance of CPUs on which such tables may be useful ---~for
instance, ARM processors, embedded platforms, etc.

\medskip

Following the work of Agner Fog, Andreas Abel and Jan Reineke have designed the
\uopsinfo{} framework~\cite{uopsinfo}, striving to automate the previous
methodology. Their work, providing data tables for the vast majority of
instructions on many recent Intel microarchitectures, has been recently
enhanced to also support AMD architectures.

The \uopsinfo{} approach, detailed in their article, consists in finding
so-called \textit{blocking instructions} for each port which, used in
combination of the instruction to be benchmarked and port-specific hardware
counters, yield a detailed analysis of the port usage of each instruction
---~and even its break-down into \uops{}. This makes for an accurate and robust
approach, but also limits it to microarchitectures offering such counters, and
requires a manual analysis of each microarchitecture to be supported in order
to find a fitting set of blocking instructions. Although we have no theoretical
guarantee of the existence of such instructions, this should never be a
problem, as all pragmatic microarchitecture design will lead to their
existence.

\subsection{Code analyzers and their models}

Going further than data extraction at the individual instruction level,
academics and industrials interested in this domain now mostly work on
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
tool embeds a model ---~or collection of models~--- on which its inference is
based, and whose definition, embedded data and obtention method varies from
tool to tool. These tools often use, to some extent, the data on individual
instructions obtained either from the manufacturer or the third-party efforts
mentioned above.

\medskip{}

The Intel Architecture Code Analyzer (\iaca)~\cite{iaca}, released by
Intel, is a fully-closed source analyzer able to analyze assembly code for
Intel microarchitectures only. It draws on Intel's own knowledge of their
microarchitectures to make accurate predictions. This accuracy made it very
helpful to experts aiming to do performance debugging on supported
microarchitectures. Yet, being closed-source and relying on data that is
partially unavailable to the public, the model is not totally satisfactory to
academics or engineers trying to understand specific performance results. It
also makes it vulnerable to deprecation, as the community is unable to
\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
binary was recently removed from official download pages.

\medskip{}

In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
developed as an internal tool at Sony, and was proposed for inclusion in
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
data tables that \llvm{} ---~a compiler~--- has to maintain for each
microarchitecture in order to produce optimized code.  The project has since
then evolved to be fairly accurate, as seen in the experiments later presented
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
to its deprecation.

\medskip{}

Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
It still lacks, however, a good model of frontend and data dependencies, making
it less performant than other code analyzers in our experiments later in this
manuscript.

\medskip{}

Taking another approach entirely, \ithemal{} is a machine-learning-based code
analyzer striving to predict the reciprocal throughput of a given kernel. The
necessity of its training resulted in the development of \bhive{}, a benchmark
suite of kernels extracted from real-life programs and libraries, along with a
profiler measuring the runtime, in CPU cycles, of a basic block isolated from
its context. This approach, in our experiments, was significantly less accurate
than those not based on machine learning. In our opinion, its main issue,
however, is to be a \textit{black-box model}: given a kernel, it is only able
to predict its reverse throughput. Doing so, even with perfect accuracy, does
not explain the source of a performance problem: the model is unable to help
detecting which resource is the performance bottleneck of a kernel; in other
words, it quantifies a potential issue, but does not help in \emph{explaining}
it ---~or debugging it.

\medskip{}

In yet another approach, \pmevo{}~\cite{PMEvo} uses genetic algorithms to
infer, from scratch and in a benchmarks-oriented approach, a port-mapping of
the processor it is running on. It is, to the best of our knowledge, the first
tool striving to compute a port-mapping model in a fully-automated way, as
\palmed{} does (see \autoref{chap:palmed} later), although through a completely
different methodology. As detailed in \palmed{}'s article~\cite{palmed}, it
however suffers from a lack of scalability: as generating a port-mapping for the
few thousands of x86-64 instructions would be extremely time-consuming with
this approach, the authors limit the evaluation of their tool to around 300
most common instructions.

\medskip{}

Abel and Reineke, the authors of \uopsinfo{}, recently released
\uica{}~\cite{uica}, a code analyzer for Intel microarchitectures based on
\uopsinfo{} tables on one hand as a port model, and on manual
reverse-engineering through the use of hardware counters to model the frontend
and pipelines. We found this tool to be very accurate (see experiments later in
this manuscript), with results comparable with \llvmmca{}. Its source code
---~under free software license~--- is self-contained and reasonably concise
(about 2,000 lines of Python for the main part), making it a good basis and
baseline for experiments. It is, however, closely tied by design to Intel
microarchitectures, or microarchitectures very close to Intel's ones.
Foundations: code analyzers (work in train) 2024-01-03 10:50:36 +01:00			`\section{State of the art}\label{sec:sota}`
Start SOTA 2024-02-21 18:38:32 +01:00
			`Performance models for CPUs have been previously studied, and applied to`
			`static code performance analysis.`

SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`\subsection{Manufacturer-sourced data}`

			`Manufacturers of CPUs are expected to offer optimisation data for software`
			`compiled for their processors. This data may be used by compilers authors,`
			`within highly-optimized libraries or in the optimisation process of critical`
			`sections of programs that require very high performance.`

			`\medskip`

			`Intel provides its \textit{Intel® 64 and IA-32 Architectures Optimization`
			`Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},`
			`regularly updated, whose nearly 1,000 pages give relevant details to Intel's`
			`microarchitectures, such as block diagrams, pipelines, ports available, etc. It`
			`further gives data tables with throughput and latencies for some instructions.`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`While the manual provides a huge collection of important insights ---~from the`
			`optimisation perspective~--- on their microarchitectures, it lacks exhaustive`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`and (conveniently) machine-parsable data tables and does not detail port usages`
			`of each instruction.`

			`ARM typically releases optimisation manuals that are way more complete for its`
			`microarchitectures, such as the Cortex A72 optimisation`
			`manual~\cite{ref:a72_optim}.`

			`AMD, since 2020, releases lengthy and complete optimisation manuals for its`
			`microarchitecture. For instance, the Zen4 optimisation`
			`manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the`
			`processor's workflow and ports, and a spreadsheet of about 3,400 x86`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`instructions ---~with operands variants broken down~--- and their port usage,`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`throughput and latencies. Such an effort, which certainly translates to a`
			`non-negligible financial cost to the company, showcases the importance and`
			`recent expectations on such documents.`

			`\medskip{}`

			`As a part of its EXEgesis project~\cite{tool:google_exegesis}, Google made an`
			`effort to parse Intel's microarchitecture manuals, resulting in a`
			`machine-usable data source of instruction details. The extracted data has since`
			`then been contributed to the llvm compiler's data model. The project, however,`
			`is no longer developed.`

			`\subsection{Third-party instruction data}`

			`The lack, for many microarchitectures, of reliable, exhaustive and`
			`machine-usable data for individual instructions has driven academics to`
			`independently obtain this data from an experimental approach.`

Start SOTA 2024-02-21 18:38:32 +01:00			`\medskip`

			`Since 1996, Agner Fog has been maintaining tables of values useful for`
			`optimisation purposes for x86 instructions~\cite{AgnerFog}. These tables, still`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`maintained and updated today, are often considered very accurate. They are the`
			`result of benchmarking scripts developed by the author, subject to manual`
			`---~and thus tedious, given the size of microarchitectures~--- analysis, and`
			`are mainly conducted through hardware counters measurements. The main`
Start SOTA 2024-02-21 18:38:32 +01:00			`issue, however, is that those tables are generated through the use of`
			`hand-picked instructions and benchmarks, depending on specific hardware`
			`counters and features specific to some CPU manufacturers. As such, while these`
			`tables are very helpful on the supported CPUs for x86, the method does not`
			`scale to the abundance of CPUs on which such tables may be useful ---~for`
			`instance, ARM processors, embedded platforms, etc.`

			`\medskip`

			`Following the work of Agner Fog, Andreas Abel and Jan Reineke have designed the`
			`\uopsinfo{} framework~\cite{uopsinfo}, striving to automate the previous`
			`methodology. Their work, providing data tables for the vast majority of`
			`instructions on many recent Intel microarchitectures, has been recently`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`enhanced to also support AMD architectures.`

			`The \uopsinfo{} approach, detailed in their article, consists in finding`
			`so-called \textit{blocking instructions} for each port which, used in`
			`combination of the instruction to be benchmarked and port-specific hardware`
			`counters, yield a detailed analysis of the port usage of each instruction`
			`---~and even its break-down into \uops{}. This makes for an accurate and robust`
			`approach, but also limits it to microarchitectures offering such counters, and`
			`requires a manual analysis of each microarchitecture to be supported in order`
			`to find a fitting set of blocking instructions. Although we have no theoretical`
			`guarantee of the existence of such instructions, this should never be a`
Proof-read up to Foundations (incl) 2024-08-15 18:53:08 +02:00			`problem, as all pragmatic microarchitecture design will lead to their`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`existence.`

			`\subsection{Code analyzers and their models}`

			`Going further than data extraction at the individual instruction level,`
			`academics and industrials interested in this domain now mostly work on`
			`code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`tool embeds a model ---~or collection of models~--- on which its inference is`
Proof-read up to Foundations (incl) 2024-08-15 18:53:08 +02:00			`based, and whose definition, embedded data and obtention method varies from`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`tool to tool. These tools often use, to some extent, the data on individual`
			`instructions obtained either from the manufacturer or the third-party efforts`
			`mentioned above.`

			`\medskip{}`

			`The Intel Architecture Code Analyzer (\iaca)~\cite{iaca}, released by`
			`Intel, is a fully-closed source analyzer able to analyze assembly code for`
			`Intel microarchitectures only. It draws on Intel's own knowledge of their`
			`microarchitectures to make accurate predictions. This accuracy made it very`
			`helpful to experts aiming to do performance debugging on supported`
			`microarchitectures. Yet, being closed-source and relying on data that is`
			`partially unavailable to the public, the model is not totally satisfactory to`
			`academics or engineers trying to understand specific performance results. It`
			`also makes it vulnerable to deprecation, as the community is unable to`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`in 2019. Thus, \iaca{} does not support recent microarchitectures, and its`
			`binary was recently removed from official download pages.`

			`\medskip{}`

Typography: -- to --- 2024-09-01 16:56:48 +02:00			`In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`developed as an internal tool at Sony, and was proposed for inclusion in`
			`\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`data tables that \llvm{} ---~a compiler~--- has to maintain for each`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`microarchitecture in order to produce optimized code. The project has since`
			`then evolved to be fairly accurate, as seen in the experiments later presented`
			`in this manuscript. It is the alternative Intel offers to \iaca{} subsequently`
			`to its deprecation.`

			`\medskip{}`

			`Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}`
			`starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.`
			`It still lacks, however, a good model of frontend and data dependencies, making`
			`it less performant than other code analyzers in our experiments later in this`
			`manuscript.`

			`\medskip{}`

			`Taking another approach entirely, \ithemal{} is a machine-learning-based code`
			`analyzer striving to predict the reciprocal throughput of a given kernel. The`
			`necessity of its training resulted in the development of \bhive{}, a benchmark`
			`suite of kernels extracted from real-life programs and libraries, along with a`
			`profiler measuring the runtime, in CPU cycles, of a basic block isolated from`
			`its context. This approach, in our experiments, was significantly less accurate`
			`than those not based on machine learning. In our opinion, its main issue,`
			`however, is to be a \textit{black-box model}: given a kernel, it is only able`
			`to predict its reverse throughput. Doing so, even with perfect accuracy, does`
Proof-read up to Foundations (incl) 2024-08-15 18:53:08 +02:00			`not explain the source of a performance problem: the model is unable to help`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00			`detecting which resource is the performance bottleneck of a kernel; in other`
			`words, it quantifies a potential issue, but does not help in \emph{explaining}`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`it ---~or debugging it.`
SotA: some writeup. Lacks uiCA. 2024-03-19 19:57:04 +01:00
			`\medskip{}`

			`In yet another approach, \pmevo{}~\cite{PMEvo} uses genetic algorithms to`
			`infer, from scratch and in a benchmarks-oriented approach, a port-mapping of`
			`the processor it is running on. It is, to the best of our knowledge, the first`
			`tool striving to compute a port-mapping model in a fully-automated way, as`
			`\palmed{} does (see \autoref{chap:palmed} later), although through a completely`
			`different methodology. As detailed in \palmed{}'s article~\cite{palmed}, it`
			`however suffers from a lack of scalability: as generating a port-mapping for the`
			`few thousands of x86-64 instructions would be extremely time-consuming with`
			`this approach, the authors limit the evaluation of their tool to around 300`
			`most common instructions.`

SotA: first writeup 2024-03-19 20:52:37 +01:00			`\medskip{}`

			`Abel and Reineke, the authors of \uopsinfo{}, recently released`
			`\uica{}~\cite{uica}, a code analyzer for Intel microarchitectures based on`
			`\uopsinfo{} tables on one hand as a port model, and on manual`
			`reverse-engineering through the use of hardware counters to model the frontend`
			`and pipelines. We found this tool to be very accurate (see experiments later in`
			`this manuscript), with results comparable with \llvmmca{}. Its source code`
Typography: -- to --- 2024-09-01 16:56:48 +02:00			`---~under free software license~--- is self-contained and reasonably concise`
SotA: first writeup 2024-03-19 20:52:37 +01:00			`(about 2,000 lines of Python for the main part), making it a good basis and`
			`baseline for experiments. It is, however, closely tied by design to Intel`
Proof-read up to Foundations (incl) 2024-08-15 18:53:08 +02:00			`microarchitectures, or microarchitectures very close to Intel's ones.`