phd-thesis/manuscrit/20_foundations/30_sota.tex

175 lines
9.2 KiB
TeX
Raw Normal View History

\section{State of the art}\label{sec:sota}
2024-02-21 18:38:32 +01:00
Performance models for CPUs have been previously studied, and applied to
static code performance analysis.
2024-03-19 19:57:04 +01:00
\subsection{Manufacturer-sourced data}
Manufacturers of CPUs are expected to offer optimisation data for software
compiled for their processors. This data may be used by compilers authors,
within highly-optimized libraries or in the optimisation process of critical
sections of programs that require very high performance.
\medskip
Intel provides its \textit{Intel® 64 and IA-32 Architectures Optimization
Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
further gives data tables with throughput and latencies for some instructions.
2024-09-01 16:56:48 +02:00
While the manual provides a huge collection of important insights ---~from the
optimisation perspective~--- on their microarchitectures, it lacks exhaustive
2024-03-19 19:57:04 +01:00
and (conveniently) machine-parsable data tables and does not detail port usages
of each instruction.
ARM typically releases optimisation manuals that are way more complete for its
microarchitectures, such as the Cortex A72 optimisation
manual~\cite{ref:a72_optim}.
AMD, since 2020, releases lengthy and complete optimisation manuals for its
microarchitecture. For instance, the Zen4 optimisation
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
processor's workflow and ports, and a spreadsheet of about 3,400 x86
2024-09-01 16:56:48 +02:00
instructions ---~with operands variants broken down~--- and their port usage,
2024-03-19 19:57:04 +01:00
throughput and latencies. Such an effort, which certainly translates to a
non-negligible financial cost to the company, showcases the importance and
recent expectations on such documents.
\medskip{}
As a part of its EXEgesis project~\cite{tool:google_exegesis}, Google made an
effort to parse Intel's microarchitecture manuals, resulting in a
machine-usable data source of instruction details. The extracted data has since
then been contributed to the llvm compiler's data model. The project, however,
is no longer developed.
\subsection{Third-party instruction data}
The lack, for many microarchitectures, of reliable, exhaustive and
machine-usable data for individual instructions has driven academics to
independently obtain this data from an experimental approach.
2024-02-21 18:38:32 +01:00
\medskip
Since 1996, Agner Fog has been maintaining tables of values useful for
optimisation purposes for x86 instructions~\cite{AgnerFog}. These tables, still
2024-03-19 19:57:04 +01:00
maintained and updated today, are often considered very accurate. They are the
result of benchmarking scripts developed by the author, subject to manual
---~and thus tedious, given the size of microarchitectures~--- analysis, and
are mainly conducted through hardware counters measurements. The main
2024-02-21 18:38:32 +01:00
issue, however, is that those tables are generated through the use of
hand-picked instructions and benchmarks, depending on specific hardware
counters and features specific to some CPU manufacturers. As such, while these
tables are very helpful on the supported CPUs for x86, the method does not
scale to the abundance of CPUs on which such tables may be useful ---~for
instance, ARM processors, embedded platforms, etc.
\medskip
Following the work of Agner Fog, Andreas Abel and Jan Reineke have designed the
\uopsinfo{} framework~\cite{uopsinfo}, striving to automate the previous
methodology. Their work, providing data tables for the vast majority of
instructions on many recent Intel microarchitectures, has been recently
2024-03-19 19:57:04 +01:00
enhanced to also support AMD architectures.
The \uopsinfo{} approach, detailed in their article, consists in finding
so-called \textit{blocking instructions} for each port which, used in
combination of the instruction to be benchmarked and port-specific hardware
counters, yield a detailed analysis of the port usage of each instruction
---~and even its break-down into \uops{}. This makes for an accurate and robust
approach, but also limits it to microarchitectures offering such counters, and
requires a manual analysis of each microarchitecture to be supported in order
to find a fitting set of blocking instructions. Although we have no theoretical
guarantee of the existence of such instructions, this should never be a
2024-08-15 18:53:08 +02:00
problem, as all pragmatic microarchitecture design will lead to their
2024-03-19 19:57:04 +01:00
existence.
\subsection{Code analyzers and their models}
Going further than data extraction at the individual instruction level,
academics and industrials interested in this domain now mostly work on
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
2024-09-01 16:56:48 +02:00
tool embeds a model ---~or collection of models~--- on which its inference is
2024-08-15 18:53:08 +02:00
based, and whose definition, embedded data and obtention method varies from
2024-03-19 19:57:04 +01:00
tool to tool. These tools often use, to some extent, the data on individual
instructions obtained either from the manufacturer or the third-party efforts
mentioned above.
\medskip{}
The Intel Architecture Code Analyzer (\iaca)~\cite{iaca}, released by
Intel, is a fully-closed source analyzer able to analyze assembly code for
Intel microarchitectures only. It draws on Intel's own knowledge of their
microarchitectures to make accurate predictions. This accuracy made it very
helpful to experts aiming to do performance debugging on supported
microarchitectures. Yet, being closed-source and relying on data that is
partially unavailable to the public, the model is not totally satisfactory to
academics or engineers trying to understand specific performance results. It
also makes it vulnerable to deprecation, as the community is unable to
2024-09-01 16:56:48 +02:00
\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
2024-03-19 19:57:04 +01:00
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
binary was recently removed from official download pages.
\medskip{}
2024-09-01 16:56:48 +02:00
In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
2024-03-19 19:57:04 +01:00
developed as an internal tool at Sony, and was proposed for inclusion in
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
2024-09-01 16:56:48 +02:00
data tables that \llvm{} ---~a compiler~--- has to maintain for each
2024-03-19 19:57:04 +01:00
microarchitecture in order to produce optimized code. The project has since
then evolved to be fairly accurate, as seen in the experiments later presented
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
to its deprecation.
\medskip{}
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
2024-09-01 16:56:48 +02:00
(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
2024-03-19 19:57:04 +01:00
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
It still lacks, however, a good model of frontend and data dependencies, making
it less performant than other code analyzers in our experiments later in this
manuscript.
\medskip{}
Taking another approach entirely, \ithemal{} is a machine-learning-based code
analyzer striving to predict the reciprocal throughput of a given kernel. The
necessity of its training resulted in the development of \bhive{}, a benchmark
suite of kernels extracted from real-life programs and libraries, along with a
profiler measuring the runtime, in CPU cycles, of a basic block isolated from
its context. This approach, in our experiments, was significantly less accurate
than those not based on machine learning. In our opinion, its main issue,
however, is to be a \textit{black-box model}: given a kernel, it is only able
to predict its reverse throughput. Doing so, even with perfect accuracy, does
2024-08-15 18:53:08 +02:00
not explain the source of a performance problem: the model is unable to help
2024-03-19 19:57:04 +01:00
detecting which resource is the performance bottleneck of a kernel; in other
words, it quantifies a potential issue, but does not help in \emph{explaining}
2024-09-01 16:56:48 +02:00
it ---~or debugging it.
2024-03-19 19:57:04 +01:00
\medskip{}
In yet another approach, \pmevo{}~\cite{PMEvo} uses genetic algorithms to
infer, from scratch and in a benchmarks-oriented approach, a port-mapping of
the processor it is running on. It is, to the best of our knowledge, the first
tool striving to compute a port-mapping model in a fully-automated way, as
\palmed{} does (see \autoref{chap:palmed} later), although through a completely
different methodology. As detailed in \palmed{}'s article~\cite{palmed}, it
however suffers from a lack of scalability: as generating a port-mapping for the
few thousands of x86-64 instructions would be extremely time-consuming with
this approach, the authors limit the evaluation of their tool to around 300
most common instructions.
2024-03-19 20:52:37 +01:00
\medskip{}
Abel and Reineke, the authors of \uopsinfo{}, recently released
\uica{}~\cite{uica}, a code analyzer for Intel microarchitectures based on
\uopsinfo{} tables on one hand as a port model, and on manual
reverse-engineering through the use of hardware counters to model the frontend
and pipelines. We found this tool to be very accurate (see experiments later in
this manuscript), with results comparable with \llvmmca{}. Its source code
2024-09-01 16:56:48 +02:00
---~under free software license~--- is self-contained and reasonably concise
2024-03-19 20:52:37 +01:00
(about 2,000 lines of Python for the main part), making it a good basis and
baseline for experiments. It is, however, closely tied by design to Intel
2024-08-15 18:53:08 +02:00
microarchitectures, or microarchitectures very close to Intel's ones.