2024-01-03 10:50:36 +01:00
|
|
|
\section{State of the art}\label{sec:sota}
|
2024-02-21 18:38:32 +01:00
|
|
|
|
|
|
|
Performance models for CPUs have been previously studied, and applied to
|
|
|
|
static code performance analysis.
|
|
|
|
|
2024-03-19 19:57:04 +01:00
|
|
|
\subsection{Manufacturer-sourced data}
|
|
|
|
|
|
|
|
Manufacturers of CPUs are expected to offer optimisation data for software
|
|
|
|
compiled for their processors. This data may be used by compilers authors,
|
|
|
|
within highly-optimized libraries or in the optimisation process of critical
|
|
|
|
sections of programs that require very high performance.
|
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
|
|
|
Intel provides its \textit{Intel® 64 and IA-32 Architectures Optimization
|
|
|
|
Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
|
|
|
|
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
|
|
|
|
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
|
|
|
|
further gives data tables with throughput and latencies for some instructions.
|
2024-09-01 16:56:48 +02:00
|
|
|
While the manual provides a huge collection of important insights ---~from the
|
|
|
|
optimisation perspective~--- on their microarchitectures, it lacks exhaustive
|
2024-03-19 19:57:04 +01:00
|
|
|
and (conveniently) machine-parsable data tables and does not detail port usages
|
|
|
|
of each instruction.
|
|
|
|
|
|
|
|
ARM typically releases optimisation manuals that are way more complete for its
|
|
|
|
microarchitectures, such as the Cortex A72 optimisation
|
|
|
|
manual~\cite{ref:a72_optim}.
|
|
|
|
|
|
|
|
AMD, since 2020, releases lengthy and complete optimisation manuals for its
|
|
|
|
microarchitecture. For instance, the Zen4 optimisation
|
|
|
|
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
|
|
|
|
processor's workflow and ports, and a spreadsheet of about 3,400 x86
|
2024-09-01 16:56:48 +02:00
|
|
|
instructions ---~with operands variants broken down~--- and their port usage,
|
2024-03-19 19:57:04 +01:00
|
|
|
throughput and latencies. Such an effort, which certainly translates to a
|
|
|
|
non-negligible financial cost to the company, showcases the importance and
|
|
|
|
recent expectations on such documents.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
As a part of its EXEgesis project~\cite{tool:google_exegesis}, Google made an
|
|
|
|
effort to parse Intel's microarchitecture manuals, resulting in a
|
|
|
|
machine-usable data source of instruction details. The extracted data has since
|
|
|
|
then been contributed to the llvm compiler's data model. The project, however,
|
|
|
|
is no longer developed.
|
|
|
|
|
|
|
|
\subsection{Third-party instruction data}
|
|
|
|
|
|
|
|
The lack, for many microarchitectures, of reliable, exhaustive and
|
|
|
|
machine-usable data for individual instructions has driven academics to
|
|
|
|
independently obtain this data from an experimental approach.
|
|
|
|
|
2024-02-21 18:38:32 +01:00
|
|
|
\medskip
|
|
|
|
|
|
|
|
Since 1996, Agner Fog has been maintaining tables of values useful for
|
|
|
|
optimisation purposes for x86 instructions~\cite{AgnerFog}. These tables, still
|
2024-03-19 19:57:04 +01:00
|
|
|
maintained and updated today, are often considered very accurate. They are the
|
|
|
|
result of benchmarking scripts developed by the author, subject to manual
|
|
|
|
---~and thus tedious, given the size of microarchitectures~--- analysis, and
|
|
|
|
are mainly conducted through hardware counters measurements. The main
|
2024-02-21 18:38:32 +01:00
|
|
|
issue, however, is that those tables are generated through the use of
|
|
|
|
hand-picked instructions and benchmarks, depending on specific hardware
|
|
|
|
counters and features specific to some CPU manufacturers. As such, while these
|
|
|
|
tables are very helpful on the supported CPUs for x86, the method does not
|
|
|
|
scale to the abundance of CPUs on which such tables may be useful ---~for
|
|
|
|
instance, ARM processors, embedded platforms, etc.
|
|
|
|
|
|
|
|
\medskip
|
|
|
|
|
|
|
|
Following the work of Agner Fog, Andreas Abel and Jan Reineke have designed the
|
|
|
|
\uopsinfo{} framework~\cite{uopsinfo}, striving to automate the previous
|
|
|
|
methodology. Their work, providing data tables for the vast majority of
|
|
|
|
instructions on many recent Intel microarchitectures, has been recently
|
2024-03-19 19:57:04 +01:00
|
|
|
enhanced to also support AMD architectures.
|
|
|
|
|
|
|
|
The \uopsinfo{} approach, detailed in their article, consists in finding
|
|
|
|
so-called \textit{blocking instructions} for each port which, used in
|
|
|
|
combination of the instruction to be benchmarked and port-specific hardware
|
|
|
|
counters, yield a detailed analysis of the port usage of each instruction
|
|
|
|
---~and even its break-down into \uops{}. This makes for an accurate and robust
|
|
|
|
approach, but also limits it to microarchitectures offering such counters, and
|
|
|
|
requires a manual analysis of each microarchitecture to be supported in order
|
|
|
|
to find a fitting set of blocking instructions. Although we have no theoretical
|
|
|
|
guarantee of the existence of such instructions, this should never be a
|
2024-08-15 18:53:08 +02:00
|
|
|
problem, as all pragmatic microarchitecture design will lead to their
|
2024-03-19 19:57:04 +01:00
|
|
|
existence.
|
|
|
|
|
|
|
|
\subsection{Code analyzers and their models}
|
|
|
|
|
|
|
|
Going further than data extraction at the individual instruction level,
|
|
|
|
academics and industrials interested in this domain now mostly work on
|
|
|
|
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
2024-09-01 16:56:48 +02:00
|
|
|
tool embeds a model ---~or collection of models~--- on which its inference is
|
2024-08-15 18:53:08 +02:00
|
|
|
based, and whose definition, embedded data and obtention method varies from
|
2024-03-19 19:57:04 +01:00
|
|
|
tool to tool. These tools often use, to some extent, the data on individual
|
|
|
|
instructions obtained either from the manufacturer or the third-party efforts
|
|
|
|
mentioned above.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
The Intel Architecture Code Analyzer (\iaca)~\cite{iaca}, released by
|
|
|
|
Intel, is a fully-closed source analyzer able to analyze assembly code for
|
|
|
|
Intel microarchitectures only. It draws on Intel's own knowledge of their
|
|
|
|
microarchitectures to make accurate predictions. This accuracy made it very
|
|
|
|
helpful to experts aiming to do performance debugging on supported
|
|
|
|
microarchitectures. Yet, being closed-source and relying on data that is
|
|
|
|
partially unavailable to the public, the model is not totally satisfactory to
|
|
|
|
academics or engineers trying to understand specific performance results. It
|
|
|
|
also makes it vulnerable to deprecation, as the community is unable to
|
2024-09-01 16:56:48 +02:00
|
|
|
\textit{fork} the project ---~and indeed, \iaca{} has been discontinued by Intel
|
2024-03-19 19:57:04 +01:00
|
|
|
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
|
|
|
|
binary was recently removed from official download pages.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
2024-09-01 16:56:48 +02:00
|
|
|
In the meantime, the LLVM Machine Code Analyzer ---~or \llvmmca{}~--- was
|
2024-03-19 19:57:04 +01:00
|
|
|
developed as an internal tool at Sony, and was proposed for inclusion in
|
|
|
|
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
|
2024-09-01 16:56:48 +02:00
|
|
|
data tables that \llvm{} ---~a compiler~--- has to maintain for each
|
2024-03-19 19:57:04 +01:00
|
|
|
microarchitecture in order to produce optimized code. The project has since
|
|
|
|
then evolved to be fairly accurate, as seen in the experiments later presented
|
|
|
|
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
|
|
|
|
to its deprecation.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
|
|
|
|
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
|
2024-09-01 16:56:48 +02:00
|
|
|
(at the time) of an open-source ---~and thus, open-model~--- alternative to IACA.
|
2024-03-19 19:57:04 +01:00
|
|
|
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
|
|
|
|
It still lacks, however, a good model of frontend and data dependencies, making
|
|
|
|
it less performant than other code analyzers in our experiments later in this
|
|
|
|
manuscript.
|
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Taking another approach entirely, \ithemal{} is a machine-learning-based code
|
|
|
|
analyzer striving to predict the reciprocal throughput of a given kernel. The
|
|
|
|
necessity of its training resulted in the development of \bhive{}, a benchmark
|
|
|
|
suite of kernels extracted from real-life programs and libraries, along with a
|
|
|
|
profiler measuring the runtime, in CPU cycles, of a basic block isolated from
|
|
|
|
its context. This approach, in our experiments, was significantly less accurate
|
|
|
|
than those not based on machine learning. In our opinion, its main issue,
|
|
|
|
however, is to be a \textit{black-box model}: given a kernel, it is only able
|
|
|
|
to predict its reverse throughput. Doing so, even with perfect accuracy, does
|
2024-08-15 18:53:08 +02:00
|
|
|
not explain the source of a performance problem: the model is unable to help
|
2024-03-19 19:57:04 +01:00
|
|
|
detecting which resource is the performance bottleneck of a kernel; in other
|
|
|
|
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
2024-09-01 16:56:48 +02:00
|
|
|
it ---~or debugging it.
|
2024-03-19 19:57:04 +01:00
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
In yet another approach, \pmevo{}~\cite{PMEvo} uses genetic algorithms to
|
|
|
|
infer, from scratch and in a benchmarks-oriented approach, a port-mapping of
|
|
|
|
the processor it is running on. It is, to the best of our knowledge, the first
|
|
|
|
tool striving to compute a port-mapping model in a fully-automated way, as
|
|
|
|
\palmed{} does (see \autoref{chap:palmed} later), although through a completely
|
|
|
|
different methodology. As detailed in \palmed{}'s article~\cite{palmed}, it
|
|
|
|
however suffers from a lack of scalability: as generating a port-mapping for the
|
|
|
|
few thousands of x86-64 instructions would be extremely time-consuming with
|
|
|
|
this approach, the authors limit the evaluation of their tool to around 300
|
|
|
|
most common instructions.
|
|
|
|
|
2024-03-19 20:52:37 +01:00
|
|
|
\medskip{}
|
|
|
|
|
|
|
|
Abel and Reineke, the authors of \uopsinfo{}, recently released
|
|
|
|
\uica{}~\cite{uica}, a code analyzer for Intel microarchitectures based on
|
|
|
|
\uopsinfo{} tables on one hand as a port model, and on manual
|
|
|
|
reverse-engineering through the use of hardware counters to model the frontend
|
|
|
|
and pipelines. We found this tool to be very accurate (see experiments later in
|
|
|
|
this manuscript), with results comparable with \llvmmca{}. Its source code
|
2024-09-01 16:56:48 +02:00
|
|
|
---~under free software license~--- is self-contained and reasonably concise
|
2024-03-19 20:52:37 +01:00
|
|
|
(about 2,000 lines of Python for the main part), making it a good basis and
|
|
|
|
baseline for experiments. It is, however, closely tied by design to Intel
|
2024-08-15 18:53:08 +02:00
|
|
|
microarchitectures, or microarchitectures very close to Intel's ones.
|