SotA: some writeup. Lacks uiCA.
This commit is contained in:
parent
bd7b3b8ad6
commit
bb932f93c6
6 changed files with 171 additions and 4 deletions
|
@ -1,4 +1,4 @@
|
|||
\section{Kernel optimization and code analyzers}
|
||||
\section{Kernel optimization and code analyzers}\label{ssec:code_analyzers}
|
||||
|
||||
Optimizing a program, in most contexts, mainly means optimizing it from an
|
||||
algorithmic point of view ---~using efficient data structures, running some
|
||||
|
|
|
@ -3,11 +3,60 @@
|
|||
Performance models for CPUs have been previously studied, and applied to
|
||||
static code performance analysis.
|
||||
|
||||
\subsection{Manufacturer-sourced data}
|
||||
|
||||
Manufacturers of CPUs are expected to offer optimisation data for software
|
||||
compiled for their processors. This data may be used by compilers authors,
|
||||
within highly-optimized libraries or in the optimisation process of critical
|
||||
sections of programs that require very high performance.
|
||||
|
||||
\medskip
|
||||
|
||||
Intel provides its \textit{Intel® 64 and IA-32 Architectures Optimization
|
||||
Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
|
||||
regularly updated, whose nearly 1,000 pages give relevant details to Intel's
|
||||
microarchitectures, such as block diagrams, pipelines, ports available, etc. It
|
||||
further gives data tables with throughput and latencies for some instructions.
|
||||
While the manual provides a huge collection of important insights --~from the
|
||||
optimisation perspective~-- on their microarchitectures, it lacks exhaustive
|
||||
and (conveniently) machine-parsable data tables and does not detail port usages
|
||||
of each instruction.
|
||||
|
||||
ARM typically releases optimisation manuals that are way more complete for its
|
||||
microarchitectures, such as the Cortex A72 optimisation
|
||||
manual~\cite{ref:a72_optim}.
|
||||
|
||||
AMD, since 2020, releases lengthy and complete optimisation manuals for its
|
||||
microarchitecture. For instance, the Zen4 optimisation
|
||||
manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
|
||||
processor's workflow and ports, and a spreadsheet of about 3,400 x86
|
||||
instructions --~with operands variants broken down~-- and their port usage,
|
||||
throughput and latencies. Such an effort, which certainly translates to a
|
||||
non-negligible financial cost to the company, showcases the importance and
|
||||
recent expectations on such documents.
|
||||
|
||||
\medskip{}
|
||||
|
||||
As a part of its EXEgesis project~\cite{tool:google_exegesis}, Google made an
|
||||
effort to parse Intel's microarchitecture manuals, resulting in a
|
||||
machine-usable data source of instruction details. The extracted data has since
|
||||
then been contributed to the llvm compiler's data model. The project, however,
|
||||
is no longer developed.
|
||||
|
||||
\subsection{Third-party instruction data}
|
||||
|
||||
The lack, for many microarchitectures, of reliable, exhaustive and
|
||||
machine-usable data for individual instructions has driven academics to
|
||||
independently obtain this data from an experimental approach.
|
||||
|
||||
\medskip
|
||||
|
||||
Since 1996, Agner Fog has been maintaining tables of values useful for
|
||||
optimisation purposes for x86 instructions~\cite{AgnerFog}. These tables, still
|
||||
maintained and updated today, are often considered very accurate. The main
|
||||
maintained and updated today, are often considered very accurate. They are the
|
||||
result of benchmarking scripts developed by the author, subject to manual
|
||||
---~and thus tedious, given the size of microarchitectures~--- analysis, and
|
||||
are mainly conducted through hardware counters measurements. The main
|
||||
issue, however, is that those tables are generated through the use of
|
||||
hand-picked instructions and benchmarks, depending on specific hardware
|
||||
counters and features specific to some CPU manufacturers. As such, while these
|
||||
|
@ -21,5 +70,94 @@ Following the work of Agner Fog, Andreas Abel and Jan Reineke have designed the
|
|||
\uopsinfo{} framework~\cite{uopsinfo}, striving to automate the previous
|
||||
methodology. Their work, providing data tables for the vast majority of
|
||||
instructions on many recent Intel microarchitectures, has been recently
|
||||
enhanced to also support AMD architectures. It is, however, still limited to
|
||||
% TODO HW counters, relevant microarchs
|
||||
enhanced to also support AMD architectures.
|
||||
|
||||
The \uopsinfo{} approach, detailed in their article, consists in finding
|
||||
so-called \textit{blocking instructions} for each port which, used in
|
||||
combination of the instruction to be benchmarked and port-specific hardware
|
||||
counters, yield a detailed analysis of the port usage of each instruction
|
||||
---~and even its break-down into \uops{}. This makes for an accurate and robust
|
||||
approach, but also limits it to microarchitectures offering such counters, and
|
||||
requires a manual analysis of each microarchitecture to be supported in order
|
||||
to find a fitting set of blocking instructions. Although we have no theoretical
|
||||
guarantee of the existence of such instructions, this should never be a
|
||||
problem, as all pragmatic microarchitecture design will yield to their
|
||||
existence.
|
||||
|
||||
\subsection{Code analyzers and their models}
|
||||
|
||||
Going further than data extraction at the individual instruction level,
|
||||
academics and industrials interested in this domain now mostly work on
|
||||
code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
|
||||
tool embeds a model --~or collection of models~-- on which to base its
|
||||
inference, and whose definition, embedded data and obtention method varies from
|
||||
tool to tool. These tools often use, to some extent, the data on individual
|
||||
instructions obtained either from the manufacturer or the third-party efforts
|
||||
mentioned above.
|
||||
|
||||
\medskip{}
|
||||
|
||||
The Intel Architecture Code Analyzer (\iaca)~\cite{iaca}, released by
|
||||
Intel, is a fully-closed source analyzer able to analyze assembly code for
|
||||
Intel microarchitectures only. It draws on Intel's own knowledge of their
|
||||
microarchitectures to make accurate predictions. This accuracy made it very
|
||||
helpful to experts aiming to do performance debugging on supported
|
||||
microarchitectures. Yet, being closed-source and relying on data that is
|
||||
partially unavailable to the public, the model is not totally satisfactory to
|
||||
academics or engineers trying to understand specific performance results. It
|
||||
also makes it vulnerable to deprecation, as the community is unable to
|
||||
\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel
|
||||
in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
|
||||
binary was recently removed from official download pages.
|
||||
|
||||
\medskip{}
|
||||
|
||||
In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was
|
||||
developed as an internal tool at Sony, and was proposed for inclusion in
|
||||
\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
|
||||
data tables that \llvm{} --~a compiler~-- has to maintain for each
|
||||
microarchitecture in order to produce optimized code. The project has since
|
||||
then evolved to be fairly accurate, as seen in the experiments later presented
|
||||
in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
|
||||
to its deprecation.
|
||||
|
||||
\medskip{}
|
||||
|
||||
Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
|
||||
starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
|
||||
(at the time) of an open-source --~and thus, open-model~-- alternative to IACA.
|
||||
As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
|
||||
It still lacks, however, a good model of frontend and data dependencies, making
|
||||
it less performant than other code analyzers in our experiments later in this
|
||||
manuscript.
|
||||
|
||||
\medskip{}
|
||||
|
||||
Taking another approach entirely, \ithemal{} is a machine-learning-based code
|
||||
analyzer striving to predict the reciprocal throughput of a given kernel. The
|
||||
necessity of its training resulted in the development of \bhive{}, a benchmark
|
||||
suite of kernels extracted from real-life programs and libraries, along with a
|
||||
profiler measuring the runtime, in CPU cycles, of a basic block isolated from
|
||||
its context. This approach, in our experiments, was significantly less accurate
|
||||
than those not based on machine learning. In our opinion, its main issue,
|
||||
however, is to be a \textit{black-box model}: given a kernel, it is only able
|
||||
to predict its reverse throughput. Doing so, even with perfect accuracy, does
|
||||
not explain the source of a performance problem: the model is unable to help in
|
||||
detecting which resource is the performance bottleneck of a kernel; in other
|
||||
words, it quantifies a potential issue, but does not help in \emph{explaining}
|
||||
it --~or debugging it.
|
||||
|
||||
\medskip{}
|
||||
|
||||
In yet another approach, \pmevo{}~\cite{PMEvo} uses genetic algorithms to
|
||||
infer, from scratch and in a benchmarks-oriented approach, a port-mapping of
|
||||
the processor it is running on. It is, to the best of our knowledge, the first
|
||||
tool striving to compute a port-mapping model in a fully-automated way, as
|
||||
\palmed{} does (see \autoref{chap:palmed} later), although through a completely
|
||||
different methodology. As detailed in \palmed{}'s article~\cite{palmed}, it
|
||||
however suffers from a lack of scalability: as generating a port-mapping for the
|
||||
few thousands of x86-64 instructions would be extremely time-consuming with
|
||||
this approach, the authors limit the evaluation of their tool to around 300
|
||||
most common instructions.
|
||||
|
||||
\todo{uiCA}
|
||||
|
|
|
@ -77,6 +77,14 @@
|
|||
title = {{LLVM} Machine Code Analyzer},
|
||||
howpublished = {\url{https://llvm.org/docs/CommandGuide/llvm-mca.html}},
|
||||
}
|
||||
@misc{llvm_mca_rfc,
|
||||
title={{[RFC] llvm-mca: a static performance analysis tool}},
|
||||
author={Andrea Di Biagio},
|
||||
howpublished={\url{https://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.html}},
|
||||
month={March},
|
||||
year={2018},
|
||||
note={Request for comments on the llvm-dev mailing-list},
|
||||
}
|
||||
|
||||
@misc{iaca,
|
||||
title={Intel Architecture Code Analyzer ({IACA})},
|
||||
|
|
|
@ -19,6 +19,20 @@
|
|||
publisher={JSTOR}
|
||||
}
|
||||
|
||||
@manual{ref:amd_zen4_optim_manual,
|
||||
title={Software Optimization Guide for the AMD Zen4 Microarchitecture},
|
||||
organization = {Advanced Micro Devices (AMD)},
|
||||
year = {2023},
|
||||
month = {January},
|
||||
note = {Publication number 57647},
|
||||
}
|
||||
@manual{ref:intel64_architectures_optim_reference_vol1,
|
||||
title = {Intel® 64 and IA-32 Architectures Optimization Reference Manual
|
||||
Volume 1},
|
||||
organization = {Intel Corporation},
|
||||
year = {2023},
|
||||
month = {September},
|
||||
}
|
||||
@manual{ref:intel64_software_dev_reference_vol1,
|
||||
title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual,
|
||||
volume 1},
|
||||
|
|
|
@ -116,6 +116,12 @@
|
|||
howpublished={\url{https://www.qemu.org}}
|
||||
}
|
||||
|
||||
@misc{tool:google_exegesis,
|
||||
title={{EXEgesis}},
|
||||
author={{Google}},
|
||||
howpublished={\url{https://github.com/google/EXEgesis}},
|
||||
}
|
||||
|
||||
@misc{intel_mkl,
|
||||
title={oneAPI Math Kernel Library ({oneMKL})},
|
||||
author={{Intel}},
|
||||
|
|
|
@ -42,6 +42,7 @@
|
|||
\newcommand{\perf}{\texttt{perf}}
|
||||
\newcommand{\qemu}{\texttt{QEMU}}
|
||||
\newcommand{\iaca}{\texttt{IACA}}
|
||||
\newcommand{\llvm}{\texttt{llvm}}
|
||||
\newcommand{\llvmmca}{\texttt{llvm-mca}}
|
||||
\newcommand{\uopsinfo}{\texttt{uops.info}}
|
||||
\newcommand{\uica}{\texttt{uiCA}}
|
||||
|
|
Loading…
Reference in a new issue