SotA: some writeup. Lacks uiCA.

2024-03-19 19:57:04 +01:00 · 2024-03-19 19:57:04 +01:00 · bb932f93c6
commit bb932f93c6
parent bd7b3b8ad6
6 changed files with 171 additions and 4 deletions
--- a/manuscrit/20_foundations/20_code_analyzers.tex
+++ b/manuscrit/20_foundations/20_code_analyzers.tex
@ -1,4 +1,4 @@
-\section{Kernel optimization and code analyzers}
+\section{Kernel optimization and code analyzers}\label{ssec:code_analyzers}

 Optimizing a program, in most contexts, mainly means optimizing it from an
 algorithmic point of view ---~using efficient data structures, running some
--- a/manuscrit/20_foundations/30_sota.tex
+++ b/manuscrit/20_foundations/30_sota.tex
@ -3,11 +3,60 @@
 Performance models for CPUs have been previously studied, and applied to
 static code performance analysis.

+\subsection{Manufacturer-sourced data}
+
+Manufacturers of CPUs are expected to offer optimisation data for software
+compiled for their processors. This data may be used by compilers authors,
+within highly-optimized libraries or in the optimisation process of critical
+sections of programs that require very high performance.
+
+\medskip
+
+Intel provides its \textit{Intel® 64 and IA-32 Architectures Optimization
+Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1},
+regularly updated, whose nearly 1,000 pages give relevant details to Intel's
+microarchitectures, such as block diagrams, pipelines, ports available, etc. It
+further gives data tables with throughput and latencies for some instructions.
+While the manual provides a huge collection of important insights --~from the
+optimisation perspective~-- on their microarchitectures, it lacks exhaustive
+and (conveniently) machine-parsable data tables and does not detail port usages
+of each instruction.
+
+ARM typically releases optimisation manuals that are way more complete for its
+microarchitectures, such as the Cortex A72 optimisation
+manual~\cite{ref:a72_optim}.
+
+AMD, since 2020, releases lengthy and complete optimisation manuals for its
+microarchitecture. For instance, the Zen4 optimisation
+manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the
+processor's workflow and ports, and a spreadsheet of about 3,400 x86
+instructions --~with operands variants broken down~-- and their port usage,
+throughput and latencies. Such an effort, which certainly translates to a
+non-negligible financial cost to the company, showcases the importance and
+recent expectations on such documents.
+
+\medskip{}
+
+As a part of its EXEgesis project~\cite{tool:google_exegesis}, Google made an
+effort to parse Intel's microarchitecture manuals, resulting in a
+machine-usable data source of instruction details. The extracted data has since
+then been contributed to the llvm compiler's data model. The project, however,
+is no longer developed.
+
+\subsection{Third-party instruction data}
+
+The lack, for many microarchitectures, of reliable, exhaustive and
+machine-usable data for individual instructions has driven academics to
+independently obtain this data from an experimental approach.
+
 \medskip

 Since 1996, Agner Fog has been maintaining tables of values useful for
 optimisation purposes for x86 instructions~\cite{AgnerFog}. These tables, still
-maintained and updated today, are often considered very accurate. The main
+maintained and updated today, are often considered very accurate. They are the
+result of benchmarking scripts developed by the author, subject to manual
+---~and thus tedious, given the size of microarchitectures~--- analysis, and
+are mainly conducted through hardware counters measurements. The main
 issue, however, is that those tables are generated through the use of
 hand-picked instructions and benchmarks, depending on specific hardware
 counters and features specific to some CPU manufacturers. As such, while these
@ -21,5 +70,94 @@ Following the work of Agner Fog, Andreas Abel and Jan Reineke have designed the
 \uopsinfo{} framework~\cite{uopsinfo}, striving to automate the previous
 methodology. Their work, providing data tables for the vast majority of
 instructions on many recent Intel microarchitectures, has been recently
-enhanced to also support AMD architectures. It is, however, still limited to
-% TODO HW counters, relevant microarchs
+enhanced to also support AMD architectures.
+
+The \uopsinfo{} approach, detailed in their article, consists in finding
+so-called \textit{blocking instructions} for each port which, used in
+combination of the instruction to be benchmarked and port-specific hardware
+counters, yield a detailed analysis of the port usage of each instruction
+---~and even its break-down into \uops{}. This makes for an accurate and robust
+approach, but also limits it to microarchitectures offering such counters, and
+requires a manual analysis of each microarchitecture to be supported in order
+to find a fitting set of blocking instructions. Although we have no theoretical
+guarantee of the existence of such instructions, this should never be a
+problem, as all pragmatic microarchitecture design will yield to their
+existence.
+
+\subsection{Code analyzers and their models}
+
+Going further than data extraction at the individual instruction level,
+academics and industrials interested in this domain now mostly work on
+code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such
+tool embeds a model --~or collection of models~-- on which to base its
+inference, and whose definition, embedded data and obtention method varies from
+tool to tool. These tools often use, to some extent, the data on individual
+instructions obtained either from the manufacturer or the third-party efforts
+mentioned above.
+
+\medskip{}
+
+The Intel Architecture Code Analyzer (\iaca)~\cite{iaca}, released by
+Intel, is a fully-closed source analyzer able to analyze assembly code for
+Intel microarchitectures only. It draws on Intel's own knowledge of their
+microarchitectures to make accurate predictions. This accuracy made it very
+helpful to experts aiming to do performance debugging on supported
+microarchitectures. Yet, being closed-source and relying on data that is
+partially unavailable to the public, the model is not totally satisfactory to
+academics or engineers trying to understand specific performance results. It
+also makes it vulnerable to deprecation, as the community is unable to
+\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel
+in 2019. Thus, \iaca{} does not support recent microarchitectures, and its
+binary was recently removed from official download pages.
+
+\medskip{}
+
+In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was
+developed as an internal tool at Sony, and was proposed for inclusion in
+\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the
+data tables that \llvm{} --~a compiler~-- has to maintain for each
+microarchitecture in order to produce optimized code.  The project has since
+then evolved to be fairly accurate, as seen in the experiments later presented
+in this manuscript. It is the alternative Intel offers to \iaca{} subsequently
+to its deprecation.
+
+\medskip{}
+
+Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.}
+starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack
+(at the time) of an open-source --~and thus, open-model~-- alternative to IACA.
+As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}.
+It still lacks, however, a good model of frontend and data dependencies, making
+it less performant than other code analyzers in our experiments later in this
+manuscript.
+
+\medskip{}
+
+Taking another approach entirely, \ithemal{} is a machine-learning-based code
+analyzer striving to predict the reciprocal throughput of a given kernel. The
+necessity of its training resulted in the development of \bhive{}, a benchmark
+suite of kernels extracted from real-life programs and libraries, along with a
+profiler measuring the runtime, in CPU cycles, of a basic block isolated from
+its context. This approach, in our experiments, was significantly less accurate
+than those not based on machine learning. In our opinion, its main issue,
+however, is to be a \textit{black-box model}: given a kernel, it is only able
+to predict its reverse throughput. Doing so, even with perfect accuracy, does
+not explain the source of a performance problem: the model is unable to help in
+detecting which resource is the performance bottleneck of a kernel; in other
+words, it quantifies a potential issue, but does not help in \emph{explaining}
+it --~or debugging it.
+
+\medskip{}
+
+In yet another approach, \pmevo{}~\cite{PMEvo} uses genetic algorithms to
+infer, from scratch and in a benchmarks-oriented approach, a port-mapping of
+the processor it is running on. It is, to the best of our knowledge, the first
+tool striving to compute a port-mapping model in a fully-automated way, as
+\palmed{} does (see \autoref{chap:palmed} later), although through a completely
+different methodology. As detailed in \palmed{}'s article~\cite{palmed}, it
+however suffers from a lack of scalability: as generating a port-mapping for the
+few thousands of x86-64 instructions would be extremely time-consuming with
+this approach, the authors limit the evaluation of their tool to around 300
+most common instructions.
+
+\todo{uiCA}
--- a/manuscrit/biblio/code_analyzers.bib
+++ b/manuscrit/biblio/code_analyzers.bib
@ -77,6 +77,14 @@
    title   = {{LLVM} Machine Code Analyzer},
    howpublished = {\url{https://llvm.org/docs/CommandGuide/llvm-mca.html}},
 }
+@misc{llvm_mca_rfc,
+    title={{[RFC] llvm-mca: a static performance analysis tool}},
+    author={Andrea Di Biagio},
+    howpublished={\url{https://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.html}},
+    month={March},
+    year={2018},
+    note={Request for comments on the llvm-dev mailing-list},
+}

@misc{iaca,
    title={Intel Architecture Code Analyzer ({IACA})},
--- a/manuscrit/biblio/misc.bib
+++ b/manuscrit/biblio/misc.bib
@ -19,6 +19,20 @@
    publisher={JSTOR}
 }

+@manual{ref:amd_zen4_optim_manual,
+    title={Software Optimization Guide for the AMD Zen4 Microarchitecture},
+    organization = {Advanced Micro Devices (AMD)},
+    year = {2023},
+    month = {January},
+    note = {Publication number 57647},
+}
+@manual{ref:intel64_architectures_optim_reference_vol1,
+    title = {Intel® 64 and IA-32 Architectures Optimization Reference Manual
+             Volume 1},
+    organization = {Intel Corporation},
+    year = {2023},
+    month = {September},
+}
@manual{ref:intel64_software_dev_reference_vol1,
    title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual,
             volume 1},
--- a/manuscrit/biblio/tools.bib
+++ b/manuscrit/biblio/tools.bib
@ -116,6 +116,12 @@
 	howpublished={\url{https://www.qemu.org}}
 }

+@misc{tool:google_exegesis,
+    title={{EXEgesis}},
+    author={{Google}},
+    howpublished={\url{https://github.com/google/EXEgesis}},
+}
+
@misc{intel_mkl,
    title={oneAPI Math Kernel Library ({oneMKL})},
    author={{Intel}},
--- a/manuscrit/include/macros.tex
+++ b/manuscrit/include/macros.tex
@ -42,6 +42,7 @@
 \newcommand{\perf}{\texttt{perf}}
 \newcommand{\qemu}{\texttt{QEMU}}
 \newcommand{\iaca}{\texttt{IACA}}
+\newcommand{\llvm}{\texttt{llvm}}
 \newcommand{\llvmmca}{\texttt{llvm-mca}}
 \newcommand{\uopsinfo}{\texttt{uops.info}}
 \newcommand{\uica}{\texttt{uiCA}}