diff --git a/manuscrit/20_foundations/20_code_analyzers.tex b/manuscrit/20_foundations/20_code_analyzers.tex index 0bcc2da..9c66b4a 100644 --- a/manuscrit/20_foundations/20_code_analyzers.tex +++ b/manuscrit/20_foundations/20_code_analyzers.tex @@ -1,4 +1,4 @@ -\section{Kernel optimization and code analyzers} +\section{Kernel optimization and code analyzers}\label{ssec:code_analyzers} Optimizing a program, in most contexts, mainly means optimizing it from an algorithmic point of view ---~using efficient data structures, running some diff --git a/manuscrit/20_foundations/30_sota.tex b/manuscrit/20_foundations/30_sota.tex index 2af64f5..aaf80b9 100644 --- a/manuscrit/20_foundations/30_sota.tex +++ b/manuscrit/20_foundations/30_sota.tex @@ -3,11 +3,60 @@ Performance models for CPUs have been previously studied, and applied to static code performance analysis. +\subsection{Manufacturer-sourced data} + +Manufacturers of CPUs are expected to offer optimisation data for software +compiled for their processors. This data may be used by compilers authors, +within highly-optimized libraries or in the optimisation process of critical +sections of programs that require very high performance. + +\medskip + +Intel provides its \textit{Intel® 64 and IA-32 Architectures Optimization +Reference Manual}~\cite{ref:intel64_architectures_optim_reference_vol1}, +regularly updated, whose nearly 1,000 pages give relevant details to Intel's +microarchitectures, such as block diagrams, pipelines, ports available, etc. It +further gives data tables with throughput and latencies for some instructions. +While the manual provides a huge collection of important insights --~from the +optimisation perspective~-- on their microarchitectures, it lacks exhaustive +and (conveniently) machine-parsable data tables and does not detail port usages +of each instruction. + +ARM typically releases optimisation manuals that are way more complete for its +microarchitectures, such as the Cortex A72 optimisation +manual~\cite{ref:a72_optim}. + +AMD, since 2020, releases lengthy and complete optimisation manuals for its +microarchitecture. For instance, the Zen4 optimisation +manual~\cite{ref:amd_zen4_optim_manual} contains both detailed insights on the +processor's workflow and ports, and a spreadsheet of about 3,400 x86 +instructions --~with operands variants broken down~-- and their port usage, +throughput and latencies. Such an effort, which certainly translates to a +non-negligible financial cost to the company, showcases the importance and +recent expectations on such documents. + +\medskip{} + +As a part of its EXEgesis project~\cite{tool:google_exegesis}, Google made an +effort to parse Intel's microarchitecture manuals, resulting in a +machine-usable data source of instruction details. The extracted data has since +then been contributed to the llvm compiler's data model. The project, however, +is no longer developed. + +\subsection{Third-party instruction data} + +The lack, for many microarchitectures, of reliable, exhaustive and +machine-usable data for individual instructions has driven academics to +independently obtain this data from an experimental approach. + \medskip Since 1996, Agner Fog has been maintaining tables of values useful for optimisation purposes for x86 instructions~\cite{AgnerFog}. These tables, still -maintained and updated today, are often considered very accurate. The main +maintained and updated today, are often considered very accurate. They are the +result of benchmarking scripts developed by the author, subject to manual +---~and thus tedious, given the size of microarchitectures~--- analysis, and +are mainly conducted through hardware counters measurements. The main issue, however, is that those tables are generated through the use of hand-picked instructions and benchmarks, depending on specific hardware counters and features specific to some CPU manufacturers. As such, while these @@ -21,5 +70,94 @@ Following the work of Agner Fog, Andreas Abel and Jan Reineke have designed the \uopsinfo{} framework~\cite{uopsinfo}, striving to automate the previous methodology. Their work, providing data tables for the vast majority of instructions on many recent Intel microarchitectures, has been recently -enhanced to also support AMD architectures. It is, however, still limited to -% TODO HW counters, relevant microarchs +enhanced to also support AMD architectures. + +The \uopsinfo{} approach, detailed in their article, consists in finding +so-called \textit{blocking instructions} for each port which, used in +combination of the instruction to be benchmarked and port-specific hardware +counters, yield a detailed analysis of the port usage of each instruction +---~and even its break-down into \uops{}. This makes for an accurate and robust +approach, but also limits it to microarchitectures offering such counters, and +requires a manual analysis of each microarchitecture to be supported in order +to find a fitting set of blocking instructions. Although we have no theoretical +guarantee of the existence of such instructions, this should never be a +problem, as all pragmatic microarchitecture design will yield to their +existence. + +\subsection{Code analyzers and their models} + +Going further than data extraction at the individual instruction level, +academics and industrials interested in this domain now mostly work on +code analyzers, as described in \autoref{ssec:code_analyzers} above. Each such +tool embeds a model --~or collection of models~-- on which to base its +inference, and whose definition, embedded data and obtention method varies from +tool to tool. These tools often use, to some extent, the data on individual +instructions obtained either from the manufacturer or the third-party efforts +mentioned above. + +\medskip{} + +The Intel Architecture Code Analyzer (\iaca)~\cite{iaca}, released by +Intel, is a fully-closed source analyzer able to analyze assembly code for +Intel microarchitectures only. It draws on Intel's own knowledge of their +microarchitectures to make accurate predictions. This accuracy made it very +helpful to experts aiming to do performance debugging on supported +microarchitectures. Yet, being closed-source and relying on data that is +partially unavailable to the public, the model is not totally satisfactory to +academics or engineers trying to understand specific performance results. It +also makes it vulnerable to deprecation, as the community is unable to +\textit{fork} the project --~and indeed, \iaca{} has been discontinued by Intel +in 2019. Thus, \iaca{} does not support recent microarchitectures, and its +binary was recently removed from official download pages. + +\medskip{} + +In the meantime, the LLVM Machine Code Analyzer --~or \llvmmca{}~-- was +developed as an internal tool at Sony, and was proposed for inclusion in +\llvm{} in 2018~\cite{llvm_mca_rfc}. This code analyzer is based on the +data tables that \llvm{} --~a compiler~-- has to maintain for each +microarchitecture in order to produce optimized code. The project has since +then evolved to be fairly accurate, as seen in the experiments later presented +in this manuscript. It is the alternative Intel offers to \iaca{} subsequently +to its deprecation. + +\medskip{} + +Another model, \osaca{}, was developed by Jan Laukemann \textit{et al.} +starting in 2017~\cite{osaca1,osaca2}. Its development stemmed from the lack +(at the time) of an open-source --~and thus, open-model~-- alternative to IACA. +As a data source, \osaca{} makes use of Agner Fog's data tables or \uopsinfo{}. +It still lacks, however, a good model of frontend and data dependencies, making +it less performant than other code analyzers in our experiments later in this +manuscript. + +\medskip{} + +Taking another approach entirely, \ithemal{} is a machine-learning-based code +analyzer striving to predict the reciprocal throughput of a given kernel. The +necessity of its training resulted in the development of \bhive{}, a benchmark +suite of kernels extracted from real-life programs and libraries, along with a +profiler measuring the runtime, in CPU cycles, of a basic block isolated from +its context. This approach, in our experiments, was significantly less accurate +than those not based on machine learning. In our opinion, its main issue, +however, is to be a \textit{black-box model}: given a kernel, it is only able +to predict its reverse throughput. Doing so, even with perfect accuracy, does +not explain the source of a performance problem: the model is unable to help in +detecting which resource is the performance bottleneck of a kernel; in other +words, it quantifies a potential issue, but does not help in \emph{explaining} +it --~or debugging it. + +\medskip{} + +In yet another approach, \pmevo{}~\cite{PMEvo} uses genetic algorithms to +infer, from scratch and in a benchmarks-oriented approach, a port-mapping of +the processor it is running on. It is, to the best of our knowledge, the first +tool striving to compute a port-mapping model in a fully-automated way, as +\palmed{} does (see \autoref{chap:palmed} later), although through a completely +different methodology. As detailed in \palmed{}'s article~\cite{palmed}, it +however suffers from a lack of scalability: as generating a port-mapping for the +few thousands of x86-64 instructions would be extremely time-consuming with +this approach, the authors limit the evaluation of their tool to around 300 +most common instructions. + +\todo{uiCA} diff --git a/manuscrit/biblio/code_analyzers.bib b/manuscrit/biblio/code_analyzers.bib index e02ad87..f08e682 100644 --- a/manuscrit/biblio/code_analyzers.bib +++ b/manuscrit/biblio/code_analyzers.bib @@ -77,6 +77,14 @@ title = {{LLVM} Machine Code Analyzer}, howpublished = {\url{https://llvm.org/docs/CommandGuide/llvm-mca.html}}, } +@misc{llvm_mca_rfc, + title={{[RFC] llvm-mca: a static performance analysis tool}}, + author={Andrea Di Biagio}, + howpublished={\url{https://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.html}}, + month={March}, + year={2018}, + note={Request for comments on the llvm-dev mailing-list}, +} @misc{iaca, title={Intel Architecture Code Analyzer ({IACA})}, diff --git a/manuscrit/biblio/misc.bib b/manuscrit/biblio/misc.bib index 13770b3..4269594 100644 --- a/manuscrit/biblio/misc.bib +++ b/manuscrit/biblio/misc.bib @@ -19,6 +19,20 @@ publisher={JSTOR} } +@manual{ref:amd_zen4_optim_manual, + title={Software Optimization Guide for the AMD Zen4 Microarchitecture}, + organization = {Advanced Micro Devices (AMD)}, + year = {2023}, + month = {January}, + note = {Publication number 57647}, +} +@manual{ref:intel64_architectures_optim_reference_vol1, + title = {Intel® 64 and IA-32 Architectures Optimization Reference Manual + Volume 1}, + organization = {Intel Corporation}, + year = {2023}, + month = {September}, +} @manual{ref:intel64_software_dev_reference_vol1, title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual, volume 1}, diff --git a/manuscrit/biblio/tools.bib b/manuscrit/biblio/tools.bib index af8ecc5..848bee6 100644 --- a/manuscrit/biblio/tools.bib +++ b/manuscrit/biblio/tools.bib @@ -116,6 +116,12 @@ howpublished={\url{https://www.qemu.org}} } +@misc{tool:google_exegesis, + title={{EXEgesis}}, + author={{Google}}, + howpublished={\url{https://github.com/google/EXEgesis}}, +} + @misc{intel_mkl, title={oneAPI Math Kernel Library ({oneMKL})}, author={{Intel}}, diff --git a/manuscrit/include/macros.tex b/manuscrit/include/macros.tex index abe2722..4895ae3 100644 --- a/manuscrit/include/macros.tex +++ b/manuscrit/include/macros.tex @@ -42,6 +42,7 @@ \newcommand{\perf}{\texttt{perf}} \newcommand{\qemu}{\texttt{QEMU}} \newcommand{\iaca}{\texttt{IACA}} +\newcommand{\llvm}{\texttt{llvm}} \newcommand{\llvmmca}{\texttt{llvm-mca}} \newcommand{\uopsinfo}{\texttt{uops.info}} \newcommand{\uica}{\texttt{uiCA}}