diff --git a/manuscrit/40_A72-frontend/10_cortex_a72.tex b/manuscrit/40_A72-frontend/10_cortex_a72.tex index 53e251e..c674791 100644 --- a/manuscrit/40_A72-frontend/10_cortex_a72.tex +++ b/manuscrit/40_A72-frontend/10_cortex_a72.tex @@ -1,3 +1,103 @@ +The usual reverse-engineering methods for CPU models usually make abundant use +of hardware counters ---~and legitimately so, as they are the natural and +accurate way to obtain insight on the internals of a CPU\@. Such methods +include, among others, the optimisation guides from Agner Fog~\cite{AgnerFog}, +as well as \uopsinfo{}~\cite{uopsinfo} and \uica{}'s~\cite{uica} approach to +respectively model the CPU's back- and front-end. In \autoref{chap:palmed}, we +introduced Palmed, whose main goal is to automatically produce port-mappings of +CPUs without assuming the presence of specific hardware counters. + +\smallskip{} + +The ARM architectures occupy a growing space in the global computing ecosystem. +They are already pervasive among the embedded and mobile devices, with most +mobile phones featuring an ARM CPU~\cite{arm_mobile}. Processors based on ARM +are emerging in datacenters and supercomputers: the Fugaku supercomputer +---~considered the fastest supercomputer in the world by the TOP500 +ranking~\cite{fugaku_top500}~--- runs on ARM-based CPUs~\cite{fugaku_arm}, the +MareNostrum 4 supercomputer has an ARM-based cluster~\cite{marenostrum4_arm}. + +Yet, the ARM ecosystem is still lacking in performance debugging tooling. While +\llvmmca{} supports ARM, it is one of the only few: \iaca{}, made by Intel, is +not supported ---~and will never be, as it is end-of-life~---; \uica{} is +focused on Intel architectures, and cannot be easily ported as it heavily +relies on reverse engineering specific to Intel, and enabled by specific +hardware counters; Intel \texttt{VTune}, a commonly used profiling performance +analysis tool, supports only x86-64. + +\smallskip{} + +In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed +to be an important goal, especially meaningful as this particular CPU only has +very few hardware counters. However, it yielded only mixed results, as shown in +\autoref{sec:palmed_results}. + +In this chapter, we show that a major cause of imprecision in these results is +the absence of a frontend model. + + +\section{Necessity to go beyond ports} + +The resource models produced by \palmed{} are mainly concerned with the backend +of the CPUs modeled. However, the importance of the frontend in the accuracy of +a model's prediction cannot be ignored. Its effect can be clearly seen in the +evaluation heatmaps of various code analyzers in \autoref{fig:palmed_heatmaps}. +Each heatmap has a clear-cut limit on the horizontal axis: independently of the +benchmark's content, it is impossible to reach more than a given number of +instructions per cycle for a given processor ---~4 instructions for the +\texttt{SKL-SP}, 5 for the \texttt{ZEN1}. This limit is imposed by the +frontend. + +Some analyzers, such as \palmed{} and \iaca{}, model this limit: the heatmap +shows that the predicted IPC will not surpass this limit. The other three +analyzers studied, however, do not model this limit; for instance, \uopsinfo{} +has a high density of benchmarks predicted at 8 instructions per cycle on +SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only +4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{} +heatmaps. + +\begin{example}{High back-end throughput on \texttt{SKL-SP}} + On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large + frontend, a number of instructions per cycle higher than 4 is easy to + reach. + + According to \uopsinfo{} data, a 64-bits + integer \lstxasm{addq} is processed with a single \uop{}, dispatched on + port 0, 1, 5 or 6. In the meantime, a simple form 64 bits register store + to a direct register-held address ---~\eg{} a \lstxasm{movq \%rax, + (\%rbx)}~--- is also processed with a single \uop{}, dispatched on port 2 + or 3. + + Thus, backend-wise, the kernel $4\times \texttt{addq} + 2\times + \texttt{mov}$ has a throughput of 6 instructions per cycle. However, in + reality, this kernel would be frontend-bound, with a theoretical maximum throughput of 4 + instructions per cycle ---~in fact, a \pipedream{} measure only yields 3 + instructions per cycle. +\end{example} + +\bigskip{} + +To account for this, \palmed{} tries to detect an additional resource, apart +from the backend ports and combined ports, on which every \uop{} incurs a load. +This allows \palmed{} to avoid large errors on frontend-bound kernels. + +The approach is, however, far from perfect. The clearest reason for this is is +that the frontend, both on x86-64 and ARM architectures, works in-order, while +\palmed{} inherently models kernels as multisets of instructions, thus +completely ignoring ordering. This resource model is purely linear: an +instruction incurs a load on the frontend resource in a fully commutative way, +independently of the previous instructions executed this cycle and of many +other effects. + +The article introducing \uica{}~\cite{uica} explores this question in detail +for x86-64 Intel architectures. The authors, having previously developed +\uopsinfo{}, discuss the importance of a correct modelling of the frontend to +accurately predict throughput. Their approach, based on the exploration and +reverse-engineering of the crucial parts of the frontend, showcases many +important and non-trivial aspects of frontends usually neglected, such as the +switching between the decoders and \uop{}-cache as source of instructions +---~which cannot be linearly modelled. + \section{The Cortex A72 CPU} The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first @@ -8,6 +108,62 @@ high-performance core for low-power applications. The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711; it is thus easy to have access to an A72 to run experiments. -\paragraph{Backend.} +\begin{figure} + \centering + \includegraphics[width=\linewidth]{A72_pipeline_diagram.svg} + \caption{Simplified overview of the Cortex A72 + pipeline}\label{fig:a72_pipeline} +\end{figure} + +\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from +the software optimization guide for the Cortex A72, published by +ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports: +\begin{itemize} + \item a branch port (branch instructions, equivalent to x86 jumps); + \item two identical integer ports (integer arithmetic operation); + \item an integer multi-cycle port (complex integer operations, \eg{} divisions); + \item two nearly-identical floating point and SIMD ports (mostly identical, + with slight specializations: \eg{} only port FP0 can do SIMD + multiplication, while only port FP1 can do floating point comparisons); + \item a load port; + \item a store port. +\end{itemize} + +\paragraph{Frontend.} The Cortex A72 frontend can only decode three +instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}. +Intel's \texttt{SKL-SP}, which we considered before, has a frontend that +bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This +difference of one \uop{} per cycle is actually meaningful, as this means that +only three of the eight backend ports can be used each cycle. + +\begin{example}[2nd order polynomial evaluation] + Consider a kernel evaluating the 2nd order polynomial expression for + different values of $x$: + \begin{align*} + P[i] &= a{X[i]}^2 + bX[i] + c \\ + &= \left( aX[i] + b \right) \times X[i] + c + \end{align*} + which directly translates to four operations: load $X[i]$, two floating + point multiply-add, store the result $P[i]$. The backend, having a load + port, two SIMD ports and a store port, can execute one iteration of such a + kernel every cycle; in steady-state, out-of-order execution can lift the + latency-induced pressure. However, as the frontend bottlenecks at three \uops{} + per cycle, this kernel does not fit in a single cycle. +\end{example} + +\paragraph{Lack of hardware counters.} +The Cortex A72 only features a very limited set of specialized hardware counters. +While the CPU is able to report the number of elapsed cycles, +retired instructions, branch misses and various metrics on cache misses, it +does not report any event regarding macro- or micro-operations, dispatching or +issuing to specific ports. This makes it, as pointed before, a particularly +relevant target for \palmed{}. + + +\section{Manually modelling the A72 frontend} + +% TODO + +\subsection{Methodology} + -\cite{ref:a72_optim} diff --git a/manuscrit/assets/imgs/40_A72-frontend/A72_pipeline_diagram.svg b/manuscrit/assets/imgs/40_A72-frontend/A72_pipeline_diagram.svg index 5ec0780..6cad889 100644 --- a/manuscrit/assets/imgs/40_A72-frontend/A72_pipeline_diagram.svg +++ b/manuscrit/assets/imgs/40_A72-frontend/A72_pipeline_diagram.svg @@ -2,9 +2,9 @@ + transform="translate(-2.1624755e-8,-5)"> + id="g4"> + x="2.1624757e-08" + y="22.5" /> Fetch + x="15.388178" + y="55.130333">Fetch + id="g5" + transform="translate(0.06535216)"> + x="59.927437" + y="22.5" /> Decode,Decode,Rename,Dispatch + id="g25" + transform="translate(35)"> + transform="translate(34.951935)"> + transform="translate(34.949875)"> + transform="translate(34.949875)"> + transform="translate(34.950013)"> + transform="translate(34.951951)"> + transform="translate(34.951951)"> + transform="translate(34.951951)"> + transform="translate(34.951951)"> - - - In-order - Out-of-order - + id="g8"> + + In-order + + + + Out-of-order + + + id="g6" + transform="translate(-129.5646,50.965203)"> - - Back-end - - + + Back-end + + + 3μOPs diff --git a/manuscrit/biblio/code_analyzers.bib b/manuscrit/biblio/code_analyzers.bib index 0865874..e02ad87 100644 --- a/manuscrit/biblio/code_analyzers.bib +++ b/manuscrit/biblio/code_analyzers.bib @@ -107,6 +107,14 @@ doi={10.1109/PMBS49563.2019.00006} } +@online{AgnerFog, + author = {Agner Fog}, + title = {Instruction tables: Lists of instruction latencies, through-puts and micro-operation breakdowns for Intel, {AMD} and {VIA} {CPU}s}, + publisher = {Technical University of Denmark}, + year = {2020}, + url = {http://www.agner.org/optimize/instruction_tables.pdf}, +} + @inproceedings{uopsinfo, title = {uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures}, acmid = {3304062}, diff --git a/manuscrit/biblio/misc.bib b/manuscrit/biblio/misc.bib index 99ea59c..954bed6 100644 --- a/manuscrit/biblio/misc.bib +++ b/manuscrit/biblio/misc.bib @@ -90,3 +90,46 @@ year = {2015}, month = {March}, } + +@misc{agnerfog_skl_front4, + title={Discussion on blogpost}, + author={Fog, Agner}, + year=2016, + howpublished={\url{https://www.agner.org/optimize/blog/read.php?i=581}} +} + +@INPROCEEDINGS{fugaku_arm, + author={Matsuoka, Satoshi}, + booktitle={2021 Symposium on VLSI Circuits}, + title={Fugaku and A64FX: the First Exascale Supercomputer and its Innovative Arm CPU}, + year={2021}, + volume={}, + number={}, + pages={1-3}, + doi={10.23919/VLSICircuits52068.2021.9492415} +} + +@misc{fugaku_top500, + title={Supercomputer Fugaku retains first place worldwide in HPCG and Graph500 rankings}, + year=2022, + month=November, + author={{Fujitsu Limited}}, + howpublished={\url{https://www.fujitsu.com/global/about/resources/news/press-releases/2022/1115-01.html}} +} + +@misc{marenostrum4_arm, + title={Technical information on the MareNostrum 4 supercomputer's ARM cluster}, + author={{Barcelona Supercomputing Center}}, + year=2020, + howpublished={\url{https://www.bsc.es/innovation-and-services/technical-information-cte-arm}} +} + +@misc{arm_mobile, + title={Together, we are building the future of computing, on Arm}, + author={Rene Haas}, + organization = {ARM}, + year=2023, + month=September, + howpublished={\url{https://www.arm.com/company/news/2023/09/building-the-future-of-computing-on-arm}}, +} + diff --git a/plan/40_a72_frontend.md b/plan/40_a72_frontend.md index 207e837..6a710d5 100644 --- a/plan/40_a72_frontend.md +++ b/plan/40_a72_frontend.md @@ -12,10 +12,19 @@ * Notion of bottleneck [[END]] +* Palmed was made to produce models for architectures with limited hardware + counters +* ARM is an important architecture: + * already pervasive in embedded devices + * starts to emerge in the datacenter [Fugaku, MareNostrum 4] + * …and lacks HW counters +* Cf prev. chapter: Palmed results on the Cortex A72 are not that good. Why? + ## Necessity to go beyond ports * Palmed: concerned mostly with ports * Noticed the importance of the frontend while investigating its performances + on x86 * heatmap representation: uops predicts unreachably high IPCs (eg. 8 on SKX) * example of a frontend-bound microkernel * Palmed's vision of a frontend @@ -50,8 +59,6 @@ * Very few hardware counters regarding the frontend! In particular, no access *at all* to macro-ops. No micro-op count. -* Pure Palmed results - ## Manual frontend ### Base methodology