A72: proper section split
This commit is contained in:
parent
41bb653013
commit
16a647e3d4
7 changed files with 171 additions and 173 deletions
|
@ -1 +1,40 @@
|
|||
\todo{Intro}
|
||||
The usual reverse-engineering methods for CPU models usually make abundant use
|
||||
of hardware counters ---~and legitimately so, as they are the natural and
|
||||
accurate way to obtain insight on the internals of a CPU\@. Such methods
|
||||
include, among others, the optimisation guides from Agner Fog~\cite{AgnerFog},
|
||||
as well as \uopsinfo{}~\cite{uopsinfo} and \uica{}'s~\cite{uica} approach to
|
||||
respectively model the CPU's back- and front-end. In \autoref{chap:palmed}, we
|
||||
introduced Palmed, whose main goal is to automatically produce port-mappings of
|
||||
CPUs without assuming the presence of specific hardware counters.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
The ARM architectures occupy a growing space in the global computing ecosystem.
|
||||
They are already pervasive among the embedded and mobile devices, with most
|
||||
mobile phones featuring an ARM CPU~\cite{arm_mobile}. Processors based on ARM
|
||||
are emerging in datacenters and supercomputers: the Fugaku supercomputer
|
||||
---~considered the fastest supercomputer in the world by the TOP500
|
||||
ranking~\cite{fugaku_top500}~--- runs on ARM-based CPUs~\cite{fugaku_arm}, the
|
||||
MareNostrum 4 supercomputer has an ARM-based cluster~\cite{marenostrum4_arm}.
|
||||
|
||||
Yet, the ARM ecosystem is still lacking in performance debugging tooling. While
|
||||
\llvmmca{} supports ARM, it is one of the only few: \iaca{}, made by Intel, is
|
||||
not supported ---~and will never be, as it is end-of-life~---; \uica{} is
|
||||
focused on Intel architectures, and cannot be easily ported as it heavily
|
||||
relies on reverse engineering specific to Intel, and enabled by specific
|
||||
hardware counters; Intel \texttt{VTune}, a commonly used profiling performance
|
||||
analysis tool, supports only x86-64.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
|
||||
to be an important goal, especially meaningful as this particular CPU only has
|
||||
very few hardware counters. However, it yielded only mixed results, as shown in
|
||||
\autoref{sec:palmed_results}.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
In this chapter, we show that a major cause of imprecision in these results is
|
||||
the absence of a frontend model. We manually model the Cortex A72 frontend to
|
||||
compare a raw \palmed{}-generated model, to one naively augmented with a
|
||||
frontend model. \todo{discuss automated future work}
|
||||
|
|
61
manuscrit/40_A72-frontend/10_beyond_ports.tex
Normal file
61
manuscrit/40_A72-frontend/10_beyond_ports.tex
Normal file
|
@ -0,0 +1,61 @@
|
|||
\section{Necessity to go beyond ports}
|
||||
|
||||
The resource models produced by \palmed{} are mainly concerned with the backend
|
||||
of the CPUs modeled. However, the importance of the frontend in the accuracy of
|
||||
a model's prediction cannot be ignored. Its effect can be clearly seen in the
|
||||
evaluation heatmaps of various code analyzers in \autoref{fig:palmed_heatmaps}.
|
||||
Each heatmap has a clear-cut limit on the horizontal axis: independently of the
|
||||
benchmark's content, it is impossible to reach more than a given number of
|
||||
instructions per cycle for a given processor ---~4 instructions for the
|
||||
\texttt{SKL-SP}, 5 for the \texttt{ZEN1}. This limit is imposed by the
|
||||
frontend.
|
||||
|
||||
Some analyzers, such as \palmed{} and \iaca{}, model this limit: the heatmap
|
||||
shows that the predicted IPC will not surpass this limit. The other three
|
||||
analyzers studied, however, do not model this limit; for instance, \uopsinfo{}
|
||||
has a high density of benchmarks predicted at 8 instructions per cycle on
|
||||
SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
|
||||
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
|
||||
heatmaps.
|
||||
|
||||
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
|
||||
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
|
||||
frontend, a number of instructions per cycle higher than 4 is easy to
|
||||
reach.
|
||||
|
||||
According to \uopsinfo{} data, a 64-bits
|
||||
integer \lstxasm{addq} is processed with a single \uop{}, dispatched on
|
||||
port 0, 1, 5 or 6. In the meantime, a simple form 64 bits register store
|
||||
to a direct register-held address ---~\eg{} a \lstxasm{movq \%rax,
|
||||
(\%rbx)}~--- is also processed with a single \uop{}, dispatched on port 2
|
||||
or 3.
|
||||
|
||||
Thus, backend-wise, the kernel $4\times \texttt{addq} + 2\times
|
||||
\texttt{mov}$ has a throughput of 6 instructions per cycle. However, in
|
||||
reality, this kernel would be frontend-bound, with a theoretical maximum throughput of 4
|
||||
instructions per cycle ---~in fact, a \pipedream{} measure only yields 3
|
||||
instructions per cycle.
|
||||
\end{example}
|
||||
|
||||
\bigskip{}
|
||||
|
||||
To account for this, \palmed{} tries to detect an additional resource, apart
|
||||
from the backend ports and combined ports, on which every \uop{} incurs a load.
|
||||
This allows \palmed{} to avoid large errors on frontend-bound kernels.
|
||||
|
||||
The approach is, however, far from perfect. The clearest reason for this is is
|
||||
that the frontend, both on x86-64 and ARM architectures, works in-order, while
|
||||
\palmed{} inherently models kernels as multisets of instructions, thus
|
||||
completely ignoring ordering. This resource model is purely linear: an
|
||||
instruction incurs a load on the frontend resource in a fully commutative way,
|
||||
independently of the previous instructions executed this cycle and of many
|
||||
other effects.
|
||||
|
||||
The article introducing \uica{}~\cite{uica} explores this question in detail
|
||||
for x86-64 Intel architectures. The authors, having previously developed
|
||||
\uopsinfo{}, discuss the importance of a correct modelling of the frontend to
|
||||
accurately predict throughput. Their approach, based on the exploration and
|
||||
reverse-engineering of the crucial parts of the frontend, showcases many
|
||||
important and non-trivial aspects of frontends usually neglected, such as the
|
||||
switching between the decoders and \uop{}-cache as source of instructions
|
||||
---~which cannot be linearly modelled.
|
|
@ -1,169 +0,0 @@
|
|||
The usual reverse-engineering methods for CPU models usually make abundant use
|
||||
of hardware counters ---~and legitimately so, as they are the natural and
|
||||
accurate way to obtain insight on the internals of a CPU\@. Such methods
|
||||
include, among others, the optimisation guides from Agner Fog~\cite{AgnerFog},
|
||||
as well as \uopsinfo{}~\cite{uopsinfo} and \uica{}'s~\cite{uica} approach to
|
||||
respectively model the CPU's back- and front-end. In \autoref{chap:palmed}, we
|
||||
introduced Palmed, whose main goal is to automatically produce port-mappings of
|
||||
CPUs without assuming the presence of specific hardware counters.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
The ARM architectures occupy a growing space in the global computing ecosystem.
|
||||
They are already pervasive among the embedded and mobile devices, with most
|
||||
mobile phones featuring an ARM CPU~\cite{arm_mobile}. Processors based on ARM
|
||||
are emerging in datacenters and supercomputers: the Fugaku supercomputer
|
||||
---~considered the fastest supercomputer in the world by the TOP500
|
||||
ranking~\cite{fugaku_top500}~--- runs on ARM-based CPUs~\cite{fugaku_arm}, the
|
||||
MareNostrum 4 supercomputer has an ARM-based cluster~\cite{marenostrum4_arm}.
|
||||
|
||||
Yet, the ARM ecosystem is still lacking in performance debugging tooling. While
|
||||
\llvmmca{} supports ARM, it is one of the only few: \iaca{}, made by Intel, is
|
||||
not supported ---~and will never be, as it is end-of-life~---; \uica{} is
|
||||
focused on Intel architectures, and cannot be easily ported as it heavily
|
||||
relies on reverse engineering specific to Intel, and enabled by specific
|
||||
hardware counters; Intel \texttt{VTune}, a commonly used profiling performance
|
||||
analysis tool, supports only x86-64.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
In this context, modelling an ARM CPU ---~the Cortex A72~--- with Palmed seemed
|
||||
to be an important goal, especially meaningful as this particular CPU only has
|
||||
very few hardware counters. However, it yielded only mixed results, as shown in
|
||||
\autoref{sec:palmed_results}.
|
||||
|
||||
In this chapter, we show that a major cause of imprecision in these results is
|
||||
the absence of a frontend model.
|
||||
|
||||
|
||||
\section{Necessity to go beyond ports}
|
||||
|
||||
The resource models produced by \palmed{} are mainly concerned with the backend
|
||||
of the CPUs modeled. However, the importance of the frontend in the accuracy of
|
||||
a model's prediction cannot be ignored. Its effect can be clearly seen in the
|
||||
evaluation heatmaps of various code analyzers in \autoref{fig:palmed_heatmaps}.
|
||||
Each heatmap has a clear-cut limit on the horizontal axis: independently of the
|
||||
benchmark's content, it is impossible to reach more than a given number of
|
||||
instructions per cycle for a given processor ---~4 instructions for the
|
||||
\texttt{SKL-SP}, 5 for the \texttt{ZEN1}. This limit is imposed by the
|
||||
frontend.
|
||||
|
||||
Some analyzers, such as \palmed{} and \iaca{}, model this limit: the heatmap
|
||||
shows that the predicted IPC will not surpass this limit. The other three
|
||||
analyzers studied, however, do not model this limit; for instance, \uopsinfo{}
|
||||
has a high density of benchmarks predicted at 8 instructions per cycle on
|
||||
SPEC2017 on the \texttt{SKL-SP} CPU, while the native measurement yielded only
|
||||
4 instructions per cycle. The same effect is visible on \pmevo{} and \llvmmca{}
|
||||
heatmaps.
|
||||
|
||||
\begin{example}{High back-end throughput on \texttt{SKL-SP}}
|
||||
On the \texttt{SKL-SP} microarchitecture, assuming an infinitely large
|
||||
frontend, a number of instructions per cycle higher than 4 is easy to
|
||||
reach.
|
||||
|
||||
According to \uopsinfo{} data, a 64-bits
|
||||
integer \lstxasm{addq} is processed with a single \uop{}, dispatched on
|
||||
port 0, 1, 5 or 6. In the meantime, a simple form 64 bits register store
|
||||
to a direct register-held address ---~\eg{} a \lstxasm{movq \%rax,
|
||||
(\%rbx)}~--- is also processed with a single \uop{}, dispatched on port 2
|
||||
or 3.
|
||||
|
||||
Thus, backend-wise, the kernel $4\times \texttt{addq} + 2\times
|
||||
\texttt{mov}$ has a throughput of 6 instructions per cycle. However, in
|
||||
reality, this kernel would be frontend-bound, with a theoretical maximum throughput of 4
|
||||
instructions per cycle ---~in fact, a \pipedream{} measure only yields 3
|
||||
instructions per cycle.
|
||||
\end{example}
|
||||
|
||||
\bigskip{}
|
||||
|
||||
To account for this, \palmed{} tries to detect an additional resource, apart
|
||||
from the backend ports and combined ports, on which every \uop{} incurs a load.
|
||||
This allows \palmed{} to avoid large errors on frontend-bound kernels.
|
||||
|
||||
The approach is, however, far from perfect. The clearest reason for this is is
|
||||
that the frontend, both on x86-64 and ARM architectures, works in-order, while
|
||||
\palmed{} inherently models kernels as multisets of instructions, thus
|
||||
completely ignoring ordering. This resource model is purely linear: an
|
||||
instruction incurs a load on the frontend resource in a fully commutative way,
|
||||
independently of the previous instructions executed this cycle and of many
|
||||
other effects.
|
||||
|
||||
The article introducing \uica{}~\cite{uica} explores this question in detail
|
||||
for x86-64 Intel architectures. The authors, having previously developed
|
||||
\uopsinfo{}, discuss the importance of a correct modelling of the frontend to
|
||||
accurately predict throughput. Their approach, based on the exploration and
|
||||
reverse-engineering of the crucial parts of the frontend, showcases many
|
||||
important and non-trivial aspects of frontends usually neglected, such as the
|
||||
switching between the decoders and \uop{}-cache as source of instructions
|
||||
---~which cannot be linearly modelled.
|
||||
|
||||
\section{The Cortex A72 CPU}
|
||||
|
||||
The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
|
||||
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
|
||||
CPU, with Neon SIMD support. The CPU is designed as a general-purpose,
|
||||
high-performance core for low-power applications.
|
||||
|
||||
The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711;
|
||||
it is thus easy to have access to an A72 to run experiments.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{A72_pipeline_diagram.svg}
|
||||
\caption{Simplified overview of the Cortex A72
|
||||
pipeline}\label{fig:a72_pipeline}
|
||||
\end{figure}
|
||||
|
||||
\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from
|
||||
the software optimization guide for the Cortex A72, published by
|
||||
ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports:
|
||||
\begin{itemize}
|
||||
\item a branch port (branch instructions, equivalent to x86 jumps);
|
||||
\item two identical integer ports (integer arithmetic operation);
|
||||
\item an integer multi-cycle port (complex integer operations, \eg{} divisions);
|
||||
\item two nearly-identical floating point and SIMD ports (mostly identical,
|
||||
with slight specializations: \eg{} only port FP0 can do SIMD
|
||||
multiplication, while only port FP1 can do floating point comparisons);
|
||||
\item a load port;
|
||||
\item a store port.
|
||||
\end{itemize}
|
||||
|
||||
\paragraph{Frontend.} The Cortex A72 frontend can only decode three
|
||||
instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}.
|
||||
Intel's \texttt{SKL-SP}, which we considered before, has a frontend that
|
||||
bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This
|
||||
difference of one \uop{} per cycle is actually meaningful, as this means that
|
||||
only three of the eight backend ports can be used each cycle.
|
||||
|
||||
\begin{example}[2nd order polynomial evaluation]
|
||||
Consider a kernel evaluating the 2nd order polynomial expression for
|
||||
different values of $x$:
|
||||
\begin{align*}
|
||||
P[i] &= a{X[i]}^2 + bX[i] + c \\
|
||||
&= \left( aX[i] + b \right) \times X[i] + c
|
||||
\end{align*}
|
||||
which directly translates to four operations: load $X[i]$, two floating
|
||||
point multiply-add, store the result $P[i]$. The backend, having a load
|
||||
port, two SIMD ports and a store port, can execute one iteration of such a
|
||||
kernel every cycle; in steady-state, out-of-order execution can lift the
|
||||
latency-induced pressure. However, as the frontend bottlenecks at three \uops{}
|
||||
per cycle, this kernel does not fit in a single cycle.
|
||||
\end{example}
|
||||
|
||||
\paragraph{Lack of hardware counters.}
|
||||
The Cortex A72 only features a very limited set of specialized hardware counters.
|
||||
While the CPU is able to report the number of elapsed cycles,
|
||||
retired instructions, branch misses and various metrics on cache misses, it
|
||||
does not report any event regarding macro- or micro-operations, dispatching or
|
||||
issuing to specific ports. This makes it, as pointed before, a particularly
|
||||
relevant target for \palmed{}.
|
||||
|
||||
|
||||
\section{Manually modelling the A72 frontend}
|
||||
|
||||
% TODO
|
||||
|
||||
\subsection{Methodology}
|
||||
|
||||
|
60
manuscrit/40_A72-frontend/20_cortex_a72.tex
Normal file
60
manuscrit/40_A72-frontend/20_cortex_a72.tex
Normal file
|
@ -0,0 +1,60 @@
|
|||
\section{The Cortex A72 CPU}
|
||||
|
||||
The Cortex A72~\cite{a72_doc} is a CPU based on the ARMv8-A ISA ---~the first
|
||||
ARM ISA to implement Aarch64, the 64-bits ARM extension. It is an out-of-order
|
||||
CPU, with Neon SIMD support. The CPU is designed as a general-purpose,
|
||||
high-performance core for low-power applications.
|
||||
|
||||
The Raspberry Pi 4 uses a 4-cores A72 CPU, implemented by Broadcom as BCM2711;
|
||||
it is thus easy to have access to an A72 to run experiments.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{A72_pipeline_diagram.svg}
|
||||
\caption{Simplified overview of the Cortex A72
|
||||
pipeline}\label{fig:a72_pipeline}
|
||||
\end{figure}
|
||||
|
||||
\paragraph{Backend.} As can be seen in \autoref{fig:a72_pipeline} (adapted from
|
||||
the software optimization guide for the Cortex A72, published by
|
||||
ARM~\cite{ref:a72_optim}), the Cortex A72 has eight execution ports:
|
||||
\begin{itemize}
|
||||
\item a branch port (branch instructions, equivalent to x86 jumps);
|
||||
\item two identical integer ports (integer arithmetic operation);
|
||||
\item an integer multi-cycle port (complex integer operations, \eg{} divisions);
|
||||
\item two nearly-identical floating point and SIMD ports (mostly identical,
|
||||
with slight specializations: \eg{} only port FP0 can do SIMD
|
||||
multiplication, while only port FP1 can do floating point comparisons);
|
||||
\item a load port;
|
||||
\item a store port.
|
||||
\end{itemize}
|
||||
|
||||
\paragraph{Frontend.} The Cortex A72 frontend can only decode three
|
||||
instructions and dispatch three \uops{} per cycle~\cite{ref:a72_optim}.
|
||||
Intel's \texttt{SKL-SP}, which we considered before, has a frontend that
|
||||
bottlenecks at four \uops{} per cycle~\cite{agnerfog_skl_front4}. This
|
||||
difference of one \uop{} per cycle is actually meaningful, as this means that
|
||||
only three of the eight backend ports can be used each cycle.
|
||||
|
||||
\begin{example}[2nd order polynomial evaluation]
|
||||
Consider a kernel evaluating the 2nd order polynomial expression for
|
||||
different values of $x$:
|
||||
\begin{align*}
|
||||
P[i] &= a{X[i]}^2 + bX[i] + c \\
|
||||
&= \left( aX[i] + b \right) \times X[i] + c
|
||||
\end{align*}
|
||||
which directly translates to four operations: load $X[i]$, two floating
|
||||
point multiply-add, store the result $P[i]$. The backend, having a load
|
||||
port, two SIMD ports and a store port, can execute one iteration of such a
|
||||
kernel every cycle; in steady-state, out-of-order execution can lift the
|
||||
latency-induced pressure. However, as the frontend bottlenecks at three \uops{}
|
||||
per cycle, this kernel does not fit in a single cycle.
|
||||
\end{example}
|
||||
|
||||
\paragraph{Lack of hardware counters.}
|
||||
The Cortex A72 only features a very limited set of specialized hardware counters.
|
||||
While the CPU is able to report the number of elapsed cycles,
|
||||
retired instructions, branch misses and various metrics on cache misses, it
|
||||
does not report any event regarding macro- or micro-operations, dispatching or
|
||||
issuing to specific ports. This makes it, as pointed before, a particularly
|
||||
relevant target for \palmed{}.
|
5
manuscrit/40_A72-frontend/30_manual_frontend.tex
Normal file
5
manuscrit/40_A72-frontend/30_manual_frontend.tex
Normal file
|
@ -0,0 +1,5 @@
|
|||
\section{Manually modelling the A72 frontend}
|
||||
|
||||
% TODO
|
||||
|
||||
\subsection{Methodology}
|
|
@ -1,4 +1,6 @@
|
|||
\chapter{Beyond ports: manually modelling the A72 frontend}\label{chap:frontend}
|
||||
|
||||
\input{00_intro.tex}
|
||||
\input{10_cortex_a72.tex}
|
||||
\input{10_beyond_ports.tex}
|
||||
\input{20_cortex_a72.tex}
|
||||
\input{30_manual_frontend.tex}
|
||||
|
|
|
@ -111,8 +111,8 @@
|
|||
|
||||
@misc{fugaku_top500,
|
||||
title={Supercomputer Fugaku retains first place worldwide in HPCG and Graph500 rankings},
|
||||
year=2022,
|
||||
month=November,
|
||||
year=2023,
|
||||
month=05,
|
||||
author={{Fujitsu Limited}},
|
||||
howpublished={\url{https://www.fujitsu.com/global/about/resources/news/press-releases/2022/1115-01.html}}
|
||||
}
|
||||
|
|
Loading…
Reference in a new issue