phd-thesis/manuscrit/20_foundations/10_cpu_arch.tex


\section{A dive into processors' microarchitecture}

A modern computer can roughly be broken down into a number of functional parts:
a processor, a general-purpose computation unit; accelerators, such
as GPUs, computation units specialized on specific tasks; memory, both volatile
but fast (RAM) and persistent but slower (SSD, HDD); hardware specialized for
interfacing, such as networks cards or USB controllers; power supplies,
responsible for providing smoothed, adequate electric power to the previous
components.

This manuscript will largely focus on the processor. While some of the
techniques described here might possibly be used for accelerators, we did not
experiment in this direction, nor are we aware of efforts in this direction.

\subsection{High-level abstraction of processors}

A processor, in its coarsest view, is simply a piece of hardware that can be
fed with a flow of instructions, which will, each after the other, modify the
machine's internal state.

The processor's state, the available instructions themselves and their effect
on the state are defined by an \emph{Instruction Set Architecture}, or ISA\@;
such as x86-64 or A64 (ARM's ISA). More generally, the ISA defines how software
will interact with a given processor, including the registers available to the
programmer, the instructions' semantics ---~broadly speaking, as these are
often informal~---, etc.  These instructions are represented, at a
human-readable level, by \emph{assembly code}, such as \lstxasm{add (\%rax),
\%rbx} in x86-64. Assembly code is then transcribed, or \emph{assembled}, to a
binary representation in order to be fed to the processor ---~for instance,
\lstxasm{0x480318} for the previous instruction. This instruction computes the
sum of the value held at memory address \reg{rax} and of the value \reg{rbx},
but it does not, strictly speaking, \emph{return} or \emph{produce} a result:
instead, its stores the result of the computation in register \reg{rbx},
altering the machine's state.

This state, generally, is composed of a small number of \emph{registers}, small
pieces of memory on which the processor can directly operate ---~to perform
arithmetic operations, index the main memory, etc. It is also composed of the
whole memory hierarchy, including the persistent memory, the main memory
(usually RAM) and the hierarchy of caches between the processor and the main
memory. This state can also be extended to encompass external effects, such as
networks communication, peripherals, etc.

The way an ISA is implemented, in order for the instructions to alter the state
as specified, is called a microarchitecture. Many microarchitectures can
implement the same ISA, as it is the case for instance with the x86-64 ISA,
implemented both by Intel and AMD, each with multiple generations, which
translates into multiple microarchitectures. It is thus frequent for ISAs to
have many extensions, which each microarchitecture may or may not implement.

\subsection{Microarchitectures}

\begin{figure}
    \centering
    \includegraphics[width=0.9\textwidth]{cpu_big_picture.svg}
    \caption{Simplified and generalized global representation of a CPU
    microarchitecture}\label{fig:cpu_big_picture}
\end{figure}

While many different ISAs are available and used, and even many more
microarchitectures are industrially implemented and widely distributed, some
generalities still hold for the vast majority of processors found in commercial
or server-grade computers. Such a generic view is obviously an approximation
and will miss many details and specificities; it should, however, be sufficient
for the purposes of this manuscript.

A microarchitecture can be broken down into a few functional blocks, shown in
\autoref{fig:cpu_big_picture}, roughly amounting to a \emph{frontend}, a \emph{backend}, a
\emph{register file}, multiple \emph{data caches} and a \emph{retire buffer}.

\medskip{}

\paragraph{Frontend and backend.} The frontend is responsible for fetching the
flow of instruction bytes to be executed, break it down into operations
executable by the backend and issue them to execution units.  The backend, in
turn, is responsible for the actual computations made by the processor.

\paragraph{Register file.} The register file holds the processor's registers,
on which computations are made.

\paragraph{Data caches.} The cache hierarchy (usually L1, L2 and L3) caches
data rows from the main memory, whose access latency would slow computation
down by several orders of magnitude if it was accessed directly. Usually, the
L1 cache resides directly in the computation core, while the L2 and L3 caches
are shared between multiple cores.

\bigskip{}

The CPU frontend constantly fetches a flow of instruction bytes. This flow must
first be broken down into a sequence of instructions. While on some ISAs, each
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
always the case: for instance, x84-64 instructions can be as short as one byte,
while the ISA only limits an instruction to 15 bytes. This task is performed
by the \emph{decoder}, which usually outputs a flow of \emph{micro-operations},
or \uops.

Some microarchitectures rely on complex decoding phases, first splitting
instructions into \emph{macro-operations}, to be split again into \uops{}
further down the line. Part of this decoding may also be cached, \eg{} to
optimize loop decoding, where the same sequence of instructions will be decoded
many times.

\smallskip{}

Microarchitectures typically implement more physical registers in their
register file than the ISA exposes to the programmer. The CPU takes advantage
of those additional registers by including a \emph{renamer} in the frontend,
which maps the ISA-defined registers used explicitly in instructions to
concrete registers in the register file. As long as enough concrete registers
are available, this phase eliminates certain categories of data dependencies;
this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.

\smallskip{}

Depending on the microarchitecture, the decoded operations ---~be they macro-
or micro-operations at this stage~--- may undergo several more phases, specific
to each processor.

\smallskip{}

Typically, however, \uops{} will eventually be fed into a \emph{Reorder
Buffer}, or ROB. Today, most consumer- or server-grade CPUs are
\emph{out-of-order}, with effects detailed below; the reorder buffer makes this
possible.

\smallskip{}

Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
as a sort of gateway towards the actual execution units of the processor.

Each execution port may (and usually is) be connected to multiple different
execution units: for instance, Intel Skylake's port 6 is responsible for both
branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
port for both memory loads and stores.

\smallskip{}

In most cases, execution units are \emph{fully pipelined}, meaning that while
processing a single \uop{} takes multiple cycles, the unit is able to start
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
at different stages, during each cycle.

\smallskip{}

Finally, when a \uop{} has been entirely processed and exits its processing
unit's pipeline, it is committed to the \emph{retire buffer}, marking the
\uop{} as complete.

\subsubsection{Dependencies handling}

In this flow of \uops{}, some are dependent on the result computed by a
previous \uop{}. If, for instance, two successive identical \uops{} compute
$\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the
completion of the first one, as the value of \reg{r10} after the execution of
the latter is not known before its completion.

The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
latter is marked as completed by entering the retire buffer.

Since computation units are pipelined, they reach their best efficiency only
when \uops{} can be fed to them in a constant flow.  Yet, as such, a dependency
may block the computation entirely until its dependent result is computed,
throttling down the CPU's performance.

The \emph{renamer} helps relieving this dependency pressure when the dependency
can be broken by simply renaming one of the registers.

\subsubsection{Out-of-order vs. in-order processors}

When computation is stalled by a dependency, it may however be possible to
issue immediately a \uop{} which comes later in the instruction stream, if it
does not need results not yet available.

For this reason, many processors are now \emph{out-of-order}, while processors
issuing \uops{} strictly in their original order are called \emph{in-order}.
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
instructions are picked to be issued. The reorder buffer acts as a sliding
window of microarchitecturally-fixed size over \uops{}, from which the oldest
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
CPUs are only able to execute operations out of order as long as the
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
be issued ---~specifically, not more than the size of the reorder buffer ahead.

It is also important to note that out-of-order processors are only out-of-order
\emph{from a certain point on}: a substantial part of the processor's frontend
is typically still in-order.

\subsubsection{Hardware counters}

Many processors provide \emph{hardware counters}, to help (low-level)
programmers understand how their code is executed. The counters available
widely depend on each specific processor. The majority of processors, however,
offer counters to determine the number of elapsed cycles between two
instructions, as well as the number of retired instructions. Some processors
further offer counters for the number of cache misses and hits on the various
caches, or even the number of \uops{} executed on a specific port.

While access to these counters is vendor-dependant, abstraction layers are
available: for instance, the Linux kernel abstracts these counters through the
\perf{} interface, while \papi{} further attempts to unify similar counters
from different vendors under a common name.

\subsubsection{SIMD operations}

\todo{}
Foundations: microarch 2023-10-22 23:04:34 +02:00
			`\section{A dive into processors' microarchitecture}`

			`A modern computer can roughly be broken down into a number of functional parts:`
			`a processor, a general-purpose computation unit; accelerators, such`
			`as GPUs, computation units specialized on specific tasks; memory, both volatile`
			`but fast (RAM) and persistent but slower (SSD, HDD); hardware specialized for`
			`interfacing, such as networks cards or USB controllers; power supplies,`
			`responsible for providing smoothed, adequate electric power to the previous`
			`components.`

			`This manuscript will largely focus on the processor. While some of the`
			`techniques described here might possibly be used for accelerators, we did not`
			`experiment in this direction, nor are we aware of efforts in this direction.`

			`\subsection{High-level abstraction of processors}`

			`A processor, in its coarsest view, is simply a piece of hardware that can be`
			`fed with a flow of instructions, which will, each after the other, modify the`
			`machine's internal state.`

Foundations: CPU big picture image 2023-11-03 17:31:01 +01:00			`The processor's state, the available instructions themselves and their effect`
			`on the state are defined by an \emph{Instruction Set Architecture}, or ISA\@;`
			`such as x86-64 or A64 (ARM's ISA). More generally, the ISA defines how software`
			`will interact with a given processor, including the registers available to the`
			`programmer, the instructions' semantics ---~broadly speaking, as these are`
			`often informal~---, etc. These instructions are represented, at a`
			`human-readable level, by \emph{assembly code}, such as \lstxasm{add (\%rax),`
			`\%rbx} in x86-64. Assembly code is then transcribed, or \emph{assembled}, to a`
			`binary representation in order to be fed to the processor ---~for instance,`
			`\lstxasm{0x480318} for the previous instruction. This instruction computes the`
			`sum of the value held at memory address \reg{rax} and of the value \reg{rbx},`
			`but it does not, strictly speaking, \emph{return} or \emph{produce} a result:`
			`instead, its stores the result of the computation in register \reg{rbx},`
			`altering the machine's state.`
Foundations: microarch 2023-10-22 23:04:34 +02:00
			`This state, generally, is composed of a small number of \emph{registers}, small`
			`pieces of memory on which the processor can directly operate ---~to perform`
			`arithmetic operations, index the main memory, etc. It is also composed of the`
			`whole memory hierarchy, including the persistent memory, the main memory`
			`(usually RAM) and the hierarchy of caches between the processor and the main`
			`memory. This state can also be extended to encompass external effects, such as`
			`networks communication, peripherals, etc.`

			`The way an ISA is implemented, in order for the instructions to alter the state`
			`as specified, is called a microarchitecture. Many microarchitectures can`
			`implement the same ISA, as it is the case for instance with the x86-64 ISA,`
			`implemented both by Intel and AMD, each with multiple generations, which`
Foundations: CPU big picture image 2023-11-03 17:31:01 +01:00			`translates into multiple microarchitectures. It is thus frequent for ISAs to`
			`have many extensions, which each microarchitecture may or may not implement.`
Foundations: microarch 2023-10-22 23:04:34 +02:00
			`\subsection{Microarchitectures}`

Foundations: CPU big picture image 2023-11-03 17:31:01 +01:00			`\begin{figure}`
			`\centering`
			`\includegraphics[width=0.9\textwidth]{cpu_big_picture.svg}`
			`\caption{Simplified and generalized global representation of a CPU`
			`microarchitecture}\label{fig:cpu_big_picture}`
			`\end{figure}`

Foundations: microarch 2023-10-22 23:04:34 +02:00			`While many different ISAs are available and used, and even many more`
			`microarchitectures are industrially implemented and widely distributed, some`
			`generalities still hold for the vast majority of processors found in commercial`
			`or server-grade computers. Such a generic view is obviously an approximation`
			`and will miss many details and specificities; it should, however, be sufficient`
			`for the purposes of this manuscript.`

			`A microarchitecture can be broken down into a few functional blocks, shown in`
Foundations: CPU big picture image 2023-11-03 17:31:01 +01:00			`\autoref{fig:cpu_big_picture}, roughly amounting to a \emph{frontend}, a \emph{backend}, a`
Foundations: slightly more writeup 2023-11-03 17:47:11 +01:00			`\emph{register file}, multiple \emph{data caches} and a \emph{retire buffer}.`
Foundations: microarch 2023-10-22 23:04:34 +02:00
			`\medskip{}`

Foundations: further microarch writeup 2023-11-03 20:17:04 +01:00			`\paragraph{Frontend and backend.} The frontend is responsible for fetching the`
			`flow of instruction bytes to be executed, break it down into operations`
			`executable by the backend and issue them to execution units. The backend, in`
			`turn, is responsible for the actual computations made by the processor.`
Foundations: slightly more writeup 2023-11-03 17:47:11 +01:00
			`\paragraph{Register file.} The register file holds the processor's registers,`
			`on which computations are made.`

			`\paragraph{Data caches.} The cache hierarchy (usually L1, L2 and L3) caches`
			`data rows from the main memory, whose access latency would slow computation`
			`down by several orders of magnitude if it was accessed directly. Usually, the`
			`L1 cache resides directly in the computation core, while the L2 and L3 caches`
			`are shared between multiple cores.`
Foundations: further microarch writeup 2023-11-03 20:17:04 +01:00
			`\bigskip{}`

			`The CPU frontend constantly fetches a flow of instruction bytes. This flow must`
			`first be broken down into a sequence of instructions. While on some ISAs, each`
			`instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not`
			`always the case: for instance, x84-64 instructions can be as short as one byte,`
			`while the ISA only limits an instruction to 15 bytes. This task is performed`
			`by the \emph{decoder}, which usually outputs a flow of \emph{micro-operations},`
			`or \uops.`

			`Some microarchitectures rely on complex decoding phases, first splitting`
			`instructions into \emph{macro-operations}, to be split again into \uops{}`
			`further down the line. Part of this decoding may also be cached, \eg{} to`
			`optimize loop decoding, where the same sequence of instructions will be decoded`
			`many times.`

			`\smallskip{}`

			`Microarchitectures typically implement more physical registers in their`
			`register file than the ISA exposes to the programmer. The CPU takes advantage`
			`of those additional registers by including a \emph{renamer} in the frontend,`
			`which maps the ISA-defined registers used explicitly in instructions to`
			`concrete registers in the register file. As long as enough concrete registers`
			`are available, this phase eliminates certain categories of data dependencies;`
			`this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.`

			`\smallskip{}`

			`Depending on the microarchitecture, the decoded operations ---~be they macro-`
			`or micro-operations at this stage~--- may undergo several more phases, specific`
			`to each processor.`

			`\smallskip{}`

			`Typically, however, \uops{} will eventually be fed into a \emph{Reorder`
			`Buffer}, or ROB. Today, most consumer- or server-grade CPUs are`
			`\emph{out-of-order}, with effects detailed below; the reorder buffer makes this`
			`possible.`

			`\smallskip{}`

			`Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution`
			`ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts`
			`as a sort of gateway towards the actual execution units of the processor.`

			`Each execution port may (and usually is) be connected to multiple different`
			`execution units: for instance, Intel Skylake's port 6 is responsible for both`
			`branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single`
			`port for both memory loads and stores.`

			`\smallskip{}`

			`In most cases, execution units are \emph{fully pipelined}, meaning that while`
			`processing a single \uop{} takes multiple cycles, the unit is able to start`
			`processing a new \uop{} every cycle: multiple \uops{} are then being processed,`
			`at different stages, during each cycle.`

			`\smallskip{}`

			`Finally, when a \uop{} has been entirely processed and exits its processing`
			`unit's pipeline, it is committed to the \emph{retire buffer}, marking the`
			`\uop{} as complete.`

			`\subsubsection{Dependencies handling}`

			`In this flow of \uops{}, some are dependent on the result computed by a`
			`previous \uop{}. If, for instance, two successive identical \uops{} compute`
			`$\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the`
			`completion of the first one, as the value of \reg{r10} after the execution of`
			`the latter is not known before its completion.`

			`The \uops{} that depend on a previous \uop{} are not \emph{issued} until the`
			`latter is marked as completed by entering the retire buffer.`

			`Since computation units are pipelined, they reach their best efficiency only`
			`when \uops{} can be fed to them in a constant flow. Yet, as such, a dependency`
			`may block the computation entirely until its dependent result is computed,`
			`throttling down the CPU's performance.`

			`The \emph{renamer} helps relieving this dependency pressure when the dependency`
			`can be broken by simply renaming one of the registers.`

			`\subsubsection{Out-of-order vs. in-order processors}`

			`When computation is stalled by a dependency, it may however be possible to`
			`issue immediately a \uop{} which comes later in the instruction stream, if it`
			`does not need results not yet available.`

			`For this reason, many processors are now \emph{out-of-order}, while processors`
			`issuing \uops{} strictly in their original order are called \emph{in-order}.`
			`Out-of-order microarchitectures feature a \emph{reorder buffer}, from which`
			`instructions are picked to be issued. The reorder buffer acts as a sliding`
			`window of microarchitecturally-fixed size over \uops{}, from which the oldest`
			`\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order`
			`CPUs are only able to execute operations out of order as long as the`
			`\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to`
			`be issued ---~specifically, not more than the size of the reorder buffer ahead.`

			`It is also important to note that out-of-order processors are only out-of-order`
			`\emph{from a certain point on}: a substantial part of the processor's frontend`
			`is typically still in-order.`

			`\subsubsection{Hardware counters}`

			`Many processors provide \emph{hardware counters}, to help (low-level)`
			`programmers understand how their code is executed. The counters available`
			`widely depend on each specific processor. The majority of processors, however,`
			`offer counters to determine the number of elapsed cycles between two`
			`instructions, as well as the number of retired instructions. Some processors`
			`further offer counters for the number of cache misses and hits on the various`
			`caches, or even the number of \uops{} executed on a specific port.`

			`While access to these counters is vendor-dependant, abstraction layers are`
			`available: for instance, the Linux kernel abstracts these counters through the`
			`\perf{} interface, while \papi{} further attempts to unify similar counters`
			`from different vendors under a common name.`

			`\subsubsection{SIMD operations}`

			`\todo{}`