264 lines
14 KiB
TeX
264 lines
14 KiB
TeX
\section{A dive into processors' microarchitecture}\label{sec:intro_microarch}
|
|
|
|
A modern computer can roughly be broken down into a number of functional parts:
|
|
a processor, a general-purpose computation unit; accelerators, such
|
|
as GPUs, computation units specialized on specific tasks; memory, both volatile
|
|
but fast (RAM) and persistent but slower (SSD, HDD); hardware specialized for
|
|
interfacing, such as networks cards or USB controllers; power supplies,
|
|
responsible for providing smoothed, adequate electric power to the previous
|
|
components.
|
|
|
|
This manuscript will largely focus on the processor. While some of the
|
|
techniques described here might possibly be used for accelerators, we did not
|
|
experiment in this direction, nor are we aware of efforts in this direction.
|
|
|
|
\subsection{High-level abstraction of processors}
|
|
|
|
A processor, in its coarsest view, is simply a piece of hardware that can be
|
|
fed with a flow of instructions, which will, each after the other, modify the
|
|
machine's internal state.
|
|
|
|
The processor's state, the available instructions themselves and their effect
|
|
on the state are defined by an \emph{Instruction Set Architecture}, or ISA\@;
|
|
such as x86-64 or A64 (ARM's ISA). More generally, the ISA defines how software
|
|
will interact with a given processor, including the registers available to the
|
|
programmer, the instructions' semantics ---~broadly speaking, as these are
|
|
often informal~---, etc. These instructions are represented, at a
|
|
human-readable level, by \emph{assembly code}, such as \lstxasm{add (\%rax),
|
|
\%rbx} in x86-64. Assembly code is then transcribed, or \emph{assembled}, to a
|
|
binary representation in order to be fed to the processor ---~for instance,
|
|
\lstxasm{0x480318} for the previous instruction. This instruction computes the
|
|
sum of the value held at memory address \reg{rax} and of the value \reg{rbx},
|
|
but it does not, strictly speaking, \emph{return} or \emph{produce} a result:
|
|
instead, its stores the result of the computation in register \reg{rbx},
|
|
altering the machine's state.
|
|
|
|
This state, generally, is composed of a small number of \emph{registers}, small
|
|
pieces of memory on which the processor can directly operate ---~to perform
|
|
arithmetic operations, index the main memory, etc. It is also composed of the
|
|
whole memory hierarchy, including the persistent memory, the main memory
|
|
(usually RAM) and the hierarchy of caches between the processor and the main
|
|
memory. This state can also be extended to encompass external effects, such as
|
|
networks communication, peripherals, etc.
|
|
|
|
The way an ISA is implemented, in order for the instructions to alter the state
|
|
as specified, is called a \emph{microarchitecture}. Many microarchitectures can
|
|
implement the same ISA, as it is the case for instance with the x86-64 ISA,
|
|
implemented both by Intel and AMD, each with multiple generations, which
|
|
translates into multiple microarchitectures. It is thus frequent for ISAs to
|
|
have many extensions, which each microarchitecture may or may not implement.
|
|
|
|
\subsection{Microarchitectures}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.9\textwidth]{cpu_big_picture.svg}
|
|
\caption{Simplified and generalized global representation of a CPU
|
|
microarchitecture}\label{fig:cpu_big_picture}
|
|
\end{figure}
|
|
|
|
While many different ISAs are available and used, and even many more
|
|
microarchitectures are industrially implemented and widely distributed, some
|
|
generalities still hold for the vast majority of processors found in commercial
|
|
or server-grade computers. Such a generic view is obviously an approximation
|
|
and will miss many details and specificities; it should, however, be sufficient
|
|
for the purposes of this manuscript.
|
|
|
|
A microarchitecture can be broken down into a few functional blocks, shown in
|
|
\autoref{fig:cpu_big_picture}, roughly amounting to a \emph{frontend}, a \emph{backend}, a
|
|
\emph{register file}, multiple \emph{data caches} and a \emph{retire buffer}.
|
|
|
|
\medskip{}
|
|
|
|
\paragraph{Frontend and backend.} The frontend is responsible for fetching the
|
|
flow of instruction bytes to be executed, break it down into operations
|
|
executable by the backend and issue them to execution units. The backend, in
|
|
turn, is responsible for the actual computations made by the processor.
|
|
|
|
As such, the frontend can be seen as a manager for the backend: the latter
|
|
actually executes the work, while the former ensures that work is made
|
|
available to it, orchestrates its execution and scheduling, and ensures each
|
|
``worker'' in the backend is assigned tasks within its skill set.
|
|
|
|
\paragraph{Register file.} The register file holds the processor's registers,
|
|
small amounts of fast memory directly built into the processor's cores, on
|
|
which computations are made.
|
|
|
|
\paragraph{Data caches.} The cache hierarchy (usually L1, L2 and L3) caches
|
|
data rows from the main memory, whose access latency would slow computation
|
|
down by several orders of magnitude if it was accessed directly. Usually, the
|
|
L1 cache resides directly in the computation core, while the L2 and L3 caches
|
|
are shared between multiple cores.
|
|
|
|
\subsubsection{An instruction's walk through the processor}
|
|
|
|
Several CPU cycles may pass from the moment an instruction is first fetched by
|
|
the processor, until the time this instruction is considered completed and
|
|
discarded. Let us follow the path of one such instruction through the
|
|
processor.
|
|
|
|
\smallskip{}
|
|
|
|
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
|
first be broken down into a sequence of instructions. While on some ISAs, each
|
|
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
|
|
always the case: for instance, x84-64 instructions can be as short as one byte,
|
|
while the ISA only limits an instruction to 15
|
|
bytes~\cite{ref:intel64_software_dev_reference_vol1}. This task is performed by
|
|
the \emph{decoder}, which usually outputs a flow of \emph{micro-operations}, or
|
|
\uops.
|
|
|
|
Some microarchitectures rely on complex decoding phases, first splitting
|
|
instructions into \emph{macro-operations}, to be split again into \uops{}
|
|
further down the line. Part of this decoding may also be cached, \eg{} to
|
|
optimize loop decoding, where the same sequence of instructions will be decoded
|
|
many times.
|
|
|
|
\smallskip{}
|
|
|
|
Microarchitectures typically implement more physical registers in their
|
|
register file than the ISA exposes to the programmer. The CPU takes advantage
|
|
of those additional registers by including a \emph{renamer} in the frontend, to
|
|
which the just-decoded operations are fed.
|
|
The renamer maps the ISA-defined registers used explicitly in instructions to
|
|
concrete registers in the register file. As long as enough concrete registers
|
|
are available, this phase eliminates certain categories of data dependencies;
|
|
this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
|
|
|
|
\smallskip{}
|
|
|
|
Depending on the microarchitecture, the decoded operations ---~be they macro-
|
|
or micro-operations at this stage~--- may undergo several more phases, specific
|
|
to each processor.
|
|
|
|
\smallskip{}
|
|
|
|
Typically, however, \uops{} will eventually be fed into a \emph{Reorder
|
|
Buffer}, or ROB\@. Today, most consumer- or server-grade CPUs are
|
|
\emph{out-of-order}, with effects detailed below; the reorder buffer makes this
|
|
possible. The \uops{} may wait for a few cycles in this reorder buffer, before
|
|
being pulled by the \emph{issuer}.
|
|
|
|
\smallskip{}
|
|
|
|
Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
|
|
ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
|
|
as a sort of gateway towards the actual execution units of the processor.
|
|
|
|
Each execution port may be (and usually is) connected to multiple different
|
|
execution units: for instance, Intel Skylake's port 6 is responsible for both
|
|
branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
|
|
port for both memory loads and stores.
|
|
|
|
\smallskip{}
|
|
|
|
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
|
processing a single \uop{} takes multiple cycles, the unit is able to start
|
|
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
|
at different stages, during each cycle, akin to a factory's assembly line.
|
|
|
|
\smallskip{}
|
|
|
|
Finally, when a \uop{} has been entirely processed and exits its processing
|
|
unit's pipeline, it is committed to the \emph{retire buffer}, marking the
|
|
\uop{} as complete.
|
|
|
|
\subsubsection{Dependencies handling}
|
|
|
|
In this flow of \uops{}, some are dependent on the result computed by a
|
|
previous \uop{} ---~or, rather more precisely, await the change of state
|
|
induced by a previous \uop{}. If, for instance, two successive identical
|
|
\uops{} compute $\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance
|
|
must wait for the completion of the first one, as the value of \reg{r10} after
|
|
the execution of the latter is not known before its completion.
|
|
|
|
The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
|
|
latter is marked as completed by entering the retire buffer\footnote{Some
|
|
processors, however, introduce ``shortcuts'' when a \uop{} can yield a
|
|
result before its full completion. In such cases, while the \uop{} depended
|
|
on is not yet complete and retired, the dependant \uop{} can still be
|
|
issued.}.
|
|
|
|
Since computation units are pipelined, they reach their best efficiency only
|
|
when \uops{} can be fed to them in a constant flow. Yet, as such, a dependency
|
|
may block the computation entirely until its dependent result is computed,
|
|
throttling down the CPU's performance.
|
|
|
|
The \emph{renamer} helps relieving this dependency pressure when the dependency
|
|
can be broken by simply renaming one of the registers. We detail this later on
|
|
\autoref{chap:staticdeps}, but such dependencies may be \eg{}
|
|
\emph{write-after-read}: if $\reg{r11} \gets \reg{r10}$ is followed by
|
|
$\reg{r10} \gets \reg{r12}$, then the latter must wait for the former's
|
|
completion, as it would else overwrite $\reg{r10}$, which is read by the
|
|
former. However, the second instruction may be \emph{renamed} to write to
|
|
$\reg{r10}_\text{alt}$ instead ---~also renaming every subsequent read to the same
|
|
value~---, thus avoiding the dependency.
|
|
|
|
\subsubsection{Out-of-order vs. in-order processors}
|
|
|
|
When computation is stalled by a dependency, it may however be possible to
|
|
issue immediately a \uop{} which comes later in the instruction stream, but
|
|
depends only on results already available.
|
|
|
|
For this reason, many processors are now \emph{out-of-order}, while processors
|
|
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
|
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
|
|
instructions are picked to be issued. The reorder buffer acts as a sliding
|
|
window of microarchitecturally-fixed size over \uops{}, from which the oldest
|
|
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
|
|
CPUs are only able to execute operations out of order as long as the
|
|
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
|
|
be issued ---~specifically, not more than the size of the reorder buffer ahead.
|
|
|
|
It is also important to note that out-of-order processors are only out-of-order
|
|
\emph{from a certain point on}: a substantial part of the processor's frontend
|
|
is typically still in-order.
|
|
|
|
\subsubsection{Hardware counters}\label{sssec:hw_counters}
|
|
|
|
Many processors provide \emph{hardware counters}, to help (low-level)
|
|
programmers understand how their code is executed. The counters available
|
|
widely depend on each specific processor. The majority of processors, however,
|
|
offer counters to determine the number of elapsed cycles between two
|
|
instructions, as well as the number of retired instructions. Some processors
|
|
further offer counters for the number of cache misses and hits on the various
|
|
caches, or even the number of \uops{} executed on a specific port.
|
|
|
|
While access to these counters is vendor-dependant, abstraction layers are
|
|
available: for instance, the Linux kernel abstracts these counters through the
|
|
\perf{} interface, while \papi{} further attempts to unify similar counters
|
|
from different vendors under a common name.
|
|
|
|
\subsubsection{SIMD operations}
|
|
|
|
Processors operate at a given \emph{word size}, fixed by the ISA ---~typically
|
|
32 or 64 bits nowadays, even though embedded processors might operate at lower
|
|
word sizes.
|
|
|
|
Some instructions, however, operate on chunks of multiple words at once. These
|
|
instructions are called \emph{vector instructions}, or \emph{SIMD} for Single
|
|
Instruction, Multiple Data. A SIMD ``add'' instruction may, for instance, add
|
|
two chunks of 128 bits, which can for instance be treated each as four integers
|
|
of 32 bits bundled together, as illustrated in \autoref{fig:cpu_simd}.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.6\textwidth]{simd.svg}
|
|
\caption{Example of SIMD add instruction on 128b}\label{fig:cpu_simd}
|
|
\end{figure}
|
|
|
|
Such instructions present clear efficiency advantages. If the processor is able
|
|
to handle one such instruction every cycle ---~even if it is pipelined for
|
|
multiple cycles~---, it multiplies by its number of vector elements the
|
|
processor's throughput, making it able to process \eg{} four add operations per
|
|
cycle instead of one, as long as the data is arranged in memory in an
|
|
appropriate way. Some processors, however, are not able to process the full
|
|
vector instruction at once, by lack of backend units ---~it may, for instance,
|
|
only process two 32-bits adds at once, making the processor able to execute
|
|
only one such instruction per two cycles. Even in this case, there are clear
|
|
efficiency benefits: while there is no real gain in the backend, the frontend
|
|
has only one instruction to decode, rename, etc., greatly alleviating frontend
|
|
pressure. This is for instance the case of the implementation of the
|
|
RISC-V~\cite{riscv_isa} vector extension, supporting up to 256 double-precision
|
|
floats in a single operation, while the hardware supports far less in one
|
|
cycle~\cite{filippo_riscv_vector, filippo_acaces23}.
|