2023-10-22 23:04:34 +02:00
|
|
|
|
|
|
|
\section{A dive into processors' microarchitecture}
|
|
|
|
|
|
|
|
A modern computer can roughly be broken down into a number of functional parts:
|
|
|
|
a processor, a general-purpose computation unit; accelerators, such
|
|
|
|
as GPUs, computation units specialized on specific tasks; memory, both volatile
|
|
|
|
but fast (RAM) and persistent but slower (SSD, HDD); hardware specialized for
|
|
|
|
interfacing, such as networks cards or USB controllers; power supplies,
|
|
|
|
responsible for providing smoothed, adequate electric power to the previous
|
|
|
|
components.
|
|
|
|
|
|
|
|
This manuscript will largely focus on the processor. While some of the
|
|
|
|
techniques described here might possibly be used for accelerators, we did not
|
|
|
|
experiment in this direction, nor are we aware of efforts in this direction.
|
|
|
|
|
|
|
|
\subsection{High-level abstraction of processors}
|
|
|
|
|
|
|
|
A processor, in its coarsest view, is simply a piece of hardware that can be
|
|
|
|
fed with a flow of instructions, which will, each after the other, modify the
|
|
|
|
machine's internal state.
|
|
|
|
|
2023-11-03 17:31:01 +01:00
|
|
|
The processor's state, the available instructions themselves and their effect
|
|
|
|
on the state are defined by an \emph{Instruction Set Architecture}, or ISA\@;
|
|
|
|
such as x86-64 or A64 (ARM's ISA). More generally, the ISA defines how software
|
|
|
|
will interact with a given processor, including the registers available to the
|
|
|
|
programmer, the instructions' semantics ---~broadly speaking, as these are
|
|
|
|
often informal~---, etc. These instructions are represented, at a
|
|
|
|
human-readable level, by \emph{assembly code}, such as \lstxasm{add (\%rax),
|
|
|
|
\%rbx} in x86-64. Assembly code is then transcribed, or \emph{assembled}, to a
|
|
|
|
binary representation in order to be fed to the processor ---~for instance,
|
|
|
|
\lstxasm{0x480318} for the previous instruction. This instruction computes the
|
|
|
|
sum of the value held at memory address \reg{rax} and of the value \reg{rbx},
|
|
|
|
but it does not, strictly speaking, \emph{return} or \emph{produce} a result:
|
|
|
|
instead, its stores the result of the computation in register \reg{rbx},
|
|
|
|
altering the machine's state.
|
2023-10-22 23:04:34 +02:00
|
|
|
|
|
|
|
This state, generally, is composed of a small number of \emph{registers}, small
|
|
|
|
pieces of memory on which the processor can directly operate ---~to perform
|
|
|
|
arithmetic operations, index the main memory, etc. It is also composed of the
|
|
|
|
whole memory hierarchy, including the persistent memory, the main memory
|
|
|
|
(usually RAM) and the hierarchy of caches between the processor and the main
|
|
|
|
memory. This state can also be extended to encompass external effects, such as
|
|
|
|
networks communication, peripherals, etc.
|
|
|
|
|
|
|
|
The way an ISA is implemented, in order for the instructions to alter the state
|
|
|
|
as specified, is called a microarchitecture. Many microarchitectures can
|
|
|
|
implement the same ISA, as it is the case for instance with the x86-64 ISA,
|
|
|
|
implemented both by Intel and AMD, each with multiple generations, which
|
2023-11-03 17:31:01 +01:00
|
|
|
translates into multiple microarchitectures. It is thus frequent for ISAs to
|
|
|
|
have many extensions, which each microarchitecture may or may not implement.
|
2023-10-22 23:04:34 +02:00
|
|
|
|
|
|
|
\subsection{Microarchitectures}
|
|
|
|
|
2023-11-03 17:31:01 +01:00
|
|
|
\begin{figure}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.9\textwidth]{cpu_big_picture.svg}
|
|
|
|
\caption{Simplified and generalized global representation of a CPU
|
|
|
|
microarchitecture}\label{fig:cpu_big_picture}
|
|
|
|
\end{figure}
|
|
|
|
|
2023-10-22 23:04:34 +02:00
|
|
|
While many different ISAs are available and used, and even many more
|
|
|
|
microarchitectures are industrially implemented and widely distributed, some
|
|
|
|
generalities still hold for the vast majority of processors found in commercial
|
|
|
|
or server-grade computers. Such a generic view is obviously an approximation
|
|
|
|
and will miss many details and specificities; it should, however, be sufficient
|
|
|
|
for the purposes of this manuscript.
|
|
|
|
|
|
|
|
A microarchitecture can be broken down into a few functional blocks, shown in
|
2023-11-03 17:31:01 +01:00
|
|
|
\autoref{fig:cpu_big_picture}, roughly amounting to a \emph{frontend}, a \emph{backend}, a
|
2023-11-03 17:47:11 +01:00
|
|
|
\emph{register file}, multiple \emph{data caches} and a \emph{retire buffer}.
|
2023-10-22 23:04:34 +02:00
|
|
|
|
|
|
|
\medskip{}
|
|
|
|
|
2023-11-03 20:17:04 +01:00
|
|
|
\paragraph{Frontend and backend.} The frontend is responsible for fetching the
|
|
|
|
flow of instruction bytes to be executed, break it down into operations
|
|
|
|
executable by the backend and issue them to execution units. The backend, in
|
|
|
|
turn, is responsible for the actual computations made by the processor.
|
2023-11-03 17:47:11 +01:00
|
|
|
|
|
|
|
\paragraph{Register file.} The register file holds the processor's registers,
|
|
|
|
on which computations are made.
|
|
|
|
|
|
|
|
\paragraph{Data caches.} The cache hierarchy (usually L1, L2 and L3) caches
|
|
|
|
data rows from the main memory, whose access latency would slow computation
|
|
|
|
down by several orders of magnitude if it was accessed directly. Usually, the
|
|
|
|
L1 cache resides directly in the computation core, while the L2 and L3 caches
|
|
|
|
are shared between multiple cores.
|
2023-11-03 20:17:04 +01:00
|
|
|
|
|
|
|
\bigskip{}
|
|
|
|
|
|
|
|
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
|
|
|
first be broken down into a sequence of instructions. While on some ISAs, each
|
|
|
|
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
|
|
|
|
always the case: for instance, x84-64 instructions can be as short as one byte,
|
|
|
|
while the ISA only limits an instruction to 15 bytes. This task is performed
|
|
|
|
by the \emph{decoder}, which usually outputs a flow of \emph{micro-operations},
|
|
|
|
or \uops.
|
|
|
|
|
|
|
|
Some microarchitectures rely on complex decoding phases, first splitting
|
|
|
|
instructions into \emph{macro-operations}, to be split again into \uops{}
|
|
|
|
further down the line. Part of this decoding may also be cached, \eg{} to
|
|
|
|
optimize loop decoding, where the same sequence of instructions will be decoded
|
|
|
|
many times.
|
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
Microarchitectures typically implement more physical registers in their
|
|
|
|
register file than the ISA exposes to the programmer. The CPU takes advantage
|
|
|
|
of those additional registers by including a \emph{renamer} in the frontend,
|
|
|
|
which maps the ISA-defined registers used explicitly in instructions to
|
|
|
|
concrete registers in the register file. As long as enough concrete registers
|
|
|
|
are available, this phase eliminates certain categories of data dependencies;
|
|
|
|
this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
|
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
Depending on the microarchitecture, the decoded operations ---~be they macro-
|
|
|
|
or micro-operations at this stage~--- may undergo several more phases, specific
|
|
|
|
to each processor.
|
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
Typically, however, \uops{} will eventually be fed into a \emph{Reorder
|
|
|
|
Buffer}, or ROB. Today, most consumer- or server-grade CPUs are
|
|
|
|
\emph{out-of-order}, with effects detailed below; the reorder buffer makes this
|
|
|
|
possible.
|
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
|
|
|
|
ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
|
|
|
|
as a sort of gateway towards the actual execution units of the processor.
|
|
|
|
|
|
|
|
Each execution port may (and usually is) be connected to multiple different
|
|
|
|
execution units: for instance, Intel Skylake's port 6 is responsible for both
|
|
|
|
branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
|
|
|
|
port for both memory loads and stores.
|
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
|
|
|
processing a single \uop{} takes multiple cycles, the unit is able to start
|
|
|
|
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
|
|
|
at different stages, during each cycle.
|
|
|
|
|
|
|
|
\smallskip{}
|
|
|
|
|
|
|
|
Finally, when a \uop{} has been entirely processed and exits its processing
|
|
|
|
unit's pipeline, it is committed to the \emph{retire buffer}, marking the
|
|
|
|
\uop{} as complete.
|
|
|
|
|
|
|
|
\subsubsection{Dependencies handling}
|
|
|
|
|
|
|
|
In this flow of \uops{}, some are dependent on the result computed by a
|
|
|
|
previous \uop{}. If, for instance, two successive identical \uops{} compute
|
|
|
|
$\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the
|
|
|
|
completion of the first one, as the value of \reg{r10} after the execution of
|
|
|
|
the latter is not known before its completion.
|
|
|
|
|
|
|
|
The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
|
|
|
|
latter is marked as completed by entering the retire buffer.
|
|
|
|
|
|
|
|
Since computation units are pipelined, they reach their best efficiency only
|
|
|
|
when \uops{} can be fed to them in a constant flow. Yet, as such, a dependency
|
|
|
|
may block the computation entirely until its dependent result is computed,
|
|
|
|
throttling down the CPU's performance.
|
|
|
|
|
|
|
|
The \emph{renamer} helps relieving this dependency pressure when the dependency
|
|
|
|
can be broken by simply renaming one of the registers.
|
|
|
|
|
|
|
|
\subsubsection{Out-of-order vs. in-order processors}
|
|
|
|
|
|
|
|
When computation is stalled by a dependency, it may however be possible to
|
|
|
|
issue immediately a \uop{} which comes later in the instruction stream, if it
|
|
|
|
does not need results not yet available.
|
|
|
|
|
|
|
|
For this reason, many processors are now \emph{out-of-order}, while processors
|
|
|
|
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
|
|
|
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
|
|
|
|
instructions are picked to be issued. The reorder buffer acts as a sliding
|
|
|
|
window of microarchitecturally-fixed size over \uops{}, from which the oldest
|
|
|
|
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
|
|
|
|
CPUs are only able to execute operations out of order as long as the
|
|
|
|
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
|
|
|
|
be issued ---~specifically, not more than the size of the reorder buffer ahead.
|
|
|
|
|
|
|
|
It is also important to note that out-of-order processors are only out-of-order
|
|
|
|
\emph{from a certain point on}: a substantial part of the processor's frontend
|
|
|
|
is typically still in-order.
|
|
|
|
|
|
|
|
\subsubsection{Hardware counters}
|
|
|
|
|
|
|
|
Many processors provide \emph{hardware counters}, to help (low-level)
|
|
|
|
programmers understand how their code is executed. The counters available
|
|
|
|
widely depend on each specific processor. The majority of processors, however,
|
|
|
|
offer counters to determine the number of elapsed cycles between two
|
|
|
|
instructions, as well as the number of retired instructions. Some processors
|
|
|
|
further offer counters for the number of cache misses and hits on the various
|
|
|
|
caches, or even the number of \uops{} executed on a specific port.
|
|
|
|
|
|
|
|
While access to these counters is vendor-dependant, abstraction layers are
|
|
|
|
available: for instance, the Linux kernel abstracts these counters through the
|
|
|
|
\perf{} interface, while \papi{} further attempts to unify similar counters
|
|
|
|
from different vendors under a common name.
|
|
|
|
|
|
|
|
\subsubsection{SIMD operations}
|
|
|
|
|
|
|
|
\todo{}
|