Foundations: further microarch writeup
This commit is contained in:
parent
c20ba8ea7e
commit
2a7b3fc2a3
2 changed files with 128 additions and 9 deletions
|
@ -71,13 +71,10 @@ A microarchitecture can be broken down into a few functional blocks, shown in
|
|||
|
||||
\medskip{}
|
||||
|
||||
\paragraph{Frontend.} The frontend is responsible for fetching the flow of
|
||||
instruction bytes to be executed, break it down into operations executable by
|
||||
the backend and issue them to execution units.
|
||||
|
||||
\paragraph{Backend.} The backend is composed of \emph{execution ports}, which
|
||||
act as gateways to the actual \emph{execution units}. Those units are
|
||||
responsible for the actual computations made by the processor.
|
||||
\paragraph{Frontend and backend.} The frontend is responsible for fetching the
|
||||
flow of instruction bytes to be executed, break it down into operations
|
||||
executable by the backend and issue them to execution units. The backend, in
|
||||
turn, is responsible for the actual computations made by the processor.
|
||||
|
||||
\paragraph{Register file.} The register file holds the processor's registers,
|
||||
on which computations are made.
|
||||
|
@ -87,3 +84,124 @@ data rows from the main memory, whose access latency would slow computation
|
|||
down by several orders of magnitude if it was accessed directly. Usually, the
|
||||
L1 cache resides directly in the computation core, while the L2 and L3 caches
|
||||
are shared between multiple cores.
|
||||
|
||||
\bigskip{}
|
||||
|
||||
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
||||
first be broken down into a sequence of instructions. While on some ISAs, each
|
||||
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
|
||||
always the case: for instance, x84-64 instructions can be as short as one byte,
|
||||
while the ISA only limits an instruction to 15 bytes. This task is performed
|
||||
by the \emph{decoder}, which usually outputs a flow of \emph{micro-operations},
|
||||
or \uops.
|
||||
|
||||
Some microarchitectures rely on complex decoding phases, first splitting
|
||||
instructions into \emph{macro-operations}, to be split again into \uops{}
|
||||
further down the line. Part of this decoding may also be cached, \eg{} to
|
||||
optimize loop decoding, where the same sequence of instructions will be decoded
|
||||
many times.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
Microarchitectures typically implement more physical registers in their
|
||||
register file than the ISA exposes to the programmer. The CPU takes advantage
|
||||
of those additional registers by including a \emph{renamer} in the frontend,
|
||||
which maps the ISA-defined registers used explicitly in instructions to
|
||||
concrete registers in the register file. As long as enough concrete registers
|
||||
are available, this phase eliminates certain categories of data dependencies;
|
||||
this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
Depending on the microarchitecture, the decoded operations ---~be they macro-
|
||||
or micro-operations at this stage~--- may undergo several more phases, specific
|
||||
to each processor.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
Typically, however, \uops{} will eventually be fed into a \emph{Reorder
|
||||
Buffer}, or ROB. Today, most consumer- or server-grade CPUs are
|
||||
\emph{out-of-order}, with effects detailed below; the reorder buffer makes this
|
||||
possible.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
|
||||
ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
|
||||
as a sort of gateway towards the actual execution units of the processor.
|
||||
|
||||
Each execution port may (and usually is) be connected to multiple different
|
||||
execution units: for instance, Intel Skylake's port 6 is responsible for both
|
||||
branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
|
||||
port for both memory loads and stores.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
||||
processing a single \uop{} takes multiple cycles, the unit is able to start
|
||||
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
||||
at different stages, during each cycle.
|
||||
|
||||
\smallskip{}
|
||||
|
||||
Finally, when a \uop{} has been entirely processed and exits its processing
|
||||
unit's pipeline, it is committed to the \emph{retire buffer}, marking the
|
||||
\uop{} as complete.
|
||||
|
||||
\subsubsection{Dependencies handling}
|
||||
|
||||
In this flow of \uops{}, some are dependent on the result computed by a
|
||||
previous \uop{}. If, for instance, two successive identical \uops{} compute
|
||||
$\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the
|
||||
completion of the first one, as the value of \reg{r10} after the execution of
|
||||
the latter is not known before its completion.
|
||||
|
||||
The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
|
||||
latter is marked as completed by entering the retire buffer.
|
||||
|
||||
Since computation units are pipelined, they reach their best efficiency only
|
||||
when \uops{} can be fed to them in a constant flow. Yet, as such, a dependency
|
||||
may block the computation entirely until its dependent result is computed,
|
||||
throttling down the CPU's performance.
|
||||
|
||||
The \emph{renamer} helps relieving this dependency pressure when the dependency
|
||||
can be broken by simply renaming one of the registers.
|
||||
|
||||
\subsubsection{Out-of-order vs. in-order processors}
|
||||
|
||||
When computation is stalled by a dependency, it may however be possible to
|
||||
issue immediately a \uop{} which comes later in the instruction stream, if it
|
||||
does not need results not yet available.
|
||||
|
||||
For this reason, many processors are now \emph{out-of-order}, while processors
|
||||
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
||||
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
|
||||
instructions are picked to be issued. The reorder buffer acts as a sliding
|
||||
window of microarchitecturally-fixed size over \uops{}, from which the oldest
|
||||
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
|
||||
CPUs are only able to execute operations out of order as long as the
|
||||
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
|
||||
be issued ---~specifically, not more than the size of the reorder buffer ahead.
|
||||
|
||||
It is also important to note that out-of-order processors are only out-of-order
|
||||
\emph{from a certain point on}: a substantial part of the processor's frontend
|
||||
is typically still in-order.
|
||||
|
||||
\subsubsection{Hardware counters}
|
||||
|
||||
Many processors provide \emph{hardware counters}, to help (low-level)
|
||||
programmers understand how their code is executed. The counters available
|
||||
widely depend on each specific processor. The majority of processors, however,
|
||||
offer counters to determine the number of elapsed cycles between two
|
||||
instructions, as well as the number of retired instructions. Some processors
|
||||
further offer counters for the number of cache misses and hits on the various
|
||||
caches, or even the number of \uops{} executed on a specific port.
|
||||
|
||||
While access to these counters is vendor-dependant, abstraction layers are
|
||||
available: for instance, the Linux kernel abstracts these counters through the
|
||||
\perf{} interface, while \papi{} further attempts to unify similar counters
|
||||
from different vendors under a common name.
|
||||
|
||||
\subsubsection{SIMD operations}
|
||||
|
||||
\todo{}
|
||||
|
|
|
@ -19,13 +19,14 @@
|
|||
* Instruction --[frontend]--> Mop, muop
|
||||
* muop --[backend port]--> retired [side effects]
|
||||
* vast majority of cases: execution units are fully pipelined
|
||||
* Dependencies are breaking the pipeline!
|
||||
* Renamer: helps up to a point
|
||||
* out of order CPUs:
|
||||
* Frontend in order up to some point
|
||||
* ROB
|
||||
* backend out-of-order
|
||||
* ROB: execution window. ILP limited to this window.
|
||||
* Dependencies handling
|
||||
* Dependencies are breaking the pipeline!
|
||||
* Renamer: helps up to a point
|
||||
|
||||
* Hardware counters
|
||||
|
||||
|
|
Loading…
Reference in a new issue