Foundations: further microarch writeup
This commit is contained in:
parent
c20ba8ea7e
commit
2a7b3fc2a3
2 changed files with 128 additions and 9 deletions
|
@ -71,13 +71,10 @@ A microarchitecture can be broken down into a few functional blocks, shown in
|
||||||
|
|
||||||
\medskip{}
|
\medskip{}
|
||||||
|
|
||||||
\paragraph{Frontend.} The frontend is responsible for fetching the flow of
|
\paragraph{Frontend and backend.} The frontend is responsible for fetching the
|
||||||
instruction bytes to be executed, break it down into operations executable by
|
flow of instruction bytes to be executed, break it down into operations
|
||||||
the backend and issue them to execution units.
|
executable by the backend and issue them to execution units. The backend, in
|
||||||
|
turn, is responsible for the actual computations made by the processor.
|
||||||
\paragraph{Backend.} The backend is composed of \emph{execution ports}, which
|
|
||||||
act as gateways to the actual \emph{execution units}. Those units are
|
|
||||||
responsible for the actual computations made by the processor.
|
|
||||||
|
|
||||||
\paragraph{Register file.} The register file holds the processor's registers,
|
\paragraph{Register file.} The register file holds the processor's registers,
|
||||||
on which computations are made.
|
on which computations are made.
|
||||||
|
@ -87,3 +84,124 @@ data rows from the main memory, whose access latency would slow computation
|
||||||
down by several orders of magnitude if it was accessed directly. Usually, the
|
down by several orders of magnitude if it was accessed directly. Usually, the
|
||||||
L1 cache resides directly in the computation core, while the L2 and L3 caches
|
L1 cache resides directly in the computation core, while the L2 and L3 caches
|
||||||
are shared between multiple cores.
|
are shared between multiple cores.
|
||||||
|
|
||||||
|
\bigskip{}
|
||||||
|
|
||||||
|
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
||||||
|
first be broken down into a sequence of instructions. While on some ISAs, each
|
||||||
|
instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
|
||||||
|
always the case: for instance, x84-64 instructions can be as short as one byte,
|
||||||
|
while the ISA only limits an instruction to 15 bytes. This task is performed
|
||||||
|
by the \emph{decoder}, which usually outputs a flow of \emph{micro-operations},
|
||||||
|
or \uops.
|
||||||
|
|
||||||
|
Some microarchitectures rely on complex decoding phases, first splitting
|
||||||
|
instructions into \emph{macro-operations}, to be split again into \uops{}
|
||||||
|
further down the line. Part of this decoding may also be cached, \eg{} to
|
||||||
|
optimize loop decoding, where the same sequence of instructions will be decoded
|
||||||
|
many times.
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
|
||||||
|
Microarchitectures typically implement more physical registers in their
|
||||||
|
register file than the ISA exposes to the programmer. The CPU takes advantage
|
||||||
|
of those additional registers by including a \emph{renamer} in the frontend,
|
||||||
|
which maps the ISA-defined registers used explicitly in instructions to
|
||||||
|
concrete registers in the register file. As long as enough concrete registers
|
||||||
|
are available, this phase eliminates certain categories of data dependencies;
|
||||||
|
this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
|
||||||
|
Depending on the microarchitecture, the decoded operations ---~be they macro-
|
||||||
|
or micro-operations at this stage~--- may undergo several more phases, specific
|
||||||
|
to each processor.
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
|
||||||
|
Typically, however, \uops{} will eventually be fed into a \emph{Reorder
|
||||||
|
Buffer}, or ROB. Today, most consumer- or server-grade CPUs are
|
||||||
|
\emph{out-of-order}, with effects detailed below; the reorder buffer makes this
|
||||||
|
possible.
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
|
||||||
|
Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
|
||||||
|
ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
|
||||||
|
as a sort of gateway towards the actual execution units of the processor.
|
||||||
|
|
||||||
|
Each execution port may (and usually is) be connected to multiple different
|
||||||
|
execution units: for instance, Intel Skylake's port 6 is responsible for both
|
||||||
|
branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
|
||||||
|
port for both memory loads and stores.
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
|
||||||
|
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
||||||
|
processing a single \uop{} takes multiple cycles, the unit is able to start
|
||||||
|
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
||||||
|
at different stages, during each cycle.
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
|
||||||
|
Finally, when a \uop{} has been entirely processed and exits its processing
|
||||||
|
unit's pipeline, it is committed to the \emph{retire buffer}, marking the
|
||||||
|
\uop{} as complete.
|
||||||
|
|
||||||
|
\subsubsection{Dependencies handling}
|
||||||
|
|
||||||
|
In this flow of \uops{}, some are dependent on the result computed by a
|
||||||
|
previous \uop{}. If, for instance, two successive identical \uops{} compute
|
||||||
|
$\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the
|
||||||
|
completion of the first one, as the value of \reg{r10} after the execution of
|
||||||
|
the latter is not known before its completion.
|
||||||
|
|
||||||
|
The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
|
||||||
|
latter is marked as completed by entering the retire buffer.
|
||||||
|
|
||||||
|
Since computation units are pipelined, they reach their best efficiency only
|
||||||
|
when \uops{} can be fed to them in a constant flow. Yet, as such, a dependency
|
||||||
|
may block the computation entirely until its dependent result is computed,
|
||||||
|
throttling down the CPU's performance.
|
||||||
|
|
||||||
|
The \emph{renamer} helps relieving this dependency pressure when the dependency
|
||||||
|
can be broken by simply renaming one of the registers.
|
||||||
|
|
||||||
|
\subsubsection{Out-of-order vs. in-order processors}
|
||||||
|
|
||||||
|
When computation is stalled by a dependency, it may however be possible to
|
||||||
|
issue immediately a \uop{} which comes later in the instruction stream, if it
|
||||||
|
does not need results not yet available.
|
||||||
|
|
||||||
|
For this reason, many processors are now \emph{out-of-order}, while processors
|
||||||
|
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
||||||
|
Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
|
||||||
|
instructions are picked to be issued. The reorder buffer acts as a sliding
|
||||||
|
window of microarchitecturally-fixed size over \uops{}, from which the oldest
|
||||||
|
\uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
|
||||||
|
CPUs are only able to execute operations out of order as long as the
|
||||||
|
\uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
|
||||||
|
be issued ---~specifically, not more than the size of the reorder buffer ahead.
|
||||||
|
|
||||||
|
It is also important to note that out-of-order processors are only out-of-order
|
||||||
|
\emph{from a certain point on}: a substantial part of the processor's frontend
|
||||||
|
is typically still in-order.
|
||||||
|
|
||||||
|
\subsubsection{Hardware counters}
|
||||||
|
|
||||||
|
Many processors provide \emph{hardware counters}, to help (low-level)
|
||||||
|
programmers understand how their code is executed. The counters available
|
||||||
|
widely depend on each specific processor. The majority of processors, however,
|
||||||
|
offer counters to determine the number of elapsed cycles between two
|
||||||
|
instructions, as well as the number of retired instructions. Some processors
|
||||||
|
further offer counters for the number of cache misses and hits on the various
|
||||||
|
caches, or even the number of \uops{} executed on a specific port.
|
||||||
|
|
||||||
|
While access to these counters is vendor-dependant, abstraction layers are
|
||||||
|
available: for instance, the Linux kernel abstracts these counters through the
|
||||||
|
\perf{} interface, while \papi{} further attempts to unify similar counters
|
||||||
|
from different vendors under a common name.
|
||||||
|
|
||||||
|
\subsubsection{SIMD operations}
|
||||||
|
|
||||||
|
\todo{}
|
||||||
|
|
|
@ -19,13 +19,14 @@
|
||||||
* Instruction --[frontend]--> Mop, muop
|
* Instruction --[frontend]--> Mop, muop
|
||||||
* muop --[backend port]--> retired [side effects]
|
* muop --[backend port]--> retired [side effects]
|
||||||
* vast majority of cases: execution units are fully pipelined
|
* vast majority of cases: execution units are fully pipelined
|
||||||
* Dependencies are breaking the pipeline!
|
|
||||||
* Renamer: helps up to a point
|
|
||||||
* out of order CPUs:
|
* out of order CPUs:
|
||||||
* Frontend in order up to some point
|
* Frontend in order up to some point
|
||||||
* ROB
|
* ROB
|
||||||
* backend out-of-order
|
* backend out-of-order
|
||||||
* ROB: execution window. ILP limited to this window.
|
* ROB: execution window. ILP limited to this window.
|
||||||
|
* Dependencies handling
|
||||||
|
* Dependencies are breaking the pipeline!
|
||||||
|
* Renamer: helps up to a point
|
||||||
|
|
||||||
* Hardware counters
|
* Hardware counters
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue