Foundations: enhancements to μarch
This commit is contained in:
parent
2a7b3fc2a3
commit
58fa37391e
1 changed files with 50 additions and 18 deletions
|
@ -43,7 +43,7 @@ memory. This state can also be extended to encompass external effects, such as
|
||||||
networks communication, peripherals, etc.
|
networks communication, peripherals, etc.
|
||||||
|
|
||||||
The way an ISA is implemented, in order for the instructions to alter the state
|
The way an ISA is implemented, in order for the instructions to alter the state
|
||||||
as specified, is called a microarchitecture. Many microarchitectures can
|
as specified, is called a \emph{microarchitecture}. Many microarchitectures can
|
||||||
implement the same ISA, as it is the case for instance with the x86-64 ISA,
|
implement the same ISA, as it is the case for instance with the x86-64 ISA,
|
||||||
implemented both by Intel and AMD, each with multiple generations, which
|
implemented both by Intel and AMD, each with multiple generations, which
|
||||||
translates into multiple microarchitectures. It is thus frequent for ISAs to
|
translates into multiple microarchitectures. It is thus frequent for ISAs to
|
||||||
|
@ -76,8 +76,14 @@ flow of instruction bytes to be executed, break it down into operations
|
||||||
executable by the backend and issue them to execution units. The backend, in
|
executable by the backend and issue them to execution units. The backend, in
|
||||||
turn, is responsible for the actual computations made by the processor.
|
turn, is responsible for the actual computations made by the processor.
|
||||||
|
|
||||||
|
As such, the frontend can be seen as a manager for the backend: the latter
|
||||||
|
actually executes the work, while the former ensures that work is made
|
||||||
|
available to it, orchestrates its execution and scheduling, and ensures each
|
||||||
|
``worker'' in the backend is assigned tasks within its skill set.
|
||||||
|
|
||||||
\paragraph{Register file.} The register file holds the processor's registers,
|
\paragraph{Register file.} The register file holds the processor's registers,
|
||||||
on which computations are made.
|
small amounts of fast memory directly built into the processor's cores, on
|
||||||
|
which computations are made.
|
||||||
|
|
||||||
\paragraph{Data caches.} The cache hierarchy (usually L1, L2 and L3) caches
|
\paragraph{Data caches.} The cache hierarchy (usually L1, L2 and L3) caches
|
||||||
data rows from the main memory, whose access latency would slow computation
|
data rows from the main memory, whose access latency would slow computation
|
||||||
|
@ -85,7 +91,14 @@ down by several orders of magnitude if it was accessed directly. Usually, the
|
||||||
L1 cache resides directly in the computation core, while the L2 and L3 caches
|
L1 cache resides directly in the computation core, while the L2 and L3 caches
|
||||||
are shared between multiple cores.
|
are shared between multiple cores.
|
||||||
|
|
||||||
\bigskip{}
|
\subsubsection{An instruction's walk through the processor}
|
||||||
|
|
||||||
|
Several CPU cycles may pass from the moment an instruction is first fetched by
|
||||||
|
the processor, until the time this instruction is considered completed and
|
||||||
|
discarded. Let us follow the path of one such instruction through the
|
||||||
|
processor.
|
||||||
|
|
||||||
|
\smallskip{}
|
||||||
|
|
||||||
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
The CPU frontend constantly fetches a flow of instruction bytes. This flow must
|
||||||
first be broken down into a sequence of instructions. While on some ISAs, each
|
first be broken down into a sequence of instructions. While on some ISAs, each
|
||||||
|
@ -105,8 +118,9 @@ many times.
|
||||||
|
|
||||||
Microarchitectures typically implement more physical registers in their
|
Microarchitectures typically implement more physical registers in their
|
||||||
register file than the ISA exposes to the programmer. The CPU takes advantage
|
register file than the ISA exposes to the programmer. The CPU takes advantage
|
||||||
of those additional registers by including a \emph{renamer} in the frontend,
|
of those additional registers by including a \emph{renamer} in the frontend, to
|
||||||
which maps the ISA-defined registers used explicitly in instructions to
|
which the just-decoded operations are fed.
|
||||||
|
The renamer maps the ISA-defined registers used explicitly in instructions to
|
||||||
concrete registers in the register file. As long as enough concrete registers
|
concrete registers in the register file. As long as enough concrete registers
|
||||||
are available, this phase eliminates certain categories of data dependencies;
|
are available, this phase eliminates certain categories of data dependencies;
|
||||||
this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
|
this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
|
||||||
|
@ -120,9 +134,10 @@ to each processor.
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
|
||||||
Typically, however, \uops{} will eventually be fed into a \emph{Reorder
|
Typically, however, \uops{} will eventually be fed into a \emph{Reorder
|
||||||
Buffer}, or ROB. Today, most consumer- or server-grade CPUs are
|
Buffer}, or ROB\@. Today, most consumer- or server-grade CPUs are
|
||||||
\emph{out-of-order}, with effects detailed below; the reorder buffer makes this
|
\emph{out-of-order}, with effects detailed below; the reorder buffer makes this
|
||||||
possible.
|
possible. The \uops{} may wait for a few cycles in this reorder buffer, before
|
||||||
|
being pulled by the \emph{issuer}.
|
||||||
|
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
|
||||||
|
@ -130,7 +145,7 @@ Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
|
||||||
ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
|
ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
|
||||||
as a sort of gateway towards the actual execution units of the processor.
|
as a sort of gateway towards the actual execution units of the processor.
|
||||||
|
|
||||||
Each execution port may (and usually is) be connected to multiple different
|
Each execution port may be (and usually is) connected to multiple different
|
||||||
execution units: for instance, Intel Skylake's port 6 is responsible for both
|
execution units: for instance, Intel Skylake's port 6 is responsible for both
|
||||||
branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
|
branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
|
||||||
port for both memory loads and stores.
|
port for both memory loads and stores.
|
||||||
|
@ -140,7 +155,7 @@ port for both memory loads and stores.
|
||||||
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
In most cases, execution units are \emph{fully pipelined}, meaning that while
|
||||||
processing a single \uop{} takes multiple cycles, the unit is able to start
|
processing a single \uop{} takes multiple cycles, the unit is able to start
|
||||||
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
processing a new \uop{} every cycle: multiple \uops{} are then being processed,
|
||||||
at different stages, during each cycle.
|
at different stages, during each cycle, akin to a factory's assembly line.
|
||||||
|
|
||||||
\smallskip{}
|
\smallskip{}
|
||||||
|
|
||||||
|
@ -151,13 +166,18 @@ unit's pipeline, it is committed to the \emph{retire buffer}, marking the
|
||||||
\subsubsection{Dependencies handling}
|
\subsubsection{Dependencies handling}
|
||||||
|
|
||||||
In this flow of \uops{}, some are dependent on the result computed by a
|
In this flow of \uops{}, some are dependent on the result computed by a
|
||||||
previous \uop{}. If, for instance, two successive identical \uops{} compute
|
previous \uop{} ---~or, rather more precisely, await the change of state
|
||||||
$\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the
|
induced by a previous \uop{}. If, for instance, two successive identical
|
||||||
completion of the first one, as the value of \reg{r10} after the execution of
|
\uops{} compute $\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance
|
||||||
the latter is not known before its completion.
|
must wait for the completion of the first one, as the value of \reg{r10} after
|
||||||
|
the execution of the latter is not known before its completion.
|
||||||
|
|
||||||
The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
|
The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
|
||||||
latter is marked as completed by entering the retire buffer.
|
latter is marked as completed by entering the retire buffer\footnote{Some
|
||||||
|
processors, however, introduce ``shortcuts'' when a \uop{} can yield a
|
||||||
|
result before its full completion. In such cases, while the \uop{} depended
|
||||||
|
on is not yet complete and retired, the dependant \uop{} can still be
|
||||||
|
issued.}.
|
||||||
|
|
||||||
Since computation units are pipelined, they reach their best efficiency only
|
Since computation units are pipelined, they reach their best efficiency only
|
||||||
when \uops{} can be fed to them in a constant flow. Yet, as such, a dependency
|
when \uops{} can be fed to them in a constant flow. Yet, as such, a dependency
|
||||||
|
@ -165,13 +185,20 @@ may block the computation entirely until its dependent result is computed,
|
||||||
throttling down the CPU's performance.
|
throttling down the CPU's performance.
|
||||||
|
|
||||||
The \emph{renamer} helps relieving this dependency pressure when the dependency
|
The \emph{renamer} helps relieving this dependency pressure when the dependency
|
||||||
can be broken by simply renaming one of the registers.
|
can be broken by simply renaming one of the registers. We detail this later on
|
||||||
|
\autoref{chap:staticdeps}, but such dependencies may be \eg{}
|
||||||
|
\emph{write-after-read}: if $\reg{r11} \gets \reg{r10}$ is followed by
|
||||||
|
$\reg{r10} \gets \reg{r12}$, then the latter must wait for the former's
|
||||||
|
completion, as it would else overwrite $\reg{r10}$, which is read by the
|
||||||
|
former. However, the second instruction may be \emph{renamed} to write to
|
||||||
|
$\reg{r10}_\text{alt}$ instead ---~also renaming every subsequent read to the same
|
||||||
|
value~---, thus avoiding the dependency.
|
||||||
|
|
||||||
\subsubsection{Out-of-order vs. in-order processors}
|
\subsubsection{Out-of-order vs. in-order processors}
|
||||||
|
|
||||||
When computation is stalled by a dependency, it may however be possible to
|
When computation is stalled by a dependency, it may however be possible to
|
||||||
issue immediately a \uop{} which comes later in the instruction stream, if it
|
issue immediately a \uop{} which comes later in the instruction stream, but
|
||||||
does not need results not yet available.
|
depends only on results already available.
|
||||||
|
|
||||||
For this reason, many processors are now \emph{out-of-order}, while processors
|
For this reason, many processors are now \emph{out-of-order}, while processors
|
||||||
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
issuing \uops{} strictly in their original order are called \emph{in-order}.
|
||||||
|
@ -204,4 +231,9 @@ from different vendors under a common name.
|
||||||
|
|
||||||
\subsubsection{SIMD operations}
|
\subsubsection{SIMD operations}
|
||||||
|
|
||||||
\todo{}
|
Processors operate at a given \emph{word size}, fixed by the ISA ---~typically
|
||||||
|
32 or 64 bits nowadays, even though embedded processors might operate at lower
|
||||||
|
word sizes.
|
||||||
|
|
||||||
|
Some operations, however, are able to work on chunks of data larger than the
|
||||||
|
word size.
|
||||||
|
|
Loading…
Reference in a new issue