Foundations: enhancements to μarch

2023-12-23 12:04:05 +01:00 · 2023-12-23 12:04:05 +01:00 · 58fa37391e
commit 58fa37391e
parent 2a7b3fc2a3
1 changed files with 50 additions and 18 deletions
--- a/manuscrit/20_foundations/10_cpu_arch.tex
+++ b/manuscrit/20_foundations/10_cpu_arch.tex
@ -43,7 +43,7 @@ memory. This state can also be extended to encompass external effects, such as
 networks communication, peripherals, etc.
 The way an ISA is implemented, in order for the instructions to alter the state
-as specified, is called a microarchitecture. Many microarchitectures can
+as specified, is called a \emph{microarchitecture}. Many microarchitectures can
 implement the same ISA, as it is the case for instance with the x86-64 ISA,
 implemented both by Intel and AMD, each with multiple generations, which
 translates into multiple microarchitectures. It is thus frequent for ISAs to
@ -76,8 +76,14 @@ flow of instruction bytes to be executed, break it down into operations
 executable by the backend and issue them to execution units.  The backend, in
 turn, is responsible for the actual computations made by the processor.
 As such, the frontend can be seen as a manager for the backend: the latter
 actually executes the work, while the former ensures that work is made
 available to it, orchestrates its execution and scheduling, and ensures each
 ``worker'' in the backend is assigned tasks within its skill set.
 \paragraph{Register file.} The register file holds the processor's registers,
-on which computations are made.
+small amounts of fast memory directly built into the processor's cores, on
 which computations are made.
 \paragraph{Data caches.} The cache hierarchy (usually L1, L2 and L3) caches
 data rows from the main memory, whose access latency would slow computation
@ -85,7 +91,14 @@ down by several orders of magnitude if it was accessed directly. Usually, the
 L1 cache resides directly in the computation core, while the L2 and L3 caches
 are shared between multiple cores.
-\bigskip{}
+\subsubsection{An instruction's walk through the processor}
 Several CPU cycles may pass from the moment an instruction is first fetched by
 the processor, until the time this instruction is considered completed and
 discarded. Let us follow the path of one such instruction through the
 processor.
 \smallskip{}
 The CPU frontend constantly fetches a flow of instruction bytes. This flow must
 first be broken down into a sequence of instructions. While on some ISAs, each
@ -105,8 +118,9 @@ many times.
 Microarchitectures typically implement more physical registers in their
 register file than the ISA exposes to the programmer. The CPU takes advantage
-of those additional registers by including a \emph{renamer} in the frontend,
+of those additional registers by including a \emph{renamer} in the frontend, to
-which maps the ISA-defined registers used explicitly in instructions to
+which the just-decoded operations are fed.
 The renamer maps the ISA-defined registers used explicitly in instructions to
 concrete registers in the register file. As long as enough concrete registers
 are available, this phase eliminates certain categories of data dependencies;
 this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
@ -120,9 +134,10 @@ to each processor.
 \smallskip{}
 Typically, however, \uops{} will eventually be fed into a \emph{Reorder
-Buffer}, or ROB. Today, most consumer- or server-grade CPUs are
+Buffer}, or ROB\@. Today, most consumer- or server-grade CPUs are
 \emph{out-of-order}, with effects detailed below; the reorder buffer makes this
-possible.
+possible. The \uops{} may wait for a few cycles in this reorder buffer, before
 being pulled by the \emph{issuer}.
 \smallskip{}
@ -130,7 +145,7 @@ Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
 ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
 as a sort of gateway towards the actual execution units of the processor.
-Each execution port may (and usually is) be connected to multiple different
+Each execution port may be (and usually is) connected to multiple different
 execution units: for instance, Intel Skylake's port 6 is responsible for both
 branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
 port for both memory loads and stores.
@ -140,7 +155,7 @@ port for both memory loads and stores.
 In most cases, execution units are \emph{fully pipelined}, meaning that while
 processing a single \uop{} takes multiple cycles, the unit is able to start
 processing a new \uop{} every cycle: multiple \uops{} are then being processed,
-at different stages, during each cycle.
+at different stages, during each cycle, akin to a factory's assembly line.
 \smallskip{}
@ -151,13 +166,18 @@ unit's pipeline, it is committed to the \emph{retire buffer}, marking the
 \subsubsection{Dependencies handling}
 In this flow of \uops{}, some are dependent on the result computed by a
-previous \uop{}. If, for instance, two successive identical \uops{} compute
+previous \uop{} ---~or, rather more precisely, await the change of state
-$\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the
+induced by a previous \uop{}. If, for instance, two successive identical
-completion of the first one, as the value of \reg{r10} after the execution of
+\uops{} compute $\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance
-the latter is not known before its completion.
+must wait for the completion of the first one, as the value of \reg{r10} after
 the execution of the latter is not known before its completion.
 The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
-latter is marked as completed by entering the retire buffer.
+latter is marked as completed by entering the retire buffer\footnote{Some
    processors, however, introduce ``shortcuts'' when a \uop{} can yield a
    result before its full completion. In such cases, while the \uop{} depended
    on is not yet complete and retired, the dependant \uop{} can still be
 issued.}.
 Since computation units are pipelined, they reach their best efficiency only
 when \uops{} can be fed to them in a constant flow.  Yet, as such, a dependency
@ -165,13 +185,20 @@ may block the computation entirely until its dependent result is computed,
 throttling down the CPU's performance.
 The \emph{renamer} helps relieving this dependency pressure when the dependency
-can be broken by simply renaming one of the registers.
+can be broken by simply renaming one of the registers. We detail this later on
 \autoref{chap:staticdeps}, but such dependencies may be \eg{}
 \emph{write-after-read}: if $\reg{r11} \gets \reg{r10}$ is followed by
 $\reg{r10} \gets \reg{r12}$, then the latter must wait for the former's
 completion, as it would else overwrite $\reg{r10}$, which is read by the
 former. However, the second instruction may be \emph{renamed} to write to
 $\reg{r10}_\text{alt}$ instead ---~also renaming every subsequent read to the same
 value~---, thus avoiding the dependency.
 \subsubsection{Out-of-order vs. in-order processors}
 When computation is stalled by a dependency, it may however be possible to
-issue immediately a \uop{} which comes later in the instruction stream, if it
+issue immediately a \uop{} which comes later in the instruction stream, but
-does not need results not yet available.
+depends only on results already available.
 For this reason, many processors are now \emph{out-of-order}, while processors
 issuing \uops{} strictly in their original order are called \emph{in-order}.
@ -204,4 +231,9 @@ from different vendors under a common name.
 \subsubsection{SIMD operations}
-\todo{}
+Processors operate at a given \emph{word size}, fixed by the ISA ---~typically
 32 or 64 bits nowadays, even though embedded processors might operate at lower
 word sizes.
 Some operations, however, are able to work on chunks of data larger than the
 word size.