Foundations: further microarch writeup

2023-11-03 20:17:04 +01:00 · 2023-11-03 20:17:04 +01:00 · 2a7b3fc2a3
commit 2a7b3fc2a3
parent c20ba8ea7e
2 changed files with 128 additions and 9 deletions
--- a/manuscrit/20_foundations/10_cpu_arch.tex
+++ b/manuscrit/20_foundations/10_cpu_arch.tex
@ -71,13 +71,10 @@ A microarchitecture can be broken down into a few functional blocks, shown in
 \medskip{}
-\paragraph{Frontend.} The frontend is responsible for fetching the flow of
+\paragraph{Frontend and backend.} The frontend is responsible for fetching the
-instruction bytes to be executed, break it down into operations executable by
+flow of instruction bytes to be executed, break it down into operations
-the backend and issue them to execution units.
+executable by the backend and issue them to execution units.  The backend, in
-
+turn, is responsible for the actual computations made by the processor.
 \paragraph{Backend.} The backend is composed of \emph{execution ports}, which
 act as gateways to the actual \emph{execution units}. Those units are
 responsible for the actual computations made by the processor.
 \paragraph{Register file.} The register file holds the processor's registers,
 on which computations are made.
@ -87,3 +84,124 @@ data rows from the main memory, whose access latency would slow computation
 down by several orders of magnitude if it was accessed directly. Usually, the
 L1 cache resides directly in the computation core, while the L2 and L3 caches
 are shared between multiple cores.
 \bigskip{}
 The CPU frontend constantly fetches a flow of instruction bytes. This flow must
 first be broken down into a sequence of instructions. While on some ISAs, each
 instruction is made of a constant amount of bytes ---~\eg{} ARM~--, this is not
 always the case: for instance, x84-64 instructions can be as short as one byte,
 while the ISA only limits an instruction to 15 bytes. This task is performed
 by the \emph{decoder}, which usually outputs a flow of \emph{micro-operations},
 or \uops.
 Some microarchitectures rely on complex decoding phases, first splitting
 instructions into \emph{macro-operations}, to be split again into \uops{}
 further down the line. Part of this decoding may also be cached, \eg{} to
 optimize loop decoding, where the same sequence of instructions will be decoded
 many times.
 \smallskip{}
 Microarchitectures typically implement more physical registers in their
 register file than the ISA exposes to the programmer. The CPU takes advantage
 of those additional registers by including a \emph{renamer} in the frontend,
 which maps the ISA-defined registers used explicitly in instructions to
 concrete registers in the register file. As long as enough concrete registers
 are available, this phase eliminates certain categories of data dependencies;
 this aspect is explored briefly below, and later in \autoref{chap:staticdeps}.
 \smallskip{}
 Depending on the microarchitecture, the decoded operations ---~be they macro-
 or micro-operations at this stage~--- may undergo several more phases, specific
 to each processor.
 \smallskip{}
 Typically, however, \uops{} will eventually be fed into a \emph{Reorder
 Buffer}, or ROB. Today, most consumer- or server-grade CPUs are
 \emph{out-of-order}, with effects detailed below; the reorder buffer makes this
 possible.
 \smallskip{}
 Finally, the \uops{} are \emph{issued} to the backend towards \emph{execution
 ports}. Each port usually processes at most one \uop{} per CPU cycle, and acts
 as a sort of gateway towards the actual execution units of the processor.
 Each execution port may (and usually is) be connected to multiple different
 execution units: for instance, Intel Skylake's port 6 is responsible for both
 branch \uops{} and integer arithmetics; while ARM's Cortex A72 has a single
 port for both memory loads and stores.
 \smallskip{}
 In most cases, execution units are \emph{fully pipelined}, meaning that while
 processing a single \uop{} takes multiple cycles, the unit is able to start
 processing a new \uop{} every cycle: multiple \uops{} are then being processed,
 at different stages, during each cycle.
 \smallskip{}
 Finally, when a \uop{} has been entirely processed and exits its processing
 unit's pipeline, it is committed to the \emph{retire buffer}, marking the
 \uop{} as complete.
 \subsubsection{Dependencies handling}
 In this flow of \uops{}, some are dependent on the result computed by a
 previous \uop{}. If, for instance, two successive identical \uops{} compute
 $\reg{r10} \gets \reg{r10} + \reg{r11}$, the second instance must wait for the
 completion of the first one, as the value of \reg{r10} after the execution of
 the latter is not known before its completion.
 The \uops{} that depend on a previous \uop{} are not \emph{issued} until the
 latter is marked as completed by entering the retire buffer.
 Since computation units are pipelined, they reach their best efficiency only
 when \uops{} can be fed to them in a constant flow.  Yet, as such, a dependency
 may block the computation entirely until its dependent result is computed,
 throttling down the CPU's performance.
 The \emph{renamer} helps relieving this dependency pressure when the dependency
 can be broken by simply renaming one of the registers.
 \subsubsection{Out-of-order vs. in-order processors}
 When computation is stalled by a dependency, it may however be possible to
 issue immediately a \uop{} which comes later in the instruction stream, if it
 does not need results not yet available.
 For this reason, many processors are now \emph{out-of-order}, while processors
 issuing \uops{} strictly in their original order are called \emph{in-order}.
 Out-of-order microarchitectures feature a \emph{reorder buffer}, from which
 instructions are picked to be issued. The reorder buffer acts as a sliding
 window of microarchitecturally-fixed size over \uops{}, from which the oldest
 \uop{} whose dependencies are satisfied will be executed. Thus, out-of-order
 CPUs are only able to execute operations out of order as long as the
 \uop{} to be executed is not too far ahead from the oldest \uop{} awaiting to
 be issued ---~specifically, not more than the size of the reorder buffer ahead.
 It is also important to note that out-of-order processors are only out-of-order
 \emph{from a certain point on}: a substantial part of the processor's frontend
 is typically still in-order.
 \subsubsection{Hardware counters}
 Many processors provide \emph{hardware counters}, to help (low-level)
 programmers understand how their code is executed. The counters available
 widely depend on each specific processor. The majority of processors, however,
 offer counters to determine the number of elapsed cycles between two
 instructions, as well as the number of retired instructions. Some processors
 further offer counters for the number of cache misses and hits on the various
 caches, or even the number of \uops{} executed on a specific port.
 While access to these counters is vendor-dependant, abstraction layers are
 available: for instance, the Linux kernel abstracts these counters through the
 \perf{} interface, while \papi{} further attempts to unify similar counters
 from different vendors under a common name.
 \subsubsection{SIMD operations}
 \todo{}
--- a/plan/20_foundations.md
+++ b/plan/20_foundations.md
@ -19,13 +19,14 @@
 * Instruction --[frontend]--> Mop, muop
 * muop --[backend port]--> retired [side effects]
 * vast majority of cases: execution units are fully pipelined
    * Dependencies are breaking the pipeline!
    * Renamer: helps up to a point
 * out of order CPUs:
    * Frontend in order up to some point
    * ROB
    * backend out-of-order
    * ROB: execution window. ILP limited to this window.
 * Dependencies handling
    * Dependencies are breaking the pipeline!
    * Renamer: helps up to a point
 * Hardware counters