Parametric frontend: add Fabrice's suggestions

This commit is contained in:
Théophile Bastian 2024-06-18 12:06:42 +02:00
parent 7dc4ec9935
commit 8c0e5e4710
2 changed files with 49 additions and 13 deletions

View file

@ -92,10 +92,10 @@ may prove to be a huge frontend slowdown, especially when such instructions
cross an instruction cache line boundary~\cite{uica}. cross an instruction cache line boundary~\cite{uica}.
Processors implementing ISAs subject to decoding bottleneck typically also Processors implementing ISAs subject to decoding bottleneck typically also
feature a decoded \uop{} cache. The typical hit rate of this cache is about feature a decoded \uop{} cache, or \emph{decoded stream buffer} (DSB). The
80\%~\cites[Section typical hit rate of this cache is about 80\%~\cites[Section
B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code
code analyzers are concerned with loops and, more generally, hot code portions. analyzers are concerned with loops and, more generally, hot code portions.
Under such conditions, we expect this cache, once hot in steady-state, to be Under such conditions, we expect this cache, once hot in steady-state, to be
very close to a 100\% hit rate. In this case, only the dispatch throughput will very close to a 100\% hit rate. In this case, only the dispatch throughput will
be limiting, and modeling the decoding bottlenecks becomes irrelevant. be limiting, and modeling the decoding bottlenecks becomes irrelevant.
@ -109,12 +109,30 @@ be investigated if the model does not reach the expected accuracy.
\begin{itemize} \begin{itemize}
\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep \item{} We introduced just above the DSB (\uop{} cache). This model
in the decode queue a whole loop's body of \uops{} if the frontend detects that a considers that the DSB will never be the cause of a bottleneck and
that, instead, the number of dispatched \uops{} per cycle will always
bottleneck before. This might not be true, as DSBs are complex in
themselves already~\cite{uica}.
\item{} Intel CPUs use a Loop Stream Detector (LSD) to keep in the decode
queue a whole loop's body of \uops{} if the frontend detects that a
small enough loop is repeated~\cite{uica, dead_uops}. In this case, small enough loop is repeated~\cite{uica, dead_uops}. In this case,
\uops{} are repeatedly streamed from the decode queue, without even the \uops{} are repeatedly streamed from the decode queue, without even the
necessity to hit a cache. We are unaware of necessity to hit a cache. We are unaware of similar features in other
other architectures with such a feature. commercial processors. In embedded programming, however, \emph{hardware
loops} --~which are set up explicitly by the programmer~-- achieve,
among others, the same goal~\cite{hardware_loops_patent}.
\item{} The \emph{branch predictor} of a CPU is responsible for guessing,
before the actual logic is computed, whether a conditional jump will be
taken. A misprediction forces the frontend to re-populate its queues
with instructions from the branch actually taken and typically stalls
the pipeline for several cycles~\cite{branch_pred_penalty}. Our model,
however, does not include a branch predictor for much the same reason
that it does not include complex decoder: in steady-state, in a hot
code portion, we expect the branch predictor to always predict
correctly.
\item{} In reality, there is an intermediary step between instructions and \item{} In reality, there is an intermediary step between instructions and
\uops{}: macro-ops. Although it serves a designing and semantic \uops{}: macro-ops. Although it serves a designing and semantic

View file

@ -230,3 +230,21 @@
abstract = {The article discusses the features of modern processors microarchitecture, the method of instructions and micro-operations accelerated execution. The research focuses on the organization of the decoding stage in the CPU core pipeline and Macro- and Micro-fusion algorithms. The Macro- and Micro-fusion mechanisms are defined. A computer simulator has been developed to explore these mechanisms. The developed software has a user-friendly interface, is easy to use, and combines training and research options. The computer simulator demonstrates the sequence of mechanism s implementation; the resulting macro-or microoperations set after Macro- and Micro-fusion, and also reflects each algorithm features for different processors families. The software allows you to use either a pre-prepared file with Assembler (x86) code fragments as source data, or enter/change the source code fragments at your request. The main combinations of machine instructions that can be fused into a single macro-operation are considered, as well as instructions that can be decoded into fused micro-operations. The simulator can be useful both for in Computer Science & Engineering students, especially for on-line education and for researchers and General-purpose CPU cores developers.} abstract = {The article discusses the features of modern processors microarchitecture, the method of instructions and micro-operations accelerated execution. The research focuses on the organization of the decoding stage in the CPU core pipeline and Macro- and Micro-fusion algorithms. The Macro- and Micro-fusion mechanisms are defined. A computer simulator has been developed to explore these mechanisms. The developed software has a user-friendly interface, is easy to use, and combines training and research options. The computer simulator demonstrates the sequence of mechanism s implementation; the resulting macro-or microoperations set after Macro- and Micro-fusion, and also reflects each algorithm features for different processors families. The software allows you to use either a pre-prepared file with Assembler (x86) code fragments as source data, or enter/change the source code fragments at your request. The main combinations of machine instructions that can be fused into a single macro-operation are considered, as well as instructions that can be decoded into fused micro-operations. The simulator can be useful both for in Computer Science & Engineering students, especially for on-line education and for researchers and General-purpose CPU cores developers.}
} }
@inproceedings{branch_pred_penalty,
author={Eyerman, S. and Smith, J.E. and Eeckhout, L.},
booktitle={2006 IEEE International Symposium on Performance Analysis of Systems and Software},
title={Characterizing the branch misprediction penalty},
year={2006},
volume={},
number={},
pages={48-58},
keywords={Pipelines;Delay;Performance analysis;Impedance;Length measurement;Clocks;Analytical models;Time measurement;Data analysis},
doi={10.1109/ISPASS.2006.1620789}}
@misc{hardware_loops_patent,
title={Hardware loops},
author={Singh, Ravi P and Roth, Charles P and Overkamp, Gregory A},
year={2004},
month=jun # "~8",
note={US Patent 6,748,523}
}