Parametric frontend: add Fabrice's suggestions

2024-06-18 12:06:42 +02:00 · 2024-06-18 12:06:42 +02:00 · 8c0e5e4710
commit 8c0e5e4710
parent 7dc4ec9935
2 changed files with 49 additions and 13 deletions
--- a/manuscrit/40_A72-frontend/50_future_works.tex
+++ b/manuscrit/40_A72-frontend/50_future_works.tex
@ -92,10 +92,10 @@ may prove to be a huge frontend slowdown, especially when such instructions
 cross an instruction cache line boundary~\cite{uica}.
 Processors implementing ISAs subject to decoding bottleneck typically also
-feature a decoded \uop{} cache. The typical hit rate of this cache is about
+feature a decoded \uop{} cache, or \emph{decoded stream buffer} (DSB). The
-80\%~\cites[Section
+typical hit rate of this cache is about 80\%~\cites[Section
-B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However,
+B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code
-code analyzers are concerned with loops and, more generally, hot code portions.
+analyzers are concerned with loops and, more generally, hot code portions.
 Under such conditions, we expect this cache, once hot in steady-state, to be
 very close to a 100\% hit rate. In this case, only the dispatch throughput will
 be limiting, and modeling the decoding bottlenecks becomes irrelevant.
@ -109,12 +109,30 @@ be investigated if the model does not reach the expected accuracy.
 \begin{itemize}
-    \item{} Intel CPUs use a Loop Stream Detector (LSD) to keep
+    \item{} We introduced just above the DSB (\uop{} cache). This model
-        in the decode queue a whole loop's body of \uops{} if the frontend detects that a
+        considers that the DSB will never be the cause of a bottleneck and
        that, instead, the number of dispatched \uops{} per cycle will always
        bottleneck before. This might not be true, as DSBs are complex in
        themselves already~\cite{uica}.
    \item{} Intel CPUs use a Loop Stream Detector (LSD) to keep in the decode
        queue a whole loop's body of \uops{} if the frontend detects that a
        small enough loop is repeated~\cite{uica, dead_uops}. In this case,
        \uops{} are repeatedly streamed from the decode queue, without even the
-        necessity to hit a cache. We are unaware of
+        necessity to hit a cache. We are unaware of similar features in other
-        other architectures with such a feature.
+        commercial processors. In embedded programming, however, \emph{hardware
        loops} --~which are set up explicitly by the programmer~-- achieve,
        among others, the same goal~\cite{hardware_loops_patent}.
    \item{} The \emph{branch predictor} of a CPU is responsible for guessing,
        before the actual logic is computed, whether a conditional jump will be
        taken. A misprediction forces the frontend to re-populate its queues
        with instructions from the branch actually taken and typically stalls
        the pipeline for several cycles~\cite{branch_pred_penalty}. Our model,
        however, does not include a branch predictor for much the same reason
        that it does not include complex decoder: in steady-state, in a hot
        code portion, we expect the branch predictor to always predict
        correctly.
    \item{} In reality, there is an intermediary step between instructions and
        \uops{}: macro-ops. Although it serves a designing and semantic
--- a/manuscrit/biblio/misc.bib
+++ b/manuscrit/biblio/misc.bib
@ -230,3 +230,21 @@
    abstract = {The article discusses the features of modern processor’s microarchitecture, the method of instruction’s and micro-operation’s accelerated execution. The research focuses on the organization of the decoding stage in the CPU core pipeline and Macro- and Micro-fusion algorithms. The Macro- and Micro-fusion mechanisms are defined. A computer simulator has been developed to explore these mechanisms. The developed software has a user-friendly interface, is easy to use, and combines training and research options. The computer simulator demonstrates the sequence of mechanism’ s implementation; the resulting macro-or microoperations set after Macro- and Micro-fusion, and also reflects each algorithm features for different processor’s families. The software allows you to use either a pre-prepared file with Assembler (x86) code fragments as source data, or enter/change the source code fragments at your request. The main combinations of machine instructions that can be fused into a single macro-operation are considered, as well as instructions that can be decoded into fused micro-operations. The simulator can be useful both for in Computer Science &amp; Engineering students, especially for on-line education and for researchers and General-purpose CPU cores developers.}
 }
@inproceedings{branch_pred_penalty,
  author={Eyerman, S. and Smith, J.E. and Eeckhout, L.},
  booktitle={2006 IEEE International Symposium on Performance Analysis of Systems and Software},
  title={Characterizing the branch misprediction penalty},
  year={2006},
  volume={},
  number={},
  pages={48-58},
  keywords={Pipelines;Delay;Performance analysis;Impedance;Length measurement;Clocks;Analytical models;Time measurement;Data analysis},
  doi={10.1109/ISPASS.2006.1620789}}
@misc{hardware_loops_patent,
  title={Hardware loops},
  author={Singh, Ravi P and Roth, Charles P and Overkamp, Gregory A},
  year={2004},
  month=jun # "~8",
  note={US Patent 6,748,523}
 }