Parametric frontend: first writeup

2024-06-18 09:50:28 +02:00 · 2024-06-18 09:50:28 +02:00 · ff7157993d
commit ff7157993d
parent e717475763
2 changed files with 29 additions and 4 deletions
--- a/manuscrit/40_A72-frontend/50_future_works.tex
+++ b/manuscrit/40_A72-frontend/50_future_works.tex
@ -91,8 +91,8 @@ cross an instruction cache line boundary~\cite{uica}.
 Processors implementing ISAs subject to decoding bottleneck typically also
 feature a decoded \uop{} cache. The typical hit rate of this cache is about
-80\%~\cite[Section
+80\%~\cites[Section
-B.5.7.2]{ref:intel64_software_dev_reference_vol1}\cite{dead_uops}. However,
+B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However,
 code analyzers are concerned with loops and, more generally, hot code portions.
 Under such conditions, we expect this cache, once hot in steady-state, to be
 very close to a 100\% hit rate. In this case, only the dispatch throughput will
@ -114,8 +114,17 @@ be investigated if the model does not reach the expected accuracy.
        necessity to hit a cache. We are unaware of
        other architectures with such a feature.
-    \item{} macro-ops \todo{}
+    \item{} In reality, there is an intermediary step between instructions and
        \uops{}: macro-ops. Although it serves a designing and semantic
        purpose, we omit this step in the current model as --~we
        believe~-- it is of little importance to predict performance.
-    \item{} fusion, lamination \todo{}
+    \item{} On x86 architectures at least, common pairs of micro- or
        macro-operations may be ``fused'' into a single one, up to various
        parts of the pipeline, to save space in some queues or artificially
        boost dispatch limitations. This mechanism is implemented in Intel
        architectures, and to some extent in AMD architectures since
        Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}.
        This may make some kernels seem to ``bypass'' dispatch limits.
 \end{itemize}
--- a/manuscrit/biblio/misc.bib
+++ b/manuscrit/biblio/misc.bib
@ -214,3 +214,19 @@
    pages={361-374},
    keywords={Program processors;Microarchitecture;Computer architecture;Timing;System-on-chip;Transient analysis},
    doi={10.1109/ISCA52012.2021.00036}}
@article{Vishnekov_2021,
    doi = {10.1088/1742-6596/1740/1/012053},
    url = {https://dx.doi.org/10.1088/1742-6596/1740/1/012053},
    year = {2021},
    month = {jan},
    publisher = {IOP Publishing},
    volume = {1740},
    number = {1},
    pages = {012053},
    author = {A V Vishnekov and E M Ivanova and N A Stepanov and N D Shaimov},
    title = {A Simulation Model for Macro- and Micro-Fusion Algorithms in the CPU Core},
    journal = {Journal of Physics: Conference Series},
    abstract = {The article discusses the features of modern processor’s microarchitecture, the method of instruction’s and micro-operation’s accelerated execution. The research focuses on the organization of the decoding stage in the CPU core pipeline and Macro- and Micro-fusion algorithms. The Macro- and Micro-fusion mechanisms are defined. A computer simulator has been developed to explore these mechanisms. The developed software has a user-friendly interface, is easy to use, and combines training and research options. The computer simulator demonstrates the sequence of mechanism’ s implementation; the resulting macro-or microoperations set after Macro- and Micro-fusion, and also reflects each algorithm features for different processor’s families. The software allows you to use either a pre-prepared file with Assembler (x86) code fragments as source data, or enter/change the source code fragments at your request. The main combinations of machine instructions that can be fused into a single macro-operation are considered, as well as instructions that can be decoded into fused micro-operations. The simulator can be useful both for in Computer Science &amp; Engineering students, especially for on-line education and for researchers and General-purpose CPU cores developers.}
 }