diff --git a/manuscrit/40_A72-frontend/50_future_works.tex b/manuscrit/40_A72-frontend/50_future_works.tex index f09db29..3c79686 100644 --- a/manuscrit/40_A72-frontend/50_future_works.tex +++ b/manuscrit/40_A72-frontend/50_future_works.tex @@ -91,8 +91,8 @@ cross an instruction cache line boundary~\cite{uica}. Processors implementing ISAs subject to decoding bottleneck typically also feature a decoded \uop{} cache. The typical hit rate of this cache is about -80\%~\cite[Section -B.5.7.2]{ref:intel64_software_dev_reference_vol1}\cite{dead_uops}. However, +80\%~\cites[Section +B.5.7.2]{ref:intel64_software_dev_reference_vol1}{dead_uops}. However, code analyzers are concerned with loops and, more generally, hot code portions. Under such conditions, we expect this cache, once hot in steady-state, to be very close to a 100\% hit rate. In this case, only the dispatch throughput will @@ -114,8 +114,17 @@ be investigated if the model does not reach the expected accuracy. necessity to hit a cache. We are unaware of other architectures with such a feature. - \item{} macro-ops \todo{} + \item{} In reality, there is an intermediary step between instructions and + \uops{}: macro-ops. Although it serves a designing and semantic + purpose, we omit this step in the current model as --~we + believe~-- it is of little importance to predict performance. - \item{} fusion, lamination \todo{} + \item{} On x86 architectures at least, common pairs of micro- or + macro-operations may be ``fused'' into a single one, up to various + parts of the pipeline, to save space in some queues or artificially + boost dispatch limitations. This mechanism is implemented in Intel + architectures, and to some extent in AMD architectures since + Zen~\cites[§3.4.2]{ref:intel64_architectures_optim_reference_vol1}{uica}{Vishnekov_2021}. + This may make some kernels seem to ``bypass'' dispatch limits. \end{itemize} diff --git a/manuscrit/biblio/misc.bib b/manuscrit/biblio/misc.bib index fc792a7..b5e8849 100644 --- a/manuscrit/biblio/misc.bib +++ b/manuscrit/biblio/misc.bib @@ -214,3 +214,19 @@ pages={361-374}, keywords={Program processors;Microarchitecture;Computer architecture;Timing;System-on-chip;Transient analysis}, doi={10.1109/ISCA52012.2021.00036}} + +@article{Vishnekov_2021, + doi = {10.1088/1742-6596/1740/1/012053}, + url = {https://dx.doi.org/10.1088/1742-6596/1740/1/012053}, + year = {2021}, + month = {jan}, + publisher = {IOP Publishing}, + volume = {1740}, + number = {1}, + pages = {012053}, + author = {A V Vishnekov and E M Ivanova and N A Stepanov and N D Shaimov}, + title = {A Simulation Model for Macro- and Micro-Fusion Algorithms in the CPU Core}, + journal = {Journal of Physics: Conference Series}, + abstract = {The article discusses the features of modern processor’s microarchitecture, the method of instruction’s and micro-operation’s accelerated execution. The research focuses on the organization of the decoding stage in the CPU core pipeline and Macro- and Micro-fusion algorithms. The Macro- and Micro-fusion mechanisms are defined. A computer simulator has been developed to explore these mechanisms. The developed software has a user-friendly interface, is easy to use, and combines training and research options. The computer simulator demonstrates the sequence of mechanism’ s implementation; the resulting macro-or microoperations set after Macro- and Micro-fusion, and also reflects each algorithm features for different processor’s families. The software allows you to use either a pre-prepared file with Assembler (x86) code fragments as source data, or enter/change the source code fragments at your request. The main combinations of machine instructions that can be fused into a single macro-operation are considered, as well as instructions that can be decoded into fused micro-operations. The simulator can be useful both for in Computer Science & Engineering students, especially for on-line education and for researchers and General-purpose CPU cores developers.} +} +