\section{Resource models}\label{sec:palmed_resource_models} \subsection{Usual representation: tripartite disjunctive graph} As we saw earlier in \autoref{sec:intro_microarch}, the behaviour of a CPU's backend can be, throughput-wise, characterized by the behaviour of its ports. Thus, a throughput model of the backend consists in a mapping of the ISA's instructions to execution ports of the backend, called a \emph{port mapping}. The mapping, however, is not direct: we also saw in \autoref{sec:intro_microarch} that instructions are themselves broken down into a number of micro-operations (\uops{}), which all have to be executed. Each of those \uops{} are then scheduled on one of the compatible execution ports of the CPU\@. A port mapping, thus, is actually a tripartite graph: a first layer mapping instructions to \uops{}, followed by a second layer mapping \uops{} to ports. In \autoref{fig:sample_port_mapping}, we show such a port mapping for a few x86-64 instructions on the SKL-SP microarchitecture. The \uopsinfo{} framework~\cite{uopsinfo}, for instance, produces such a model: each instruction's mapping is described as a string, \eg{} \texttt{VCVTT}\footnote{The precise variant is \texttt{VCVTTSD2SI (R32, XMM)}} is described as \texttt{1*p0+1*p01}. The two layers of such a model play a very different role. Indeed, the top layer (instructions to \uops{}) can be seen as an \emph{and}, or \emph{conjunctive} layer: an instruction is decomposed into each of its \uops{}, which must all be executed for the instruction to be completed. The bottom layer (\uops{} to ports), however, can be seen as an \emph{or}, or \emph{disjunctive} layer: a \uop{} must be executed on \emph{one} of those ports, each able to execute this \uop{}. This can be seen in the example from \uopsinfo{} above: \texttt{VCVTT} is decomposed into two \uops{}, the first necessarily executed on port 0, the second on port either 0 or 1. \medskip{} \begin{figure} \centering \includegraphics[width=0.65\textwidth]{p016_tri.svg} \caption{\label{fig:sample_port_mapping}Port mapping and maximum port throughput for a few SKL-SP instructions.} \end{figure} We also saw that on modern CPUs, ports and computation units are most of the time fully-pipelined; that is, each port can execute a \uop{} each cycle, even through actually executing a \uop{} may take multiple cycles. Thus, instruction latencies are not needed to compute the throughput of a kernel without dependencies in steady-state, and a port mapping is sufficient. As some \uops{} are compatible with multiple ports, the number of cycles required to run one occurrence of a kernel is not trivial. An assignment, for a given kernel, of its constitutive \uops{} to ports, is a \emph{schedule} ---~the number of cycles taken by a kernel given a fixed schedule is well-defined. The throughput of a kernel is defined as the throughput under an optimal schedule for this kernel. \begin{example}[Kernel throughputs with port mappings] The kernel $\kerK_1 = \texttt{DIVPS} + \texttt{BSR} + \texttt{JMP}$ can complete in one cycle: $\cyc{\kerK_1} = 1$. Indeed, according to the port mapping in \autoref{fig:sample_port_mapping}, each of those instructions is decoded into a single \uop{}, each compatible with a single, distinct port. Thus, the three instructions can be issued in parallel in one cycle. The same goes for $\kerK_2 = \texttt{ADDSS} + \texttt{BSR}$, although it is a bit less trivial. Both instructions decode to a single \uop{}. \texttt{BSR} can only be executed by port $p_1$, while \texttt{ADDSS} can be executed either by port $p_0$ or $p_1$: by picking $p_0$, both instructions can be executed in a single cycle in steady state, hence $\cyc{\kerK_2} = 1$. The kernel $\kerK_3 = \texttt{ADDSS} + 2\times\texttt{BSR}$, however, needs at least two cycles to be executed: \texttt{BSR} can only be executed on port $p_1$, which can execute at most one \uop{} per cycle. $\cyc{\kerK_3} = 2$. The instruction \texttt{ADDSS} alone, however, can be executed twice per cycle: once on $p_0$ and once on $p_1$. The kernel $\kerK_4 = 2\times\texttt{ADDSS} + \texttt{BSR}$ can thus be executed in 1.5 cycles in average: $\cyc{\kerK_4} = 1.5$. \medskip The following tables present an optimal schedule for each kernel $\kerK_2, \kerK_3, \kerK_4$. Each row represents a cycle. \begin{minipage}[t]{0.3\textwidth} \centering $\kerK_2$ \smallskip \begin{tabular}{c c} \toprule $p_0$ & $p_1$ \\ \midrule \texttt{ADDSS} & \texttt{BSR} \\ \texttt{ADDSS} & \texttt{BSR} \\ \multicolumn{2}{c}{$\vdots$}\\ \bottomrule \end{tabular} \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth} \centering $\kerK_3$ \smallskip \begin{tabular}{c c} \toprule $p_0$ & $p_1$ \\ \midrule \texttt{ADDSS} & \texttt{BSR} \\ $\emptyset$ & \texttt{BSR} \\ \texttt{ADDSS} & \texttt{BSR} \\ $\emptyset$ & \texttt{BSR} \\ \multicolumn{2}{c}{$\vdots$}\\ \bottomrule \end{tabular} \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth} \centering $\kerK_4$ \smallskip \begin{tabular}{c c} \toprule $p_0$ & $p_1$ \\ \midrule \texttt{ADDSS} & \texttt{BSR} \\ \texttt{ADDSS} & \texttt{BSR} \\ \texttt{ADDSS} & \texttt{ADDSS} \\ \multicolumn{2}{c}{$\vdots$}\\ \bottomrule \end{tabular} \end{minipage} \end{example} Finding the throughput of the kernels presented above is easy enough, as the kernels involve few \uops{} compatible with many ports. However, in the general case, finding an optimal schedule becomes more complicated; in fact, it can be expressed as a flow problem ---~described in Section~3.5.1 of \fgruber{}'s PhD thesis~\cite{phd:gruber}. \subsection{Dual representation: conjunctive resource mapping} \begin{figure}[b] \centering \begin{subfigure}[b]{0.65\textwidth}\centering \includegraphics[width=\textwidth]{p016_bi.svg} \caption{Full resource mapping}\label{fig:sample_resource_mapping} \end{subfigure}\hfill \begin{subfigure}[b]{0.30\textwidth}\centering \includegraphics[width=0.9\textwidth]{p016_norm.svg} \caption{Normalized}\label{fig:norm_sample_resource_mapping} \end{subfigure} \caption{Abstract resource mapping (conjunctive form) and maximum resource throughput for a few SKL-SP instructions.} \end{figure} The method behind \palmed{} is based on the observation that a port mapping admits a dual representation, where the bottom layer is not expressed as an ``or'', but also as an ``and''. In this dual model, an instruction such as \texttt{ADDSS} does not use \emph{either} $p_0$ or $p_1$, but instead uses once the combined resource $r_{01}$, which has a throughput of 2. Instructions such as \texttt{BSR}, using only $p_1$, are using \emph{both} $r_1$ and $r_{01}$. In \autoref{fig:sample_resource_mapping}, we present the resource mapping equivalent to the port mapping presented in \autoref{fig:sample_port_mapping}. As both top and bottom layers are now conjunctive, we remove altogether the intermediate nodes: an instruction directly consumes resources. We then normalize this graph to resources with a unitary throughput by dividing each edge's weight by its corresponding resource throughput. The normalized mapping for \texttt{ADDSS} and \texttt{BSR} is presented in \autoref{fig:norm_sample_resource_mapping}. The construction of this dual model, and its equivalence to the original, disjunctive model is detailed in the extended version of the full article on \palmed{}~\cite{palmed}. Finding the throughput of a kernel with this conjunctive representation does not require the solving of an optimisation problem. The number of cycles required by a kernel is simply the maximum load over all resources. More formally, we note $\rho_{i,r}$ the weight of the edge between instruction $i$ and resource $r$, and $\calR$ the set of resources. We consider a kernel $\kerK = \sum_{i \in \kerK} \sigma_{i,\kerK} \times i$. Then, in steady-state, the backend would need $\cyc{\kerK}$ cycles to execute $\kerK$, and \begin{align} \cyc{\kerK} = \max_{r \in \calR}\left( \sum_{i \in \kerK} \sigma_{i,\kerK} \times \rho_{i,r} \right) \label{eqn:res_model_rthroughput} \end{align} \begin{example} The throughputs of the previous kernels can be computed using the conjunctive resource model instead. \begin{minipage}[t]{0.3\textwidth} \centering $\kerK_2$ \smallskip \begin{tabular}{l r r r} \toprule & $r_0$ & $r_1$ & $r_{01}$ \\ \midrule \texttt{ADDSS} & & & $\sfrac{1}{2}$ \\ \texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\ \midrule Total & 0 & \textbf{1} & \textbf{1} \\ \bottomrule \end{tabular} \smallskip{} $\implies{} \cyc{\kerK_2} = 1$ \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth} \centering $\kerK_3$ \smallskip \begin{tabular}{l r r r} \toprule & $r_0$ & $r_1$ & $r_{01}$ \\ \midrule \texttt{ADDSS} & & & $\sfrac{1}{2}$ \\ $2\times$\texttt{BSR} & & 2 & 1 \\ \midrule Total & 0 & \textbf{2} & 1.5 \\ \bottomrule \end{tabular} \smallskip{} $\implies{} \cyc{\kerK_3} = 2$ \end{minipage}\hfill\begin{minipage}[t]{0.3\textwidth} \centering $\kerK_4$ \smallskip \begin{tabular}{l r r r} \toprule & $r_0$ & $r_1$ & $r_{01}$ \\ \midrule $2\times$\texttt{ADDSS} & & & 1 \\ \texttt{BSR} & & 1 & $\sfrac{1}{2}$ \\ \midrule Total & 0 & 1 & \textbf{1.5} \\ \bottomrule \end{tabular} \smallskip{} $\implies{} \cyc{\kerK_4} = 1.5$ \end{minipage} \end{example} The drawback of this conjunctive model, however, is that it generates a theoretically combinatorial number of new resources. This, however, does not happen in practice: a combined resource is only necessary if at least one \uop{} is supported by this set of combined ports. On real processors, ports are not random, but instead have a well-defined set of functions, \eg{} arithmetics, memory access, etc. Thus, only a very limited number of combined resources are necessary.