101 lines
5 KiB
TeX
101 lines
5 KiB
TeX
\section{Measuring a kernel's throughput: \pipedream{}}
|
|
|
|
\todo{Introduce pipedream notation for instructions}
|
|
|
|
To build a mapping of a CPU, Palmed fundamentally depends on the ability to
|
|
measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we
|
|
saw above, Palmed defines a kernel as a multiset of instructions, and makes
|
|
hypotheses on the measures accordingly. Specifically, this measure should
|
|
reflect only out-of-order behaviours, without any dependency between
|
|
instructions. The behaviour should be purely the CPU's; specifically, no memory
|
|
effect should be accounted for, and thus, data should be L1-resident. Finally,
|
|
even if hardware counters could be used to provide such a metric, they should
|
|
be ignored, as Palmed aims to be as universal as possible and should avoid
|
|
depending on vendor-specific counters.
|
|
|
|
For this purpose, we use \pipedream{} as a benchmarking backend, a tool
|
|
initially written by \fgruber{} and described in his PhD
|
|
thesis~\cite{phd:gruber}, with further developments from \nderumig{} and
|
|
\cguillon{}. Originally written for a broader purpose, \pipedream{} is able to
|
|
generate assembly code to benchmark a multiset of instructions, fitting the
|
|
constraints mentioned above. The generated assembly uses the following
|
|
high-level shape:
|
|
|
|
\begin{lstlisting}[language=Python]
|
|
for NUM_MEASURES:
|
|
measure_start()
|
|
for NUM_ITER:
|
|
kernel
|
|
kernel
|
|
...
|
|
kernel
|
|
measure_stop()
|
|
\end{lstlisting}
|
|
|
|
The kernel is unrolled enough times so that the body of the innermost loop has
|
|
at least \lstc{UNROLL_SIZE} instructions; and \lstc{NUM_ITER} is defined so
|
|
that $\texttt{\footnotesize{}UNROLL\_SIZE} \times
|
|
\texttt{\footnotesize{}NUM\_ITER} \geq \texttt{\footnotesize{}TOTAL\_INSN}$.
|
|
\lstc{UNROLL_SIZE}, \lstc{TOTAL_INSN} and \lstc{NUM_MEASURES} are
|
|
parameters of the benchmark generation.
|
|
|
|
As \pipedream{} gets a multiset of instructions as a kernel, these
|
|
instructions' arguments must be instantiated to turn them into actual assembly
|
|
code ---~that is, turn \eg{} \lstxasm{ADD_GPR64i64_IMMi8} (the x86-64 variant
|
|
of \texttt{add} taking as arguments a general purpose regsiter of 64 bits and
|
|
an immediate of 8 bits) into \lstxasm{addq \$0x10, \%rdx}. There is no
|
|
particular issue to instantiate immediates; however, allocating registers and
|
|
memory operands requires some care, as no dependency must be created between
|
|
instructions.
|
|
|
|
\paragraph{Register operands.} For each type of register of the ISA (general
|
|
purpose, vector, \ldots{}), the registers are split into a read and a write
|
|
pool. The read pool only contains the maximum number of registers of this type
|
|
needed by one instruction of the kernel; while the write pool contains the
|
|
rest. The registers are then allocated for each instruction as follows.
|
|
\begin{itemize}
|
|
\item{} The read operands are allocated registers from the read pool
|
|
without specific care, as read-after-read does not create a dependency
|
|
between two instructions.
|
|
\item{} The written operands are allocated registers from the write pool in
|
|
a round-robin fashion, to maximize the distance between two reuses of
|
|
the same register. Indeed, on some architectures, write-after-write
|
|
dependencies do not allow full parallelism; while on some others,
|
|
some instructions' operands are both read and written, resulting in
|
|
read-after-write dependencies.
|
|
\end{itemize}
|
|
|
|
\paragraph{Memory operands.} At startup, a memory arena small enough to fit
|
|
into the L1 cache is allocated. This arena is again split into a read and a
|
|
write pool.
|
|
\begin{itemize}
|
|
\item In direct register addressing mode ---~\eg{} \lstxasm{movq \%rax,
|
|
(\%rbx)} in x86-64~---, the same address is always reused (one for
|
|
reads and one for writes).
|
|
\item In base-index-displacement mode, the same base address is always
|
|
reused; displacements are used in a round-robin fashion\footnote{On
|
|
ARM, a displacement of 0 is always used, resulting in the same
|
|
accesses as direct register addressing.}.
|
|
\end{itemize}
|
|
|
|
\bigskip{}
|
|
|
|
These two allocation policies are meant to ensure that, whenever possible, no
|
|
dependency will be created between two instructions: a dependency should only
|
|
appear when write-after-write dependencies matter, and not enough registers of
|
|
a kind are available in the architecture ---~in this case, the same register
|
|
may be reused too early.
|
|
|
|
To finally ensure that data is always L1-resident, warm-up rounds are performed
|
|
before actually measuring the inner loop. As the memory area is small enough,
|
|
and no other memory access is made during the measure, the memory area is
|
|
subsequently L1-resident. This also has the effect to warm up the branch
|
|
predictor.
|
|
|
|
\medskip{}
|
|
|
|
Finally, the generated assembly is assembled into a shared library object, and
|
|
invoked with a lightweight C harness. Using the \papi{}~\cite{tool:papi}
|
|
library, it measures and records two standard and omnipresent hardware events:
|
|
the number of elapsed cycles (\lstxasm{PAPI_TOT_CYC}) and the number of
|
|
completed instructions (\lstxasm{PAPI_TOT_INS}).
|