phd-thesis/manuscrit/30_palmed/30_pipedream.tex

\section{Measuring a kernel's throughput: \pipedream{}}

\todo{Introduce pipedream notation for instructions}

To build a mapping of a CPU, Palmed fundamentally depends on the ability to
measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we
saw above, Palmed defines a kernel as a multiset of instructions, and makes
hypotheses on the measures accordingly. Specifically, this measure should
reflect only out-of-order behaviours, without any dependency between
instructions. The behaviour should be purely the CPU's; specifically, no memory
effect should be accounted for, and thus, data should be L1-resident. Finally,
even if hardware counters could be used to provide such a metric, they should
be ignored, as Palmed aims to be as universal as possible and should avoid
depending on vendor-specific counters.

For this purpose, we use \pipedream{} as a benchmarking backend, a tool
initially written by \fgruber{} and described in his PhD
thesis~\cite{phd:gruber}, with further developments from \nderumig{} and
\cguillon{}. Originally written for a broader purpose, \pipedream{} is able to
generate assembly code to benchmark a multiset of instructions, fitting the
constraints mentioned above. The generated assembly uses the following
high-level shape:

\begin{lstlisting}[language=Python]
for NUM_MEASURES:
    measure_start()
    for NUM_ITER:
        kernel
        kernel
        ...
        kernel
    measure_stop()
\end{lstlisting}

The kernel is unrolled enough times so that the body of the innermost loop has
at least \lstc{UNROLL_SIZE} instructions; and \lstc{NUM_ITER} is defined so
that $\texttt{\footnotesize{}UNROLL\_SIZE} \times
\texttt{\footnotesize{}NUM\_ITER} \geq \texttt{\footnotesize{}TOTAL\_INSN}$.
\lstc{UNROLL_SIZE}, \lstc{TOTAL_INSN} and \lstc{NUM_MEASURES} are
parameters of the benchmark generation.

As \pipedream{} gets a multiset of instructions as a kernel, these
instructions' arguments must be instantiated to turn them into actual assembly
code ---~that is, turn \eg{} \lstxasm{ADD_GPR64i64_IMMi8} (the x86-64 variant
of \texttt{add} taking as arguments a general purpose regsiter of 64 bits and
an immediate of 8 bits) into \lstxasm{addq \$0x10, \%rdx}. There is no
particular issue to instantiate immediates; however, allocating registers and
memory operands requires some care, as no dependency must be created between
instructions.

\paragraph{Register operands.} For each type of register of the ISA (general
purpose, vector, \ldots{}), the registers are split into a read and a write
pool. The read pool only contains the maximum number of registers of this type
needed by one instruction of the kernel; while the write pool contains the
rest. The registers are then allocated for each instruction as follows.
\begin{itemize}
    \item{} The read operands are allocated registers from the read pool
        without specific care, as read-after-read does not create a dependency
        between two instructions.
    \item{} The written operands are allocated registers from the write pool in
        a round-robin fashion, to maximize the distance between two reuses of
        the same register. Indeed, on some architectures, write-after-write
        dependencies do not allow full parallelism; while on some others,
        some instructions' operands are both read and written, resulting in
        read-after-write dependencies.
\end{itemize}

\paragraph{Memory operands.} At startup, a memory arena small enough to fit
into the L1 cache is allocated. This arena is again split into a read and a
write pool.
\begin{itemize}
    \item In direct register addressing mode ---~\eg{} \lstxasm{movq \%rax,
        (\%rbx)} in x86-64~---, the same address is always reused (one for
        reads and one for writes).
    \item In base-index-displacement mode, the same base address is always
        reused; displacements are used in a round-robin fashion\footnote{On
            ARM, a displacement of 0 is always used, resulting in the same
        accesses as direct register addressing.}.
\end{itemize}

\bigskip{}

These two allocation policies are meant to ensure that, whenever possible, no
dependency will be created between two instructions: a dependency should only
appear when write-after-write dependencies matter, and not enough registers of
a kind are available in the architecture ---~in this case, the same register
may be reused too early.

To finally ensure that data is always L1-resident, warm-up rounds are performed
before actually measuring the inner loop. As the memory area is small enough,
and no other memory access is made during the measure, the memory area is
subsequently L1-resident. This also has the effect to warm up the branch
predictor.

\medskip{}

Finally, the generated assembly is assembled into a shared library object, and
invoked with a lightweight C harness. Using the \papi{}~\cite{tool:papi}
library, it measures and records two standard and omnipresent hardware events:
the number of elapsed cycles (\lstxasm{PAPI_TOT_CYC}) and the number of
completed instructions (\lstxasm{PAPI_TOT_INS}).
Palmed: write up to Pipedream 2023-09-15 14:34:32 +02:00			`\section{Measuring a kernel's throughput: \pipedream{}}`

Continue writing on A72 frontend 2023-09-20 17:46:06 +02:00			`\todo{Introduce pipedream notation for instructions}`

Palmed: write up to Pipedream 2023-09-15 14:34:32 +02:00			`To build a mapping of a CPU, Palmed fundamentally depends on the ability to`
			`measure the execution time $\cyc{\kerK}$ of a kernel $\kerK$. However, as we`
			`saw above, Palmed defines a kernel as a multiset of instructions, and makes`
			`hypotheses on the measures accordingly. Specifically, this measure should`
			`reflect only out-of-order behaviours, without any dependency between`
			`instructions. The behaviour should be purely the CPU's; specifically, no memory`
			`effect should be accounted for, and thus, data should be L1-resident. Finally,`
			`even if hardware counters could be used to provide such a metric, they should`
			`be ignored, as Palmed aims to be as universal as possible and should avoid`
			`depending on vendor-specific counters.`

			`For this purpose, we use \pipedream{} as a benchmarking backend, a tool`
			`initially written by \fgruber{} and described in his PhD`
			`thesis~\cite{phd:gruber}, with further developments from \nderumig{} and`
			`\cguillon{}. Originally written for a broader purpose, \pipedream{} is able to`
			`generate assembly code to benchmark a multiset of instructions, fitting the`
			`constraints mentioned above. The generated assembly uses the following`
			`high-level shape:`

			`\begin{lstlisting}[language=Python]`
			`for NUM_MEASURES:`
			`measure_start()`
			`for NUM_ITER:`
			`kernel`
			`kernel`
			`...`
			`kernel`
			`measure_stop()`
			`\end{lstlisting}`

			`The kernel is unrolled enough times so that the body of the innermost loop has`
			`at least \lstc{UNROLL_SIZE} instructions; and \lstc{NUM_ITER} is defined so`
			`that $\texttt{\footnotesize{}UNROLL\_SIZE} \times`
			`\texttt{\footnotesize{}NUM\_ITER} \geq \texttt{\footnotesize{}TOTAL\_INSN}$.`
			`\lstc{UNROLL_SIZE}, \lstc{TOTAL_INSN} and \lstc{NUM_MEASURES} are`
			`parameters of the benchmark generation.`

			`As \pipedream{} gets a multiset of instructions as a kernel, these`
			`instructions' arguments must be instantiated to turn them into actual assembly`
			`code ---~that is, turn \eg{} \lstxasm{ADD_GPR64i64_IMMi8} (the x86-64 variant`
			`of \texttt{add} taking as arguments a general purpose regsiter of 64 bits and`
			`an immediate of 8 bits) into \lstxasm{addq \$0x10, \%rdx}. There is no`
			`particular issue to instantiate immediates; however, allocating registers and`
			`memory operands requires some care, as no dependency must be created between`
			`instructions.`

			`\paragraph{Register operands.} For each type of register of the ISA (general`
			`purpose, vector, \ldots{}), the registers are split into a read and a write`
			`pool. The read pool only contains the maximum number of registers of this type`
			`needed by one instruction of the kernel; while the write pool contains the`
			`rest. The registers are then allocated for each instruction as follows.`
			`\begin{itemize}`
			`\item{} The read operands are allocated registers from the read pool`
			`without specific care, as read-after-read does not create a dependency`
			`between two instructions.`
			`\item{} The written operands are allocated registers from the write pool in`
			`a round-robin fashion, to maximize the distance between two reuses of`
			`the same register. Indeed, on some architectures, write-after-write`
			`dependencies do not allow full parallelism; while on some others,`
			`some instructions' operands are both read and written, resulting in`
			`read-after-write dependencies.`
			`\end{itemize}`

			`\paragraph{Memory operands.} At startup, a memory arena small enough to fit`
			`into the L1 cache is allocated. This arena is again split into a read and a`
			`write pool.`
			`\begin{itemize}`
			`\item In direct register addressing mode ---~\eg{} \lstxasm{movq \%rax,`
			`(\%rbx)} in x86-64~---, the same address is always reused (one for`
			`reads and one for writes).`
			`\item In base-index-displacement mode, the same base address is always`
			`reused; displacements are used in a round-robin fashion\footnote{On`
			`ARM, a displacement of 0 is always used, resulting in the same`
			`accesses as direct register addressing.}.`
			`\end{itemize}`

			`\bigskip{}`

			`These two allocation policies are meant to ensure that, whenever possible, no`
			`dependency will be created between two instructions: a dependency should only`
			`appear when write-after-write dependencies matter, and not enough registers of`
			`a kind are available in the architecture ---~in this case, the same register`
			`may be reused too early.`

			`To finally ensure that data is always L1-resident, warm-up rounds are performed`
			`before actually measuring the inner loop. As the memory area is small enough,`
			`and no other memory access is made during the measure, the memory area is`
			`subsequently L1-resident. This also has the effect to warm up the branch`
			`predictor.`

			`\medskip{}`

			`Finally, the generated assembly is assembled into a shared library object, and`
			`invoked with a lightweight C harness. Using the \papi{}~\cite{tool:papi}`
			`library, it measures and records two standard and omnipresent hardware events:`
			`the number of elapsed cycles (\lstxasm{PAPI_TOT_CYC}) and the number of`
			`completed instructions (\lstxasm{PAPI_TOT_INS}).`