Staticdeps: up to evaluation (not yet started)

This commit is contained in:
Théophile Bastian 2023-09-28 17:45:31 +02:00
parent 094985307d
commit 3511d27516
7 changed files with 200 additions and 6 deletions

View file

@ -1,5 +1,5 @@
\chapter{A more systematic approach to throughput prediction performance
analysis: \cesasme{}}
analysis: \cesasme{}}\label{chap:CesASMe}
\input{00_intro.tex}
\input{02_measuring_exec_time.tex}

View file

@ -50,9 +50,10 @@ however, other channels.
As we saw in the introduction to this chapter, as well as in the previous
chapter, dependencies can also be \emph{memory-carried}, in more or less
straightforward ways, such as in the following examples, where the last line
always depend on the first:
straightforward ways, such as in the examples from
\autoref{lst:mem_carried_exn}, where the last line always depend on the first.
\begin{lstfloat}[h!]
\begin{minipage}[t]{0.32\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
add %rax, (%rbx)
@ -70,6 +71,8 @@ lea 16(%rbx), %r10
add %rax, (%rbx)
add -16(%r10), %rcx\end{lstlisting}
\end{minipage}\hfill
\caption{Examples of memory-carried dependencies.}\label{lst:mem_carried_exn}
\end{lstfloat}
\smallskip{}
@ -90,8 +93,10 @@ with a large emphasis on memory-carried dependencies.
\paragraph{Presence of loops.} The previous examples were all pieces of
\emph{straight-line code} in which a dependency arose. However, many
dependencies are actually \emph{loop-carried}, such as the following:
dependencies are actually \emph{loop-carried}, such as those in
\autoref{lst:loop_carried_exn}.
\begin{lstfloat}
\begin{minipage}[t]{0.48\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
# Compute sum(A), %rax points to A
@ -103,11 +108,13 @@ loop:
\end{minipage}\hfill
\begin{minipage}[t]{0.48\linewidth}
\begin{lstlisting}[language={[x86masm]Assembler}]
# Compute B[i] = A[i] + B[i-1]
# Compute B[i] = A[i] + B[i-2]
loop:
mov -8(%rbx, %r10), (%rbx, %r10)
mov -16(%rbx, %r10), (%rbx, %r10)
add (%rax, %r10), (%rbx, %r10)
add $8, %r10
jmp loop
\end{lstlisting}
\end{minipage}\hfill
\caption{Examples of loop-carried dependencies.}\label{lst:loop_carried_exn}
\end{lstfloat}

View file

@ -38,3 +38,43 @@ Valgrind, which will re-compile it to a native binary before running it.
While this intermediate representation, called \vex{}, is convenient to
instrument a binary, it may further be used as a way to obtain \emph{semantics}
for some assembly code, independently of the Valgrind framework.
\subsection{Depsim}\label{ssec:depsim}
The tool we write to extract runtime-gathered dependencies, \depsim{}, is
able to extract dependencies through both registers, memory and temporary
variables ---~in its intermediate representation, Valgrind keeps some values
assigned to temporary variables in static single-assignment (SSA) form.
It however supports a flag to detect only memory-carried dependencies, as this
will be useful to evaluate our static algorithm later.
As a dynamic tool, the distinction between straight-line code and loop-carried
dependencies is irrelevant, as the analysis follows the actual program flow.
\medskip{}
In order to track dependencies, each basic block of the program is
instrumented. Dependencies are stored as a hash table and represented as a
pair of source and destination program counter; they are mapped to a number of
encountered occurrences.
Dependencies through temporaries are, by construction, resident to a single
basic block ---~they are thus statically detected at instrumentation time. At
runtime, the occurrence count of those dependencies is updated whenever the
basic block is executed.
For both register- and memory-carried dependencies, each write is instrumented
by adding a runtime write to a \emph{shadow} register file or memory, noting
that the written register or memory address was last written at the current
program counter. Each read, in turn, is instrumented by adding a fetch to this
shadow register file or memory, retrieving the last program counter at which
this location was written to; the dependency count between this program counter
and the current program counter is then incremented.
In practice, the shadow register file is simply implemented as an array
holding, for each register id, the last program counter that wrote at this
location. The shadow memory is instead implemented as a hash table.
At the end of the run, all the dependencies retrieved are reported. Care is
taken to translate back the runtime program counters to addresses in the
original ELF files, using the running process' memory map.

View file

@ -1 +1,76 @@
\section{Static dependencies detection}
Depending on the type of dependencies considered, it is more or less difficult
to statically detect them.
\paragraph{Register-carried dependencies in straight-line code.} This case is
the easiest to statically detect, and is most often supported by code analyzers
---~for instance, \llvmmca{} supports it. The same strategy that was used to
dynamically find dependencies in \autoref{ssec:depsim} can still be used: a
shadow register file simply keeps track of which instruction last wrote each
register.
\paragraph{Register-carried, loop-carried dependencies.} Loop-carried
dependencies can, to some extent, be detected the same way. As the basic block
is always assumed to be the body of an infinite loop, a straight-line analysis
can be performed on a duplicated kernel. This strategy is \eg{} adopted by
\osaca{}~\cite{osaca2} (§II.D).
When dealing only with register accesses, this
strategy is always sufficient: as each iteration always executes the same basic
block, it is not possible for an instruction to depend on another instruction
two iterations earlier or more.
\paragraph{Memory-carried dependencies in straight-line code.} Memory
dependencies, however, are significantly harder to tackle. While basic
heuristics can handle some simple cases, in the general case two main
difficulties arise:
\begin{enumerate}[(i)]
\item{}\label{memcarried_difficulty_alias} pointers may \emph{alias}, \ie{}
point to the same address or array; for instance, if \reg{rax} points
to an array, it may be that \reg{rbx} points to $\reg{rax} + 8$, making
the detection of such a dependency difficult;
\item{}\label{memcarried_difficulty_arith} arbitrary arithmetic operations
may be performed on pointers, possibly through diverting paths: \eg{}
it might be necessary to detect that $\reg{rax} + 16 << 2$ is identical
to $\reg{rax} + 128 / 2$; this requires semantics for assembly
instructions and tracking formal expressions across register values
---~and possibly even memory.
\end{enumerate}
Tracking memory-carried dependencies is, to the best of our knowledge, not done
in code analyzers, as our results in \autoref{chap:CesASMe} suggests.
\paragraph{Loop-carried, memory-carried dependencies.} While the strategy
previously used for register-carried dependencies is sufficient to detect
loop-carried dependencies from one occurrence to the next one, it is not
sufficient at all times when the dependencies tracked are memory-carried. For
instance, in the second example from \autoref{lst:loop_carried_exn}, an
instruction depends on another two iterations ago.
Dependencies can reach arbitrarily old iterations of a loop: in this example,
\lstxasm{-8192(\%rbx, \%r10)} may be used to reach 1\,024 iterations back.
However, while far-reaching dependencies may \emph{exist}, they are not
necessarily \emph{relevant} from a performance analysis point of view. Indeed,
if an instruction $i_2$ depends on a result previously produced by an
instruction $i_1$, this dependency is only relevant if it is possible that
$i_1$ is not yet completed when $i_2$ is considered for issuing ---~else, the
result is already produced, and $i_2$ needs not wait to execute.
The reorder buffer (ROB) of a CPU can be modelled as a sliding window of fixed
size over \uops{}. In particular, if a \uop{} $\mu_1$ is not yet retired, the
ROB may not contain \uops{} more than the ROB's size ahead of $\mu_1$. This is
in particular also true for instructions, as the vast majority of instructions
decode to at least one \uop{}\footnote{Some \texttt{mov} instructions from
register to register may, for instance, only have an impact on the renamer;
no \uops{} are dispatched to the backend.}.
A possible solution to detect loop-carried dependencies in a kernel $\kerK$ is
thus to unroll it until it contains about $\card{\text{ROB}} +
\card{\kerK}$. This ensures that every instruction in the last kernel can find
dependencies reaching up to $\card{\text{ROB}}$ back.
On Intel CPUs, the reorder buffer size contained 224 \uops{} on Skylake (2015),
or 512 \uops{} on Golden Cove (2021)~\cite{wikichip_intel_rob_size}. These
sizes are small enough to reasonably use this solution without excessive
slowdown.

View file

@ -1 +1,64 @@
\section{The \staticdeps{} heuristic}
The static analyzer we present, \staticdeps{}, only aims to tackle the
difficulty~\ref{memcarried_difficulty_arith} mentioned above: tracking
dependencies across arbitrarily complex pointer arithmetic.
To do so, \staticdeps{} works at the basic-block level, unrolled enough times
to fill the reorder buffer as detailed above; this way, arbitrarily
long-reaching relevant loop-carried dependencies can be detected.
This problem could be solved using symbolic calculus algorithms. However, those
algorithms are not straightforward to implement, and the equality test between
two arbitrary expressions can be costly.
\medskip{}
Instead, we use an heuristic based on random values. We consider the set $\calR
= \left\{0, 1, \ldots, 2^{64}-1\right\}$ of values representable by a 64-bits
unsigned integer; we extend this set to $\bar\calR = \calR \cup \{\bot\}$,
where $\bot$ denotes an invalid value. We then proceed as previously for
register-carried dependencies, applying the following principles.
\smallskip{}
\begin{itemize}
\item{} Whenever an unknown value is read, either from a register or from
memory, generate a fresh value from $\calR$, uniformly sampled at
random. This value is saved to a shadow register file or memory, and
will be used again the next time this same data is accessed.
\item{} Whenever an integer arithmetic operation is encountered, compute
the result of the operation and save the result to the shadow register
file or memory.
\item{} Whenever another kind of operation, or an operation that is
unsupported, is encountered, save the destination operand as $\bot$;
this operation is assumed to not be valid pointer arithmetic.
Operations on $\bot$ always yield $\bot$ as a result.
\item{} Whenever writing to a memory location, compute the written address
using the above principles, and proceed as with a dynamic analysis,
keeping track of the instruction that last wrote to a memory address.
\item{} Whenever reading from a memory location, compute the read address
using the above principles, and generate a dependency from the current
instruction to the instruction that last wrote to this address (if
known).
\end{itemize}
The semantics needed to compute encountered operations are obtained by lifting
the kernel's assembly to \valgrind{}'s \vex{} intermediary representation.
\medskip{}
This first analysis provides us with a raw list of dependencies across
iterations of the considered basic block. We then ``re-roll'' the unrolled
kernel by transcribing each dependency to a triplet $(\texttt{source\_insn},
\texttt{dest\_insn}, \Delta{}k)$, where the first two elements are the source
and destination instruction of the dependency \emph{in the original,
non-unrolled kernel}, and $\Delta{}k$ is the number of iterations of the kernel
between the source and destination instruction of the dependency.
Finally, we filter out spurious dependencies: each dependency found should
occur for each kernel iteration $i$ at which $i + \Delta{}k$ is within bounds.
If the dependency is found for less than $80\,\%$ of those iterations, the
dependency is declared spurious and is dropped.

View file

@ -133,3 +133,10 @@
howpublished={\url{https://www.arm.com/company/news/2023/09/building-the-future-of-computing-on-arm}},
}
@misc{wikichip_intel_rob_size,
title={Intel Details Golden Cove: Next-Generation Big Core For Client and Server SoCs},
author={{WikiChip}},
year=2021,
month=08,
howpublished={\url{https://fuse.wikichip.org/news/6111/intel-details-golden-cove-next-generation-big-core-for-client-and-server-socs/}}
}

View file

@ -21,6 +21,8 @@
\newcommand{\mucount}{\#_{\mu}}
\newcommand{\ceil}[1]{\left\lceil{} #1 \right\rceil{}}
\newcommand{\card}[1]{\left| #1 \right|}
% Names
\newcommand{\fgruber}{Fabian \textsc{Gruber}}