2023-09-25 17:00:07 +02:00
|
|
|
\section{Related works}
|
|
|
|
|
2023-09-26 11:39:26 +02:00
|
|
|
\paragraph{Another comparative study: \anica{}.} The \anica{}
|
|
|
|
framework~\cite{anica} also attempts to comparatively evaluate various throughput predictors by
|
|
|
|
finding examples on which they are inaccurate. \anica{} starts with randomly
|
|
|
|
generated assembly snippets fed to various code analyzers. Once it finds a
|
|
|
|
snippet on which (some) code analyzers yield unsatisfying results, it refines
|
|
|
|
it through a process derived from abstract interpretation to reach a
|
|
|
|
more general category of input, \eg{} ``a load to a SIMD register followed by a
|
|
|
|
SIMD arithmetic operation''.
|
|
|
|
|
|
|
|
\paragraph{A dynamic code analyzer: \gus{}.}
|
|
|
|
So far, this manuscript was mostly concerned with static code analyzers.
|
|
|
|
Throughput prediction tools, however, are not all static.
|
|
|
|
\gus is a dynamic tool first introduced in \fgruber{}'s PhD
|
|
|
|
thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
|
|
|
|
dynamically predict the throughput of user-defined regions of interest in whole
|
|
|
|
program.
|
|
|
|
In these regions, it instruments every instruction, memory access, \ldots{} in
|
|
|
|
order to retrieve the exact events occurring through the program's
|
|
|
|
execution. \gus{} then leverages throughput, latency and microarchitectural
|
|
|
|
models to analyze resource usage and produce an accurate theoretical elapsed
|
|
|
|
cycles prediction.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
2023-09-26 11:39:26 +02:00
|
|
|
Its main strength, however, resides in its \emph{sensitivity analysis}
|
|
|
|
capabilities: by applying an arbitrary factor to some parts of the model (\eg{}
|
|
|
|
latencies, arithmetics port, \ldots{}), it is possible to investigate the
|
|
|
|
impact of a specific resource on the final execution time of a region of
|
|
|
|
interest. It can also accurately determine if a resource is actually a
|
|
|
|
bottleneck for a region, \ie{} if increasing this resource's capabilities would
|
|
|
|
reduce the execution time. The output of \gus{} on a region of interest
|
|
|
|
provides a very detailed insight on each instruction's resource consumption and
|
|
|
|
its contribution to the final execution time. As a dynamic analysis tool, it
|
|
|
|
is also able to extract the dependencies an instruction exhibits on a real run.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
2023-09-26 11:39:26 +02:00
|
|
|
The main downside of \gus{}, however, is its slowness. As most dynamic tools,
|
|
|
|
it suffers from a heavy slowdown compared to a native execution of the binary,
|
|
|
|
oftentimes about $100\times$ slower. While it remains a precious tool to the
|
|
|
|
user willing to deeply optimize an execution kernel, it makes \gus{} highly
|
|
|
|
impractical to run on a large collection of execution kernels.
|
2023-09-25 17:00:07 +02:00
|
|
|
|
2023-09-26 11:39:26 +02:00
|
|
|
\paragraph{An isolated basic-block profiler: \bhive{}.} In
|
|
|
|
\autoref{sec:redefine_exec_time} above, we advocated for measuring a basic
|
|
|
|
block's execution time \emph{in-context}. The \bhive{} profiler~\cite{bhive},
|
|
|
|
initially written by \ithemal{}'s authors~\cite{ithemal} to provide their model
|
|
|
|
with sufficient ---~and sufficiently accurate~--- training data, takes an
|
|
|
|
orthogonal approach to basic block throughput measurement. By mapping memory at
|
|
|
|
any address accessed by a basic block, it can effectively run and measure
|
|
|
|
arbitrary code without context, often ---~but not always, as we discuss
|
|
|
|
later~--- yielding good results.
|