\section{Related works}

\paragraph{Another comparative study: \anica{}.} The \anica{}
framework~\cite{anica} also attempts to comparatively evaluate various throughput predictors by
finding examples on which they are inaccurate. \anica{} starts with randomly
generated assembly snippets fed to various code analyzers. Once it finds a
snippet on which (some) code analyzers yield unsatisfying results, it refines
it through a process derived from abstract interpretation to reach a
more general category of input, \eg{} ``a load to a SIMD register followed by a
SIMD arithmetic operation''.

\paragraph{A dynamic code analyzer: \gus{}.}
So far, this manuscript was mostly concerned with static code analyzers.
Throughput prediction tools, however, are not all static.
\gus is a dynamic tool first introduced in \fgruber{}'s PhD
thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
dynamically predict the throughput of user-defined regions of interest in whole
program.
In these regions, it instruments every instruction, memory access, \ldots{} in
order to retrieve the exact events occurring through the program's
execution. \gus{} then leverages throughput, latency and microarchitectural
models to analyze resource usage and produce an accurate theoretical elapsed
cycles prediction.

Its main strength, however, resides in its \emph{sensitivity analysis}
capabilities: by applying an arbitrary factor to some parts of the model (\eg{}
latencies, arithmetics port, \ldots{}), it is possible to investigate the
impact of a specific resource on the final execution time of a region of
interest. It can also accurately determine if a resource is actually a
bottleneck for a region, \ie{} if increasing this resource's capabilities would
reduce the execution time. The output of \gus{} on a region of interest
provides a very detailed insight on each instruction's resource consumption and
its contribution to the final execution time. As a dynamic analysis tool, it
is also able to extract the dependencies an instruction exhibits on a real run.

The main downside of \gus{}, however, is its slowness. As most dynamic tools,
it suffers from a heavy slowdown compared to a native execution of the binary,
oftentimes about $100\times$ slower. While it remains a precious tool to the
user willing to deeply optimize an execution kernel, it makes \gus{} highly
impractical to run on a large collection of execution kernels.

\paragraph{An isolated basic-block profiler: \bhive{}.} In
\autoref{sec:redefine_exec_time} above, we advocated for measuring a basic
block's execution time \emph{in-context}. The \bhive{} profiler~\cite{bhive},
initially written by \ithemal{}'s authors~\cite{ithemal} to provide their model
with sufficient ---~and sufficiently accurate~--- training data, takes an
orthogonal approach to basic block throughput measurement. By mapping memory at
any address accessed by a basic block, it can effectively run and measure
arbitrary code without context, often ---~but not always, as we discuss
later~--- yielding good results.