CesASMe: brutal paper import. Not compiling yet.
This commit is contained in:
parent
0b089085e0
commit
fc9182428d
14 changed files with 1143 additions and 0 deletions
141
manuscrit/50_CesASMe/00_intro.tex
Normal file
141
manuscrit/50_CesASMe/00_intro.tex
Normal file
|
@ -0,0 +1,141 @@
|
|||
\begin{abstract}
|
||||
A variety of code analyzers, such as \iaca, \uica, \llvmmca{} or
|
||||
\ithemal{}, strive to statically predict the throughput of a computation
|
||||
kernel. Each analyzer is based on its own simplified CPU model
|
||||
reasoning at the scale of an isolated basic block.
|
||||
Facing this diversity, evaluating their strengths and
|
||||
weaknesses is important to guide both their usage and their enhancement.
|
||||
|
||||
We argue that reasoning at the scale of a single basic block is not
|
||||
always sufficient and that a lack of context can mislead analyses. We present \tool, a fully-tooled
|
||||
solution to evaluate code analyzers on C-level benchmarks. It is composed of a
|
||||
benchmark derivation procedure that feeds an evaluation harness. We use it to
|
||||
evaluate state-of-the-art code analyzers and to provide insights on their
|
||||
precision. We use \tool's results to show that memory-carried data
|
||||
dependencies are a major source of imprecision for these tools.
|
||||
\end{abstract}
|
||||
|
||||
\section{Introduction}\label{sec:intro}
|
||||
|
||||
At a time when software is expected to perform more computations, faster and in
|
||||
more constrained environments, tools that statically predict the resources (and
|
||||
in particular the CPU resources) they consume are very useful to guide their
|
||||
optimization. This need is reflected in the diversity of binary or assembly
|
||||
code analyzers following the deprecation of \iaca~\cite{iaca}, which Intel has
|
||||
maintained through 2019. Whether it is \llvmmca{}~\cite{llvm-mca},
|
||||
\uica{}~\cite{uica}, \ithemal~\cite{ithemal} or \gus~\cite{phd:gruber}, all these tools strive to extract various
|
||||
performance metrics, including the number of CPU cycles a computation kernel will take
|
||||
---~which roughly translates to execution time.
|
||||
In addition to raw measurements (relying on hardware counters), these model-based analyses provide
|
||||
higher-level and refined data, to expose the bottlenecks and guide the
|
||||
optimization of a given code. This feedback is useful to experts optimizing
|
||||
computation kernels, including scientific simulations and deep-learning
|
||||
kernels.
|
||||
|
||||
An exact throughput prediction would require a cycle-accurate simulator of the
|
||||
processor, based on microarchitectural data that is most often not publicly
|
||||
available, and would be prohibitively slow in any case. These tools thus each
|
||||
solve in their own way the challenge of modeling complex CPUs while remaining
|
||||
simple enough to yield a prediction in a reasonable time, ending up with
|
||||
different models. For instance, on the following x86-64 basic block computing a
|
||||
general matrix multiplication,
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
movsd (%rcx, %rax), %xmm0
|
||||
mulsd %xmm1, %xmm0
|
||||
addsd (%rdx, %rax), %xmm0
|
||||
movsd %xmm0, (%rdx, %rax)
|
||||
addq $8, %rax
|
||||
cmpq $0x2260, %rax
|
||||
jne 0x16e0
|
||||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
\noindent\llvmmca{} predicts 1.5 cycles, \iaca{} and \ithemal{} predict 2 cycles, while \uica{}
|
||||
predicts 3 cycles. One may wonder which tool is correct.
|
||||
|
||||
|
||||
The obvious solution to assess their predictions is to compare them to an
|
||||
actual measure. However, as these tools reason at the basic block level, this
|
||||
is not as trivially defined as it would seem. Take for instance the following
|
||||
kernel:
|
||||
|
||||
\begin{minipage}{0.90\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
mov (%rax, %rcx, 1), %r10
|
||||
mov %r10, (%rbx, %rcx, 1)
|
||||
add $8, %rcx
|
||||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
\input{overview}
|
||||
|
||||
\noindent{}At first, it looks like an array copy from location \reg{rax} to
|
||||
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
|
||||
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
|
||||
instruction and the second instruction at the previous iteration; which makes
|
||||
the throughput drop significantly. As we shall see in
|
||||
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
|
||||
block's throughput is not well-defined}.
|
||||
|
||||
To recover the context of each basic block, we reason instead at the scale of
|
||||
a C source code. This
|
||||
makes the measures unambiguous: one can use hardware counters to measure the
|
||||
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
|
||||
that both is representative of the domain studied, and wide enough to have a
|
||||
good coverage of the domain. However, this is not in itself sufficient to
|
||||
evaluate static tools: on the preceding matrix multiplication kernel, counters
|
||||
report 80,059 elapsed cycles ---~for the total loop.
|
||||
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
|
||||
basic block-level predictions seen above.
|
||||
|
||||
A common practice to make these numbers comparable is to renormalize them to
|
||||
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
|
||||
$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
|
||||
$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
|
||||
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
|
||||
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
|
||||
kernel's efficiency}. Indeed, the static number of instructions is affected by
|
||||
many compiler passes, such as scalar evolution, strength reduction, register
|
||||
allocation, instruction selection\ldots{} Thus, when comparing two compiled
|
||||
versions of the same code, IPC alone does not necessarily point to the most
|
||||
efficient version. For instance, a kernel using SIMD instructions will use
|
||||
fewer instructions than one using only scalars, and thus exhibit a lower or
|
||||
constant IPC; yet, its performance will unquestionably increase.
|
||||
|
||||
The total cycles elapsed to solve a given problem, on the other
|
||||
hand, is a sound metric of the efficiency of an implementation. We thus
|
||||
instead \emph{lift} the predictions at basic-block level to a total number of
|
||||
cycles. In simple cases, this simply means multiplying the block-level
|
||||
prediction by the number of loop iterations; however, this bound might not
|
||||
generally be known. More importantly, the compiler may apply any number of
|
||||
transformations: unrolling, for instance, changes this number. Control flow may
|
||||
also be complicated by code versioning.
|
||||
|
||||
%In the general case, instrumenting the generated code to obtain the number of
|
||||
%occurrences of the basic block yields accurate results.
|
||||
|
||||
\bigskip
|
||||
|
||||
In this article, we present a fully-tooled solution to evaluate and compare the
|
||||
diversity of static throughput predictors. Our tool, \tool, solves two main
|
||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||
\tool{} generates a wide variety of computation kernels stressing different
|
||||
parameters of the architecture, and thus of the predictors' models, while
|
||||
staying close to representative workloads. To achieve this, we use
|
||||
Polybench~\cite{polybench}, a C-level benchmark suite representative of
|
||||
scientific computation workloads, that we combine with a variety of
|
||||
optimisations, including polyhedral loop transformations.
|
||||
In Section~\ref{sec:bench_harness}, we describe how \tool{} is able to
|
||||
evaluate throughput predictors on this set of benchmarks by lifting their
|
||||
predictions to a total number of cycles that can be compared to a hardware
|
||||
counters-based measure. A
|
||||
high-level view of \tool{} is shown in Figure~\ref{fig:contrib}.
|
||||
|
||||
In Section~\ref{sec:exp_setup}, we detail our experimental setup and assess our
|
||||
methodology. In Section~\ref{sec:results_analysis}, we compare the predictors' results and
|
||||
analyze the results of \tool{}.
|
||||
In addition to statistical studies, we use \tool's results
|
||||
to investigate analyzers' flaws. We show that code
|
||||
analyzers do not always correctly model data dependencies through memory
|
||||
accesses, substantially impacting their precision.
|
56
manuscrit/50_CesASMe/05_related_works.tex
Normal file
56
manuscrit/50_CesASMe/05_related_works.tex
Normal file
|
@ -0,0 +1,56 @@
|
|||
\section{Related works}
|
||||
|
||||
The static throughput analyzers studied rely on a variety of models.
|
||||
\iaca~\cite{iaca}, developed by Intel and now deprecated, is closed-source and
|
||||
relies on Intel's expertise on their own processors.
|
||||
The LLVM compiling ecosystem, to guide optimization passes, maintains models of many
|
||||
architectures. These models are used in the LLVM Machine Code Analyzer,
|
||||
\llvmmca~\cite{llvm-mca}, to statically evaluate the performance of a segment
|
||||
of assembly.
|
||||
Independently, Abel and Reineke used an automated microbenchmark generation
|
||||
approach to generate port mappings of many architectures in
|
||||
\texttt{uops.info}~\cite{nanobench, uopsinfo} to model processors' backends.
|
||||
This work was continued with \uica~\cite{uica}, extending this model with an
|
||||
extensive frontend description.
|
||||
Following a completely different approach, \ithemal~\cite{ithemal} uses a deep
|
||||
neural network to predict basic blocks throughput. To obtain enough data to
|
||||
train its model, the authors also developed \bhive~\cite{bhive}, a profiling
|
||||
tool working on basic blocks.
|
||||
|
||||
Another static tool, \osaca~\cite{osaca2}, provides lower- and
|
||||
upper-bounds to the execution time of a basic block. As this kind of
|
||||
information cannot be fairly compared with tools yielding an exact throughput
|
||||
prediction, we exclude it from our scope.
|
||||
|
||||
All these tools statically predict the number of cycles taken by a piece of
|
||||
assembly or binary that is assumed to be the body of an infinite ---~or
|
||||
sufficiently large~--- loop in steady state, all its data being L1-resident. As
|
||||
discussed by \uica's authors~\cite{uica}, hypotheses can subtly vary between
|
||||
analyzers; \eg{} by assuming that the loop is either unrolled or has control
|
||||
instructions, with non-negligible impact. Some of the tools, \eg{} \ithemal,
|
||||
necessarily work on a single basic block, while some others, \eg{} \iaca, work
|
||||
on a section of code delimited by markers. However, even in the second case,
|
||||
the code is assumed to be \emph{straight-line code}: branch instructions, if
|
||||
any, are assumed not taken.
|
||||
|
||||
\smallskip
|
||||
|
||||
Throughput prediction tools, however, are not all static.
|
||||
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
|
||||
region, instrumenting it to retrieve the exact events occurring through its
|
||||
execution. This way, \gus{} can more finely detect bottlenecks by
|
||||
sensitivity analysis, at the cost of a significantly longer run time.
|
||||
|
||||
\smallskip
|
||||
|
||||
The \bhive{} profiler~\cite{bhive} takes another approach to basic block
|
||||
throughput measurement: by mapping memory at any address accessed by a basic
|
||||
block, it can effectively run and measure arbitrary code without context, often
|
||||
---~but not always, as we discuss later~--- yielding good results.
|
||||
|
||||
\smallskip
|
||||
|
||||
The \anica{} framework~\cite{anica} also attempts to evaluate throughput
|
||||
predictors by finding examples on which they are inaccurate. \anica{} starts
|
||||
with randomly generated assembly snippets, and refines them through a process
|
||||
derived from abstract interpretation to reach general categories of problems.
|
109
manuscrit/50_CesASMe/10_bench_gen.tex
Normal file
109
manuscrit/50_CesASMe/10_bench_gen.tex
Normal file
|
@ -0,0 +1,109 @@
|
|||
\section{Generating microbenchmarks}\label{sec:bench_gen}
|
||||
|
||||
Our framework aims to generate \emph{microbenchmarks} relevant to a specific
|
||||
domain.
|
||||
A microbenchmark is a code that is as simplified as possible to expose the
|
||||
behaviour under consideration.
|
||||
The specified computations should be representative of the considered domain,
|
||||
and at the same time they should stress the different aspects of the
|
||||
target architecture ---~which is modeled by code analyzers.
|
||||
|
||||
In practice, a microbenchmark's \textit{computational kernel} is a simple
|
||||
\texttt{for} loop, whose
|
||||
body contains no loops and whose bounds are statically known.
|
||||
A \emph{measure} is a number of repetitions $n$ of this computational
|
||||
kernel, $n$ being an user-specified parameter.
|
||||
The measure may be repeated an arbitrary number of times to improve
|
||||
stability.
|
||||
|
||||
Furthermore, such a microbenchmark should be a function whose computation
|
||||
happens without leaving the L1 cache.
|
||||
This requirement helps measurements and analyses to be
|
||||
undisturbed by memory accesses, but it is also a matter of comparability.
|
||||
Indeed, most of the static analyzers make the assumption that the code under
|
||||
consideration is L1-resident; if it is not, their results are meaningless, and
|
||||
can not be compared with an actual measurement.
|
||||
|
||||
The generation of such microbenchmarks is achieved through four distinct
|
||||
components, whose parameter variations are specified in configuration files~:
|
||||
a benchmark suite, C-to-C loop nest optimizers, a constraining utility
|
||||
and a C-to-binary compiler.
|
||||
|
||||
\subsection{Benchmark suite}\label{ssec:bench_suite}
|
||||
Our first component is an initial set of benchmarks which materializes
|
||||
the human expertise we intend to exploit for the generation of relevant codes.
|
||||
The considered suite must embed computation kernels
|
||||
delimited by ad-hoc \texttt{\#pragma}s,
|
||||
whose arrays are accessed
|
||||
directly (no indirections) and whose loops are affine.
|
||||
These constraints are necessary to ensure that the microkernelification phase,
|
||||
presented below, generates segfault-free code.
|
||||
|
||||
In this case, we use Polybench~\cite{polybench}, a suite of 30
|
||||
benchmarks for polyhedral compilation ---~of which we use only 26. The
|
||||
\texttt{nussinov}, \texttt{ludcmp} and \texttt{deriche} benchmarks are
|
||||
removed because they are incompatible with PoCC (introduced below). The
|
||||
\texttt{lu} benchmark is left out as its execution alone takes longer than all
|
||||
others together, making its dynamic analysis (\eg{} with \gus) impractical.
|
||||
In addition to the importance of linear algebra within
|
||||
it, one of its important features is that it does not include computational
|
||||
kernels with conditional control flow (\eg{} \texttt{if-then-else})
|
||||
---~however, it does includes conditional data flow, using the ternary
|
||||
conditional operator of C.
|
||||
|
||||
\subsection{C-to-C loop nest optimizers}\label{ssec:loop_nest_optimizer}
|
||||
Loop nest optimizers transform an initial benchmark in different ways (generate different
|
||||
\textit{versions} of the same benchmark), varying the stress on
|
||||
resources of the target architecture, and by extension the models on which the
|
||||
static analyzers are based.
|
||||
|
||||
In this case, we chose to use the
|
||||
\textsc{Pluto}~\cite{pluto} and PoCC~\cite{pocc} polyhedral compilers, to easily access common loop nest optimizations~: register tiling, tiling,
|
||||
skewing, vectorization/simdization, loop unrolling, loop permutation,
|
||||
loop fusion.
|
||||
These transformations are meant to maximize variety within the initial
|
||||
benchmark suite. Eventually, the generated benchmarks are expected to
|
||||
highlight the impact on performance of the resulting behaviours.
|
||||
For instance, \textit{skewing} introduces non-trivial pointer arithmetics,
|
||||
increasing the pressure on address computation units~; \textit{loop unrolling},
|
||||
among many things, opens the way to register promotion, which exposes dependencies
|
||||
and alleviates load-store units~;
|
||||
\textit{vectorization} stresses SIMD units and decreases
|
||||
pressure on the front-end~; and so on.
|
||||
|
||||
\subsection{Constraining utility}\label{ssec:kernelify}
|
||||
|
||||
A constraining utility transforms the code in order to respect an arbitrary number of non-functional
|
||||
properties.
|
||||
In this case, we apply a pass of \emph{microkernelification}: we
|
||||
extract a computational kernel from the arbitrarily deep and arbitrarily
|
||||
long loop nest generated by the previous component.
|
||||
The loop chosen to form the microkernel is the one considered to be
|
||||
the \textit{hottest}; the \textit{hotness} of a loop being obtained by
|
||||
multiplying the number of arithmetic operations it contains by the number of
|
||||
times it is iterated. This metric allows us to prioritize the parts of the
|
||||
code that have the greatest impact on performance.
|
||||
|
||||
At this point, the resulting code can
|
||||
compute a different result from the initial code;
|
||||
for instance, the composition of tiling and
|
||||
kernelification reduces the number of loop iterations.
|
||||
Indeed, our framework is not meant to preserve the
|
||||
functional semantics of the benchmarks.
|
||||
Our goal is only to generate codes that are relevant from the point of view of
|
||||
performance analysis.
|
||||
|
||||
\subsection{C-to-binary compiler}\label{ssec:compile}
|
||||
|
||||
A C-to-binary compiler varies binary optimization options by
|
||||
enabling/disabling auto-vectorization, extended instruction
|
||||
sets, \textit{etc}. We use \texttt{gcc}.
|
||||
|
||||
\bigskip
|
||||
|
||||
Eventually, the relevance of the microbenchmarks set generated using this approach
|
||||
derives not only from initial benchmark suite and the relevance of the
|
||||
transformations chosen at each
|
||||
stage, but also from the combinatorial explosion generated by the composition
|
||||
of the four stages. In our experimental setup, this yields up to 144
|
||||
microbenchmarks per benchmark of the original suite.
|
87
manuscrit/50_CesASMe/15_harness.tex
Normal file
87
manuscrit/50_CesASMe/15_harness.tex
Normal file
|
@ -0,0 +1,87 @@
|
|||
\section{Benchmarking harness}\label{sec:bench_harness}
|
||||
|
||||
To compare full-kernel cycle measurements to throughput predictions on
|
||||
individual basic blocks, we lift predictions by adding the weighted basic block
|
||||
predictions:
|
||||
|
||||
\[
|
||||
\text{lifted\_pred}(\mathcal{K}) =
|
||||
\sum_{b \in \operatorname{BBs}(\mathcal{K})}
|
||||
\operatorname{occurences}(b) \times \operatorname{pred}(b)
|
||||
\]
|
||||
|
||||
Our benchmarking harness works in three successive stages. It first
|
||||
extracts the basic blocks constituting a computation kernel, and instruments it
|
||||
to retrieve their respective occurrences in the original context. It then runs
|
||||
all the studied tools on each basic block, while also running measures on the
|
||||
whole computation kernel. Finally, the block-level results are lifted to
|
||||
kernel-level results thanks to the occurrences previously measured.
|
||||
|
||||
\subsection{Basic block extraction}\label{ssec:bb_extr}
|
||||
|
||||
Using the Capstone disassembler~\cite{tool:capstone}, we split the assembly
|
||||
code at each control flow instruction (jump, call, return, \ldots) and each
|
||||
jump site.
|
||||
|
||||
To accurately obtain the occurrences of each basic block in the whole kernel's
|
||||
computation,
|
||||
we then instrument it with \texttt{gdb} by placing a break
|
||||
point at each basic block's first instruction in order to count the occurrences
|
||||
of each basic block between two calls to the \perf{} counters\footnote{We
|
||||
assume the program under analysis to be deterministic.}. While this
|
||||
instrumentation takes about 50 to 100$\times$ more time than a regular run, it
|
||||
can safely be run in parallel, as the performance results are discarded.
|
||||
|
||||
\subsection{Throughput predictions and measures}\label{ssec:throughput_pred_meas}
|
||||
|
||||
The harness leverages a variety of tools: actual CPU measurement; the \bhive{}
|
||||
basic block profiler~\cite{bhive}; \llvmmca~\cite{llvm-mca}, \uica~\cite{uica}
|
||||
and \iaca~\cite{iaca}, which leverage microarchitectural
|
||||
models to predict a block's throughput; \ithemal~\cite{ithemal}, a machine
|
||||
learning model; and \gus~\cite{phd:gruber}, a dynamic analyzer based on \qemu{}
|
||||
that works at the whole binary level.
|
||||
|
||||
The execution time of the full kernel is measured using Linux
|
||||
\perf~\cite{tool:perf} CPU counters around the full computation kernel. The
|
||||
measure is repeated four times and the smallest is kept; this ensures that the
|
||||
cache is warm and compensates for context switching or other measurement
|
||||
artifacts. \gus{} instruments the whole function body. The other tools included
|
||||
all work at basic block level; these are run on each basic block of each
|
||||
benchmark.
|
||||
|
||||
We emphasize the importance, throughout the whole evaluation chain, to keep the
|
||||
exact same assembled binary. Indeed, recompiling the kernel from source
|
||||
\emph{cannot} be assumed to produce the same assembly kernel. This is even more
|
||||
important in the presence of slight changes: for instance, inserting \iaca{}
|
||||
markers at the C-level ---~as is intended~--- around the kernel \emph{might}
|
||||
change the compiled kernel, if only for alignment regions. We argue that, in
|
||||
the case of \iaca{} markers, the problem is even more critical, as those
|
||||
markers prevent a binary from being run by overwriting registers with arbitrary
|
||||
values. This forces a user to run and measure a version which is different from
|
||||
the analyzed one. In our harness, we circumvent this issue by adding markers
|
||||
directly at the assembly level, editing the already compiled version. Our
|
||||
\texttt{gdb} instrumentation procedure also respects this principle of
|
||||
single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
|
||||
\gus{} with a preloaded stub shared library to be able to instrument binaries
|
||||
containing calls to \perf.
|
||||
|
||||
\subsection{Prediction lifting and filtering}\label{ssec:harness_lifting}
|
||||
|
||||
We finally lift single basic block predictions to a whole-kernel cycle
|
||||
prediction by summing the block-level results, weighted by the occurrences of
|
||||
the basic block in the original context (formula above). If an analyzer fails
|
||||
on one of the basic blocks of a benchmark, the whole benchmark is discarded for
|
||||
this analyzer.
|
||||
|
||||
In the presence of complex control flow, \eg{} with conditionals inside loops,
|
||||
our approach based on basic block occurrences is arguably less precise than an
|
||||
approach based on paths occurrences, as we have less information available
|
||||
---~for instance, whether a branch is taken with a regular pattern, whether we
|
||||
have constraints on register values, etc. We however chose this block-based
|
||||
approach, as most throughput prediction tools work a basic block-level, and are
|
||||
thus readily available and can be directly plugged into our harness.
|
||||
|
||||
Finally, we control the proportion of cache misses in the program's execution
|
||||
using \texttt{Cachegrind}~\cite{valgrind} and \gus; programs that have more
|
||||
than 15\,\% of cache misses on a warm cache are not considered L1-resident and
|
||||
are discarded.
|
213
manuscrit/50_CesASMe/20_evaluation.tex
Normal file
213
manuscrit/50_CesASMe/20_evaluation.tex
Normal file
|
@ -0,0 +1,213 @@
|
|||
\section{Experimental setup and evaluation}\label{sec:exp_setup}
|
||||
|
||||
Running the harness described above provides us with 3500
|
||||
benchmarks ---~after filtering out non-L1-resident
|
||||
benchmarks~---, on which each throughput predictor is run. We make the full
|
||||
output of our tool available in our artifact. Before analyzing these results in
|
||||
Section~\ref{sec:results_analysis}, we evaluate the relevance of the
|
||||
methodology presented in Section~\ref{sec:bench_harness} to make the tools'
|
||||
predictions comparable to baseline hardware counter measures.
|
||||
|
||||
\subsection{Experimental environment}
|
||||
|
||||
The experiments presented in this paper were all realized on a Dell PowerEdge
|
||||
C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
|
||||
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
|
||||
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
||||
|
||||
The experiments themselves were run inside a Docker environment very close to
|
||||
our artifact, based on Debian Bullseye. Care was taken to disable
|
||||
hyperthreading to improve measurements stability. For tools whose output is
|
||||
based on a direct measurement (\perf, \bhive), the benchmarks were run
|
||||
sequentially on a single core with no experiments on the other cores. No such
|
||||
care was taken for \gus{} as, although based on a dynamic run, its prediction
|
||||
is purely function of recorded program events and not of program measures. All
|
||||
other tools were run in parallel.
|
||||
|
||||
We use \llvmmca{} \texttt{v13.0.1}, \iaca{} \texttt{v3.0-28-g1ba2cbb}, \bhive{}
|
||||
at commit \texttt{5f1d500}, \uica{} at commit \texttt{9cbbe93}, \gus{} at
|
||||
commit \texttt{87463c9}, \ithemal{} at commit \texttt{b3c39a8}.
|
||||
|
||||
\subsection{Comparability of the results}
|
||||
|
||||
We define the relative error of a time prediction
|
||||
$C_\text{pred}$ (in cycles) with respect to a baseline $C_\text{baseline}$ as
|
||||
\[
|
||||
\operatorname{err} = \frac{\left| C_\text{pred} - C_\text{baseline}
|
||||
\right|}{C_\text{baseline}}
|
||||
\]
|
||||
|
||||
We assess the comparability of the whole benchmark, measured with \perf{}, to
|
||||
lifted block-based results by measuring the statistical distribution of the
|
||||
relative error of two series: the predictions made by \bhive, and the series of
|
||||
the best block-based prediction for each benchmark.
|
||||
|
||||
We single out \bhive{} as it is the only tool able to \textit{measure}
|
||||
---~instead of predicting~--- an isolated basic block's timing. This, however, is
|
||||
not sufficient: as discussed later in Section~\ref{ssec:bhive_errors}, \bhive{}
|
||||
is not able to yield a result for about $40\,\%$ of the benchmarks, and is
|
||||
subject to large errors in some cases. For this purpose, we also consider, for
|
||||
each benchmark, the best block-based prediction: we argue that if, for most
|
||||
benchmarks, at least one of these predictors is able to yield a satisfyingly
|
||||
accurate result, then the lifting methodology is sound in practice.
|
||||
|
||||
The result of this analysis is presented in Table~\ref{table:exp_comparability}
|
||||
and in Figure~\ref{fig:exp_comparability}. The results are in a range
|
||||
compatible with common results of the field, as seen \eg{} in~\cite{uica}
|
||||
reporting Mean Absolute Percentage Error (MAPE, corresponding to the
|
||||
``Average'' row) of about 10-15\,\% in many cases. While lifted \bhive's
|
||||
average error is driven high by large errors on certain benchmarks,
|
||||
investigated later in this article, its median error is still comparable to the
|
||||
errors of state-of-the-art tools. From this, we conclude that lifted cycle
|
||||
measures and predictions are consistent with whole-benchmark measures; and
|
||||
consequently, lifted predictions can reasonably be compared to one another.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{figs/results_comparability_hist.pdf}
|
||||
\caption{Relative error distribution \wrt{} \perf}\label{fig:exp_comparability}
|
||||
\end{figure}
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\caption{Relative error statistics \wrt{} \perf}\label{table:exp_comparability}
|
||||
\begin{tabular}{l r r}
|
||||
\toprule
|
||||
& \textbf{Best block-based} & \textbf{BHive} \\
|
||||
\midrule
|
||||
Datapoints & 3500 & 2198 \\
|
||||
Errors & 0 & 1302 \\
|
||||
& (0\,\%) & (37.20\,\%) \\
|
||||
Average (\%) & 11.60 & 27.95 \\
|
||||
Median (\%) & 5.81 & 7.78 \\
|
||||
Q1 (\%) & 1.99 & 3.01 \\
|
||||
Q3 (\%) & 15.41 & 23.01 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
|
||||
\begin{table*}[!htbp]
|
||||
\centering
|
||||
\caption{Bottleneck reports from the studied tools}\label{table:coverage}
|
||||
|
||||
\begin{tabular}{l | r r r | r r r | r r r}
|
||||
\toprule
|
||||
& \multicolumn{3}{c|}{\textbf{Frontend}}
|
||||
& \multicolumn{3}{c|}{\textbf{Ports}}
|
||||
& \multicolumn{3}{c}{\textbf{Dependencies}} \\
|
||||
& \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} & \textbf{yes} & \textbf{no} & \textbf{disagr.} \\
|
||||
|
||||
\midrule
|
||||
2mm & 34 & 61 & 25.8 \% & 25 & 13 & 70.3 \% & 18 & 29 & 63.3 \% \\
|
||||
3mm & 44 & 61 & 18.0 \% & 30 & 13 & 66.4 \% & 23 & 37 & 53.1 \% \\
|
||||
atax & 13 & 72 & 41.0 \% & 25 & 17 & 70.8 \% & 23 & 30 & 63.2 \% \\
|
||||
bicg & 19 & 59 & 45.8 \% & 25 & 25 & 65.3 \% & 21 & 37 & 59.7 \% \\
|
||||
doitgen & 51 & 25 & 40.6 \% & 36 & 30 & 48.4 \% & 17 & 22 & 69.5 \% \\
|
||||
mvt & 27 & 53 & 33.3 \% & 9 & 18 & 77.5 \% & 7 & 32 & 67.5 \% \\
|
||||
gemver & 62 & 13 & 39.5 \% & 2 & 48 & 59.7 \% & 1 & 28 & 76.6 \% \\
|
||||
gesummv & 16 & 69 & 41.0 \% & 17 & 23 & 72.2 \% & 24 & 28 & 63.9 \% \\
|
||||
syr2k & 51 & 37 & 38.9 \% & 8 & 42 & 65.3 \% & 19 & 34 & 63.2 \% \\
|
||||
trmm & 69 & 27 & 25.0 \% & 16 & 30 & 64.1 \% & 15 & 30 & 64.8 \% \\
|
||||
symm & 0 & 121 & 11.0 \% & 5 & 20 & 81.6 \% & 9 & 5 & 89.7 \% \\
|
||||
syrk & 54 & 46 & 30.6 \% & 12 & 42 & 62.5 \% & 20 & 48 & 52.8 \% \\
|
||||
gemm & 42 & 41 & 42.4 \% & 30 & 41 & 50.7 \% & 16 & 57 & 49.3 \% \\
|
||||
gramschmidt & 48 & 52 & 21.9 \% & 16 & 20 & 71.9 \% & 24 & 39 & 50.8 \% \\
|
||||
cholesky & 24 & 72 & 33.3 \% & 0 & 19 & 86.8 \% & 5 & 14 & 86.8 \% \\
|
||||
durbin & 49 & 52 & 29.9 \% & 0 & 65 & 54.9 \% & 2 & 39 & 71.5 \% \\
|
||||
trisolv & 53 & 84 & 4.9 \% & 6 & 22 & 80.6 \% & 4 & 16 & 86.1 \% \\
|
||||
jacobi-1d & 18 & 78 & 33.3 \% & 66 & 9 & 47.9 \% & 0 & 13 & 91.0 \% \\
|
||||
heat-3d & 32 & 8 & 72.2 \% & 26 & 0 & 81.9 \% & 0 & 0 & 100.0 \% \\
|
||||
seidel-2d & 0 & 112 & 22.2 \% & 32 & 0 & 77.8 \% & 0 & 0 & 100.0 \% \\
|
||||
fdtd-2d & 52 & 22 & 47.1 \% & 20 & 41 & 56.4 \% & 0 & 40 & 71.4 \% \\
|
||||
jacobi-2d & 6 & 31 & 73.6 \% & 24 & 61 & 39.3 \% & 0 & 44 & 68.6 \% \\
|
||||
adi & 12 & 76 & 21.4 \% & 40 & 0 & 64.3 \% & 0 & 0 & 100.0 \% \\
|
||||
correlation & 18 & 36 & 51.8 \% & 19 & 30 & 56.2 \% & 23 & 45 & 39.3 \% \\
|
||||
covariance & 39 & 36 & 37.5 \% & 4 & 34 & 68.3 \% & 19 & 53 & 40.0 \% \\
|
||||
floyd-warshall & 74 & 16 & 29.7 \% & 16 & 24 & 68.8 \% & 20 & 8 & 78.1 \% \\
|
||||
\textbf{Total} & 907 & 1360 & 35.2 \% & 509 & 687 & 65.8 \% & 310 & 728 & 70.3 \% \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table*}
|
||||
|
||||
\subsection{Relevance and representativity (bottleneck
|
||||
analysis)}\label{ssec:bottleneck_diversity}
|
||||
|
||||
The results provided by our harness are only relevant to evaluate the parts of
|
||||
the tools' models that are stressed by the benchmarks generated; it is hence
|
||||
critical that our benchmark generation procedure in Section~\ref{sec:bench_gen}
|
||||
yields diverse results. This should be true by construction, as the various
|
||||
polyhedral compilation techniques used stress different parts of the
|
||||
microarchitecture.
|
||||
|
||||
To assess this, we study the generated benchmarks' bottlenecks, \ie{}
|
||||
architectural resources on which a release of pressure improves execution time.
|
||||
Note that a saturated resource is not necessarily a bottleneck: a code that
|
||||
uses \eg{} 100\,\% of the arithmetics units available for computations outside
|
||||
of the critical path, at a point where a chain of dependencies is blocking,
|
||||
will not run faster if the arithmetics operations are removed; hence, hardware
|
||||
counters alone are not sufficient to find bottlenecks.
|
||||
|
||||
However, some static analyzers report the bottlenecks they detect. To unify
|
||||
their results and keep things simple, we study three general kinds of
|
||||
bottlenecks.
|
||||
|
||||
\begin{itemize}
|
||||
\item{} \emph{Frontend:} the CPU's frontend is not able to issue
|
||||
micro-operations to the backend fast enough. \iaca{} and \uica{} are
|
||||
able to detect this.
|
||||
\item{} \emph{Ports:} at least one of the backend ports has too much work;
|
||||
reducing its pressure would accelerate the computation.
|
||||
\llvmmca, \iaca{} and \uica{} are able to detect this.
|
||||
\item{} \emph{Dependencies:} there is a chain of data dependencies slowing
|
||||
down the computation.
|
||||
\llvmmca, \iaca{} and \uica{} are able to detect this.
|
||||
\end{itemize}
|
||||
|
||||
For each source benchmark from Polybench and each type of bottleneck, we report
|
||||
in Table~\ref{table:coverage} the number of derived benchmarks on which all the
|
||||
tools agree that the bottleneck is present or absent. We also report the
|
||||
proportion of cases in which the tools failed to agree. We analyze those
|
||||
results later in Section~\ref{ssec:bottleneck_pred_analysis}.
|
||||
|
||||
As we have no source of truth indicating whether a bottleneck is effectively
|
||||
present in a microbenchmark, we adopt a conservative approach, and consider
|
||||
only the subset of the microbenchmarks on which the tools agree on the status
|
||||
of all three resources; for those, we have a good confidence on the bottlenecks
|
||||
reported. Obviously, this approach is limited, because it excludes
|
||||
microbenchmarks that might be worth considering, and is most probably subject
|
||||
to selection bias.
|
||||
|
||||
Of the 3,500 microbenchmarks we have generated, 261 (7.5\,\%) are the subject
|
||||
of the above-mentioned consensus. This sample is made up of microbenchmarks
|
||||
generated from 21 benchmarks ---~\ie{} for 5 benchmarks, none of the derived
|
||||
microbenchmarks reached a consensus among the tools~---, yielding a wide
|
||||
variety of calculations, including floating-point arithmetic, pointer
|
||||
arithmetic or Boolean arithmetic. Of these, 200 (76.6\,\%) are bottlenecked on
|
||||
the CPU front-end, 19 (7,3\,\%) on back-end ports, and 81 (31.0\,\%) on latency
|
||||
introduced by dependencies. As mentioned above, this distribution
|
||||
probably does not transcribe the distribution among the 3,500 original
|
||||
benchmarks, as the 261 were not uniformly sampled. However, we argue that, as
|
||||
all categories are represented in the sample, the initial hypothesis that the
|
||||
generated benchmarks are diverse and representative is confirmed ---~thanks to
|
||||
the transformations described in Section~\ref{sec:bench_gen}.
|
||||
|
||||
\subsection{Carbon footprint}
|
||||
|
||||
Generating and running the full suite of benchmarks required about 30h of
|
||||
continuous computation on a single machine. During the experiments, the power
|
||||
supply units reported a near-constant consumption of about 350W. The carbon
|
||||
intensity of the power grid for the region where the experiment was run, at the
|
||||
time of the experiments, was of about 29g\coeq/kWh~\cite{electricitymaps}.
|
||||
|
||||
The electricity consumed directly by the server thus amounts to about
|
||||
10.50\,kWh. Assuming a Power Usage Efficiency of 1.5, the total electricity
|
||||
consumption roughly amounts to 15.75\,kWh, or about 450\,g\coeq.
|
||||
|
||||
A carbon footprint estimate of the machine's manufacture itself was conducted
|
||||
by the manufacturer~\cite{poweredgeC6420lca}. Additionally accounting for the
|
||||
extra 160\,GB of DDR4 SDRAM~\cite{meta_ACT}, the hardware manufacturing,
|
||||
transport and end-of-life is evaluated to 1,266\,kg\coeq. In 2023, this
|
||||
computation cluster's usage rate was 35\,\%. Assuming 6 years of product life,
|
||||
30h of usage represents about 2,050\,g\coeq{}. The whole experiment thus amounts to
|
||||
2.5\,kg\coeq.
|
338
manuscrit/50_CesASMe/25_results_analysis.tex
Normal file
338
manuscrit/50_CesASMe/25_results_analysis.tex
Normal file
|
@ -0,0 +1,338 @@
|
|||
\section{Results analysis}\label{sec:results_analysis}
|
||||
|
||||
The raw complete output from our benchmarking harness ---~roughly speaking, a
|
||||
large table with, for each benchmark, a cycle measurement, cycle count for each
|
||||
throughput analyzer, the resulting relative error, and a synthesis of the
|
||||
bottlenecks reported by each tool~--- enables many analyses that, we believe,
|
||||
could be useful both to throughput analysis tool developers and users. Tool
|
||||
designers can draw insights on their tool's best strengths and weaknesses, and
|
||||
work towards improving them with a clearer vision. Users can gain a better
|
||||
understanding of which tool is more suited for each situation.
|
||||
|
||||
\subsection{Throughput results}\label{ssec:overall_results}
|
||||
|
||||
\begin{table*}
|
||||
\centering
|
||||
\caption{Statistical analysis of overall results}\label{table:overall_analysis_stats}
|
||||
\begin{tabular}{l r r r r r r r r r}
|
||||
\toprule
|
||||
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
|
||||
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau} & \textbf{Time (CPU$\cdot$h)}\\
|
||||
\midrule
|
||||
BHive & 2198 & 1302 & (37.20\,\%) & 27.95\,\% & 7.78\,\% & 3.01\,\% & 23.01\,\% & 0.81 & 1.37\\
|
||||
llvm-mca & 3500 & 0 & (0.00\,\%) & 36.71\,\% & 27.80\,\% & 12.92\,\% & 59.80\,\% & 0.57 & 0.96 \\
|
||||
UiCA & 3500 & 0 & (0.00\,\%) & 29.59\,\% & 18.26\,\% & 7.11\,\% & 52.99\,\% & 0.58 & 2.12 \\
|
||||
Ithemal & 3500 & 0 & (0.00\,\%) & 57.04\,\% & 48.70\,\% & 22.92\,\% & 75.69\,\% & 0.39 & 0.38 \\
|
||||
Iaca & 3500 & 0 & (0.00\,\%) & 30.23\,\% & 18.51\,\% & 7.13\,\% & 57.18\,\% & 0.59 & 1.31 \\
|
||||
Gus & 3500 & 0 & (0.00\,\%) & 20.37\,\% & 15.01\,\% & 7.82\,\% & 30.59\,\% & 0.82 & 188.04 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table*}
|
||||
|
||||
The error distribution of the relative errors, for each tool, is presented as a
|
||||
box plot in Figure~\ref{fig:overall_analysis_boxplot}. Statistical indicators
|
||||
are also given in Table~\ref{table:overall_analysis_stats}. We also give, for
|
||||
each tool, its Kendall's tau indicator~\cite{kendall1938tau}: this indicator,
|
||||
used to evaluate \eg{} uiCA~\cite{uica} and Palmed~\cite{palmed}, measures how
|
||||
well the pair-wise ordering of benchmarks is preserved, $-1$ being a full
|
||||
anti-correlation and $1$ a full correlation. This is especially useful when one
|
||||
is not interested in a program's absolute throughput, but rather in comparing
|
||||
which program has a better throughput.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[width=\linewidth]{figs/overall_analysis_boxplot.pdf}
|
||||
\caption{Statistical distribution of relative errors}\label{fig:overall_analysis_boxplot}
|
||||
\end{figure}
|
||||
|
||||
|
||||
These results are, overall, significantly worse than what each tool's article
|
||||
presents. We attribute this difference mostly to the specificities of
|
||||
Polybench: being composed of computation kernels, it intrinsically stresses the
|
||||
CPU more than basic blocks extracted out of the Spec benchmark suite. This
|
||||
difference is clearly reflected in the experimental section of the Palmed
|
||||
article~\cite{palmed}: the accuracy of most tools is worse on Polybench than on
|
||||
Spec, often by more than a factor of two.
|
||||
|
||||
As \bhive{} and \ithemal{} do not support control flow instructions
|
||||
(\eg{} \texttt{jump} instructions), those had
|
||||
to be removed from the blocks before analysis. While none of these tools, apart
|
||||
from \gus{} ---~which is dynamic~---, is able to account for branching costs,
|
||||
these two analyzers were also unable to account for the front- and backend cost
|
||||
of the control flow instructions themselves as well ---~corresponding to the
|
||||
$TP_U$ mode introduced by \uica~\cite{uica}, while others
|
||||
measure $TP_L$.
|
||||
|
||||
|
||||
\subsection{Understanding \bhive's results}\label{ssec:bhive_errors}
|
||||
|
||||
The error distribution of \bhive{} against \perf{}, plotted right in
|
||||
Figure~\ref{fig:exp_comparability}, puts forward irregularities in \bhive's
|
||||
results. Since \bhive{} is based on measures ---~instead of predictions~---
|
||||
through hardware counters, an excellent accuracy is expected. Its lack of
|
||||
support for control flow instructions can be held accountable for a portion of
|
||||
this accuracy drop; our lifting method, based on block occurrences instead of
|
||||
paths, can explain another portion. We also find that \bhive{} fails to produce
|
||||
a result in about 40\,\% of the kernels explored ---~which means that, for those
|
||||
cases, \bhive{} failed to produce a result on at least one of the constituent
|
||||
basic blocks. In fact, this is due to the difficulties we mentioned in
|
||||
Section \ref{sec:intro} related to the need to reconstruct the context of each
|
||||
basic block \textit{ex nihilo}.
|
||||
|
||||
The basis of \bhive's method is to run the code to be measured, unrolled a
|
||||
number of times depending on the code size, with all memory pages but the
|
||||
code unmapped. As the code tries to access memory, it will raise segfaults,
|
||||
caught by \bhive's harness, which allocates a single shared-memory page, filled
|
||||
with a repeated constant, that it will map wherever segfaults occur before
|
||||
restarting the program.
|
||||
The main causes of \bhive{} failure are bad code behaviour (\eg{} control flow
|
||||
not reaching the exit point of the measure if a bad jump is inserted), too many
|
||||
segfaults to be handled, or a segfault that occurs even after mapping a page at
|
||||
the problematic address.
|
||||
|
||||
The registers are also initialized, at the beginning of the measurement, to the
|
||||
fixed constant \texttt{0x2324000}. We show through two examples that this
|
||||
initial value can be of crucial importance.
|
||||
|
||||
The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
|
||||
(Cascade Lake), with hyperthreading disabled.
|
||||
|
||||
\paragraph{Imprecise analysis} we consider the following x86-64 kernel.
|
||||
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
vmulsd (%rax), %xmm3, %xmm0
|
||||
vmovsd %xmm0, (%r10)
|
||||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
When executed with all the general purpose registers initialized to the default
|
||||
constant, \bhive{} reports 9 cycles per iteration, since \reg{rax} and
|
||||
\reg{r10} hold the same value, inducing a read-after-write dependency between
|
||||
the two instructions. If, however, \bhive{} is tweaked to initialize \reg{r10}
|
||||
to a value that aliases (\wrt{} physical addresses) with the value in
|
||||
\reg{rax}, \eg{} between \texttt{0x10000} and \texttt{0x10007} (inclusive), it
|
||||
reports 19 cycles per iteration instead; while a value between \texttt{0x10008}
|
||||
and \texttt{0x1009f} (inclusive) yields the expected 1 cycle ---~except for
|
||||
values in \texttt{0x10039}-\texttt{0x1003f} and
|
||||
\texttt{0x10079}-\texttt{0x1007f}, yielding 2 cycles as the store crosses a
|
||||
cache line boundary.
|
||||
|
||||
In the same way, the value used to initialize the shared memory page can
|
||||
influence the results whenever it gets loaded into registers.
|
||||
|
||||
\vspace{0.5em}
|
||||
|
||||
\paragraph{Failed analysis} some memory accesses will always result in an
|
||||
error; for instance, it is impossible to \texttt{mmap} at an address lower
|
||||
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
|
||||
with equal initial values for all registers, the following kernel would fail,
|
||||
since the second operation attempts to load at address 0:
|
||||
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
subq %r11, %r10
|
||||
movq (%r10), %rax
|
||||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
Such errors can occur in more circumvoluted ways. The following x86-64 kernel,
|
||||
for instance, is extracted from a version of the \texttt{durbin}
|
||||
kernel\footnote{\texttt{durbin.pocc.noopt.default.unroll8.MEDIUM.kernel21.s}
|
||||
in the full results}.
|
||||
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||
vmovsd 0x10(%r8, %rcx), %xmm6
|
||||
subl %eax, %esi
|
||||
movslq %esi, %rsi
|
||||
vfmadd231sd -8(%r9, %rsi, 8), \
|
||||
%xmm6, %xmm0
|
||||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
Here, \bhive{} fails to measure the kernel when run with the general purpose
|
||||
registers initialized to the default constant at the 2\textsuperscript{nd}
|
||||
occurrence of the unrolled loop body, failing to recover from an error at the
|
||||
\texttt{vfmadd231sd} instruction with the \texttt{mmap} strategy. Indeed, after
|
||||
the first iteration the value in \reg{rsi} becomes null, then negative at the
|
||||
second iteration; thus, the second occurrence of the last instruction fetches
|
||||
at address \texttt{0xfffffffff0a03ff8}, which is in kernel space. This
|
||||
microkernel can be benchmarked with BHive \eg{} by initializing \reg{rax} to 1.
|
||||
|
||||
Some other microkernels fail in a similar way when trying to access addresses
|
||||
that are not a virtual address in \emph{canonical form} space for x86-64 with
|
||||
48 bits virtual addresses, as defined in Section~3.3.7.1 of Intel's Software
|
||||
Developer's Manual~\cite{ref:intel64_software_dev_reference_vol1} and
|
||||
Section~5.3.1 of the AMD64 Architecture Programmer's
|
||||
Manual~\cite{ref:amd64_architecture_dev_reference_vol2}. Others still fail with
|
||||
accesses relative to the instruction pointer, as \bhive{} read-protects the
|
||||
unrolled microkernel's instructions page.
|
||||
|
||||
\subsection{Bottleneck prediction}\label{ssec:bottleneck_pred_analysis}
|
||||
|
||||
We introduced in Section~\ref{ssec:bottleneck_diversity} earlier that some of
|
||||
the tools studied are also able to report suspected bottlenecks for the
|
||||
evaluated program, whose results are presented in Table~\ref{table:coverage}.
|
||||
This feature might be even more useful than raw throughput predictions to the
|
||||
users of these tools willing to optimize their program, as they strongly hint
|
||||
towards what needs to be enhanced.
|
||||
|
||||
In the majority of the cases studied, the tools are not able to agree on the
|
||||
presence or absence of a type of bottleneck. Although it might seem that the
|
||||
tools are performing better on frontend bottleneck detection, it must be
|
||||
recalled that only two tools (versus three in the other cases) are reporting
|
||||
frontend bottlenecks, thus making it easier for them to agree.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\caption{Diverging bottleneck prediction per tool}\label{table:bottleneck_diverging_pred}
|
||||
\begin{tabular}{l r r r r}
|
||||
\toprule
|
||||
\textbf{Tool}
|
||||
& \multicolumn{2}{c}{\textbf{Ports}}
|
||||
& \multicolumn{2}{c}{\textbf{Dependencies}} \\
|
||||
\midrule
|
||||
\llvmmca{} & 567 & (24.6 \%) & 1032 & (41.9 \%) \\
|
||||
\uica{} & 516 & (22.4 \%) & 530 & (21.5 \%) \\
|
||||
\iaca{} & 1221 & (53.0 \%) & 900 & (36.6 \%) \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
The Table~\ref{table:bottleneck_diverging_pred}, in turn, breaks down the cases
|
||||
on which three tools disagree into the number of times one tool makes a
|
||||
diverging prediction ---~\ie{} the tool predicts differently than the two
|
||||
others. In the case of ports, \iaca{} is responsible for half of the
|
||||
divergences ---~which is not sufficient to conclude that the prediction of the
|
||||
other tools is correct. In the case of dependencies, however, there is no clear
|
||||
outlier, even though \uica{} seems to fare better than others.
|
||||
|
||||
In no case one tool seems to be responsible for the vast majority of
|
||||
disagreements, which could hint towards it failing to predict correctly this
|
||||
bottleneck. In the absence of a source of truth indicating whether a bottleneck
|
||||
is effectively present, and with no clear-cut result for (a subset of) tool
|
||||
predictions, we cannot conclude on the quality of the predictions from each
|
||||
tool for each kind of bottleneck.
|
||||
|
||||
\subsection{Impact of dependency-boundness}\label{ssec:memlatbound}
|
||||
|
||||
\begin{table*}
|
||||
\centering
|
||||
\caption{Statistical analysis of overall results, without latency bound
|
||||
through memory-carried dependencies rows}\label{table:nomemdeps_stats}
|
||||
\begin{tabular}{l r r r r r r r r r}
|
||||
\toprule
|
||||
\textbf{Bencher} & \textbf{Datapoints} & \textbf{Failures} & \textbf{(\%)} &
|
||||
\textbf{MAPE} & \textbf{Median} & \textbf{Q1} & \textbf{Q3} & \textbf{K. tau}\\
|
||||
\midrule
|
||||
BHive & 1365 & 1023 & (42.84\,\%) & 34.07\,\% & 8.62\,\% & 4.30\,\% & 24.25\,\% & 0.76\\
|
||||
llvm-mca & 2388 & 0 & (0.00\,\%) & 27.06\,\% & 21.04\,\% & 9.69\,\% & 32.73\,\% & 0.79\\
|
||||
UiCA & 2388 & 0 & (0.00\,\%) & 18.42\,\% & 11.96\,\% & 5.42\,\% & 23.32\,\% & 0.80\\
|
||||
Ithemal & 2388 & 0 & (0.00\,\%) & 62.66\,\% & 53.84\,\% & 24.12\,\% & 81.95\,\% & 0.40\\
|
||||
Iaca & 2388 & 0 & (0.00\,\%) & 17.55\,\% & 12.17\,\% & 4.64\,\% & 22.35\,\% & 0.82\\
|
||||
Gus & 2388 & 0 & (0.00\,\%) & 23.18\,\% & 20.23\,\% & 8.78\,\% & 32.73\,\% & 0.83\\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table*}
|
||||
|
||||
An overview of the full results table (available in our artifact) hints towards
|
||||
two main tendencies: on a significant number of rows, the static tools
|
||||
---~thus leaving \gus{} and \bhive{} apart~---, excepted \ithemal, often yield
|
||||
comparatively bad throughput predictions \emph{together}; and many of these
|
||||
rows are those using the \texttt{O1} and \texttt{O1autovect} compilation
|
||||
setting (\texttt{gcc} with \texttt{-O1}, plus vectorisation options for the
|
||||
latter).
|
||||
|
||||
To confirm the first observation, we look at the 30\,\% worst benchmarks ---~in
|
||||
terms of MAPE relative to \perf~--- for \llvmmca, \uica{} and \iaca{}
|
||||
---~yielding 1050 rows each. All of these share 869 rows (82.8\,\%), which we
|
||||
call \textit{jointly bad rows}.
|
||||
|
||||
Among these 869 jointly bad rows, we further find that respectively 342
|
||||
(39.4\,\%) and 337 (38.8\,\%) are compiled using the \texttt{O1} and
|
||||
\texttt{O1autovect}, totalling to 679 (78.1\,\%) of \texttt{O1}-based rows,
|
||||
against 129 (14.8\,\%) for \texttt{default} and 61 (7.0\,\%) for
|
||||
\texttt{O3nosimd}. This result is significant enough to be used as a hint to
|
||||
investigate the issue.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[width=\linewidth]{figs/nomemdeps_boxplot.pdf}
|
||||
\caption{Statistical distribution of relative errors, with and without
|
||||
pruning latency bound through memory-carried dependencies rows}\label{fig:nomemdeps_boxplot}
|
||||
\end{figure}
|
||||
|
||||
|
||||
Insofar as our approach maintains a strong link between the basic blocks studied and
|
||||
the source codes from which they are extracted, it is possible to identify the
|
||||
high-level characteristics of the concerned microbenchmarks.
|
||||
In the overwhelming majority (97.5\,\%) of those jointly bad rows, the tools predicted
|
||||
fewer cycles than measured, meaning that a bottleneck is either missed or
|
||||
underestimated.
|
||||
Manual investigation of a few simple benchmarks (no polyhedral transformation
|
||||
applied, \texttt{O1} mode, not unrolled) further hints towards dependencies:
|
||||
for instance, the \texttt{gemver} benchmark, which is \emph{not} among the
|
||||
badly predicted benchmarks, has this kernel:
|
||||
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language={[ANSI]C}]
|
||||
for(c3)
|
||||
A[c1][c3] += u1[c1] * v1[c3]
|
||||
+ u2[c1] * v2[c3];
|
||||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
while the \texttt{atax} benchmark, which is among the badly predicted ones, has
|
||||
this kernel:
|
||||
|
||||
\begin{minipage}{0.95\linewidth}
|
||||
\begin{lstlisting}[language=c]
|
||||
for(c3)
|
||||
tmp[c1] += A[c1][c3] * x[c3];
|
||||
\end{lstlisting}
|
||||
\end{minipage}
|
||||
|
||||
The first one exhibits no obvious dependency-boundness, while the second,
|
||||
accumulating on \texttt{tmp[c1]} (independent of the iteration variable) lacks
|
||||
in instruction-level parallelism. Among the simple benchmarks (as described
|
||||
above), 8 are in the badly predicted list, all of which exhibit a
|
||||
read-after-write data dependency to the preceding iteration.
|
||||
|
||||
Looking at the assembly code generated for those in \texttt{O1} modes, it
|
||||
appears that the dependencies exhibited at the C level are compiled to
|
||||
\emph{memory-carried} dependencies: the read-after-write happens for a given
|
||||
memory address, instead of for a register. This kind of dependency, prone to
|
||||
aliasing and dependent on the values of the registers, is hard to infer for a
|
||||
static tool and is not supported by the analyzers under scrutiny in the general
|
||||
case; it could thus reasonably explain the results observed.
|
||||
|
||||
There is no easy way, however, to know for certain which of the 3500 benchmarks
|
||||
are latency bound: no hardware counter reports this. We investigate this
|
||||
further using \gus's sensitivity analysis: in complement of the ``normal''
|
||||
throughput estimation of \gus, we run it a second time, disabling the
|
||||
accounting for latency through memory dependencies. By construction, this second measurement should be
|
||||
either very close to the first one, or significantly below. We then assume a
|
||||
benchmark to be latency bound due to memory-carried dependencies when it is at
|
||||
least 40\,\% faster when this latency is disabled; there are 1112 (31.8\,\%) such
|
||||
benchmarks.
|
||||
|
||||
Of the 869 jointly bad rows, 745 (85.7\,\%) are declared latency
|
||||
bound through memory-carried dependencies by \gus. We conclude that the main
|
||||
reason for these jointly badly predicted benchmarks is that the predictors
|
||||
under scrutiny failed to correctly detect these dependencies.
|
||||
|
||||
In Section~\ref{ssec:overall_results}, we presented in
|
||||
Figure~\ref{fig:overall_analysis_boxplot} and
|
||||
Table~\ref{table:overall_analysis_stats} general statistics on the tools
|
||||
on the full set of benchmarks. We now remove the 1112 benchmarks
|
||||
flagged as latency bound through memory-carried dependencies by \gus{} from the
|
||||
dataset, and present in Figure~\ref{fig:nomemdeps_boxplot} a comparative
|
||||
box plot for the tools under scrutiny. We also present in
|
||||
Table~\ref{table:nomemdeps_stats} the same statistics on this pruned dataset.
|
||||
While the results for \llvmmca, \uica{} and \iaca{} globally improved
|
||||
significantly, the most noticeable improvements are the reduced spread of the
|
||||
results and the Kendall's $\tau$ correlation coefficient's increase.
|
||||
|
||||
From this,
|
||||
we argue that detecting memory-carried dependencies is a weak point in current
|
||||
state-of-the-art static analyzers, and that their results could be
|
||||
significantly more accurate if improvements are made in this direction.
|
106
manuscrit/50_CesASMe/30_future_works.tex
Normal file
106
manuscrit/50_CesASMe/30_future_works.tex
Normal file
|
@ -0,0 +1,106 @@
|
|||
\section{Conclusion and future works}
|
||||
|
||||
In this article, we have presented a fully-tooled approach that enables:
|
||||
|
||||
\begin{itemize}
|
||||
\item the generation of a wide variety of microbenchmarks, reflecting both the
|
||||
expertise contained in an initial benchmark suite, and the diversity of
|
||||
code transformations allowing to stress different aspects of a performance model
|
||||
---~or even a measurement environment, \eg{} \bhive; and
|
||||
\item the comparability of various measurements and
|
||||
analyses applied to each of these microbenchmarks.
|
||||
\end{itemize}
|
||||
|
||||
Thanks to this tooling, we were able to show the limits and strengths of
|
||||
various performance models in relation to the expertise contained in the
|
||||
Polybench suite. We discuss throughput results in
|
||||
Section~\ref{ssec:overall_results} and bottleneck prediction in
|
||||
Section~\ref{ssec:bottleneck_pred_analysis}.
|
||||
|
||||
We were also able to demonstrate the difficulties of reasoning at the level of
|
||||
a basic block isolated from its context. We specifically study those
|
||||
difficulties in the case of \bhive{} in Section~\ref{ssec:bhive_errors}.
|
||||
Indeed, the actual values ---~both from registers and memory~--- involved in a
|
||||
basic block's computation are constitutive not only of its functional
|
||||
properties (\ie{} the result of the calculation), but also of some of its
|
||||
non-functional properties (\eg{} latency, throughput).
|
||||
|
||||
We were also able to show in Section~\ref{ssec:memlatbound}
|
||||
that state-of-the-art static analyzers struggle to
|
||||
account for memory-carried dependencies; a weakness significantly impacting
|
||||
their overall results on our benchmarks. We believe that detecting
|
||||
and accounting for these dependencies is an important future works direction.
|
||||
|
||||
Moreover, we present this work in the form of a modular software package, each
|
||||
component of which exposes numerous adjustable parameters. These components can
|
||||
also be replaced by others fulfilling the same abstract function: another
|
||||
initial benchmark suite in place of Polybench, other loop nest
|
||||
optimizers in place of PLUTO and PoCC, other code
|
||||
analyzers, and so on. This software modularity reflects the fact that our
|
||||
contribution is about interfacing and communication between distinct issues.
|
||||
|
||||
\medskip
|
||||
|
||||
Furthermore, we believe that the contributions we made in the course of this work
|
||||
may eventually be used to face different, yet neighbouring issues.
|
||||
These perspectives can also be seen as future works:
|
||||
|
||||
\smallskip
|
||||
|
||||
\paragraph{Program optimization} the whole program processing we have designed
|
||||
can be used not only to evaluate the performance model underlying a static
|
||||
analyzer, but also to guide program optimization itself. In such a perspective,
|
||||
we would generate different versions of the same program using the
|
||||
transformations discussed in Section~\ref{sec:bench_gen} and colored blue in
|
||||
Figure~\ref{fig:contrib}. These different versions would then feed the
|
||||
execution and measurement environment outlined in
|
||||
Section~\ref{sec:bench_harness} and colored orange in Figure~\ref{fig:contrib}.
|
||||
Indeed, thanks to our previous work, we know that the results of these
|
||||
comparable analyses and measurements would make it possible to identify which
|
||||
version is the most efficient, and even to reconstruct information indicating
|
||||
why (which bottlenecks, etc.).
|
||||
|
||||
However, this approach would require that these different versions of the same
|
||||
program are functionally equivalent, \ie{} that they compute the same
|
||||
result from the same inputs; yet we saw in Section~\ref{sec:bench_harness}
|
||||
that, as it stands, the transformations we apply are not concerned with
|
||||
preserving the semantics of the input codes. To recover this semantic
|
||||
preservation property, abandoning the kernelification pass we have presented
|
||||
suffices; this however would require to control L1-residence otherwise.
|
||||
|
||||
\smallskip
|
||||
|
||||
\paragraph{Dataset building} our microbenchmarks generation phase outputs a
|
||||
large, diverse and representative dataset of microkernels. In addition to our
|
||||
harness, we believe that such a dataset could be used to improve existing
|
||||
data-dependant solutions.
|
||||
|
||||
%the measurement and execution environment we
|
||||
%propose is not the only type of tool whose function is to process a large
|
||||
%dataset (\ie{} the microbenchmarks generated earlier) to automatically
|
||||
%abstract its characteristics. We can also think of:
|
||||
|
||||
Inductive methods, for instance in \anica, strive to preserve the properties of a basic
|
||||
block through successive abstractions of the instructions it contains, so as to
|
||||
draw the most general conclusions possible from a particular experiment.
|
||||
Currently, \anica{} starts off from randomly generated basic blocks. This
|
||||
approach guarantees a certain variety, and avoids
|
||||
over-specialization, which would prevent it from finding interesting cases too
|
||||
far from an initial dataset. However, it may well lead to the sample under
|
||||
consideration being systematically outside the relevant area of the search
|
||||
space ---~\ie{} having no relation to real-life programs or those in the user's
|
||||
field.
|
||||
|
||||
On the other hand, machine learning methods based on neural networks, for
|
||||
instance in \ithemal, seek to correlate the result of a function with the
|
||||
characteristics of its input ---~in this case to correlate a throughput
|
||||
prediction with the instructions making up a basic block~--- by backpropagating
|
||||
the gradient of a cost function. In the case of \ithemal{}, it is trained on
|
||||
benchmarks originating from a data suite. As opposed to random generation,
|
||||
this approach offers representative samples, but comes with a risk of lack of
|
||||
variety and over-specialization.
|
||||
|
||||
Comparatively, our microbenchmark generation method is natively meant to
|
||||
produce a representative, varied and large dataset. We believe that
|
||||
enriching the dataset of the above-mentioned methods with our benchmarks might
|
||||
extend their results and reach.
|
2
manuscrit/50_CesASMe/99_conclusion.tex
Normal file
2
manuscrit/50_CesASMe/99_conclusion.tex
Normal file
|
@ -0,0 +1,2 @@
|
|||
%% \section*{Conclusion}
|
||||
%% \todo{}
|
|
@ -1 +1,9 @@
|
|||
\chapter{A more systematic approach to throughput prediction performance analysis}
|
||||
|
||||
\input{00_intro.tex}
|
||||
\input{05_related_works.tex}
|
||||
\input{10_bench_gen.tex}
|
||||
\input{20_evaluation.tex}
|
||||
\input{25_results_analysis.tex}
|
||||
\input{30_future_works.tex}
|
||||
\input{99_conclusion.tex}
|
||||
|
|
82
manuscrit/50_CesASMe/overview.tex
Normal file
82
manuscrit/50_CesASMe/overview.tex
Normal file
|
@ -0,0 +1,82 @@
|
|||
\begin{figure*}[ht!]
|
||||
\definecolor{col_bench_gen}{HTML}{5a7eff}
|
||||
\definecolor{col_bench_gen_bg}{HTML}{dbeeff}
|
||||
\definecolor{col_bench_harness}{HTML}{ffa673}
|
||||
\definecolor{col_results}{HTML}{000000}
|
||||
\centerline{
|
||||
\begin{tikzpicture}[
|
||||
hiddennode/.style={rectangle,draw=white, very thick, minimum size=5mm, align=center, font=\footnotesize},
|
||||
normnode/.style={rectangle,draw=black, very thick, minimum size=5mm, align=center, font=\footnotesize},
|
||||
resultnode/.style={rectangle,draw=col_results, fill=black!2, very thick, minimum size=5mm, align=center, font=\footnotesize},
|
||||
bluenode/.style={rectangle, draw=col_bench_gen, fill=col_bench_gen_bg, very thick, minimum height=5mm, minimum width=4cm, align=center, font=\footnotesize},
|
||||
rednode/.style={rectangle, draw=col_bench_harness, fill=orange!5, very thick, minimum size=5mm, align=center, font=\footnotesize},
|
||||
bencher/.style={rednode, minimum width=2.5cm, minimum height=5mm},
|
||||
genarrow/.style={draw=col_bench_gen},
|
||||
harnarrow/.style={draw=col_bench_harness},
|
||||
]
|
||||
\centering
|
||||
%Nodes
|
||||
\node[bluenode] (bench) {Benchmark suite \figref{ssec:bench_suite}};
|
||||
\node[bluenode] (pocc) [below=of bench] {Loop nest optimizers \figref{ssec:loop_nest_optimizer}};
|
||||
\node[bluenode] (kernel) [below=of pocc] {Constraining utility \figref{ssec:kernelify}};
|
||||
\node[bluenode] (gcc) [below=of kernel] {Compilations \figref{ssec:compile}};
|
||||
\node[rednode] (gdb) [right=0.1\textwidth of gcc] {Basic block \\extraction \figref{ssec:bb_extr}};
|
||||
\node[bencher] (ithemal) [right=4cm of gdb] {Ithemal};
|
||||
\node[bencher] (iaca) [above=0.5em of ithemal] {IACA};
|
||||
\node[bencher] (uica) [above=0.5em of iaca] {uiCA};
|
||||
\node[bencher] (llvm) [above=0.5em of uica] {llvm-mca};
|
||||
\node[bencher] (bhive) [above=0.5em of llvm] {BHive (measure)};
|
||||
\node[rednode] (ppapi) [left=1cm of bhive] {perf (measure)};
|
||||
\node[rednode] (gus) [below=0.5em of ppapi] {Gus};
|
||||
%% \node[rednode] (uica) [below=of gdb] {uiCA};
|
||||
\node[rednode] (lifting) [right=of bhive] {
|
||||
Prediction lifting\\\figref{ssec:harness_lifting}};
|
||||
\node[
|
||||
draw=black,
|
||||
very thick,
|
||||
dotted,
|
||||
fit=(ppapi) (gus) (bhive) (llvm) (uica) (iaca) (ithemal)
|
||||
] (comps) {};
|
||||
\node (throughput_label) [above=0.2em of comps,align=center] {
|
||||
\footnotesize Throughput predictions \\\footnotesize \& measures
|
||||
\figref{ssec:throughput_pred_meas}};
|
||||
\node[draw=black,
|
||||
very thick,
|
||||
dotted,
|
||||
%% label={below:\footnotesize Variations},
|
||||
label={[above,xshift=1cm]\footnotesize Variations},
|
||||
fit=(pocc) (kernel) (gcc)
|
||||
] (vars) {};
|
||||
\node[resultnode] (bench2) [below=of lifting] {Evaluation metrics \\ for
|
||||
code analyzers};
|
||||
|
||||
% Key
|
||||
\node[] (keyblue1) [below left=0.7cm and 0cm of vars] {};
|
||||
\node[hiddennode] (keyblue2) [right=0.5cm of keyblue1] {Section~\ref{sec:bench_gen}~: generating microbenchmarks};
|
||||
\node[] (keyred1) [right=0.6cm of keyblue2] {};
|
||||
\node[hiddennode] (keyred2) [right=0.5cm of keyred1] {Section~\ref{sec:bench_harness}~: benchmarking harness};
|
||||
\node[] (keyresult1) [right=0.6cm of keyred2] {};
|
||||
\node[hiddennode] (keyresult2) [right=0.5cm of keyresult1]
|
||||
{Section~\ref{sec:results_analysis}~: results analysis};
|
||||
|
||||
%Lines
|
||||
\draw[-, very thick, harnarrow] (keyred1.east) -- (keyred2.west);
|
||||
\draw[-, very thick, genarrow] (keyblue1.east) -- (keyblue2.west);
|
||||
\draw[-, very thick] (keyresult1.east) -- (keyresult2.west);
|
||||
\draw[->, very thick, genarrow] (bench.south) -- (pocc.north);
|
||||
\draw[->, very thick, genarrow] (pocc.south) -- (kernel.north);
|
||||
\draw[->, very thick, genarrow] (kernel.south) -- (gcc.north);
|
||||
\draw[->, very thick, genarrow] (gcc.east) -- (gdb.west);
|
||||
\draw[->, very thick, genarrow] (gcc.east) -- (ppapi.west);
|
||||
\draw[->, very thick, genarrow] (gcc.east) -- (gus.west);
|
||||
\draw[->, very thick, harnarrow] (gdb.east) -- (uica.west);
|
||||
\draw[->, very thick, harnarrow] (gdb.east) -- (iaca.west);
|
||||
\draw[->, very thick, harnarrow] (gdb.east) -- (ithemal.west);
|
||||
\draw[->, very thick, harnarrow] (gdb.east) -- (bhive.west);
|
||||
\draw[->, very thick, harnarrow] (gdb.east) -- (llvm.west);
|
||||
\draw[->, very thick, harnarrow] (comps.east|-lifting) -- (lifting.west);
|
||||
\draw[->, very thick] (lifting.south) -- (bench2.north);
|
||||
\end{tikzpicture}
|
||||
}
|
||||
\caption{Our analysis and measurement environment.\label{fig:contrib}}
|
||||
\end{figure*}
|
1
manuscrit/assets/imgs/50_CesASMe/.gitignore
vendored
Normal file
1
manuscrit/assets/imgs/50_CesASMe/.gitignore
vendored
Normal file
|
@ -0,0 +1 @@
|
|||
!*.pdf
|
BIN
manuscrit/assets/imgs/50_CesASMe/nomemdeps_boxplot.pdf
Normal file
BIN
manuscrit/assets/imgs/50_CesASMe/nomemdeps_boxplot.pdf
Normal file
Binary file not shown.
BIN
manuscrit/assets/imgs/50_CesASMe/overall_analysis_boxplot.pdf
Normal file
BIN
manuscrit/assets/imgs/50_CesASMe/overall_analysis_boxplot.pdf
Normal file
Binary file not shown.
BIN
manuscrit/assets/imgs/50_CesASMe/results_comparability_hist.pdf
Normal file
BIN
manuscrit/assets/imgs/50_CesASMe/results_comparability_hist.pdf
Normal file
Binary file not shown.
Loading…
Reference in a new issue