CesASMe: another integration pass
This commit is contained in:
parent
b44b6fcf10
commit
4c4f70c246
8 changed files with 133 additions and 93 deletions
|
@ -40,13 +40,19 @@ assess more generally the landscape of code analyzers. What are the key
|
||||||
bottlenecks to account for if one aims to predict the execution time of a
|
bottlenecks to account for if one aims to predict the execution time of a
|
||||||
kernel correctly? Are some of these badly accounted for by state-of-the-art
|
kernel correctly? Are some of these badly accounted for by state-of-the-art
|
||||||
code analyzers? This chapter, by conducting a broad experimental analysis of
|
code analyzers? This chapter, by conducting a broad experimental analysis of
|
||||||
these tools, strives to answer to these questions.
|
these tools, strives to answer these questions.
|
||||||
|
|
||||||
\input{overview}
|
\input{overview}
|
||||||
|
|
||||||
\bigskip{}
|
\bigskip{}
|
||||||
|
|
||||||
We present a fully-tooled solution to evaluate and compare the
|
In \autoref{sec:redefine_exec_time}, we investigate how a kernel's execution time
|
||||||
|
may be measured if we want to correctly account for its dependencies. We
|
||||||
|
advocate for the measurement of the total execution time of a computation
|
||||||
|
kernel in its original context, coupled with a precise measure of its number of
|
||||||
|
iterations to normalize the measure.
|
||||||
|
|
||||||
|
We then present a fully-tooled solution to evaluate and compare the
|
||||||
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
diversity of static throughput predictors. Our tool, \cesasme, solves two main
|
||||||
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
issues in this direction. In Section~\ref{sec:bench_gen}, we describe how
|
||||||
\cesasme{} generates a wide variety of computation kernels stressing different
|
\cesasme{} generates a wide variety of computation kernels stressing different
|
||||||
|
@ -56,8 +62,6 @@ Polybench~\cite{bench:polybench}, a C-level benchmark suite representative of
|
||||||
scientific computation workloads, that we combine with a variety of
|
scientific computation workloads, that we combine with a variety of
|
||||||
optimisations, including polyhedral loop transformations.
|
optimisations, including polyhedral loop transformations.
|
||||||
|
|
||||||
In \autoref{sec:redefine_exec_time}, we \qtodo{blabla full cycles}.
|
|
||||||
|
|
||||||
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
In Section~\ref{sec:bench_harness}, we describe how \cesasme{} is able to
|
||||||
evaluate throughput predictors on this set of benchmarks by lifting their
|
evaluate throughput predictors on this set of benchmarks by lifting their
|
||||||
predictions to a total number of cycles that can be compared to a hardware
|
predictions to a total number of cycles that can be compared to a hardware
|
||||||
|
@ -71,63 +75,3 @@ analyze the results of \cesasme{}.
|
||||||
to investigate analyzers' flaws. We show that code
|
to investigate analyzers' flaws. We show that code
|
||||||
analyzers do not always correctly model data dependencies through memory
|
analyzers do not always correctly model data dependencies through memory
|
||||||
accesses, substantially impacting their precision.
|
accesses, substantially impacting their precision.
|
||||||
|
|
||||||
\section{Re-defining the execution time of a
|
|
||||||
kernel}\label{sec:redefine_exec_time}
|
|
||||||
|
|
||||||
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
|
|
||||||
the execution time of a relatively simple kernel. The obvious solution to
|
|
||||||
assess their predictions is to compare them to an actual measure. However,
|
|
||||||
accounting for dependencies at the scale of a basic block makes this
|
|
||||||
\emph{actual measure} not as trivially defined as it would seem. Take for
|
|
||||||
instance the following kernel:
|
|
||||||
|
|
||||||
\begin{minipage}{0.90\linewidth}
|
|
||||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
|
||||||
mov (%rax, %rcx, 1), %r10
|
|
||||||
mov %r10, (%rbx, %rcx, 1)
|
|
||||||
add $8, %rcx
|
|
||||||
\end{lstlisting}
|
|
||||||
\end{minipage}
|
|
||||||
|
|
||||||
\noindent{}At first, it looks like an array copy from location \reg{rax} to
|
|
||||||
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
|
|
||||||
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
|
|
||||||
instruction and the second instruction at the previous iteration; which makes
|
|
||||||
the throughput drop significantly. As we shall see in
|
|
||||||
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
|
|
||||||
block's throughput is not well-defined}.
|
|
||||||
|
|
||||||
To recover the context of each basic block, we reason instead at the scale of
|
|
||||||
a C source code. This
|
|
||||||
makes the measures unambiguous: one can use hardware counters to measure the
|
|
||||||
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
|
|
||||||
that both is representative of the domain studied, and wide enough to have a
|
|
||||||
good coverage of the domain. However, this is not in itself sufficient to
|
|
||||||
evaluate static tools: on the preceding matrix multiplication kernel, counters
|
|
||||||
report 80,059 elapsed cycles ---~for the total loop.
|
|
||||||
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
|
|
||||||
basic block-level predictions seen above.
|
|
||||||
|
|
||||||
A common practice to make these numbers comparable is to renormalize them to
|
|
||||||
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
|
|
||||||
$\frac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
|
|
||||||
$\frac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\frac{7}{3}~=~2.3$. In this
|
|
||||||
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
|
|
||||||
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
|
|
||||||
kernel's efficiency}. Indeed, the static number of instructions is affected by
|
|
||||||
many compiler passes, such as scalar evolution, strength reduction, register
|
|
||||||
allocation, instruction selection\ldots{} Thus, when comparing two compiled
|
|
||||||
versions of the same code, IPC alone does not necessarily point to the most
|
|
||||||
efficient version. For instance, a kernel using SIMD instructions will use
|
|
||||||
fewer instructions than one using only scalars, and thus exhibit a lower or
|
|
||||||
constant IPC; yet, its performance will unquestionably increase.
|
|
||||||
|
|
||||||
The total cycles elapsed to solve a given problem, on the other
|
|
||||||
hand, is a sound metric of the efficiency of an implementation. We thus
|
|
||||||
instead \emph{lift} the predictions at basic-block level to a total number of
|
|
||||||
cycles. In simple cases, this simply means multiplying the block-level
|
|
||||||
prediction by the number of loop iterations; however, this bound might not
|
|
||||||
generally be known. More importantly, the compiler may apply any number of
|
|
||||||
transformations: unrolling, for instance, changes this number. Control flow may
|
|
||||||
also be complicated by code versioning.
|
|
||||||
|
|
64
manuscrit/50_CesASMe/02_measuring_exec_time.tex
Normal file
64
manuscrit/50_CesASMe/02_measuring_exec_time.tex
Normal file
|
@ -0,0 +1,64 @@
|
||||||
|
\section{Re-defining the execution time of a
|
||||||
|
kernel}\label{sec:redefine_exec_time}
|
||||||
|
|
||||||
|
We saw above that state-of-the-art code analyzers disagreed by up to 100\,\% on
|
||||||
|
the execution time of a relatively simple kernel. The obvious solution to
|
||||||
|
assess their predictions is to compare them to an actual measure. However,
|
||||||
|
accounting for dependencies at the scale of a basic block makes this
|
||||||
|
\emph{actual measure} not as trivially defined as it would seem. Take for
|
||||||
|
instance the following kernel:
|
||||||
|
|
||||||
|
\begin{minipage}{0.90\linewidth}
|
||||||
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||||
|
mov (%rax, %rcx, 1), %r10
|
||||||
|
mov %r10, (%rbx, %rcx, 1)
|
||||||
|
add $8, %rcx
|
||||||
|
\end{lstlisting}
|
||||||
|
\end{minipage}
|
||||||
|
|
||||||
|
\noindent{}At first, it looks like an array copy from location \reg{rax} to
|
||||||
|
\reg{rbx}. Yet, if before the loop, \reg{rbx} is initialized to
|
||||||
|
\reg{rax}\texttt{+8}, there is a read-after-write dependency between the first
|
||||||
|
instruction and the second instruction at the previous iteration; which makes
|
||||||
|
the throughput drop significantly. As we shall see in
|
||||||
|
Section~\ref{ssec:bhive_errors}, \emph{without proper context, a basic
|
||||||
|
block's throughput is not well-defined}.
|
||||||
|
|
||||||
|
To recover the context of each basic block, we reason instead at the scale of
|
||||||
|
a C source code. This
|
||||||
|
makes the measures unambiguous: one can use hardware counters to measure the
|
||||||
|
elapsed cycles during a loop nest. This requires a suite of benchmarks, in C,
|
||||||
|
that both is representative of the domain studied, and wide enough to have a
|
||||||
|
good coverage of the domain. However, this is not in itself sufficient to
|
||||||
|
evaluate static tools: on the preceding matrix multiplication kernel, counters
|
||||||
|
report 80,059 elapsed cycles ---~for the total loop.
|
||||||
|
This number compares hardly to \llvmmca{}, \iaca{}, \ithemal{}, and \uica{}
|
||||||
|
basic block-level predictions seen above.
|
||||||
|
|
||||||
|
A common practice to make these numbers comparable is to renormalize them to
|
||||||
|
instructions per cycles (IPC). Here, \llvmmca{} reports an IPC of
|
||||||
|
$\sfrac{7}{1.5}~=~4.67$, \iaca{} and \ithemal{} report an IPC of
|
||||||
|
$\sfrac{7}{2}~=~3.5$, and \uica{} reports an IPC of $\sfrac{7}{3}~=~2.3$. In this
|
||||||
|
case, the measured IPC is 3.45, which is closest to \iaca{} and \ithemal. Yet,
|
||||||
|
IPC is a metric for microarchitectural load, and \textit{tells nothing about a
|
||||||
|
kernel's efficiency}. Indeed, the static number of instructions is affected by
|
||||||
|
many compiler passes, such as scalar evolution, strength reduction, register
|
||||||
|
allocation, instruction selection\ldots{} Thus, when comparing two compiled
|
||||||
|
versions of the same code, IPC alone does not necessarily point to the most
|
||||||
|
efficient version. For instance, a kernel using SIMD instructions will use
|
||||||
|
fewer instructions than one using only scalars, and thus exhibit a lower or
|
||||||
|
constant IPC; yet, its performance will unquestionably increase.
|
||||||
|
|
||||||
|
The total cycles elapsed to solve a given problem, on the other
|
||||||
|
hand, is a sound metric of the efficiency of an implementation. We thus
|
||||||
|
instead \emph{lift} the predictions at basic-block level to a total number of
|
||||||
|
cycles. In simple cases, this simply means multiplying the block-level
|
||||||
|
prediction by the number of loop iterations; however, this bound might not
|
||||||
|
generally be known. More importantly, the compiler may apply any number of
|
||||||
|
transformations: unrolling, for instance, changes this number. Control flow may
|
||||||
|
also be complicated by code versioning.
|
||||||
|
|
||||||
|
Instead of guessing this final number of iterations at the assembly level, a
|
||||||
|
sounder alternative is to measure it on the final binary. In
|
||||||
|
\autoref{sec:bench_harness}, we present our solution to do so, using \gdb{} to
|
||||||
|
instrument an execution of the binary.
|
|
@ -1,20 +1,50 @@
|
||||||
\section{Related works}
|
\section{Related works}
|
||||||
|
|
||||||
|
\paragraph{Another comparative study: \anica{}.} The \anica{}
|
||||||
|
framework~\cite{anica} also attempts to comparatively evaluate various throughput predictors by
|
||||||
|
finding examples on which they are inaccurate. \anica{} starts with randomly
|
||||||
|
generated assembly snippets fed to various code analyzers. Once it finds a
|
||||||
|
snippet on which (some) code analyzers yield unsatisfying results, it refines
|
||||||
|
it through a process derived from abstract interpretation to reach a
|
||||||
|
more general category of input, \eg{} ``a load to a SIMD register followed by a
|
||||||
|
SIMD arithmetic operation''.
|
||||||
|
|
||||||
|
\paragraph{A dynamic code analyzer: \gus{}.}
|
||||||
|
So far, this manuscript was mostly concerned with static code analyzers.
|
||||||
Throughput prediction tools, however, are not all static.
|
Throughput prediction tools, however, are not all static.
|
||||||
\gus~\cite{phd:gruber} dynamically predicts the throughput of a whole program
|
\gus is a dynamic tool first introduced in \fgruber{}'s PhD
|
||||||
region, instrumenting it to retrieve the exact events occurring through its
|
thesis~\cite{phd:gruber}. It leverages \qemu{}'s instrumentation capabilities to
|
||||||
execution. This way, \gus{} can more finely detect bottlenecks by
|
dynamically predict the throughput of user-defined regions of interest in whole
|
||||||
sensitivity analysis, at the cost of a significantly longer run time.
|
program.
|
||||||
|
In these regions, it instruments every instruction, memory access, \ldots{} in
|
||||||
|
order to retrieve the exact events occurring through the program's
|
||||||
|
execution. \gus{} then leverages throughput, latency and microarchitectural
|
||||||
|
models to analyze resource usage and produce an accurate theoretical elapsed
|
||||||
|
cycles prediction.
|
||||||
|
|
||||||
\smallskip
|
Its main strength, however, resides in its \emph{sensitivity analysis}
|
||||||
|
capabilities: by applying an arbitrary factor to some parts of the model (\eg{}
|
||||||
|
latencies, arithmetics port, \ldots{}), it is possible to investigate the
|
||||||
|
impact of a specific resource on the final execution time of a region of
|
||||||
|
interest. It can also accurately determine if a resource is actually a
|
||||||
|
bottleneck for a region, \ie{} if increasing this resource's capabilities would
|
||||||
|
reduce the execution time. The output of \gus{} on a region of interest
|
||||||
|
provides a very detailed insight on each instruction's resource consumption and
|
||||||
|
its contribution to the final execution time. As a dynamic analysis tool, it
|
||||||
|
is also able to extract the dependencies an instruction exhibits on a real run.
|
||||||
|
|
||||||
The \bhive{} profiler~\cite{bhive} takes another approach to basic block
|
The main downside of \gus{}, however, is its slowness. As most dynamic tools,
|
||||||
throughput measurement: by mapping memory at any address accessed by a basic
|
it suffers from a heavy slowdown compared to a native execution of the binary,
|
||||||
block, it can effectively run and measure arbitrary code without context, often
|
oftentimes about $100\times$ slower. While it remains a precious tool to the
|
||||||
---~but not always, as we discuss later~--- yielding good results.
|
user willing to deeply optimize an execution kernel, it makes \gus{} highly
|
||||||
|
impractical to run on a large collection of execution kernels.
|
||||||
|
|
||||||
\smallskip
|
\paragraph{An isolated basic-block profiler: \bhive{}.} In
|
||||||
|
\autoref{sec:redefine_exec_time} above, we advocated for measuring a basic
|
||||||
The \anica{} framework~\cite{anica} also attempts to evaluate throughput
|
block's execution time \emph{in-context}. The \bhive{} profiler~\cite{bhive},
|
||||||
predictors by finding examples on which they are inaccurate. \anica{} starts
|
initially written by \ithemal{}'s authors~\cite{ithemal} to provide their model
|
||||||
with randomly generated assembly snippets, and refines them through a process
|
with sufficient ---~and sufficiently accurate~--- training data, takes an
|
||||||
derived from abstract interpretation to reach general categories of problems.
|
orthogonal approach to basic block throughput measurement. By mapping memory at
|
||||||
|
any address accessed by a basic block, it can effectively run and measure
|
||||||
|
arbitrary code without context, often ---~but not always, as we discuss
|
||||||
|
later~--- yielding good results.
|
||||||
|
|
|
@ -25,7 +25,7 @@ jump site.
|
||||||
|
|
||||||
To accurately obtain the occurrences of each basic block in the whole kernel's
|
To accurately obtain the occurrences of each basic block in the whole kernel's
|
||||||
computation,
|
computation,
|
||||||
we then instrument it with \texttt{gdb} by placing a break
|
we then instrument it with \gdb{} by placing a break
|
||||||
point at each basic block's first instruction in order to count the occurrences
|
point at each basic block's first instruction in order to count the occurrences
|
||||||
of each basic block between two calls to the \perf{} counters\footnote{We
|
of each basic block between two calls to the \perf{} counters\footnote{We
|
||||||
assume the program under analysis to be deterministic.}. While this
|
assume the program under analysis to be deterministic.}. While this
|
||||||
|
@ -60,7 +60,7 @@ markers prevent a binary from being run by overwriting registers with arbitrary
|
||||||
values. This forces a user to run and measure a version which is different from
|
values. This forces a user to run and measure a version which is different from
|
||||||
the analyzed one. In our harness, we circumvent this issue by adding markers
|
the analyzed one. In our harness, we circumvent this issue by adding markers
|
||||||
directly at the assembly level, editing the already compiled version. Our
|
directly at the assembly level, editing the already compiled version. Our
|
||||||
\texttt{gdb} instrumentation procedure also respects this principle of
|
\gdb{} instrumentation procedure also respects this principle of
|
||||||
single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
|
single-compilation. As \qemu{} breaks the \perf{} interface, we have to run
|
||||||
\gus{} with a preloaded stub shared library to be able to instrument binaries
|
\gus{} with a preloaded stub shared library to be able to instrument binaries
|
||||||
containing calls to \perf.
|
containing calls to \perf.
|
||||||
|
|
|
@ -11,9 +11,10 @@ predictions comparable to baseline hardware counter measures.
|
||||||
\subsection{Experimental environment}
|
\subsection{Experimental environment}
|
||||||
|
|
||||||
The experiments presented in this paper were all realized on a Dell PowerEdge
|
The experiments presented in this paper were all realized on a Dell PowerEdge
|
||||||
C6420 machine from the Grid5000 cluster~\cite{grid5000}, equipped with 192\,GB
|
C6420 machine, from the \textit{Dahu} cluster of Grid5000 in
|
||||||
of DDR4 SDRAM ---~only a small fraction of which was used~--- and two Intel
|
Grenoble~\cite{grid5000}. The server is equipped with 192\,GB of DDR4 SDRAM
|
||||||
Xeon Gold 6130 CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
---~only a small fraction of which was used~--- and two Intel Xeon Gold 6130
|
||||||
|
CPUs (x86-64, Skylake microarchitecture) with 16 cores each.
|
||||||
|
|
||||||
The experiments themselves were run inside a Docker environment based on Debian
|
The experiments themselves were run inside a Docker environment based on Debian
|
||||||
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
Bullseye. Care was taken to disable hyperthreading to improve measurements
|
||||||
|
|
|
@ -72,11 +72,11 @@ through hardware counters, an excellent accuracy is expected. Its lack of
|
||||||
support for control flow instructions can be held accountable for a portion of
|
support for control flow instructions can be held accountable for a portion of
|
||||||
this accuracy drop; our lifting method, based on block occurrences instead of
|
this accuracy drop; our lifting method, based on block occurrences instead of
|
||||||
paths, can explain another portion. We also find that \bhive{} fails to produce
|
paths, can explain another portion. We also find that \bhive{} fails to produce
|
||||||
a result in about 40\,\% of the kernels explored ---~which means that, for those
|
a result in about 40\,\% of the kernels explored ---~which means that, for
|
||||||
cases, \bhive{} failed to produce a result on at least one of the constituent
|
those cases, \bhive{} failed to produce a result on at least one of the
|
||||||
basic blocks. In fact, this is due to the difficulties we mentioned in
|
constituent basic blocks. In fact, this is due to the difficulties we mentioned
|
||||||
\qtodo{[ref intro]} related to the need to reconstruct the context of each
|
in \autoref{sec:redefine_exec_time} earlier, related to the need to reconstruct
|
||||||
basic block \textit{ex nihilo}.
|
the context of each basic block \textit{ex nihilo}.
|
||||||
|
|
||||||
The basis of \bhive's method is to run the code to be measured, unrolled a
|
The basis of \bhive's method is to run the code to be measured, unrolled a
|
||||||
number of times depending on the code size, with all memory pages but the
|
number of times depending on the code size, with all memory pages but the
|
||||||
|
@ -96,7 +96,7 @@ initial value can be of crucial importance.
|
||||||
The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
|
The following experiments are executed on an Intel(R) Xeon(R) Gold 6230R CPU
|
||||||
(Cascade Lake), with hyperthreading disabled.
|
(Cascade Lake), with hyperthreading disabled.
|
||||||
|
|
||||||
\paragraph{Imprecise analysis} we consider the following x86-64 kernel.
|
\paragraph{Imprecise analysis.} We consider the following x86-64 kernel.
|
||||||
|
|
||||||
\begin{minipage}{0.95\linewidth}
|
\begin{minipage}{0.95\linewidth}
|
||||||
\begin{lstlisting}[language={[x86masm]Assembler}]
|
\begin{lstlisting}[language={[x86masm]Assembler}]
|
||||||
|
@ -122,7 +122,7 @@ influence the results whenever it gets loaded into registers.
|
||||||
|
|
||||||
\vspace{0.5em}
|
\vspace{0.5em}
|
||||||
|
|
||||||
\paragraph{Failed analysis} some memory accesses will always result in an
|
\paragraph{Failed analysis.} Some memory accesses will always result in an
|
||||||
error; for instance, it is impossible to \texttt{mmap} at an address lower
|
error; for instance, it is impossible to \texttt{mmap} at an address lower
|
||||||
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
|
than \texttt{/proc/sys/vm/mmap\_min\_addr}, defaulting to \texttt{0x10000}. Thus,
|
||||||
with equal initial values for all registers, the following kernel would fail,
|
with equal initial values for all registers, the following kernel would fail,
|
||||||
|
|
|
@ -47,7 +47,7 @@ These perspectives can also be seen as future works:
|
||||||
|
|
||||||
\smallskip
|
\smallskip
|
||||||
|
|
||||||
\paragraph{Program optimization} the whole program processing we have designed
|
\paragraph{Program optimization.} The whole program processing we have designed
|
||||||
can be used not only to evaluate the performance model underlying a static
|
can be used not only to evaluate the performance model underlying a static
|
||||||
analyzer, but also to guide program optimization itself. In such a perspective,
|
analyzer, but also to guide program optimization itself. In such a perspective,
|
||||||
we would generate different versions of the same program using the
|
we would generate different versions of the same program using the
|
||||||
|
@ -70,7 +70,7 @@ suffices; this however would require to control L1-residence otherwise.
|
||||||
|
|
||||||
\smallskip
|
\smallskip
|
||||||
|
|
||||||
\paragraph{Dataset building} our microbenchmarks generation phase outputs a
|
\paragraph{Dataset building.} Our microbenchmarks generation phase outputs a
|
||||||
large, diverse and representative dataset of microkernels. In addition to our
|
large, diverse and representative dataset of microkernels. In addition to our
|
||||||
harness, we believe that such a dataset could be used to improve existing
|
harness, we believe that such a dataset could be used to improve existing
|
||||||
data-dependant solutions.
|
data-dependant solutions.
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
analysis: \cesasme{}}
|
analysis: \cesasme{}}
|
||||||
|
|
||||||
\input{00_intro.tex}
|
\input{00_intro.tex}
|
||||||
|
\input{02_measuring_exec_time.tex}
|
||||||
\input{05_related_works.tex}
|
\input{05_related_works.tex}
|
||||||
\input{10_bench_gen.tex}
|
\input{10_bench_gen.tex}
|
||||||
\input{15_harness.tex}
|
\input{15_harness.tex}
|
||||||
|
|
Loading…
Reference in a new issue