From 6a24e7a4c587ba79a61d04785ad42f2889918a79 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Th=C3=A9ophile=20Bastian?= <contact@tobast.fr> Date: Wed, 27 Dec 2023 20:14:44 +0100 Subject: [PATCH] Foundations: start writeup on code analyzers --- manuscrit/20_foundations/10_cpu_arch.tex | 2 +- .../20_foundations/20_code_analyzers.tex | 90 +++++++++++++++++++ manuscrit/20_foundations/main.tex | 1 + manuscrit/biblio/tools.bib | 37 ++++++++ 4 files changed, 129 insertions(+), 1 deletion(-) create mode 100644 manuscrit/20_foundations/20_code_analyzers.tex diff --git a/manuscrit/20_foundations/10_cpu_arch.tex b/manuscrit/20_foundations/10_cpu_arch.tex index b0a6cea..1d64a45 100644 --- a/manuscrit/20_foundations/10_cpu_arch.tex +++ b/manuscrit/20_foundations/10_cpu_arch.tex @@ -214,7 +214,7 @@ It is also important to note that out-of-order processors are only out-of-order \emph{from a certain point on}: a substantial part of the processor's frontend is typically still in-order. -\subsubsection{Hardware counters} +\subsubsection{Hardware counters}\label{sssec:hw_counters} Many processors provide \emph{hardware counters}, to help (low-level) programmers understand how their code is executed. The counters available diff --git a/manuscrit/20_foundations/20_code_analyzers.tex b/manuscrit/20_foundations/20_code_analyzers.tex new file mode 100644 index 0000000..ca6a3dd --- /dev/null +++ b/manuscrit/20_foundations/20_code_analyzers.tex @@ -0,0 +1,90 @@ +\section{Kernel optimization and code analyzers} + +Optimizing a program, in most contexts, mainly means optimizing it from an +algorithmic point of view ---~using efficient data structures, running some +computations in parallel on multiple cores, etc. As pointed out in our +introduction, though, optimizations close to the machine's microarchitecture +can yield large efficiency benefits, sometimes up to two orders of +magnitude~\cite{dgemm_finetune}. These optimizations, however, are difficult to +carry for multiple reasons: they depend on the specific machine on which the +code is run; they require deep expert knowledge; they are most often manual, +requiring expert time ---~and thus making them expensive. + +Such optimizations are, however, routinely used in some domains. Scientific +computation ---~such as ocean simulation, weather forecast, \ldots{}~--- often +rely on the same operations, implemented by low-level libraries optimized in +such a way, such as OpenBLAS~\cite{openblas_webpage, openblas_2013} or Intel's +MKL~\cite{intel_mkl}, implementing low-level math operations, such as linear +algebra. Machine learning applications, on the other hand, may typically be +trained for extensive periods of time, on many cores and accelerators, on a +well-defined hardware, with small portions of code being executed many times on +different data; as such, they are very well suited for such specific and +low-level optimizations. + +\medskip{} + +When optimizing those short fragments of code whose efficiency is critical, or +\emph{computation kernels}, insights on what limits the code's performance, or +\emph{performance bottlenecks}, are precious to the expert. These insights can +be gained by reading the processor's hardware counters, described above in +\autoref{sssec:hw_counters}, typically accurate but of limited versatility. +Specialized profilers, such as Intel's VTune~\cite{intel_vtune}, integrate +these counters with profiling to derive further performance metrics at runtime. + +\subsection{Code analyzers} + +Another approach is to rely on \emph{code analyzers}, pieces of software that +analyze a code fragment ---~typically at assembly or binary level~---, and +provide insights on its performance metrics on a given hardware. Code analyzers +thus work statically, that is, without executing the code. + +\paragraph{Common hypotheses.} Code analyzers operate under a common +hypotheses, derived from the typical intended usage. + +The kernel analyzed is expected to be the body of a loop, or +nest of loops, that should be iterated many times enough to be approximated by +an infinite loop. The kernel will further be analyzed under the assumption that +it is in \emph{steady-state}, and will thus ignore startup or border effects +occurring in extremal cases. As the kernels analyzed are those worth optimizing +manually, it is reasonable to assume that they will be executed many times, and +focus on their steady-state. + +The kernel is further assumed to be \emph{L1-resident}, that is, to work only +on data that resides in the L1 cache. This assumption is reasonable in two +ways. First, if data must be fetched from farther caches, or even the main +memory, these fetch operations will be multiple orders of magnitude slower than +the computation being analyzed, making it useless to optimize this kernel for +CPU efficiency ---~the expert should, in this case, focus instead on data +locality, prefetching, etc. Second, code analyzers typically focus only on the +CPU itself, and ignore memory effects. This hypothesis formalizes this focus; +code analyzers metrics are thus to be regarded \textit{assuming the CPU is the +bottleneck}. + +Code analyzers also disregard control flow, and thus assume the code to be +\emph{straight-line code}: the kernel analyzed is considered as a sequence of +instructions without influence on the control flow, executed in order, and +jumping unconditionally back to the first instruction after the last ---~or, +more accurately, the last jump is always assumed taken, and any control flow +instruction in the middle is assumed not taken, while their computational cost +is accounted for. + +\paragraph{Metrics produced.} The insights they provide as an output vary with +the code analyzer used. All of them are able to predict either the throughput +or reciprocal throughput ---~defined below~--- of the kernel studied, that is, +how many cycles one iteration of the loop takes, in average and in +steady-state. Although throughput can already be measured at runtime with +hardware counters, a static estimation ---~if reliable~--- is already an +improvement, as a static analyzer is typically faster than running the actual +program under profiling. + +Each code analyzer relies on a model, or a collection of models, of the +hardware on which it provides analyzes. Depending on what is, or is not +modelled by a specific code analyzer, it may further extract any available and +relevant metric from its model: whether the frontend is saturated, which +computation units from the backend are stressed and by which precise +instructions, when the CPU stalls and why, etc. Code analyzers may further +point towards the resources that are limiting the kernel's performance, or +\emph{bottlenecks}. + + +\paragraph{Static vs.\ dynamic analyzers.} diff --git a/manuscrit/20_foundations/main.tex b/manuscrit/20_foundations/main.tex index f732f0e..63b3a0d 100644 --- a/manuscrit/20_foundations/main.tex +++ b/manuscrit/20_foundations/main.tex @@ -2,3 +2,4 @@ \input{00_intro.tex} \input{10_cpu_arch.tex} +\input{20_code_analyzers.tex} diff --git a/manuscrit/biblio/tools.bib b/manuscrit/biblio/tools.bib index a62f603..af8ecc5 100644 --- a/manuscrit/biblio/tools.bib +++ b/manuscrit/biblio/tools.bib @@ -91,3 +91,40 @@ howpublished={\url{https://www.qemu.org}} } +% OpenBLAS +@inproceedings{openblas_2013, + author = {Wang, Qian and Zhang, Xianyi and Zhang, Yunquan and Yi, Qing}, + title = {AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on X86 CPUs}, + year = {2013}, + isbn = {9781450323789}, + publisher = {Association for Computing Machinery}, + address = {New York, NY, USA}, + url = {https://doi.org/10.1145/2503210.2503219}, + doi = {10.1145/2503210.2503219}, + abstract = {Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.}, + booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis}, + articleno = {25}, + numpages = {12}, + keywords = {auto-tuning, code generation, DLA code optimization}, + location = {Denver, Colorado}, + series = {SC '13} +} + +@misc{openblas_webpage, + title={{OpenBLAS}: an optimized {BLAS} library}, + author={Xianyi, Zhang}, + howpublished={\url{https://www.qemu.org}} +} + +@misc{intel_mkl, + title={oneAPI Math Kernel Library ({oneMKL})}, + author={{Intel}}, + howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html}}, + year=2003, +} + +@misc{intel_vtune, + title={{VTune} profiler}, + author={{Intel}}, + howpublished={\url{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}}, +}