commit 16c00db4bb607a7b42359858386fd54392f90377 Author: Linus Torvalds Date: Tue May 15 10:48:36 2018 -0700 Merge tag 'afs-fixes-20180514' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "Here's a set of patches that fix a number of bugs in the in-kernel AFS client, including: - Fix directory locking to not use individual page locks for directory reading/scanning but rather to use a semaphore on the afs_vnode struct as the directory contents must be read in a single blob and data from different reads must not be mixed as the entire contents may be shuffled about between reads. - Fix address list parsing to handle port specifiers correctly. - Only give up callback records on a server if we actually talked to that server (we might not be able to access a server). - Fix some callback handling bugs, including refcounting, whole-volume callbacks and when callbacks actually get broken in response to a CB.CallBack op. - Fix some server/address rotation bugs, including giving up if we can't probe a server; giving up if a server says it doesn't have a volume, but there are more servers to try. - Fix the decoding of fetched statuses to be OpenAFS compatible. - Fix the handling of server lookups in Cache Manager ops (such as CB.InitCallBackState3) to use a UUID if possible and to handle no server being found. - Fix a bug in server lookup where not all addresses are compared. - Fix the non-encryption of calls that prevents some servers from being accessed (this also requires an AF_RXRPC patch that has already gone in through the net tree). There's also a patch that adds tracepoints to log Cache Manager ops that don't find a matching server, either by UUID or by address" * tag 'afs-fixes-20180514' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix the non-encryption of calls afs: Fix CB.CallBack handling afs: Fix whole-volume callback handling afs: Fix afs_find_server search loop afs: Fix the handling of an unfound server in CM operations afs: Add a tracepoint to record callbacks from unlisted servers afs: Fix the handling of CB.InitCallBackState3 to find the server by UUID afs: Fix VNOVOL handling in address rotation afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility afs: Fix server rotation's handling of fileserver probe failure afs: Fix refcounting in callback registration afs: Fix giving up callbacks on server destruction afs: Fix address list parsing afs: Fix directory page locking diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..3e5135d --- /dev/null +++ b/.gitignore @@ -0,0 +1,36 @@ +PERF-CFLAGS +PERF-GUI-VARS +PERF-VERSION-FILE +FEATURE-DUMP +perf +perf-read-vdso32 +perf-read-vdsox32 +perf-help +perf-record +perf-report +perf-stat +perf-top +perf*.1 +perf*.xml +perf*.html +common-cmds.h +perf.data +perf.data.old +output.svg +perf-archive +perf-with-kcore +tags +TAGS +cscope* +config.mak +config.mak.autogen +*-bison.* +*-flex.* +*.pyc +*.pyo +.config-detected +util/intel-pt-decoder/inat-tables.c +arch/*/include/generated/ +trace/beauty/generated/ +pmu-events/pmu-events.c +pmu-events/jevents diff --git a/Build b/Build new file mode 100644 index 0000000..e5232d5 --- /dev/null +++ b/Build @@ -0,0 +1,55 @@ +perf-y += builtin-bench.o +perf-y += builtin-annotate.o +perf-y += builtin-config.o +perf-y += builtin-diff.o +perf-y += builtin-evlist.o +perf-y += builtin-ftrace.o +perf-y += builtin-help.o +perf-y += builtin-sched.o +perf-y += builtin-buildid-list.o +perf-y += builtin-buildid-cache.o +perf-y += builtin-kallsyms.o +perf-y += builtin-list.o +perf-y += builtin-record.o +perf-y += builtin-report.o +perf-y += builtin-stat.o +perf-y += builtin-timechart.o +perf-y += builtin-top.o +perf-y += builtin-script.o +perf-y += builtin-kmem.o +perf-y += builtin-lock.o +perf-y += builtin-kvm.o +perf-y += builtin-inject.o +perf-y += builtin-mem.o +perf-y += builtin-data.o +perf-y += builtin-version.o +perf-y += builtin-c2c.o + +perf-$(CONFIG_TRACE) += builtin-trace.o +perf-$(CONFIG_LIBELF) += builtin-probe.o + +perf-y += bench/ +perf-y += tests/ + +perf-y += perf.o + +paths += -DPERF_HTML_PATH="BUILD_STR($(htmldir_SQ))" +paths += -DPERF_INFO_PATH="BUILD_STR($(infodir_SQ))" +paths += -DPERF_MAN_PATH="BUILD_STR($(mandir_SQ))" + +CFLAGS_builtin-help.o += $(paths) +CFLAGS_builtin-timechart.o += $(paths) +CFLAGS_perf.o += -DPERF_HTML_PATH="BUILD_STR($(htmldir_SQ))" \ + -DPERF_EXEC_PATH="BUILD_STR($(perfexecdir_SQ))" \ + -DPREFIX="BUILD_STR($(prefix_SQ))" +CFLAGS_builtin-trace.o += -DSTRACE_GROUPS_DIR="BUILD_STR($(STRACE_GROUPS_DIR_SQ))" +CFLAGS_builtin-report.o += -DTIPDIR="BUILD_STR($(tipdir_SQ))" +CFLAGS_builtin-report.o += -DDOCDIR="BUILD_STR($(srcdir_SQ)/Documentation)" + +libperf-y += util/ +libperf-y += arch/ +libperf-y += ui/ +libperf-y += scripts/ +libperf-$(CONFIG_TRACE) += trace/beauty/ + +gtk-y += ui/gtk/ diff --git a/CREDITS b/CREDITS new file mode 100644 index 0000000..c2ddcb3 --- /dev/null +++ b/CREDITS @@ -0,0 +1,30 @@ +Most of the infrastructure that 'perf' uses here has been reused +from the Git project, as of version: + + 66996ec: Sync with 1.6.2.4 + +Here is an (incomplete!) list of main contributors to those files +in util/* and elsewhere: + + Alex Riesen + Christian Couder + Dmitry Potapov + Jeff King + Johannes Schindelin + Johannes Sixt + Junio C Hamano + Linus Torvalds + Matthias Kestenholz + Michal Ostrowski + Miklos Vajna + Petr Baudis + Pierre Habouzit + René Scharfe + Samuel Tardieu + Shawn O. Pearce + Steffen Prohaska + Steve Haslam + +Thanks guys! + +The full history of the files can be found in the upstream Git commits. diff --git a/Documentation/Build.txt b/Documentation/Build.txt new file mode 100644 index 0000000..f6fc650 --- /dev/null +++ b/Documentation/Build.txt @@ -0,0 +1,49 @@ + +1) perf build +============= +The perf build process consists of several separated building blocks, +which are linked together to form the perf binary: + - libperf library (static) + - perf builtin commands + - traceevent library (static) + - GTK ui library + +Several makefiles govern the perf build: + + - Makefile + top level Makefile working as a wrapper that calls the main + Makefile.perf with a -j option to do parallel builds. + + - Makefile.perf + main makefile that triggers build of all perf objects including + installation and documentation processing. + + - tools/build/Makefile.build + main makefile of the build framework + + - tools/build/Build.include + build framework generic definitions + + - Build makefiles + makefiles that defines build objects + +Please refer to tools/build/Documentation/Build.txt for more +information about build framework. + + +2) perf build +============= +The Makefile.perf triggers the build framework for build objects: + perf, libperf, gtk + +resulting in following objects: + $ ls *-in.o + gtk-in.o libperf-in.o perf-in.o + +Those objects are then used in final linking: + libperf-gtk.so <- gtk-in.o libperf-in.o + perf <- perf-in.o libperf-in.o + + +NOTE this description is omitting other libraries involved, only + focusing on build framework outcomes diff --git a/Documentation/Makefile b/Documentation/Makefile new file mode 100644 index 0000000..db11478 --- /dev/null +++ b/Documentation/Makefile @@ -0,0 +1,343 @@ +include ../../scripts/Makefile.include +include ../../scripts/utilities.mak + +MAN1_TXT= \ + $(filter-out $(addsuffix .txt, $(ARTICLES) $(SP_ARTICLES)), \ + $(wildcard perf-*.txt)) \ + perf.txt +MAN5_TXT= +MAN7_TXT= + +MAN_TXT = $(MAN1_TXT) $(MAN5_TXT) $(MAN7_TXT) +_MAN_XML=$(patsubst %.txt,%.xml,$(MAN_TXT)) +_MAN_HTML=$(patsubst %.txt,%.html,$(MAN_TXT)) + +MAN_XML=$(addprefix $(OUTPUT),$(_MAN_XML)) +MAN_HTML=$(addprefix $(OUTPUT),$(_MAN_HTML)) + +ARTICLES = +# with their own formatting rules. +SP_ARTICLES = +API_DOCS = $(patsubst %.txt,%,$(filter-out technical/api-index-skel.txt technical/api-index.txt, $(wildcard technical/api-*.txt))) +SP_ARTICLES += $(API_DOCS) +SP_ARTICLES += technical/api-index + +_DOC_HTML = $(_MAN_HTML) +_DOC_HTML+=$(patsubst %,%.html,$(ARTICLES) $(SP_ARTICLES)) +DOC_HTML=$(addprefix $(OUTPUT),$(_DOC_HTML)) + +_DOC_MAN1=$(patsubst %.txt,%.1,$(MAN1_TXT)) +_DOC_MAN5=$(patsubst %.txt,%.5,$(MAN5_TXT)) +_DOC_MAN7=$(patsubst %.txt,%.7,$(MAN7_TXT)) + +DOC_MAN1=$(addprefix $(OUTPUT),$(_DOC_MAN1)) +DOC_MAN5=$(addprefix $(OUTPUT),$(_DOC_MAN5)) +DOC_MAN7=$(addprefix $(OUTPUT),$(_DOC_MAN7)) + +# Make the path relative to DESTDIR, not prefix +ifndef DESTDIR +prefix?=$(HOME) +endif +bindir?=$(prefix)/bin +htmldir?=$(prefix)/share/doc/perf-doc +pdfdir?=$(prefix)/share/doc/perf-doc +mandir?=$(prefix)/share/man +man1dir=$(mandir)/man1 +man5dir=$(mandir)/man5 +man7dir=$(mandir)/man7 + +ASCIIDOC=asciidoc +ASCIIDOC_EXTRA = --unsafe +MANPAGE_XSL = manpage-normal.xsl +XMLTO_EXTRA = +INSTALL?=install +RM ?= rm -f +DOC_REF = origin/man +HTML_REF = origin/html + +infodir?=$(prefix)/share/info +MAKEINFO=makeinfo +INSTALL_INFO=install-info +DOCBOOK2X_TEXI=docbook2x-texi +DBLATEX=dblatex +XMLTO=xmlto +ifndef PERL_PATH + PERL_PATH = /usr/bin/perl +endif + +-include ../config.mak.autogen +-include ../config.mak + +_tmp_tool_path := $(call get-executable,$(ASCIIDOC)) +ifeq ($(_tmp_tool_path),) + missing_tools = $(ASCIIDOC) +endif + +_tmp_tool_path := $(call get-executable,$(XMLTO)) +ifeq ($(_tmp_tool_path),) + missing_tools += $(XMLTO) +endif + +# +# For asciidoc ... +# -7.1.2, no extra settings are needed. +# 8.0-, set ASCIIDOC8. +# + +# +# For docbook-xsl ... +# -1.68.1, set ASCIIDOC_NO_ROFF? (based on changelog from 1.73.0) +# 1.69.0, no extra settings are needed? +# 1.69.1-1.71.0, set DOCBOOK_SUPPRESS_SP? +# 1.71.1, no extra settings are needed? +# 1.72.0, set DOCBOOK_XSL_172. +# 1.73.0-, set ASCIIDOC_NO_ROFF +# + +# +# If you had been using DOCBOOK_XSL_172 in an attempt to get rid +# of 'the ".ft C" problem' in your generated manpages, and you +# instead ended up with weird characters around callouts, try +# using ASCIIDOC_NO_ROFF instead (it works fine with ASCIIDOC8). +# + +ifdef ASCIIDOC8 +ASCIIDOC_EXTRA += -a asciidoc7compatible +endif +ifdef DOCBOOK_XSL_172 +ASCIIDOC_EXTRA += -a perf-asciidoc-no-roff +MANPAGE_XSL = manpage-1.72.xsl +else + ifdef ASCIIDOC_NO_ROFF + # docbook-xsl after 1.72 needs the regular XSL, but will not + # pass-thru raw roff codes from asciidoc.conf, so turn them off. + ASCIIDOC_EXTRA += -a perf-asciidoc-no-roff + endif +endif +ifdef MAN_BOLD_LITERAL +XMLTO_EXTRA += -m manpage-bold-literal.xsl +endif +ifdef DOCBOOK_SUPPRESS_SP +XMLTO_EXTRA += -m manpage-suppress-sp.xsl +endif + +SHELL_PATH ?= $(SHELL) +# Shell quote; +SHELL_PATH_SQ = $(subst ','\'',$(SHELL_PATH)) + +# +# Please note that there is a minor bug in asciidoc. +# The version after 6.0.3 _will_ include the patch found here: +# http://marc.theaimsgroup.com/?l=perf&m=111558757202243&w=2 +# +# Until that version is released you may have to apply the patch +# yourself - yes, all 6 characters of it! +# + +QUIET_SUBDIR0 = +$(MAKE) -C # space to separate -C and subdir +QUIET_SUBDIR1 = + +ifneq ($(findstring $(MAKEFLAGS),w),w) +PRINT_DIR = --no-print-directory +else # "make -w" +NO_SUBDIR = : +endif + +ifneq ($(findstring $(MAKEFLAGS),s),s) +ifneq ($(V),1) + QUIET_ASCIIDOC = @echo ' ASCIIDOC '$@; + QUIET_XMLTO = @echo ' XMLTO '$@; + QUIET_DB2TEXI = @echo ' DB2TEXI '$@; + QUIET_MAKEINFO = @echo ' MAKEINFO '$@; + QUIET_DBLATEX = @echo ' DBLATEX '$@; + QUIET_XSLTPROC = @echo ' XSLTPROC '$@; + QUIET_GEN = @echo ' GEN '$@; + QUIET_STDERR = 2> /dev/null + QUIET_SUBDIR0 = +@subdir= + QUIET_SUBDIR1 = ;$(NO_SUBDIR) \ + echo ' SUBDIR ' $$subdir; \ + $(MAKE) $(PRINT_DIR) -C $$subdir + export V +endif +endif + +all: html man + +html: $(DOC_HTML) + +$(DOC_HTML) $(DOC_MAN1) $(DOC_MAN5) $(DOC_MAN7): asciidoc.conf + +man: man1 man5 man7 +man1: $(DOC_MAN1) +man5: $(DOC_MAN5) +man7: $(DOC_MAN7) + +info: $(OUTPUT)perf.info $(OUTPUT)perfman.info + +pdf: $(OUTPUT)user-manual.pdf + +install: install-man + +check-man-tools: +ifdef missing_tools + $(error "You need to install $(missing_tools) for man pages") +endif + +do-install-man: man + $(call QUIET_INSTALL, Documentation-man) \ + $(INSTALL) -d -m 755 $(DESTDIR)$(man1dir); \ +# $(INSTALL) -d -m 755 $(DESTDIR)$(man5dir); \ +# $(INSTALL) -d -m 755 $(DESTDIR)$(man7dir); \ + $(INSTALL) -m 644 $(DOC_MAN1) $(DESTDIR)$(man1dir); \ +# $(INSTALL) -m 644 $(DOC_MAN5) $(DESTDIR)$(man5dir); \ +# $(INSTALL) -m 644 $(DOC_MAN7) $(DESTDIR)$(man7dir) + +install-man: check-man-tools man do-install-man + +ifdef missing_tools + DO_INSTALL_MAN = $(warning Please install $(missing_tools) to have the man pages installed) +else + DO_INSTALL_MAN = do-install-man +endif + +try-install-man: $(DO_INSTALL_MAN) + +install-info: info + $(call QUIET_INSTALL, Documentation-info) \ + $(INSTALL) -d -m 755 $(DESTDIR)$(infodir); \ + $(INSTALL) -m 644 $(OUTPUT)perf.info $(OUTPUT)perfman.info $(DESTDIR)$(infodir); \ + if test -r $(DESTDIR)$(infodir)/dir; then \ + $(INSTALL_INFO) --info-dir=$(DESTDIR)$(infodir) perf.info ;\ + $(INSTALL_INFO) --info-dir=$(DESTDIR)$(infodir) perfman.info ;\ + else \ + echo "No directory found in $(DESTDIR)$(infodir)" >&2 ; \ + fi + +install-pdf: pdf + $(call QUIET_INSTALL, Documentation-pdf) \ + $(INSTALL) -d -m 755 $(DESTDIR)$(pdfdir); \ + $(INSTALL) -m 644 $(OUTPUT)user-manual.pdf $(DESTDIR)$(pdfdir) + +#install-html: html +# '$(SHELL_PATH_SQ)' ./install-webdoc.sh $(DESTDIR)$(htmldir) + + +# +# Determine "include::" file references in asciidoc files. +# +$(OUTPUT)doc.dep : $(wildcard *.txt) build-docdep.perl + $(QUIET_GEN)$(RM) $@+ $@ && \ + $(PERL_PATH) ./build-docdep.perl >$@+ $(QUIET_STDERR) && \ + mv $@+ $@ + +-include $(OUPTUT)doc.dep + +_cmds_txt = cmds-ancillaryinterrogators.txt \ + cmds-ancillarymanipulators.txt \ + cmds-mainporcelain.txt \ + cmds-plumbinginterrogators.txt \ + cmds-plumbingmanipulators.txt \ + cmds-synchingrepositories.txt \ + cmds-synchelpers.txt \ + cmds-purehelpers.txt \ + cmds-foreignscminterface.txt +cmds_txt=$(addprefix $(OUTPUT),$(_cmds_txt)) + +$(cmds_txt): $(OUTPUT)cmd-list.made + +$(OUTPUT)cmd-list.made: cmd-list.perl ../command-list.txt $(MAN1_TXT) + $(QUIET_GEN)$(RM) $@ && \ + $(PERL_PATH) ./cmd-list.perl ../command-list.txt $(QUIET_STDERR) && \ + date >$@ + +CLEAN_FILES = \ + $(MAN_XML) $(addsuffix +,$(MAN_XML)) \ + $(MAN_HTML) $(addsuffix +,$(MAN_HTML)) \ + $(DOC_HTML) $(DOC_MAN1) $(DOC_MAN5) $(DOC_MAN7) \ + $(OUTPUT)*.texi $(OUTPUT)*.texi+ $(OUTPUT)*.texi++ \ + $(OUTPUT)perf.info $(OUTPUT)perfman.info \ + $(OUTPUT)howto-index.txt $(OUTPUT)howto/*.html $(OUTPUT)doc.dep \ + $(OUTPUT)technical/api-*.html $(OUTPUT)technical/api-index.txt \ + $(cmds_txt) $(OUTPUT)*.made +clean: + $(call QUIET_CLEAN, Documentation) $(RM) $(CLEAN_FILES) + +$(MAN_HTML): $(OUTPUT)%.html : %.txt + $(QUIET_ASCIIDOC)$(RM) $@+ $@ && \ + $(ASCIIDOC) -b xhtml11 -d manpage -f asciidoc.conf \ + $(ASCIIDOC_EXTRA) -aperf_version=$(PERF_VERSION) -o $@+ $< && \ + mv $@+ $@ + +$(OUTPUT)%.1 $(OUTPUT)%.5 $(OUTPUT)%.7 : $(OUTPUT)%.xml + $(QUIET_XMLTO)$(RM) $@ && \ + $(XMLTO) -o $(OUTPUT). -m $(MANPAGE_XSL) $(XMLTO_EXTRA) man $< + +$(OUTPUT)%.xml : %.txt + $(QUIET_ASCIIDOC)$(RM) $@+ $@ && \ + $(ASCIIDOC) -b docbook -d manpage -f asciidoc.conf \ + $(ASCIIDOC_EXTRA) -aperf_version=$(PERF_VERSION) -o $@+ $< && \ + mv $@+ $@ + +XSLT = docbook.xsl +XSLTOPTS = --xinclude --stringparam html.stylesheet docbook-xsl.css + +$(OUTPUT)user-manual.html: $(OUTPUT)user-manual.xml + $(QUIET_XSLTPROC)xsltproc $(XSLTOPTS) -o $@ $(XSLT) $< + +$(OUTPUT)perf.info: $(OUTPUT)user-manual.texi + $(QUIET_MAKEINFO)$(MAKEINFO) --no-split -o $@ $(OUTPUT)user-manual.texi + +$(OUTPUT)user-manual.texi: $(OUTPUT)user-manual.xml + $(QUIET_DB2TEXI)$(RM) $@+ $@ && \ + $(DOCBOOK2X_TEXI) $(OUTPUT)user-manual.xml --encoding=UTF-8 --to-stdout >$@++ && \ + $(PERL_PATH) fix-texi.perl <$@++ >$@+ && \ + rm $@++ && \ + mv $@+ $@ + +$(OUTPUT)user-manual.pdf: $(OUTPUT)user-manual.xml + $(QUIET_DBLATEX)$(RM) $@+ $@ && \ + $(DBLATEX) -o $@+ -p /etc/asciidoc/dblatex/asciidoc-dblatex.xsl -s /etc/asciidoc/dblatex/asciidoc-dblatex.sty $< && \ + mv $@+ $@ + +$(OUTPUT)perfman.texi: $(MAN_XML) cat-texi.perl + $(QUIET_DB2TEXI)$(RM) $@+ $@ && \ + ($(foreach xml,$(MAN_XML),$(DOCBOOK2X_TEXI) --encoding=UTF-8 \ + --to-stdout $(xml) &&) true) > $@++ && \ + $(PERL_PATH) cat-texi.perl $@ <$@++ >$@+ && \ + rm $@++ && \ + mv $@+ $@ + +$(OUTPUT)perfman.info: $(OUTPUT)perfman.texi + $(QUIET_MAKEINFO)$(MAKEINFO) --no-split --no-validate $*.texi + +$(patsubst %.txt,%.texi,$(MAN_TXT)): %.texi : %.xml + $(QUIET_DB2TEXI)$(RM) $@+ $@ && \ + $(DOCBOOK2X_TEXI) --to-stdout $*.xml >$@+ && \ + mv $@+ $@ + +howto-index.txt: howto-index.sh $(wildcard howto/*.txt) + $(QUIET_GEN)$(RM) $@+ $@ && \ + '$(SHELL_PATH_SQ)' ./howto-index.sh $(wildcard howto/*.txt) >$@+ && \ + mv $@+ $@ + +$(patsubst %,%.html,$(ARTICLES)) : %.html : %.txt + $(QUIET_ASCIIDOC)$(ASCIIDOC) -b xhtml11 $*.txt + +WEBDOC_DEST = /pub/software/tools/perf/docs + +$(patsubst %.txt,%.html,$(wildcard howto/*.txt)): %.html : %.txt + $(QUIET_ASCIIDOC)$(RM) $@+ $@ && \ + sed -e '1,/^$$/d' $< | $(ASCIIDOC) -b xhtml11 - >$@+ && \ + mv $@+ $@ + +# UNIMPLEMENTED +#install-webdoc : html +# '$(SHELL_PATH_SQ)' ./install-webdoc.sh $(WEBDOC_DEST) + +# quick-install: quick-install-man + +# quick-install-man: +# '$(SHELL_PATH_SQ)' ./install-doc-quick.sh $(DOC_REF) $(DESTDIR)$(mandir) + +#quick-install-html: +# '$(SHELL_PATH_SQ)' ./install-doc-quick.sh $(HTML_REF) $(DESTDIR)$(htmldir) diff --git a/Documentation/android.txt b/Documentation/android.txt new file mode 100644 index 0000000..24a5999 --- /dev/null +++ b/Documentation/android.txt @@ -0,0 +1,78 @@ +How to compile perf for Android +========================================= + +I. Set the Android NDK environment +------------------------------------------------ + +(a). Use the Android NDK +------------------------------------------------ +1. You need to download and install the Android Native Development Kit (NDK). +Set the NDK variable to point to the path where you installed the NDK: + export NDK=/path/to/android-ndk + +2. Set cross-compiling environment variables for NDK toolchain and sysroot. +For arm: + export NDK_TOOLCHAIN=${NDK}/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin/arm-linux-androideabi- + export NDK_SYSROOT=${NDK}/platforms/android-24/arch-arm +For x86: + export NDK_TOOLCHAIN=${NDK}/toolchains/x86-4.9/prebuilt/linux-x86_64/bin/i686-linux-android- + export NDK_SYSROOT=${NDK}/platforms/android-24/arch-x86 + +This method is only tested for Android NDK versions Revision 11b and later. +perf uses some bionic enhancements that are not included in prior NDK versions. +You can use method (b) described below instead. + +(b). Use the Android source tree +----------------------------------------------- +1. Download the master branch of the Android source tree. +Set the environment for the target you want using: + source build/envsetup.sh + lunch + +2. Build your own NDK sysroot to contain latest bionic changes and set the +NDK sysroot environment variable. + cd ${ANDROID_BUILD_TOP}/ndk +For arm: + ./build/tools/build-ndk-sysroot.sh --abi=arm + export NDK_SYSROOT=${ANDROID_BUILD_TOP}/ndk/build/platforms/android-3/arch-arm +For x86: + ./build/tools/build-ndk-sysroot.sh --abi=x86 + export NDK_SYSROOT=${ANDROID_BUILD_TOP}/ndk/build/platforms/android-3/arch-x86 + +3. Set the NDK toolchain environment variable. +For arm: + export NDK_TOOLCHAIN=${ANDROID_TOOLCHAIN}/arm-linux-androideabi- +For x86: + export NDK_TOOLCHAIN=${ANDROID_TOOLCHAIN}/i686-linux-android- + +II. Compile perf for Android +------------------------------------------------ +You need to run make with the NDK toolchain and sysroot defined above: +For arm: + make WERROR=0 ARCH=arm CROSS_COMPILE=${NDK_TOOLCHAIN} EXTRA_CFLAGS="-pie --sysroot=${NDK_SYSROOT}" +For x86: + make WERROR=0 ARCH=x86 CROSS_COMPILE=${NDK_TOOLCHAIN} EXTRA_CFLAGS="-pie --sysroot=${NDK_SYSROOT}" + +III. Install perf +----------------------------------------------- +You need to connect to your Android device/emulator using adb. +Install perf using: + adb push perf /data/perf + +If you also want to use perf-archive you need busybox tools for Android. +For installing perf-archive, you first need to replace #!/bin/bash with #!/system/bin/sh: + sed 's/#!\/bin\/bash/#!\/system\/bin\/sh/g' perf-archive >> /tmp/perf-archive + chmod +x /tmp/perf-archive + adb push /tmp/perf-archive /data/perf-archive + +IV. Environment settings for running perf +------------------------------------------------ +Some perf features need environment variables to run properly. +You need to set these before running perf on the target: + adb shell + # PERF_PAGER=cat + +IV. Run perf +------------------------------------------------ +Run perf on your device/emulator to which you previously connected using adb: + # ./data/perf diff --git a/Documentation/asciidoc.conf b/Documentation/asciidoc.conf new file mode 100644 index 0000000..356b23a --- /dev/null +++ b/Documentation/asciidoc.conf @@ -0,0 +1,91 @@ +## linkperf: macro +# +# Usage: linkperf:command[manpage-section] +# +# Note, {0} is the manpage section, while {target} is the command. +# +# Show PERF link as: (
); if section is defined, else just show +# the command. + +[macros] +(?su)[\\]?(?Plinkperf):(?P\S*?)\[(?P.*?)\]= + +[attributes] +asterisk=* +plus=+ +caret=^ +startsb=[ +endsb=] +tilde=~ + +ifdef::backend-docbook[] +[linkperf-inlinemacro] +{0%{target}} +{0#} +{0#{target}{0}} +{0#} +endif::backend-docbook[] + +ifdef::backend-docbook[] +ifndef::perf-asciidoc-no-roff[] +# "unbreak" docbook-xsl v1.68 for manpages. v1.69 works with or without this. +# v1.72 breaks with this because it replaces dots not in roff requests. +[listingblock] +{title} + +ifdef::doctype-manpage[] + .ft C +endif::doctype-manpage[] +| +ifdef::doctype-manpage[] + .ft +endif::doctype-manpage[] + +{title#} +endif::perf-asciidoc-no-roff[] + +ifdef::perf-asciidoc-no-roff[] +ifdef::doctype-manpage[] +# The following two small workarounds insert a simple paragraph after screen +[listingblock] +{title} + +| + +{title#} + +[verseblock] +{title} +{title%} +{title#} +| + +{title#} +{title%} +endif::doctype-manpage[] +endif::perf-asciidoc-no-roff[] +endif::backend-docbook[] + +ifdef::doctype-manpage[] +ifdef::backend-docbook[] +[header] +template::[header-declarations] + + +{mantitle} +{manvolnum} +perf +{perf_version} +perf Manual + + + {manname} + {manpurpose} + +endif::backend-docbook[] +endif::doctype-manpage[] + +ifdef::backend-xhtml11[] +[linkperf-inlinemacro] +{target}{0?({0})} +endif::backend-xhtml11[] diff --git a/Documentation/callchain-overhead-calculation.txt b/Documentation/callchain-overhead-calculation.txt new file mode 100644 index 0000000..1a75792 --- /dev/null +++ b/Documentation/callchain-overhead-calculation.txt @@ -0,0 +1,108 @@ +Overhead calculation +-------------------- +The overhead can be shown in two columns as 'Children' and 'Self' when +perf collects callchains. The 'self' overhead is simply calculated by +adding all period values of the entry - usually a function (symbol). +This is the value that perf shows traditionally and sum of all the +'self' overhead values should be 100%. + +The 'children' overhead is calculated by adding all period values of +the child functions so that it can show the total overhead of the +higher level functions even if they don't directly execute much. +'Children' here means functions that are called from another (parent) +function. + +It might be confusing that the sum of all the 'children' overhead +values exceeds 100% since each of them is already an accumulation of +'self' overhead of its child functions. But with this enabled, users +can find which function has the most overhead even if samples are +spread over the children. + +Consider the following example; there are three functions like below. + +----------------------- +void foo(void) { + /* do something */ +} + +void bar(void) { + /* do something */ + foo(); +} + +int main(void) { + bar() + return 0; +} +----------------------- + +In this case 'foo' is a child of 'bar', and 'bar' is an immediate +child of 'main' so 'foo' also is a child of 'main'. In other words, +'main' is a parent of 'foo' and 'bar', and 'bar' is a parent of 'foo'. + +Suppose all samples are recorded in 'foo' and 'bar' only. When it's +recorded with callchains the output will show something like below +in the usual (self-overhead-only) output of perf report: + +---------------------------------- +Overhead Symbol +........ ..................... + 60.00% foo + | + --- foo + bar + main + __libc_start_main + + 40.00% bar + | + --- bar + main + __libc_start_main +---------------------------------- + +When the --children option is enabled, the 'self' overhead values of +child functions (i.e. 'foo' and 'bar') are added to the parents to +calculate the 'children' overhead. In this case the report could be +displayed as: + +------------------------------------------- +Children Self Symbol +........ ........ .................... + 100.00% 0.00% __libc_start_main + | + --- __libc_start_main + + 100.00% 0.00% main + | + --- main + __libc_start_main + + 100.00% 40.00% bar + | + --- bar + main + __libc_start_main + + 60.00% 60.00% foo + | + --- foo + bar + main + __libc_start_main +------------------------------------------- + +In the above output, the 'self' overhead of 'foo' (60%) was add to the +'children' overhead of 'bar', 'main' and '\_\_libc_start_main'. +Likewise, the 'self' overhead of 'bar' (40%) was added to the +'children' overhead of 'main' and '\_\_libc_start_main'. + +So '\_\_libc_start_main' and 'main' are shown first since they have +same (100%) 'children' overhead (even though they have zero 'self' +overhead) and they are the parents of 'foo' and 'bar'. + +Since v3.16 the 'children' overhead is shown by default and the output +is sorted by its values. The 'children' overhead is disabled by +specifying --no-children option on the command line or by adding +'report.children = false' or 'top.children = false' in the perf config +file. diff --git a/Documentation/examples.txt b/Documentation/examples.txt new file mode 100644 index 0000000..a4e3921 --- /dev/null +++ b/Documentation/examples.txt @@ -0,0 +1,225 @@ + + ------------------------------ + ****** perf by examples ****** + ------------------------------ + +[ From an e-mail by Ingo Molnar, http://lkml.org/lkml/2009/8/4/346 ] + + +First, discovery/enumeration of available counters can be done via +'perf list': + +titan:~> perf list + [...] + kmem:kmalloc [Tracepoint event] + kmem:kmem_cache_alloc [Tracepoint event] + kmem:kmalloc_node [Tracepoint event] + kmem:kmem_cache_alloc_node [Tracepoint event] + kmem:kfree [Tracepoint event] + kmem:kmem_cache_free [Tracepoint event] + kmem:mm_page_free [Tracepoint event] + kmem:mm_page_free_batched [Tracepoint event] + kmem:mm_page_alloc [Tracepoint event] + kmem:mm_page_alloc_zone_locked [Tracepoint event] + kmem:mm_page_pcpu_drain [Tracepoint event] + kmem:mm_page_alloc_extfrag [Tracepoint event] + +Then any (or all) of the above event sources can be activated and +measured. For example the page alloc/free properties of a 'hackbench +run' are: + + titan:~> perf stat -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc + -e kmem:mm_page_free_batched -e kmem:mm_page_free ./hackbench 10 + Time: 0.575 + + Performance counter stats for './hackbench 10': + + 13857 kmem:mm_page_pcpu_drain + 27576 kmem:mm_page_alloc + 6025 kmem:mm_page_free_batched + 20934 kmem:mm_page_free + + 0.613972165 seconds time elapsed + +You can observe the statistical properties as well, by using the +'repeat the workload N times' feature of perf stat: + + titan:~> perf stat --repeat 5 -e kmem:mm_page_pcpu_drain -e + kmem:mm_page_alloc -e kmem:mm_page_free_batched -e + kmem:mm_page_free ./hackbench 10 + Time: 0.627 + Time: 0.644 + Time: 0.564 + Time: 0.559 + Time: 0.626 + + Performance counter stats for './hackbench 10' (5 runs): + + 12920 kmem:mm_page_pcpu_drain ( +- 3.359% ) + 25035 kmem:mm_page_alloc ( +- 3.783% ) + 6104 kmem:mm_page_free_batched ( +- 0.934% ) + 18376 kmem:mm_page_free ( +- 4.941% ) + + 0.643954516 seconds time elapsed ( +- 2.363% ) + +Furthermore, these tracepoints can be used to sample the workload as +well. For example the page allocations done by a 'git gc' can be +captured the following way: + + titan:~/git> perf record -e kmem:mm_page_alloc -c 1 ./git gc + Counting objects: 1148, done. + Delta compression using up to 2 threads. + Compressing objects: 100% (450/450), done. + Writing objects: 100% (1148/1148), done. + Total 1148 (delta 690), reused 1148 (delta 690) + [ perf record: Captured and wrote 0.267 MB perf.data (~11679 samples) ] + +To check which functions generated page allocations: + + titan:~/git> perf report + # Samples: 10646 + # + # Overhead Command Shared Object + # ........ ............... .......................... + # + 23.57% git-repack /lib64/libc-2.5.so + 21.81% git /lib64/libc-2.5.so + 14.59% git ./git + 11.79% git-repack ./git + 7.12% git /lib64/ld-2.5.so + 3.16% git-repack /lib64/libpthread-2.5.so + 2.09% git-repack /bin/bash + 1.97% rm /lib64/libc-2.5.so + 1.39% mv /lib64/ld-2.5.so + 1.37% mv /lib64/libc-2.5.so + 1.12% git-repack /lib64/ld-2.5.so + 0.95% rm /lib64/ld-2.5.so + 0.90% git-update-serv /lib64/libc-2.5.so + 0.73% git-update-serv /lib64/ld-2.5.so + 0.68% perf /lib64/libpthread-2.5.so + 0.64% git-repack /usr/lib64/libz.so.1.2.3 + +Or to see it on a more finegrained level: + +titan:~/git> perf report --sort comm,dso,symbol +# Samples: 10646 +# +# Overhead Command Shared Object Symbol +# ........ ............... .......................... ...... +# + 9.35% git-repack ./git [.] insert_obj_hash + 9.12% git ./git [.] insert_obj_hash + 7.31% git /lib64/libc-2.5.so [.] memcpy + 6.34% git-repack /lib64/libc-2.5.so [.] _int_malloc + 6.24% git-repack /lib64/libc-2.5.so [.] memcpy + 5.82% git-repack /lib64/libc-2.5.so [.] __GI___fork + 5.47% git /lib64/libc-2.5.so [.] _int_malloc + 2.99% git /lib64/libc-2.5.so [.] memset + +Furthermore, call-graph sampling can be done too, of page +allocations - to see precisely what kind of page allocations there +are: + + titan:~/git> perf record -g -e kmem:mm_page_alloc -c 1 ./git gc + Counting objects: 1148, done. + Delta compression using up to 2 threads. + Compressing objects: 100% (450/450), done. + Writing objects: 100% (1148/1148), done. + Total 1148 (delta 690), reused 1148 (delta 690) + [ perf record: Captured and wrote 0.963 MB perf.data (~42069 samples) ] + + titan:~/git> perf report -g + # Samples: 10686 + # + # Overhead Command Shared Object + # ........ ............... .......................... + # + 23.25% git-repack /lib64/libc-2.5.so + | + |--50.00%-- _int_free + | + |--37.50%-- __GI___fork + | make_child + | + |--12.50%-- ptmalloc_unlock_all2 + | make_child + | + --6.25%-- __GI_strcpy + 21.61% git /lib64/libc-2.5.so + | + |--30.00%-- __GI_read + | | + | --83.33%-- git_config_from_file + | git_config + | | + [...] + +Or you can observe the whole system's page allocations for 10 +seconds: + +titan:~/git> perf stat -a -e kmem:mm_page_pcpu_drain -e +kmem:mm_page_alloc -e kmem:mm_page_free_batched -e +kmem:mm_page_free sleep 10 + + Performance counter stats for 'sleep 10': + + 171585 kmem:mm_page_pcpu_drain + 322114 kmem:mm_page_alloc + 73623 kmem:mm_page_free_batched + 254115 kmem:mm_page_free + + 10.000591410 seconds time elapsed + +Or observe how fluctuating the page allocations are, via statistical +analysis done over ten 1-second intervals: + + titan:~/git> perf stat --repeat 10 -a -e kmem:mm_page_pcpu_drain -e + kmem:mm_page_alloc -e kmem:mm_page_free_batched -e + kmem:mm_page_free sleep 1 + + Performance counter stats for 'sleep 1' (10 runs): + + 17254 kmem:mm_page_pcpu_drain ( +- 3.709% ) + 34394 kmem:mm_page_alloc ( +- 4.617% ) + 7509 kmem:mm_page_free_batched ( +- 4.820% ) + 25653 kmem:mm_page_free ( +- 3.672% ) + + 1.058135029 seconds time elapsed ( +- 3.089% ) + +Or you can annotate the recorded 'git gc' run on a per symbol basis +and check which instructions/source-code generated page allocations: + + titan:~/git> perf annotate __GI___fork + ------------------------------------------------ + Percent | Source code & Disassembly of libc-2.5.so + ------------------------------------------------ + : + : + : Disassembly of section .plt: + : Disassembly of section .text: + : + : 00000031a2e95560 <__fork>: + [...] + 0.00 : 31a2e95602: b8 38 00 00 00 mov $0x38,%eax + 0.00 : 31a2e95607: 0f 05 syscall + 83.42 : 31a2e95609: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax + 0.00 : 31a2e9560f: 0f 87 4d 01 00 00 ja 31a2e95762 <__fork+0x202> + 0.00 : 31a2e95615: 85 c0 test %eax,%eax + +( this shows that 83.42% of __GI___fork's page allocations come from + the 0x38 system call it performs. ) + +etc. etc. - a lot more is possible. I could list a dozen of +other different usecases straight away - neither of which is +possible via /proc/vmstat. + +/proc/vmstat is not in the same league really, in terms of +expressive power of system analysis and performance +analysis. + +All that the above results needed were those new tracepoints +in include/tracing/events/kmem.h. + + Ingo + + diff --git a/Documentation/intel-bts.txt b/Documentation/intel-bts.txt new file mode 100644 index 0000000..8bdc93b --- /dev/null +++ b/Documentation/intel-bts.txt @@ -0,0 +1,86 @@ +Intel Branch Trace Store +======================== + +Overview +======== + +Intel BTS could be regarded as a predecessor to Intel PT and has some +similarities because it can also identify every branch a program takes. A +notable difference is that Intel BTS has no timing information and as a +consequence the present implementation is limited to per-thread recording. + +While decoding Intel BTS does not require walking the object code, the object +code is still needed to pair up calls and returns correctly, consequently much +of the Intel PT documentation applies also to Intel BTS. Refer to the Intel PT +documentation and consider that the PMU 'intel_bts' can usually be used in +place of 'intel_pt' in the examples provided, with the proviso that per-thread +recording must also be stipulated i.e. the --per-thread option for +'perf record'. + + +perf record +=========== + +new event +--------- + +The Intel BTS kernel driver creates a new PMU for Intel BTS. The perf record +option is: + + -e intel_bts// + +Currently Intel BTS is limited to per-thread tracing so the --per-thread option +is also needed. + + +snapshot option +--------------- + +The snapshot option is the same as Intel PT (refer Intel PT documentation). + + +auxtrace mmap size option +----------------------- + +The mmap size option is the same as Intel PT (refer Intel PT documentation). + + +perf script +=========== + +By default, perf script will decode trace data found in the perf.data file. +This can be further controlled by option --itrace. The --itrace option is +the same as Intel PT (refer Intel PT documentation) except that neither +"instructions" events nor "transactions" events (and consequently call +chains) are supported. + +To disable trace decoding entirely, use the option --no-itrace. + + +dump option +----------- + +perf script has an option (-D) to "dump" the events i.e. display the binary +data. + +When -D is used, Intel BTS packets are displayed. + +To disable the display of Intel BTS packets, combine the -D option with +--no-itrace. + + +perf report +=========== + +By default, perf report will decode trace data found in the perf.data file. +This can be further controlled by new option --itrace exactly the same as +perf script. + + +perf inject +=========== + +perf inject also accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g. + + perf inject --itrace -i perf.data -o perf.data.new diff --git a/Documentation/intel-pt.txt b/Documentation/intel-pt.txt new file mode 100644 index 0000000..76971d2 --- /dev/null +++ b/Documentation/intel-pt.txt @@ -0,0 +1,891 @@ +Intel Processor Trace +===================== + +Overview +======== + +Intel Processor Trace (Intel PT) is an extension of Intel Architecture that +collects information about software execution such as control flow, execution +modes and timings and formats it into highly compressed binary packets. +Technical details are documented in the Intel 64 and IA-32 Architectures +Software Developer Manuals, Chapter 36 Intel Processor Trace. + +Intel PT is first supported in Intel Core M and 5th generation Intel Core +processors that are based on the Intel micro-architecture code name Broadwell. + +Trace data is collected by 'perf record' and stored within the perf.data file. +See below for options to 'perf record'. + +Trace data must be 'decoded' which involves walking the object code and matching +the trace data packets. For example a TNT packet only tells whether a +conditional branch was taken or not taken, so to make use of that packet the +decoder must know precisely which instruction was being executed. + +Decoding is done on-the-fly. The decoder outputs samples in the same format as +samples output by perf hardware events, for example as though the "instructions" +or "branches" events had been recorded. Presently 3 tools support this: +'perf script', 'perf report' and 'perf inject'. See below for more information +on using those tools. + +The main distinguishing feature of Intel PT is that the decoder can determine +the exact flow of software execution. Intel PT can be used to understand why +and how did software get to a certain point, or behave a certain way. The +software does not have to be recompiled, so Intel PT works with debug or release +builds, however the executed images are needed - which makes use in JIT-compiled +environments, or with self-modified code, a challenge. Also symbols need to be +provided to make sense of addresses. + +A limitation of Intel PT is that it produces huge amounts of trace data +(hundreds of megabytes per second per core) which takes a long time to decode, +for example two or three orders of magnitude longer than it took to collect. +Another limitation is the performance impact of tracing, something that will +vary depending on the use-case and architecture. + + +Quickstart +========== + +It is important to start small. That is because it is easy to capture vastly +more data than can possibly be processed. + +The simplest thing to do with Intel PT is userspace profiling of small programs. +Data is captured with 'perf record' e.g. to trace 'ls' userspace-only: + + perf record -e intel_pt//u ls + +And profiled with 'perf report' e.g. + + perf report + +To also trace kernel space presents a problem, namely kernel self-modifying +code. A fairly good kernel image is available in /proc/kcore but to get an +accurate image a copy of /proc/kcore needs to be made under the same conditions +as the data capture. A script perf-with-kcore can do that, but beware that the +script makes use of 'sudo' to copy /proc/kcore. If you have perf installed +locally from the source tree you can do: + + ~/libexec/perf-core/perf-with-kcore record pt_ls -e intel_pt// -- ls + +which will create a directory named 'pt_ls' and put the perf.data file and +copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. Then to use +'perf report' becomes: + + ~/libexec/perf-core/perf-with-kcore report pt_ls + +Because samples are synthesized after-the-fact, the sampling period can be +selected for reporting. e.g. sample every microsecond + + ~/libexec/perf-core/perf-with-kcore report pt_ls --itrace=i1usge + +See the sections below for more information about the --itrace option. + +Beware the smaller the period, the more samples that are produced, and the +longer it takes to process them. + +Also note that the coarseness of Intel PT timing information will start to +distort the statistical value of the sampling as the sampling period becomes +smaller. + +To represent software control flow, "branches" samples are produced. By default +a branch sample is synthesized for every single branch. To get an idea what +data is available you can use the 'perf script' tool with no parameters, which +will list all the samples. + + perf record -e intel_pt//u ls + perf script + +An interesting field that is not printed by default is 'flags' which can be +displayed as follows: + + perf script -Fcomm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr,symoff,flags + +The flags are "bcrosyiABEx" which stand for branch, call, return, conditional, +system, asynchronous, interrupt, transaction abort, trace begin, trace end, and +in transaction, respectively. + +While it is possible to create scripts to analyze the data, an alternative +approach is available to export the data to a sqlite or postgresql database. +Refer to script export-to-sqlite.py or export-to-postgresql.py for more details, +and to script call-graph-from-sql.py for an example of using the database. + +There is also script intel-pt-events.py which provides an example of how to +unpack the raw data for power events and PTWRITE. + +As mentioned above, it is easy to capture too much data. One way to limit the +data captured is to use 'snapshot' mode which is explained further below. +Refer to 'new snapshot option' and 'Intel PT modes of operation' further below. + +Another problem that will be experienced is decoder errors. They can be caused +by inability to access the executed image, self-modified or JIT-ed code, or the +inability to match side-band information (such as context switches and mmaps) +which results in the decoder not knowing what code was executed. + +There is also the problem of perf not being able to copy the data fast enough, +resulting in data lost because the buffer was full. See 'Buffer handling' below +for more details. + + +perf record +=========== + +new event +--------- + +The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are +selected by providing the PMU name followed by the "config" separated by slashes. +An enhancement has been made to allow default "config" e.g. the option + + -e intel_pt// + +will use a default config value. Currently that is the same as + + -e intel_pt/tsc,noretcomp=0/ + +which is the same as + + -e intel_pt/tsc=1,noretcomp=0/ + +Note there are now new config terms - see section 'config terms' further below. + +The config terms are listed in /sys/devices/intel_pt/format. They are bit +fields within the config member of the struct perf_event_attr which is +passed to the kernel by the perf_event_open system call. They correspond to bit +fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions: + + $ grep -H . /sys/bus/event_source/devices/intel_pt/format/* + /sys/bus/event_source/devices/intel_pt/format/cyc:config:1 + /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22 + /sys/bus/event_source/devices/intel_pt/format/mtc:config:9 + /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17 + /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11 + /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27 + /sys/bus/event_source/devices/intel_pt/format/tsc:config:10 + +Note that the default config must be overridden for each term i.e. + + -e intel_pt/noretcomp=0/ + +is the same as: + + -e intel_pt/tsc=1,noretcomp=0/ + +So, to disable TSC packets use: + + -e intel_pt/tsc=0/ + +It is also possible to specify the config value explicitly: + + -e intel_pt/config=0x400/ + +Note that, as with all events, the event is suffixed with event modifiers: + + u userspace + k kernel + h hypervisor + G guest + H host + p precise ip + +'h', 'G' and 'H' are for virtualization which is not supported by Intel PT. +'p' is also not relevant to Intel PT. So only options 'u' and 'k' are +meaningful for Intel PT. + +perf_event_attr is displayed if the -vv option is used e.g. + + ------------------------------------------------------------ + perf_event_attr: + type 6 + size 112 + config 0x400 + { sample_period, sample_freq } 1 + sample_type IP|TID|TIME|CPU|IDENTIFIER + read_format ID + disabled 1 + inherit 1 + exclude_kernel 1 + exclude_hv 1 + enable_on_exec 1 + sample_id_all 1 + ------------------------------------------------------------ + sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 + ------------------------------------------------------------ + + +config terms +------------ + +The June 2015 version of Intel 64 and IA-32 Architectures Software Developer +Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features. +Some of the features are reflect in new config terms. All the config terms are +described below. + +tsc Always supported. Produces TSC timestamp packets to provide + timing information. In some cases it is possible to decode + without timing information, for example a per-thread context + that does not overlap executable memory maps. + + The default config selects tsc (i.e. tsc=1). + +noretcomp Always supported. Disables "return compression" so a TIP packet + is produced when a function returns. Causes more packets to be + produced but might make decoding more reliable. + + The default config does not select noretcomp (i.e. noretcomp=0). + +psb_period Allows the frequency of PSB packets to be specified. + + The PSB packet is a synchronization packet that provides a + starting point for decoding or recovery from errors. + + Support for psb_period is indicated by: + + /sys/bus/event_source/devices/intel_pt/caps/psb_cyc + + which contains "1" if the feature is supported and "0" + otherwise. + + Valid values are given by: + + /sys/bus/event_source/devices/intel_pt/caps/psb_periods + + which contains a hexadecimal value, the bits of which represent + valid values e.g. bit 2 set means value 2 is valid. + + The psb_period value is converted to the approximate number of + trace bytes between PSB packets as: + + 2 ^ (value + 11) + + e.g. value 3 means 16KiB bytes between PSBs + + If an invalid value is entered, the error message + will give a list of valid values e.g. + + $ perf record -e intel_pt/psb_period=15/u uname + Invalid psb_period for intel_pt. Valid values are: 0-5 + + If MTC packets are selected, the default config selects a value + of 3 (i.e. psb_period=3) or the nearest lower value that is + supported (0 is always supported). Otherwise the default is 0. + + If decoding is expected to be reliable and the buffer is large + then a large PSB period can be used. + + Because a TSC packet is produced with PSB, the PSB period can + also affect the granularity to timing information in the absence + of MTC or CYC. + +mtc Produces MTC timing packets. + + MTC packets provide finer grain timestamp information than TSC + packets. MTC packets record time using the hardware crystal + clock (CTC) which is related to TSC packets using a TMA packet. + + Support for this feature is indicated by: + + /sys/bus/event_source/devices/intel_pt/caps/mtc + + which contains "1" if the feature is supported and + "0" otherwise. + + The frequency of MTC packets can also be specified - see + mtc_period below. + +mtc_period Specifies how frequently MTC packets are produced - see mtc + above for how to determine if MTC packets are supported. + + Valid values are given by: + + /sys/bus/event_source/devices/intel_pt/caps/mtc_periods + + which contains a hexadecimal value, the bits of which represent + valid values e.g. bit 2 set means value 2 is valid. + + The mtc_period value is converted to the MTC frequency as: + + CTC-frequency / (2 ^ value) + + e.g. value 3 means one eighth of CTC-frequency + + Where CTC is the hardware crystal clock, the frequency of which + can be related to TSC via values provided in cpuid leaf 0x15. + + If an invalid value is entered, the error message + will give a list of valid values e.g. + + $ perf record -e intel_pt/mtc_period=15/u uname + Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9 + + The default value is 3 or the nearest lower value + that is supported (0 is always supported). + +cyc Produces CYC timing packets. + + CYC packets provide even finer grain timestamp information than + MTC and TSC packets. A CYC packet contains the number of CPU + cycles since the last CYC packet. Unlike MTC and TSC packets, + CYC packets are only sent when another packet is also sent. + + Support for this feature is indicated by: + + /sys/bus/event_source/devices/intel_pt/caps/psb_cyc + + which contains "1" if the feature is supported and + "0" otherwise. + + The number of CYC packets produced can be reduced by specifying + a threshold - see cyc_thresh below. + +cyc_thresh Specifies how frequently CYC packets are produced - see cyc + above for how to determine if CYC packets are supported. + + Valid cyc_thresh values are given by: + + /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds + + which contains a hexadecimal value, the bits of which represent + valid values e.g. bit 2 set means value 2 is valid. + + The cyc_thresh value represents the minimum number of CPU cycles + that must have passed before a CYC packet can be sent. The + number of CPU cycles is: + + 2 ^ (value - 1) + + e.g. value 4 means 8 CPU cycles must pass before a CYC packet + can be sent. Note a CYC packet is still only sent when another + packet is sent, not at, e.g. every 8 CPU cycles. + + If an invalid value is entered, the error message + will give a list of valid values e.g. + + $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname + Invalid cyc_thresh for intel_pt. Valid values are: 0-12 + + CYC packets are not requested by default. + +pt Specifies pass-through which enables the 'branch' config term. + + The default config selects 'pt' if it is available, so a user will + never need to specify this term. + +branch Enable branch tracing. Branch tracing is enabled by default so to + disable branch tracing use 'branch=0'. + + The default config selects 'branch' if it is available. + +ptw Enable PTWRITE packets which are produced when a ptwrite instruction + is executed. + + Support for this feature is indicated by: + + /sys/bus/event_source/devices/intel_pt/caps/ptwrite + + which contains "1" if the feature is supported and + "0" otherwise. + +fup_on_ptw Enable a FUP packet to follow the PTWRITE packet. The FUP packet + provides the address of the ptwrite instruction. In the absence of + fup_on_ptw, the decoder will use the address of the previous branch + if branch tracing is enabled, otherwise the address will be zero. + Note that fup_on_ptw will work even when branch tracing is disabled. + +pwr_evt Enable power events. The power events provide information about + changes to the CPU C-state. + + Support for this feature is indicated by: + + /sys/bus/event_source/devices/intel_pt/caps/power_event_trace + + which contains "1" if the feature is supported and + "0" otherwise. + + +new snapshot option +------------------- + +The difference between full trace and snapshot from the kernel's perspective is +that in full trace we don't overwrite trace data that the user hasn't collected +yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let +the trace run and overwrite older data in the buffer so that whenever something +interesting happens, we can stop it and grab a snapshot of what was going on +around that interesting moment. + +To select snapshot mode a new option has been added: + + -S + +Optionally it can be followed by the snapshot size e.g. + + -S0x100000 + +The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size +nor snapshot size is specified, then the default is 4MiB for privileged users +(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. +If an unprivileged user does not specify mmap pages, the mmap pages will be +reduced as described in the 'new auxtrace mmap size option' section below. + +The snapshot size is displayed if the option -vv is used e.g. + + Intel PT snapshot size: %zu + + +new auxtrace mmap size option +--------------------------- + +Intel PT buffer size is specified by an addition to the -m option e.g. + + -m,16 + +selects a buffer size of 16 pages i.e. 64KiB. + +Note that the existing functionality of -m is unchanged. The auxtrace mmap size +is specified by the optional addition of a comma and the value. + +The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users +(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. +If an unprivileged user does not specify mmap pages, the mmap pages will be +reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the +user is likely to get an error as they exceed their mlock limit (Max locked +memory as shown in /proc/self/limits). Note that perf does not count the first +512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu +against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus +their mlock limit (which defaults to 64KiB but is not multiplied by the number +of cpus). + +In full-trace mode, powers of two are allowed for buffer size, with a minimum +size of 2 pages. In snapshot mode, it is the same but the minimum size is +1 page. + +The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g. + + mmap length 528384 + auxtrace mmap length 4198400 + + +Intel PT modes of operation +--------------------------- + +Intel PT can be used in 2 modes: + full-trace mode + snapshot mode + +Full-trace mode traces continuously e.g. + + perf record -e intel_pt//u uname + +Snapshot mode captures the available data when a signal is sent e.g. + + perf record -v -e intel_pt//u -S ./loopy 1000000000 & + [1] 11435 + kill -USR2 11435 + Recording AUX area tracing snapshot + +Note that the signal sent is SIGUSR2. +Note that "Recording AUX area tracing snapshot" is displayed because the -v +option is used. + +The 2 modes cannot be used together. + + +Buffer handling +--------------- + +There may be buffer limitations (i.e. single ToPa entry) which means that actual +buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to +provide other sizes, and in particular an arbitrarily large size, multiple +buffers are logically concatenated. However an interrupt must be used to switch +between buffers. That has two potential problems: + a) the interrupt may not be handled in time so that the current buffer + becomes full and some trace data is lost. + b) the interrupts may slow the system and affect the performance + results. + +If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event +which the tools report as an error. + +In full-trace mode, the driver waits for data to be copied out before allowing +the (logical) buffer to wrap-around. If data is not copied out quickly enough, +again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to +wait, the intel_pt event gets disabled. Because it is difficult to know when +that happens, perf tools always re-enable the intel_pt event after copying out +data. + + +Intel PT and build ids +---------------------- + +By default "perf record" post-processes the event stream to find all build ids +for executables for all addresses sampled. Deliberately, Intel PT is not +decoded for that purpose (it would take too long). Instead the build ids for +all executables encountered (due to mmap, comm or task events) are included +in the perf.data file. + +To see buildids included in the perf.data file use the command: + + perf buildid-list + +If the perf.data file contains Intel PT data, that is the same as: + + perf buildid-list --with-hits + + +Snapshot mode and event disabling +--------------------------------- + +In order to make a snapshot, the intel_pt event is disabled using an IOCTL, +namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the +collection of side-band information. In order to prevent that, a dummy +software event has been introduced that permits tracking events (like mmaps) to +continue to be recorded while intel_pt is disabled. That is important to ensure +there is complete side-band information to allow the decoding of subsequent +snapshots. + +A test has been created for that. To find the test: + + perf test list + ... + 23: Test using a dummy software event to keep tracking + +To run the test: + + perf test 23 + 23: Test using a dummy software event to keep tracking : Ok + + +perf record modes (nothing new here) +------------------------------------ + +perf record essentially operates in one of three modes: + per thread + per cpu + workload only + +"per thread" mode is selected by -t or by --per-thread (with -p or -u or just a +workload). +"per cpu" is selected by -C or -a. +"workload only" mode is selected by not using the other options but providing a +command to run (i.e. the workload). + +In per-thread mode an exact list of threads is traced. There is no inheritance. +Each thread has its own event buffer. + +In per-cpu mode all processes (or processes from the selected cgroup i.e. -G +option, or processes selected with -p or -u) are traced. Each cpu has its own +buffer. Inheritance is allowed. + +In workload-only mode, the workload is traced but with per-cpu buffers. +Inheritance is allowed. Note that you can now trace a workload in per-thread +mode by using the --per-thread option. + + +Privileged vs non-privileged users +---------------------------------- + +Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users +have memory limits imposed upon them. That affects what buffer sizes they can +have as outlined above. + +The v4.2 kernel introduced support for a context switch metadata event, +PERF_RECORD_SWITCH, which allows unprivileged users to see when their processes +are scheduled out and in, just not by whom, which is left for the +PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide context, +which in turn requires CAP_SYS_ADMIN. + +Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate context +switches") commit, that introduces these metadata events for further info. + +When working with kernels < v4.2, the following considerations must be taken, +as the sched:sched_switch tracepoints will be used to receive such information: + +Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are +not permitted to use tracepoints which means there is insufficient side-band +information to decode Intel PT in per-cpu mode, and potentially workload-only +mode too if the workload creates new processes. + +Note also, that to use tracepoints, read-access to debugfs is required. So if +debugfs is not mounted or the user does not have read-access, it will again not +be possible to decode Intel PT in per-cpu mode. + + +sched_switch tracepoint +----------------------- + +The sched_switch tracepoint is used to provide side-band data for Intel PT +decoding in kernels where the PERF_RECORD_SWITCH metadata event isn't +available. + +The sched_switch events are automatically added. e.g. the second event shown +below: + + $ perf record -vv -e intel_pt//u uname + ------------------------------------------------------------ + perf_event_attr: + type 6 + size 112 + config 0x400 + { sample_period, sample_freq } 1 + sample_type IP|TID|TIME|CPU|IDENTIFIER + read_format ID + disabled 1 + inherit 1 + exclude_kernel 1 + exclude_hv 1 + enable_on_exec 1 + sample_id_all 1 + ------------------------------------------------------------ + sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 + ------------------------------------------------------------ + perf_event_attr: + type 2 + size 112 + config 0x108 + { sample_period, sample_freq } 1 + sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER + read_format ID + inherit 1 + sample_id_all 1 + exclude_guest 1 + ------------------------------------------------------------ + sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 + sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 + sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 + sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 + ------------------------------------------------------------ + perf_event_attr: + type 1 + size 112 + config 0x9 + { sample_period, sample_freq } 1 + sample_type IP|TID|TIME|IDENTIFIER + read_format ID + disabled 1 + inherit 1 + exclude_kernel 1 + exclude_hv 1 + mmap 1 + comm 1 + enable_on_exec 1 + task 1 + sample_id_all 1 + mmap2 1 + comm_exec 1 + ------------------------------------------------------------ + sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 + sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 + mmap size 528384B + AUX area mmap length 4194304 + perf event ring buffer mmapped per cpu + Synthesizing auxtrace information + Linux + [ perf record: Woken up 1 times to write data ] + [ perf record: Captured and wrote 0.042 MB perf.data ] + +Note, the sched_switch event is only added if the user is permitted to use it +and only in per-cpu mode. + +Note also, the sched_switch event is only added if TSC packets are requested. +That is because, in the absence of timing information, the sched_switch events +cannot be matched against the Intel PT trace. + + +perf script +=========== + +By default, perf script will decode trace data found in the perf.data file. +This can be further controlled by new option --itrace. + + +New --itrace option +------------------- + +Having no option is the same as + + --itrace + +which, in turn, is the same as + + --itrace=ibxwpe + +The letters are: + + i synthesize "instructions" events + b synthesize "branches" events + x synthesize "transactions" events + w synthesize "ptwrite" events + p synthesize "power" events + c synthesize branches events (calls only) + r synthesize branches events (returns only) + e synthesize tracing error events + d create a debug log + g synthesize a call chain (use with i or x) + l synthesize last branch entries (use with i or x) + s skip initial number of events + +"Instructions" events look like they were recorded by "perf record -e +instructions". + +"Branches" events look like they were recorded by "perf record -e branches". "c" +and "r" can be combined to get calls and returns. + +"Transactions" events correspond to the start or end of transactions. The +'flags' field can be used in perf script to determine whether the event is a +tranasaction start, commit or abort. + +Note that "instructions", "branches" and "transactions" events depend on code +flow packets which can be disabled by using the config term "branch=0". Refer +to the config terms section above. + +"ptwrite" events record the payload of the ptwrite instruction and whether +"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are +recorded only if the "ptw" config term was used. Refer to the config terms +section above. perf script "synth" field displays "ptwrite" information like +this: "ip: 0 payload: 0x123456789abcdef0" where "ip" is 1 if "fup_on_ptw" was +used. + +"Power" events correspond to power event packets and CBR (core-to-bus ratio) +packets. While CBR packets are always recorded when tracing is enabled, power +event packets are recorded only if the "pwr_evt" config term was used. Refer to +the config terms section above. The power events record information about +C-state changes, whereas CBR is indicative of CPU frequency. perf script +"event,synth" fields display information like this: + cbr: cbr: 22 freq: 2189 MHz (200%) + mwait: hints: 0x60 extensions: 0x1 + pwre: hw: 0 cstate: 2 sub-cstate: 0 + exstop: ip: 1 + pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4 +Where: + "cbr" includes the frequency and the percentage of maximum non-turbo + "mwait" shows mwait hints and extensions + "pwre" shows C-state transitions (to a C-state deeper than C0) and + whether initiated by hardware + "exstop" indicates execution stopped and whether the IP was recorded + exactly, + "pwrx" indicates return to C0 +For more details refer to the Intel 64 and IA-32 Architectures Software +Developer Manuals. + +Error events show where the decoder lost the trace. Error events +are quite important. Users must know if what they are seeing is a complete +picture or not. + +The "d" option will cause the creation of a file "intel_pt.log" containing all +decoded packets and instructions. Note that this option slows down the decoder +and that the resulting file may be very large. + +In addition, the period of the "instructions" event can be specified. e.g. + + --itrace=i10us + +sets the period to 10us i.e. one instruction sample is synthesized for each 10 +microseconds of trace. Alternatives to "us" are "ms" (milliseconds), +"ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions). + +"ms", "us" and "ns" are converted to TSC ticks. + +The timing information included with Intel PT does not give the time of every +instruction. Consequently, for the purpose of sampling, the decoder estimates +the time since the last timing packet based on 1 tick per instruction. The time +on the sample is *not* adjusted and reflects the last known value of TSC. + +For Intel PT, the default period is 100us. + +Setting it to a zero period means "as often as possible". + +In the case of Intel PT that is the same as a period of 1 and a unit of +'instructions' (i.e. --itrace=i1i). + +Also the call chain size (default 16, max. 1024) for instructions or +transactions events can be specified. e.g. + + --itrace=ig32 + --itrace=xg32 + +Also the number of last branch entries (default 64, max. 1024) for instructions or +transactions events can be specified. e.g. + + --itrace=il10 + --itrace=xl10 + +Note that last branch entries are cleared for each sample, so there is no overlap +from one sample to the next. + +To disable trace decoding entirely, use the option --no-itrace. + +It is also possible to skip events generated (instructions, branches, transactions) +at the beginning. This is useful to ignore initialization code. + + --itrace=i0nss1000000 + +skips the first million instructions. + +dump option +----------- + +perf script has an option (-D) to "dump" the events i.e. display the binary +data. + +When -D is used, Intel PT packets are displayed. The packet decoder does not +pay attention to PSB packets, but just decodes the bytes - so the packets seen +by the actual decoder may not be identical in places where the data is corrupt. +One example of that would be when the buffer-switching interrupt has been too +slow, and the buffer has been filled completely. In that case, the last packet +in the buffer might be truncated and immediately followed by a PSB as the trace +continues in the next buffer. + +To disable the display of Intel PT packets, combine the -D option with +--no-itrace. + + +perf report +=========== + +By default, perf report will decode trace data found in the perf.data file. +This can be further controlled by new option --itrace exactly the same as +perf script, with the exception that the default is --itrace=igxe. + + +perf inject +=========== + +perf inject also accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g. + + perf inject --itrace -i perf.data -o perf.data.new + +Below is an example of using Intel PT with autofdo. It requires autofdo +(https://github.com/google/autofdo) and gcc version 5. The bubble +sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) +amended to take the number of elements as a parameter. + + $ gcc-5 -O3 sort.c -o sort_optimized + $ ./sort_optimized 30000 + Bubble sorting array of 30000 elements + 2254 ms + + $ cat ~/.perfconfig + [intel-pt] + mispred-all = on + + $ perf record -e intel_pt//u ./sort 3000 + Bubble sorting array of 3000 elements + 58 ms + [ perf record: Woken up 2 times to write data ] + [ perf record: Captured and wrote 3.939 MB perf.data ] + $ perf inject -i perf.data -o inj --itrace=i100usle --strip + $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1 + $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo + $ ./sort_autofdo 30000 + Bubble sorting array of 30000 elements + 2155 ms + +Note there is currently no advantage to using Intel PT instead of LBR, but +that may change in the future if greater use is made of the data. diff --git a/Documentation/itrace.txt b/Documentation/itrace.txt new file mode 100644 index 0000000..a3abe04 --- /dev/null +++ b/Documentation/itrace.txt @@ -0,0 +1,36 @@ + i synthesize instructions events + b synthesize branches events + c synthesize branches events (calls only) + r synthesize branches events (returns only) + x synthesize transactions events + w synthesize ptwrite events + p synthesize power events + e synthesize error events + d create a debug log + g synthesize a call chain (use with i or x) + l synthesize last branch entries (use with i or x) + s skip initial number of events + + The default is all events i.e. the same as --itrace=ibxwpe + + In addition, the period (default 100000) for instructions events + can be specified in units of: + + i instructions + t ticks + ms milliseconds + us microseconds + ns nanoseconds (default) + + Also the call chain size (default 16, max. 1024) for instructions or + transactions events can be specified. + + Also the number of last branch entries (default 64, max. 1024) for + instructions or transactions events can be specified. + + It is also possible to skip events generated (instructions, branches, transactions, + ptwrite, power) at the beginning. This is useful to ignore initialization code. + + --itrace=i0nss1000000 + + skips the first million instructions. diff --git a/Documentation/jit-interface.txt b/Documentation/jit-interface.txt new file mode 100644 index 0000000..a8656f5 --- /dev/null +++ b/Documentation/jit-interface.txt @@ -0,0 +1,15 @@ +perf supports a simple JIT interface to resolve symbols for dynamic code generated +by a JIT. + +The JIT has to write a /tmp/perf-%d.map (%d = pid of process) file + +This is a text file. + +Each line has the following format, fields separated with spaces: + +START SIZE symbolname + +START and SIZE are hex numbers without 0x. +symbolname is the rest of the line, so it could contain special characters. + +The ownership of the file has to match the process. diff --git a/Documentation/jitdump-specification.txt b/Documentation/jitdump-specification.txt new file mode 100644 index 0000000..4c62b07 --- /dev/null +++ b/Documentation/jitdump-specification.txt @@ -0,0 +1,170 @@ +JITDUMP specification version 2 +Last Revised: 09/15/2016 +Author: Stephane Eranian + +-------------------------------------------------------- +| Revision | Date | Description | +-------------------------------------------------------- +| 1 | 09/07/2016 | Initial revision | +-------------------------------------------------------- +| 2 | 09/15/2016 | Add JIT_CODE_UNWINDING_INFO | +-------------------------------------------------------- + + +I/ Introduction + + +This document describes the jitdump file format. The file is generated by Just-In-time compiler runtimes to save meta-data information about the generated code, such as address, size, and name of generated functions, the native code generated, the source line information. The data may then be used by performance tools, such as Linux perf to generate function and assembly level profiles. + +The format is not specific to any particular programming language. It can be extended as need be. + +The format of the file is binary. It is self-describing in terms of endianness and is portable across multiple processor architectures. + + +II/ Overview of the format + + +The format requires only sequential accesses, i.e., append only mode. The file starts with a fixed size file header describing the version of the specification, the endianness. + +The header is followed by a series of records, each starting with a fixed size header describing the type of record and its size. It is, itself, followed by the payload for the record. Records can have a variable size even for a given type. + +Each entry in the file is timestamped. All timestamps must use the same clock source. The CLOCK_MONOTONIC clock source is recommended. + + +III/ Jitdump file header format + +Each jitdump file starts with a fixed size header containing the following fields in order: + + +* uint32_t magic : a magic number tagging the file type. The value is 4-byte long and represents the string "JiTD" in ASCII form. It is 0x4A695444 or 0x4454694a depending on the endianness. The field can be used to detect the endianness of the file +* uint32_t version : a 4-byte value representing the format version. It is currently set to 2 +* uint32_t total_size: size in bytes of file header +* uint32_t elf_mach : ELF architecture encoding (ELF e_machine value as specified in /usr/include/elf.h) +* uint32_t pad1 : padding. Reserved for future use +* uint32_t pid : JIT runtime process identification (OS specific) +* uint64_t timestamp : timestamp of when the file was created +* uint64_t flags : a bitmask of flags + +The flags currently defined are as follows: + * bit 0: JITDUMP_FLAGS_ARCH_TIMESTAMP : set if the jitdump file is using an architecture-specific timestamp clock source. For instance, on x86, one could use TSC directly + +IV/ Record header + +The file header is immediately followed by records. Each record starts with a fixed size header describing the record that follows. + +The record header is specified in order as follows: +* uint32_t id : a value identifying the record type (see below) +* uint32_t total_size: the size in bytes of the record including the header. +* uint64_t timestamp : a timestamp of when the record was created. + +The following record types are defined: + * Value 0 : JIT_CODE_LOAD : record describing a jitted function + * Value 1 : JIT_CODE_MOVE : record describing an already jitted function which is moved + * Value 2 : JIT_CODE_DEBUG_INFO: record describing the debug information for a jitted function + * Value 3 : JIT_CODE_CLOSE : record marking the end of the jit runtime (optional) + * Value 4 : JIT_CODE_UNWINDING_INFO: record describing a function unwinding information + + The payload of the record must immediately follow the record header without padding. + +V/ JIT_CODE_LOAD record + + + The record has the following fields following the fixed-size record header in order: + * uint32_t pid: OS process id of the runtime generating the jitted code + * uint32_t tid: OS thread identification of the runtime thread generating the jitted code + * uint64_t vma: virtual address of jitted code start + * uint64_t code_addr: code start address for the jitted code. By default vma = code_addr + * uint64_t code_size: size in bytes of the generated jitted code + * uint64_t code_index: unique identifier for the jitted code (see below) + * char[n]: function name in ASCII including the null termination + * native code: raw byte encoding of the jitted code + + The record header total_size field is inclusive of all components: + * record header + * fixed-sized fields + * function name string, including termination + * native code length + * record specific variable data (e.g., array of data entries) + +The code_index is used to uniquely identify each jitted function. The index can be a monotonically increasing 64-bit value. Each time a function is jitted it gets a new number. This value is used in case the code for a function is moved and avoids having to issue another JIT_CODE_LOAD record. + +The format supports empty functions with no native code. + + +VI/ JIT_CODE_MOVE record + + The record type is optional. + + The record has the following fields following the fixed-size record header in order: + * uint32_t pid : OS process id of the runtime generating the jitted code + * uint32_t tid : OS thread identification of the runtime thread generating the jitted code + * uint64_t vma : new virtual address of jitted code start + * uint64_t old_code_addr: previous code address for the same function + * uint64_t new_code_addr: alternate new code started address for the jitted code. By default it should be equal to the vma address. + * uint64_t code_size : size in bytes of the jitted code + * uint64_t code_index : index referring to the JIT_CODE_LOAD code_index record of when the function was initially jitted + + +The MOVE record can be used in case an already jitted function is simply moved by the runtime inside the code cache. + +The JIT_CODE_MOVE record cannot come before the JIT_CODE_LOAD record for the same function name. The function cannot have changed name, otherwise a new JIT_CODE_LOAD record must be emitted. + +The code size of the function cannot change. + + +VII/ JIT_DEBUG_INFO record + +The record type is optional. + +The record contains source lines debug information, i.e., a way to map a code address back to a source line. This information may be used by the performance tool. + +The record has the following fields following the fixed-size record header in order: + * uint64_t code_addr: address of function for which the debug information is generated + * uint64_t nr_entry : number of debug entries for the function + * debug_entry[n]: array of nr_entry debug entries for the function + +The debug_entry describes the source line information. It is defined as follows in order: +* uint64_t code_addr: address of function for which the debug information is generated +* uint32_t line : source file line number (starting at 1) +* uint32_t discrim : column discriminator, 0 is default +* char name[n] : source file name in ASCII, including null termination + +The debug_entry entries are saved in sequence but given that they have variable sizes due to the file name string, they cannot be indexed directly. +They need to be walked sequentially. The next debug_entry is found at sizeof(debug_entry) + strlen(name) + 1. + +IMPORTANT: + The JIT_CODE_DEBUG for a given function must always be generated BEFORE the JIT_CODE_LOAD for the function. This facilitates greatly the parser for the jitdump file. + + +VIII/ JIT_CODE_CLOSE record + + +The record type is optional. + +The record is used as a marker for the end of the jitted runtime. It can be replaced by the end of the file. + +The JIT_CODE_CLOSE record does not have any specific fields, the record header contains all the information needed. + + +IX/ JIT_CODE_UNWINDING_INFO + + +The record type is optional. + +The record is used to describe the unwinding information for a jitted function. + +The record has the following fields following the fixed-size record header in order: + +uint64_t unwind_data_size : the size in bytes of the unwinding data table at the end of the record +uint64_t eh_frame_hdr_size : the size in bytes of the DWARF EH Frame Header at the start of the unwinding data table at the end of the record +uint64_t mapped_size : the size of the unwinding data mapped in memory +const char unwinding_data[n]: an array of unwinding data, consisting of the EH Frame Header, followed by the actual EH Frame + + +The EH Frame header follows the Linux Standard Base (LSB) specification as described in the document at https://refspecs.linuxfoundation.org/LSB_1.3.0/gLSB/gLSB/ehframehdr.html + + +The EH Frame follows the LSB specicfication as described in the document at https://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/ehframechpt.html + + +NOTE: The mapped_size is generally either the same as unwind_data_size (if the unwinding data was mapped in memory by the running process) or zero (if the unwinding data is not mapped by the process). If the unwinding data was not mapped, then only the EH Frame Header will be read, which can be used to specify FP based unwinding for a function which does not have unwinding information. diff --git a/Documentation/manpage-1.72.xsl b/Documentation/manpage-1.72.xsl new file mode 100644 index 0000000..b4d315c --- /dev/null +++ b/Documentation/manpage-1.72.xsl @@ -0,0 +1,14 @@ + + + + + + + + + + diff --git a/Documentation/manpage-base.xsl b/Documentation/manpage-base.xsl new file mode 100644 index 0000000..a264fa6 --- /dev/null +++ b/Documentation/manpage-base.xsl @@ -0,0 +1,35 @@ + + + + + + + + + + + + + + sp + + + + + + + + br + + + diff --git a/Documentation/manpage-bold-literal.xsl b/Documentation/manpage-bold-literal.xsl new file mode 100644 index 0000000..608eb5d --- /dev/null +++ b/Documentation/manpage-bold-literal.xsl @@ -0,0 +1,17 @@ + + + + + + + fB + + + fR + + + diff --git a/Documentation/manpage-normal.xsl b/Documentation/manpage-normal.xsl new file mode 100644 index 0000000..a48f5b1 --- /dev/null +++ b/Documentation/manpage-normal.xsl @@ -0,0 +1,13 @@ + + + + + + +\ +. + + diff --git a/Documentation/manpage-suppress-sp.xsl b/Documentation/manpage-suppress-sp.xsl new file mode 100644 index 0000000..a63c763 --- /dev/null +++ b/Documentation/manpage-suppress-sp.xsl @@ -0,0 +1,21 @@ + + + + + + + + + + + + + + + diff --git a/Documentation/perf-annotate.txt b/Documentation/perf-annotate.txt new file mode 100644 index 0000000..749cc60 --- /dev/null +++ b/Documentation/perf-annotate.txt @@ -0,0 +1,123 @@ +perf-annotate(1) +================ + +NAME +---- +perf-annotate - Read perf.data (created by perf record) and display annotated code + +SYNOPSIS +-------- +[verse] +'perf annotate' [-i | --input=file] [symbol_name] + +DESCRIPTION +----------- +This command reads the input file and displays an annotated version of the +code. If the object file has debug symbols then the source code will be +displayed alongside assembly code. + +If there is no debug info in the object, then annotated assembly is displayed. + +OPTIONS +------- +-i:: +--input=:: + Input file name. (default: perf.data unless stdin is a fifo) + +-d:: +--dsos=:: + Only consider symbols in these dsos. +-s:: +--symbol=:: + Symbol to annotate. + +-f:: +--force:: + Don't do ownership validation. + +-v:: +--verbose:: + Be more verbose. (Show symbol address, etc) + +-q:: +--quiet:: + Do not show any message. (Suppress -v) + +-n:: +--show-nr-samples:: + Show the number of samples for each symbol + +-D:: +--dump-raw-trace:: + Dump raw trace in ASCII. + +-k:: +--vmlinux=:: + vmlinux pathname. + +--ignore-vmlinux:: + Ignore vmlinux files. + +-m:: +--modules:: + Load module symbols. WARNING: use only with -k and LIVE kernel. + +-l:: +--print-line:: + Print matching source lines (may be slow). + +-P:: +--full-paths:: + Don't shorten the displayed pathnames. + +--stdio:: Use the stdio interface. + +--stdio2:: Use the stdio2 interface, non-interactive, uses the TUI formatting. + +--stdio-color=:: + 'always', 'never' or 'auto', allowing configuring color output + via the command line, in addition to via "color.ui" .perfconfig. + Use '--stdio-color always' to generate color even when redirecting + to a pipe or file. Using just '--stdio-color' is equivalent to + using 'always'. + +--tui:: Use the TUI interface. Use of --tui requires a tty, if one is not + present, as when piping to other commands, the stdio interface is + used. This interfaces starts by centering on the line with more + samples, TAB/UNTAB cycles through the lines with more samples. + +--gtk:: Use the GTK interface. + +-C:: +--cpu=:: Only report samples for the list of CPUs provided. Multiple CPUs can + be provided as a comma-separated list with no space: 0,1. Ranges of + CPUs are specified with -: 0-2. Default is to report samples on all + CPUs. + +--asm-raw:: + Show raw instruction encoding of assembly instructions. + +--show-total-period:: Show a column with the sum of periods. + +--source:: + Interleave source code with assembly code. Enabled by default, + disable with --no-source. + +--symfs=:: + Look for files with symbols relative to this directory. + +-M:: +--disassembler-style=:: Set disassembler style for objdump. + +--objdump=:: + Path to objdump binary. + +--skip-missing:: + Skip symbols that cannot be annotated. + +--group:: + Show event group information together + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-report[1] diff --git a/Documentation/perf-archive.txt b/Documentation/perf-archive.txt new file mode 100644 index 0000000..ac6ecbb --- /dev/null +++ b/Documentation/perf-archive.txt @@ -0,0 +1,22 @@ +perf-archive(1) +=============== + +NAME +---- +perf-archive - Create archive with object files with build-ids found in perf.data file + +SYNOPSIS +-------- +[verse] +'perf archive' [file] + +DESCRIPTION +----------- +This command runs perf-buildid-list --with-hits, and collects the files with the +buildids found so that analysis of perf.data contents can be possible on another +machine. + + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-buildid-list[1], linkperf:perf-report[1] diff --git a/Documentation/perf-bench.txt b/Documentation/perf-bench.txt new file mode 100644 index 0000000..34750fc --- /dev/null +++ b/Documentation/perf-bench.txt @@ -0,0 +1,209 @@ +perf-bench(1) +============= + +NAME +---- +perf-bench - General framework for benchmark suites + +SYNOPSIS +-------- +[verse] +'perf bench' [] [] + +DESCRIPTION +----------- +This 'perf bench' command is a general framework for benchmark suites. + +COMMON OPTIONS +-------------- +-r:: +--repeat=:: +Specify amount of times to repeat the run (default 10). + +-f:: +--format=:: +Specify format style. +Current available format styles are: + +'default':: +Default style. This is mainly for human reading. +--------------------- +% perf bench sched pipe # with no style specified +(executing 1000000 pipe operations between two tasks) + Total time:5.855 sec + 5.855061 usecs/op + 170792 ops/sec +--------------------- + +'simple':: +This simple style is friendly for automated +processing by scripts. +--------------------- +% perf bench --format=simple sched pipe # specified simple +5.988 +--------------------- + +SUBSYSTEM +--------- + +'sched':: + Scheduler and IPC mechanisms. + +'mem':: + Memory access performance. + +'numa':: + NUMA scheduling and MM benchmarks. + +'futex':: + Futex stressing benchmarks. + +'all':: + All benchmark subsystems. + +SUITES FOR 'sched' +~~~~~~~~~~~~~~~~~~ +*messaging*:: +Suite for evaluating performance of scheduler and IPC mechanisms. +Based on hackbench by Rusty Russell. + +Options of *messaging* +^^^^^^^^^^^^^^^^^^^^^^ +-p:: +--pipe:: +Use pipe() instead of socketpair() + +-t:: +--thread:: +Be multi thread instead of multi process + +-g:: +--group=:: +Specify number of groups + +-l:: +--nr_loops=:: +Specify number of loops + +Example of *messaging* +^^^^^^^^^^^^^^^^^^^^^^ + +--------------------- +% perf bench sched messaging # run with default +options (20 sender and receiver processes per group) +(10 groups == 400 processes run) + + Total time:0.308 sec + +% perf bench sched messaging -t -g 20 # be multi-thread, with 20 groups +(20 sender and receiver threads per group) +(20 groups == 800 threads run) + + Total time:0.582 sec +--------------------- + +*pipe*:: +Suite for pipe() system call. +Based on pipe-test-1m.c by Ingo Molnar. + +Options of *pipe* +^^^^^^^^^^^^^^^^^ +-l:: +--loop=:: +Specify number of loops. + +Example of *pipe* +^^^^^^^^^^^^^^^^^ + +--------------------- +% perf bench sched pipe +(executing 1000000 pipe operations between two tasks) + + Total time:8.091 sec + 8.091833 usecs/op + 123581 ops/sec + +% perf bench sched pipe -l 1000 # loop 1000 +(executing 1000 pipe operations between two tasks) + + Total time:0.016 sec + 16.948000 usecs/op + 59004 ops/sec +--------------------- + +SUITES FOR 'mem' +~~~~~~~~~~~~~~~~ +*memcpy*:: +Suite for evaluating performance of simple memory copy in various ways. + +Options of *memcpy* +^^^^^^^^^^^^^^^^^^^ +-l:: +--size:: +Specify size of memory to copy (default: 1MB). +Available units are B, KB, MB, GB and TB (case insensitive). + +-f:: +--function:: +Specify function to copy (default: default). +Available functions are depend on the architecture. +On x86-64, x86-64-unrolled, x86-64-movsq and x86-64-movsb are supported. + +-l:: +--nr_loops:: +Repeat memcpy invocation this number of times. + +-c:: +--cycles:: +Use perf's cpu-cycles event instead of gettimeofday syscall. + +*memset*:: +Suite for evaluating performance of simple memory set in various ways. + +Options of *memset* +^^^^^^^^^^^^^^^^^^^ +-l:: +--size:: +Specify size of memory to set (default: 1MB). +Available units are B, KB, MB, GB and TB (case insensitive). + +-f:: +--function:: +Specify function to set (default: default). +Available functions are depend on the architecture. +On x86-64, x86-64-unrolled, x86-64-stosq and x86-64-stosb are supported. + +-l:: +--nr_loops:: +Repeat memset invocation this number of times. + +-c:: +--cycles:: +Use perf's cpu-cycles event instead of gettimeofday syscall. + +SUITES FOR 'numa' +~~~~~~~~~~~~~~~~~ +*mem*:: +Suite for evaluating NUMA workloads. + +SUITES FOR 'futex' +~~~~~~~~~~~~~~~~~~ +*hash*:: +Suite for evaluating hash tables. + +*wake*:: +Suite for evaluating wake calls. + +*wake-parallel*:: +Suite for evaluating parallel wake calls. + +*requeue*:: +Suite for evaluating requeue calls. + +*lock-pi*:: +Suite for evaluating futex lock_pi calls. + + +SEE ALSO +-------- +linkperf:perf[1] diff --git a/Documentation/perf-buildid-cache.txt b/Documentation/perf-buildid-cache.txt new file mode 100644 index 0000000..73c2650 --- /dev/null +++ b/Documentation/perf-buildid-cache.txt @@ -0,0 +1,74 @@ +perf-buildid-cache(1) +===================== + +NAME +---- +perf-buildid-cache - Manage build-id cache. + +SYNOPSIS +-------- +[verse] +'perf buildid-cache ' + +DESCRIPTION +----------- +This command manages the build-id cache. It can add, remove, update and purge +files to/from the cache. In the future it should as well set upper limits for +the space used by the cache, etc. +This also scans the target binary for SDT (Statically Defined Tracing) and +record it along with the buildid-cache, which will be used by perf-probe. +For more details, see linkperf:perf-probe[1]. + +OPTIONS +------- +-a:: +--add=:: + Add specified file to the cache. +-f:: +--force:: + Don't complain, do it. +-k:: +--kcore:: + Add specified kcore file to the cache. For the current host that is + /proc/kcore which requires root permissions to read. Be aware that + running 'perf buildid-cache' as root may update root's build-id cache + not the user's. Use the -v option to see where the file is created. + Note that the copied file contains only code sections not the whole core + image. Note also that files "kallsyms" and "modules" must also be in the + same directory and are also copied. All 3 files are created with read + permissions for root only. kcore will not be added if there is already a + kcore in the cache (with the same build-id) that has the same modules at + the same addresses. Use the -v option to see if a copy of kcore is + actually made. +-r:: +--remove=:: + Remove a cached binary which has same build-id of specified file + from the cache. +-p:: +--purge=:: + Purge all cached binaries including older caches which have specified + path from the cache. +-M:: +--missing=:: + List missing build ids in the cache for the specified file. +-u:: +--update=:: + Update specified file of the cache. Note that this doesn't remove + older entires since those may be still needed for annotating old + (or remote) perf.data. Only if there is already a cache which has + exactly same build-id, that is replaced by new one. It can be used + to update kallsyms and kernel dso to vmlinux in order to support + annotation. + +-v:: +--verbose:: + Be more verbose. + +--target-ns=PID: + Obtain mount namespace information from the target pid. This is + used when creating a uprobe for a process that resides in a + different mount namespace from the perf(1) utility. + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-buildid-list[1] diff --git a/Documentation/perf-buildid-list.txt b/Documentation/perf-buildid-list.txt new file mode 100644 index 0000000..25c52ef --- /dev/null +++ b/Documentation/perf-buildid-list.txt @@ -0,0 +1,43 @@ +perf-buildid-list(1) +==================== + +NAME +---- +perf-buildid-list - List the buildids in a perf.data file + +SYNOPSIS +-------- +[verse] +'perf buildid-list ' + +DESCRIPTION +----------- +This command displays the buildids found in a perf.data file, so that other +tools can be used to fetch packages with matching symbol tables for use by +perf report. + +It can also be used to show the build id of the running kernel or in an ELF +file using -i/--input. + +OPTIONS +------- +-H:: +--with-hits:: + Show only DSOs with hits. +-i:: +--input=:: + Input file name. (default: perf.data unless stdin is a fifo) +-f:: +--force:: + Don't do ownership validation. +-k:: +--kernel:: + Show running kernel build id. +-v:: +--verbose:: + Be more verbose. + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-top[1], +linkperf:perf-report[1] diff --git a/Documentation/perf-c2c.txt b/Documentation/perf-c2c.txt new file mode 100644 index 0000000..095aebd --- /dev/null +++ b/Documentation/perf-c2c.txt @@ -0,0 +1,290 @@ +perf-c2c(1) +=========== + +NAME +---- +perf-c2c - Shared Data C2C/HITM Analyzer. + +SYNOPSIS +-------- +[verse] +'perf c2c record' [] +'perf c2c record' [] -- [] +'perf c2c report' [] + +DESCRIPTION +----------- +C2C stands for Cache To Cache. + +The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows +you to track down the cacheline contentions. + +The tool is based on x86's load latency and precise store facility events +provided by Intel CPUs. These events provide: + - memory address of the access + - type of the access (load and store details) + - latency (in cycles) of the load access + +The c2c tool provide means to record this data and report back access details +for cachelines with highest contention - highest number of HITM accesses. + +The basic workflow with this tool follows the standard record/report phase. +User uses the record command to record events data and report command to +display it. + + +RECORD OPTIONS +-------------- +-e:: +--event=:: + Select the PMU event. Use 'perf mem record -e list' + to list available events. + +-v:: +--verbose:: + Be more verbose (show counter open errors, etc). + +-l:: +--ldlat:: + Configure mem-loads latency. + +-k:: +--all-kernel:: + Configure all used events to run in kernel space. + +-u:: +--all-user:: + Configure all used events to run in user space. + +REPORT OPTIONS +-------------- +-k:: +--vmlinux=:: + vmlinux pathname + +-v:: +--verbose:: + Be more verbose (show counter open errors, etc). + +-i:: +--input:: + Specify the input file to process. + +-N:: +--node-info:: + Show extra node info in report (see NODE INFO section) + +-c:: +--coalesce:: + Specify sorting fields for single cacheline display. + Following fields are available: tid,pid,iaddr,dso + (see COALESCE) + +-g:: +--call-graph:: + Setup callchains parameters. + Please refer to perf-report man page for details. + +--stdio:: + Force the stdio output (see STDIO OUTPUT) + +--stats:: + Display only statistic tables and force stdio mode. + +--full-symbols:: + Display full length of symbols. + +--no-source:: + Do not display Source:Line column. + +--show-all:: + Show all captured HITM lines, with no regard to HITM % 0.0005 limit. + +-f:: +--force:: + Don't do ownership validation. + +-d:: +--display:: + Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default. + +C2C RECORD +---------- +The perf c2c record command setup options related to HITM cacheline analysis +and calls standard perf record command. + +Following perf record options are configured by default: +(check perf record man page for details) + + -W,-d,--phys-data,--sample-cpu + +Unless specified otherwise with '-e' option, following events are monitored by +default: + + cpu/mem-loads,ldlat=30/P + cpu/mem-stores/P + +User can pass any 'perf record' option behind '--' mark, like (to enable +callchains and system wide monitoring): + + $ perf c2c record -- -g -a + +Please check RECORD OPTIONS section for specific c2c record options. + +C2C REPORT +---------- +The perf c2c report command displays shared data analysis. It comes in two +display modes: stdio and tui (default). + +The report command workflow is following: + - sort all the data based on the cacheline address + - store access details for each cacheline + - sort all cachelines based on user settings + - display data + +In general perf report output consist of 2 basic views: + 1) most expensive cachelines list + 2) offsets details for each cacheline + +For each cacheline in the 1) list we display following data: +(Both stdio and TUI modes follow the same fields output) + + Index + - zero based index to identify the cacheline + + Cacheline + - cacheline address (hex number) + + Total records + - sum of all cachelines accesses + + Rmt/Lcl Hitm + - cacheline percentage of all Remote/Local HITM accesses + + LLC Load Hitm - Total, Lcl, Rmt + - count of Total/Local/Remote load HITMs + + Store Reference - Total, L1Hit, L1Miss + Total - all store accesses + L1Hit - store accesses that hit L1 + L1Hit - store accesses that missed L1 + + Load Dram + - count of local and remote DRAM accesses + + LLC Ld Miss + - count of all accesses that missed LLC + + Total Loads + - sum of all load accesses + + Core Load Hit - FB, L1, L2 + - count of load hits in FB (Fill Buffer), L1 and L2 cache + + LLC Load Hit - Llc, Rmt + - count of LLC and Remote load hits + +For each offset in the 2) list we display following data: + + HITM - Rmt, Lcl + - % of Remote/Local HITM accesses for given offset within cacheline + + Store Refs - L1 Hit, L1 Miss + - % of store accesses that hit/missed L1 for given offset within cacheline + + Data address - Offset + - offset address + + Pid + - pid of the process responsible for the accesses + + Tid + - tid of the process responsible for the accesses + + Code address + - code address responsible for the accesses + + cycles - rmt hitm, lcl hitm, load + - sum of cycles for given accesses - Remote/Local HITM and generic load + + cpu cnt + - number of cpus that participated on the access + + Symbol + - code symbol related to the 'Code address' value + + Shared Object + - shared object name related to the 'Code address' value + + Source:Line + - source information related to the 'Code address' value + + Node + - nodes participating on the access (see NODE INFO section) + +NODE INFO +--------- +The 'Node' field displays nodes that accesses given cacheline +offset. Its output comes in 3 flavors: + - node IDs separated by ',' + - node IDs with stats for each ID, in following format: + Node{cpus %hitms %stores} + - node IDs with list of affected CPUs in following format: + Node{cpu list} + +User can switch between above flavors with -N option or +use 'n' key to interactively switch in TUI mode. + +COALESCE +-------- +User can specify how to sort offsets for cacheline. + +Following fields are available and governs the final +output fields set for caheline offsets output: + + tid - coalesced by process TIDs + pid - coalesced by process PIDs + iaddr - coalesced by code address, following fields are displayed: + Code address, Code symbol, Shared Object, Source line + dso - coalesced by shared object + +By default the coalescing is setup with 'pid,iaddr'. + +STDIO OUTPUT +------------ +The stdio output displays data on standard output. + +Following tables are displayed: + Trace Event Information + - overall statistics of memory accesses + + Global Shared Cache Line Event Information + - overall statistics on shared cachelines + + Shared Data Cache Line Table + - list of most expensive cachelines + + Shared Cache Line Distribution Pareto + - list of all accessed offsets for each cacheline + +TUI OUTPUT +---------- +The TUI output provides interactive interface to navigate +through cachelines list and to display offset details. + +For details please refer to the help window by pressing '?' key. + +CREDITS +------- +Although Don Zickus, Dick Fowles and Joe Mario worked together +to get this implemented, we got lots of early help from Arnaldo +Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen. + +C2C BLOG +-------- +Check Joe's blog on c2c tool for detailed use case explanation: + https://joemario.github.io/blog/2016/09/01/c2c-blog/ + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-mem[1] diff --git a/Documentation/perf-config.txt b/Documentation/perf-config.txt new file mode 100644 index 0000000..32f4a89 --- /dev/null +++ b/Documentation/perf-config.txt @@ -0,0 +1,520 @@ +perf-config(1) +============== + +NAME +---- +perf-config - Get and set variables in a configuration file. + +SYNOPSIS +-------- +[verse] +'perf config' [] [section.name[=value] ...] +or +'perf config' [] -l | --list + +DESCRIPTION +----------- +You can manage variables in a configuration file with this command. + +OPTIONS +------- + +-l:: +--list:: + Show current config variables, name and value, for all sections. + +--user:: + For writing and reading options: write to user + '$HOME/.perfconfig' file or read it. + +--system:: + For writing and reading options: write to system-wide + '$(sysconfdir)/perfconfig' or read it. + +CONFIGURATION FILE +------------------ + +The perf configuration file contains many variables to change various +aspects of each of its tools, including output, disk usage, etc. +The '$HOME/.perfconfig' file is used to store a per-user configuration. +The file '$(sysconfdir)/perfconfig' can be used to +store a system-wide default configuration. + +When reading or writing, the values are read from the system and user +configuration files by default, and options '--system' and '--user' +can be used to tell the command to read from or write to only that location. + +Syntax +~~~~~~ + +The file consist of sections. A section starts with its name +surrounded by square brackets and continues till the next section +begins. Each variable must be in a section, and have the form +'name = value', for example: + + [section] + name1 = value1 + name2 = value2 + +Section names are case sensitive and can contain any characters except +newline (double quote `"` and backslash have to be escaped as `\"` and `\\`, +respectively). Section headers can't span multiple lines. + +Example +~~~~~~~ + +Given a $HOME/.perfconfig like this: + +# +# This is the config file, and +# a '#' and ';' character indicates a comment +# + + [colors] + # Color variables + top = red, default + medium = green, default + normal = lightgray, default + selected = white, lightgray + jump_arrows = blue, default + addr = magenta, default + root = white, blue + + [tui] + # Defaults if linked with libslang + report = on + annotate = on + top = on + + [buildid] + # Default, disable using /dev/null + dir = ~/.debug + + [annotate] + # Defaults + hide_src_code = false + use_offset = true + jump_arrows = true + show_nr_jumps = false + + [help] + # Format can be man, info, web or html + format = man + autocorrect = 0 + + [ui] + show-headers = true + + [call-graph] + # fp (framepointer), dwarf + record-mode = fp + print-type = graph + order = caller + sort-key = function + + [report] + # Defaults + sort-order = comm,dso,symbol + percent-limit = 0 + queue-size = 0 + children = true + group = true + +You can hide source code of annotate feature setting the config to false with + + % perf config annotate.hide_src_code=true + +If you want to add or modify several config items, you can do like + + % perf config ui.show-headers=false kmem.default=slab + +To modify the sort order of report functionality in user config file(i.e. `~/.perfconfig`), do + + % perf config --user report sort-order=srcline + +To change colors of selected line to other foreground and background colors +in system config file (i.e. `$(sysconf)/perfconfig`), do + + % perf config --system colors.selected=yellow,green + +To query the record mode of call graph, do + + % perf config call-graph.record-mode + +If you want to know multiple config key/value pairs, you can do like + + % perf config report.queue-size call-graph.order report.children + +To query the config value of sort order of call graph in user config file (i.e. `~/.perfconfig`), do + + % perf config --user call-graph.sort-order + +To query the config value of buildid directory in system config file (i.e. `$(sysconf)/perfconfig`), do + + % perf config --system buildid.dir + +Variables +~~~~~~~~~ + +colors.*:: + The variables for customizing the colors used in the output for the + 'report', 'top' and 'annotate' in the TUI. They should specify the + foreground and background colors, separated by a comma, for example: + + medium = green, lightgray + + If you want to use the color configured for you terminal, just leave it + as 'default', for example: + + medium = default, lightgray + + Available colors: + red, yellow, green, cyan, gray, black, blue, + white, default, magenta, lightgray + + colors.top:: + 'top' means a overhead percentage which is more than 5%. + And values of this variable specify percentage colors. + Basic key values are foreground-color 'red' and + background-color 'default'. + colors.medium:: + 'medium' means a overhead percentage which has more than 0.5%. + Default values are 'green' and 'default'. + colors.normal:: + 'normal' means the rest of overhead percentages + except 'top', 'medium', 'selected'. + Default values are 'lightgray' and 'default'. + colors.selected:: + This selects the colors for the current entry in a list of entries + from sub-commands (top, report, annotate). + Default values are 'black' and 'lightgray'. + colors.jump_arrows:: + Colors for jump arrows on assembly code listings + such as 'jns', 'jmp', 'jane', etc. + Default values are 'blue', 'default'. + colors.addr:: + This selects colors for addresses from 'annotate'. + Default values are 'magenta', 'default'. + colors.root:: + Colors for headers in the output of a sub-commands (top, report). + Default values are 'white', 'blue'. + +tui.*, gtk.*:: + Subcommands that can be configured here are 'top', 'report' and 'annotate'. + These values are booleans, for example: + + [tui] + top = true + + will make the TUI be the default for the 'top' subcommand. Those will be + available if the required libs were detected at tool build time. + +buildid.*:: + buildid.dir:: + Each executable and shared library in modern distributions comes with a + content based identifier that, if available, will be inserted in a + 'perf.data' file header to, at analysis time find what is needed to do + symbol resolution, code annotation, etc. + + The recording tools also stores a hard link or copy in a per-user + directory, $HOME/.debug/, of binaries, shared libraries, /proc/kallsyms + and /proc/kcore files to be used at analysis time. + + The buildid.dir variable can be used to either change this directory + cache location, or to disable it altogether. If you want to disable it, + set buildid.dir to /dev/null. The default is $HOME/.debug + +annotate.*:: + These options work only for TUI. + These are in control of addresses, jump function, source code + in lines of assembly code from a specific program. + + annotate.hide_src_code:: + If a program which is analyzed has source code, + this option lets 'annotate' print a list of assembly code with the source code. + For example, let's see a part of a program. There're four lines. + If this option is 'true', they can be printed + without source code from a program as below. + + │ push %rbp + │ mov %rsp,%rbp + │ sub $0x10,%rsp + │ mov (%rdi),%rdx + + But if this option is 'false', source code of the part + can be also printed as below. Default is 'false'. + + │ struct rb_node *rb_next(const struct rb_node *node) + │ { + │ push %rbp + │ mov %rsp,%rbp + │ sub $0x10,%rsp + │ struct rb_node *parent; + │ + │ if (RB_EMPTY_NODE(node)) + │ mov (%rdi),%rdx + │ return n; + + annotate.use_offset:: + Basing on a first address of a loaded function, offset can be used. + Instead of using original addresses of assembly code, + addresses subtracted from a base address can be printed. + Let's illustrate an example. + If a base address is 0XFFFFFFFF81624d50 as below, + + ffffffff81624d50 + + an address on assembly code has a specific absolute address as below + + ffffffff816250b8:│ mov 0x8(%r14),%rdi + + but if use_offset is 'true', an address subtracted from a base address is printed. + Default is true. This option is only applied to TUI. + + 368:│ mov 0x8(%r14),%rdi + + annotate.jump_arrows:: + There can be jump instruction among assembly code. + Depending on a boolean value of jump_arrows, + arrows can be printed or not which represent + where do the instruction jump into as below. + + │ ┌──jmp 1333 + │ │ xchg %ax,%ax + │1330:│ mov %r15,%r10 + │1333:└─→cmp %r15,%r14 + + If jump_arrow is 'false', the arrows isn't printed as below. + Default is 'false'. + + │ ↓ jmp 1333 + │ xchg %ax,%ax + │1330: mov %r15,%r10 + │1333: cmp %r15,%r14 + + annotate.show_linenr:: + When showing source code if this option is 'true', + line numbers are printed as below. + + │1628 if (type & PERF_SAMPLE_IDENTIFIER) { + │ ↓ jne 508 + │1628 data->id = *array; + │1629 array++; + │1630 } + + However if this option is 'false', they aren't printed as below. + Default is 'false'. + + │ if (type & PERF_SAMPLE_IDENTIFIER) { + │ ↓ jne 508 + │ data->id = *array; + │ array++; + │ } + + annotate.show_nr_jumps:: + Let's see a part of assembly code. + + │1382: movb $0x1,-0x270(%rbp) + + If use this, the number of branches jumping to that address can be printed as below. + Default is 'false'. + + │1 1382: movb $0x1,-0x270(%rbp) + + annotate.show_total_period:: + To compare two records on an instruction base, with this option + provided, display total number of samples that belong to a line + in assembly code. If this option is 'true', total periods are printed + instead of percent values as below. + + 302 │ mov %eax,%eax + + But if this option is 'false', percent values for overhead are printed i.e. + Default is 'false'. + + 99.93 │ mov %eax,%eax + + annotate.offset_level:: + Default is '1', meaning just jump targets will have offsets show right beside + the instruction. When set to '2' 'call' instructions will also have its offsets + shown, 3 or higher will show offsets for all instructions. + +hist.*:: + hist.percentage:: + This option control the way to calculate overhead of filtered entries - + that means the value of this option is effective only if there's a + filter (by comm, dso or symbol name). Suppose a following example: + + Overhead Symbols + ........ ....... + 33.33% foo + 33.33% bar + 33.33% baz + + This is an original overhead and we'll filter out the first 'foo' + entry. The value of 'relative' would increase the overhead of 'bar' + and 'baz' to 50.00% for each, while 'absolute' would show their + current overhead (33.33%). + +ui.*:: + ui.show-headers:: + This option controls display of column headers (like 'Overhead' and 'Symbol') + in 'report' and 'top'. If this option is false, they are hidden. + This option is only applied to TUI. + +call-graph.*:: + When sub-commands 'top' and 'report' work with -g/—-children + there're options in control of call-graph. + + call-graph.record-mode:: + The record-mode can be 'fp' (frame pointer), 'dwarf' and 'lbr'. + The value of 'dwarf' is effective only if perf detect needed library + (libunwind or a recent version of libdw). + 'lbr' only work for cpus that support it. + + call-graph.dump-size:: + The size of stack to dump in order to do post-unwinding. Default is 8192 (byte). + When using dwarf into record-mode, the default size will be used if omitted. + + call-graph.print-type:: + The print-types can be graph (graph absolute), fractal (graph relative), + flat and folded. This option controls a way to show overhead for each callchain + entry. Suppose a following example. + + Overhead Symbols + ........ ....... + 40.00% foo + | + ---foo + | + |--50.00%--bar + | main + | + --50.00%--baz + main + + This output is a 'fractal' format. The 'foo' came from 'bar' and 'baz' exactly + half and half so 'fractal' shows 50.00% for each + (meaning that it assumes 100% total overhead of 'foo'). + + The 'graph' uses absolute overhead value of 'foo' as total so each of + 'bar' and 'baz' callchain will have 20.00% of overhead. + If 'flat' is used, single column and linear exposure of call chains. + 'folded' mean call chains are displayed in a line, separated by semicolons. + + call-graph.order:: + This option controls print order of callchains. The default is + 'callee' which means callee is printed at top and then followed by its + caller and so on. The 'caller' prints it in reverse order. + + If this option is not set and report.children or top.children is + set to true (or the equivalent command line option is given), + the default value of this option is changed to 'caller' for the + execution of 'perf report' or 'perf top'. Other commands will + still default to 'callee'. + + call-graph.sort-key:: + The callchains are merged if they contain same information. + The sort-key option determines a way to compare the callchains. + A value of 'sort-key' can be 'function' or 'address'. + The default is 'function'. + + call-graph.threshold:: + When there're many callchains it'd print tons of lines. So perf omits + small callchains under a certain overhead (threshold) and this option + control the threshold. Default is 0.5 (%). The overhead is calculated + by value depends on call-graph.print-type. + + call-graph.print-limit:: + This is a maximum number of lines of callchain printed for a single + histogram entry. Default is 0 which means no limitation. + +report.*:: + report.sort_order:: + Allows changing the default sort order from "comm,dso,symbol" to + some other default, for instance "sym,dso" may be more fitting for + kernel developers. + report.percent-limit:: + This one is mostly the same as call-graph.threshold but works for + histogram entries. Entries having an overhead lower than this + percentage will not be printed. Default is '0'. If percent-limit + is '10', only entries which have more than 10% of overhead will be + printed. + + report.queue-size:: + This option sets up the maximum allocation size of the internal + event queue for ordering events. Default is 0, meaning no limit. + + report.children:: + 'Children' means functions called from another function. + If this option is true, 'perf report' cumulates callchains of children + and show (accumulated) total overhead as well as 'Self' overhead. + Please refer to the 'perf report' manual. The default is 'true'. + + report.group:: + This option is to show event group information together. + Example output with this turned on, notice that there is one column + per event in the group, ref-cycles and cycles: + + # group: {ref-cycles,cycles} + # ======== + # + # Samples: 7K of event 'anon group { ref-cycles, cycles }' + # Event count (approx.): 6876107743 + # + # Overhead Command Shared Object Symbol + # ................ ....... ................. ................... + # + 99.84% 99.76% noploop noploop [.] main + 0.07% 0.00% noploop ld-2.15.so [.] strcmp + 0.03% 0.00% noploop [kernel.kallsyms] [k] timerqueue_del + +top.*:: + top.children:: + Same as 'report.children'. So if it is enabled, the output of 'top' + command will have 'Children' overhead column as well as 'Self' overhead + column by default. + The default is 'true'. + +man.*:: + man.viewer:: + This option can assign a tool to view manual pages when 'help' + subcommand was invoked. Supported tools are 'man', 'woman' + (with emacs client) and 'konqueror'. Default is 'man'. + + New man viewer tool can be also added using 'man..cmd' + or use different path using 'man..path' config option. + +pager.*:: + pager.:: + When the subcommand is run on stdio, determine whether it uses + pager or not based on this value. Default is 'unspecified'. + +kmem.*:: + kmem.default:: + This option decides which allocator is to be analyzed if neither + '--slab' nor '--page' option is used. Default is 'slab'. + +record.*:: + record.build-id:: + This option can be 'cache', 'no-cache' or 'skip'. + 'cache' is to post-process data and save/update the binaries into + the build-id cache (in ~/.debug). This is the default. + But if this option is 'no-cache', it will not update the build-id cache. + 'skip' skips post-processing and does not update the cache. + +diff.*:: + diff.order:: + This option sets the number of columns to sort the result. + The default is 0, which means sorting by baseline. + Setting it to 1 will sort the result by delta (or other + compute method selected). + + diff.compute:: + This options sets the method for computing the diff result. + Possible values are 'delta', 'delta-abs', 'ratio' and + 'wdiff'. Default is 'delta'. + +SEE ALSO +-------- +linkperf:perf[1] diff --git a/Documentation/perf-data.txt b/Documentation/perf-data.txt new file mode 100644 index 0000000..c871807 --- /dev/null +++ b/Documentation/perf-data.txt @@ -0,0 +1,48 @@ +perf-data(1) +============ + +NAME +---- +perf-data - Data file related processing + +SYNOPSIS +-------- +[verse] +'perf data' [] []", + +DESCRIPTION +----------- +Data file related processing. + +COMMANDS +-------- +convert:: + Converts perf data file into another format (only CTF [1] format is support by now). + It's possible to set data-convert debug variable to get debug messages from conversion, + like: + perf --debug data-convert data convert ... + +OPTIONS for 'convert' +--------------------- +--to-ctf:: + Triggers the CTF conversion, specify the path of CTF data directory. + +-i:: + Specify input perf data file path. + +-f:: +--force:: + Don't complain, do it. + +-v:: +--verbose:: + Be more verbose (show counter open errors, etc). + +--all:: + Convert all events, including non-sample events (comm, fork, ...), to output. + Default is off, only convert samples. + +SEE ALSO +-------- +linkperf:perf[1] +[1] Common Trace Format - http://www.efficios.com/ctf diff --git a/Documentation/perf-diff.txt b/Documentation/perf-diff.txt new file mode 100644 index 0000000..a79c84a --- /dev/null +++ b/Documentation/perf-diff.txt @@ -0,0 +1,227 @@ +perf-diff(1) +============ + +NAME +---- +perf-diff - Read perf.data files and display the differential profile + +SYNOPSIS +-------- +[verse] +'perf diff' [baseline file] [data file1] [[data file2] ... ] + +DESCRIPTION +----------- +This command displays the performance difference amongst two or more perf.data +files captured via perf record. + +If no parameters are passed it will assume perf.data.old and perf.data. + +The differential profile is displayed only for events matching both +specified perf.data files. + +If no parameters are passed the samples will be sorted by dso and symbol. +As the perf.data files could come from different binaries, the symbols addresses +could vary. So perf diff is based on the comparison of the files and +symbols name. + +OPTIONS +------- +-D:: +--dump-raw-trace:: + Dump raw trace in ASCII. + +--kallsyms=:: + kallsyms pathname + +-m:: +--modules:: + Load module symbols. WARNING: use only with -k and LIVE kernel + +-d:: +--dsos=:: + Only consider symbols in these dsos. CSV that understands + file://filename entries. This option will affect the percentage + of the Baseline/Delta column. See --percentage for more info. + +-C:: +--comms=:: + Only consider symbols in these comms. CSV that understands + file://filename entries. This option will affect the percentage + of the Baseline/Delta column. See --percentage for more info. + +-S:: +--symbols=:: + Only consider these symbols. CSV that understands + file://filename entries. This option will affect the percentage + of the Baseline/Delta column. See --percentage for more info. + +-s:: +--sort=:: + Sort by key(s): pid, comm, dso, symbol, cpu, parent, srcline. + Please see description of --sort in the perf-report man page. + +-t:: +--field-separator=:: + + Use a special separator character and don't pad with spaces, replacing + all occurrences of this separator in symbol names (and other output) + with a '.' character, that thus it's the only non valid separator. + +-v:: +--verbose:: + Be verbose, for instance, show the raw counts in addition to the + diff. + +-q:: +--quiet:: + Do not show any message. (Suppress -v) + +-f:: +--force:: + Don't do ownership validation. + +--symfs=:: + Look for files with symbols relative to this directory. + +-b:: +--baseline-only:: + Show only items with match in baseline. + +-c:: +--compute:: + Differential computation selection - delta, ratio, wdiff, delta-abs + (default is delta-abs). Default can be changed using diff.compute + config option. See COMPARISON METHODS section for more info. + +-p:: +--period:: + Show period values for both compared hist entries. + +-F:: +--formula:: + Show formula for given computation. + +-o:: +--order:: + Specify compute sorting column number. 0 means sorting by baseline + overhead and 1 (default) means sorting by computed value of column 1 + (data from the first file other base baseline). Values more than 1 + can be used only if enough data files are provided. + The default value can be set using the diff.order config option. + +--percentage:: + Determine how to display the overhead percentage of filtered entries. + Filters can be applied by --comms, --dsos and/or --symbols options. + + "relative" means it's relative to filtered entries only so that the + sum of shown entries will be always 100%. "absolute" means it retains + the original value before and after the filter is applied. + +COMPARISON +---------- +The comparison is governed by the baseline file. The baseline perf.data +file is iterated for samples. All other perf.data files specified on +the command line are searched for the baseline sample pair. If the pair +is found, specified computation is made and result is displayed. + +All samples from non-baseline perf.data files, that do not match any +baseline entry, are displayed with empty space within baseline column +and possible computation results (delta) in their related column. + +Example files samples: +- file A with samples f1, f2, f3, f4, f6 +- file B with samples f2, f4, f5 +- file C with samples f1, f2, f5 + +Example output: + x - computation takes place for pair + b - baseline sample percentage + +- perf diff A B C + + baseline/A compute/B compute/C samples + --------------------------------------- + b x f1 + b x x f2 + b f3 + b x f4 + b f6 + x x f5 + +- perf diff B A C + + baseline/B compute/A compute/C samples + --------------------------------------- + b x x f2 + b x f4 + b x f5 + x x f1 + x f3 + x f6 + +- perf diff C B A + + baseline/C compute/B compute/A samples + --------------------------------------- + b x f1 + b x x f2 + b x f5 + x f3 + x x f4 + x f6 + +COMPARISON METHODS +------------------ +delta +~~~~~ +If specified the 'Delta' column is displayed with value 'd' computed as: + + d = A->period_percent - B->period_percent + +with: + - A/B being matching hist entry from data/baseline file specified + (or perf.data/perf.data.old) respectively. + + - period_percent being the % of the hist entry period value within + single data file + + - with filtering by -C, -d and/or -S, period_percent might be changed + relative to how entries are filtered. Use --percentage=absolute to + prevent such fluctuation. + +delta-abs +~~~~~~~~~ +Same as 'delta` method, but sort the result with the absolute values. + +ratio +~~~~~ +If specified the 'Ratio' column is displayed with value 'r' computed as: + + r = A->period / B->period + +with: + - A/B being matching hist entry from data/baseline file specified + (or perf.data/perf.data.old) respectively. + + - period being the hist entry period value + +wdiff:WEIGHT-B,WEIGHT-A +~~~~~~~~~~~~~~~~~~~~~~~ +If specified the 'Weighted diff' column is displayed with value 'd' computed as: + + d = B->period * WEIGHT-A - A->period * WEIGHT-B + + - A/B being matching hist entry from data/baseline file specified + (or perf.data/perf.data.old) respectively. + + - period being the hist entry period value + + - WEIGHT-A/WEIGHT-B being user supplied weights in the the '-c' option + behind ':' separator like '-c wdiff:1,2'. + - WEIGHT-A being the weight of the data file + - WEIGHT-B being the weight of the baseline data file + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-report[1] diff --git a/Documentation/perf-evlist.txt b/Documentation/perf-evlist.txt new file mode 100644 index 0000000..c0a6640 --- /dev/null +++ b/Documentation/perf-evlist.txt @@ -0,0 +1,45 @@ +perf-evlist(1) +============== + +NAME +---- +perf-evlist - List the event names in a perf.data file + +SYNOPSIS +-------- +[verse] +'perf evlist ' + +DESCRIPTION +----------- +This command displays the names of events sampled in a perf.data file. + +OPTIONS +------- +-i:: +--input=:: + Input file name. (default: perf.data unless stdin is a fifo) + +-f:: +--force:: + Don't complain, do it. + +-F:: +--freq=:: + Show just the sample frequency used for each event. + +-v:: +--verbose=:: + Show all fields. + +-g:: +--group:: + Show event group information. + +--trace-fields:: + Show tracepoint field names. + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-list[1], +linkperf:perf-report[1] diff --git a/Documentation/perf-ftrace.txt b/Documentation/perf-ftrace.txt new file mode 100644 index 0000000..b80c843 --- /dev/null +++ b/Documentation/perf-ftrace.txt @@ -0,0 +1,87 @@ +perf-ftrace(1) +============== + +NAME +---- +perf-ftrace - simple wrapper for kernel's ftrace functionality + + +SYNOPSIS +-------- +[verse] +'perf ftrace' + +DESCRIPTION +----------- +The 'perf ftrace' command is a simple wrapper of kernel's ftrace +functionality. It only supports single thread tracing currently and +just reads trace_pipe in text and then write it to stdout. + +The following options apply to perf ftrace. + +OPTIONS +------- + +-t:: +--tracer=:: + Tracer to use: function_graph or function. + +-v:: +--verbose=:: + Verbosity level. + +-p:: +--pid=:: + Trace on existing process id (comma separated list). + +-a:: +--all-cpus:: + Force system-wide collection. Scripts run without a + normally use -a by default, while scripts run with a + normally don't - this option allows the latter to be run in + system-wide mode. + +-C:: +--cpu=:: + Only trace for the list of CPUs provided. Multiple CPUs can + be provided as a comma separated list with no space like: 0,1. + Ranges of CPUs are specified with -: 0-2. + Default is to trace on all online CPUs. + +-T:: +--trace-funcs=:: + Only trace functions given by the argument. Multiple functions + can be given by using this option more than once. The function + argument also can be a glob pattern. It will be passed to + 'set_ftrace_filter' in tracefs. + +-N:: +--notrace-funcs=:: + Do not trace functions given by the argument. Like -T option, + this can be used more than once to specify multiple functions + (or glob patterns). It will be passed to 'set_ftrace_notrace' + in tracefs. + +-G:: +--graph-funcs=:: + Set graph filter on the given function (or a glob pattern). + This is useful for the function_graph tracer only and enables + tracing for functions executed from the given function. + This can be used more than once to specify multiple functions. + It will be passed to 'set_graph_function' in tracefs. + +-g:: +--nograph-funcs=:: + Set graph notrace filter on the given function (or a glob pattern). + Like -G option, this is useful for the function_graph tracer only + and disables tracing for function executed from the given function. + This can be used more than once to specify multiple functions. + It will be passed to 'set_graph_notrace' in tracefs. + +-D:: +--graph-depth=:: + Set max depth for function graph tracer to follow + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-trace[1] diff --git a/Documentation/perf-help.txt b/Documentation/perf-help.txt new file mode 100644 index 0000000..5143918 --- /dev/null +++ b/Documentation/perf-help.txt @@ -0,0 +1,38 @@ +perf-help(1) +============ + +NAME +---- +perf-help - display help information about perf + +SYNOPSIS +-------- +'perf help' [-a|--all] [COMMAND] + +DESCRIPTION +----------- + +With no options and no COMMAND given, the synopsis of the 'perf' +command and a list of the most commonly used perf commands are printed +on the standard output. + +If the option '--all' or '-a' is given, then all available commands are +printed on the standard output. + +If a perf command is named, a manual page for that command is brought +up. The 'man' program is used by default for this purpose, but this +can be overridden by other options or configuration variables. + +Note that `perf --help ...` is identical to `perf help ...` because the +former is internally converted into the latter. + +OPTIONS +------- +-a:: +--all:: + Prints all the available commands on the standard output. This + option supersedes any other option. + +PERF +---- +Part of the linkperf:perf[1] suite diff --git a/Documentation/perf-inject.txt b/Documentation/perf-inject.txt new file mode 100644 index 0000000..a64d658 --- /dev/null +++ b/Documentation/perf-inject.txt @@ -0,0 +1,69 @@ +perf-inject(1) +============== + +NAME +---- +perf-inject - Filter to augment the events stream with additional information + +SYNOPSIS +-------- +[verse] +'perf inject ' + +DESCRIPTION +----------- +perf-inject reads a perf-record event stream and repipes it to stdout. At any +point the processing code can inject other events into the event stream - in +this case build-ids (-b option) are read and injected as needed into the event +stream. + +Build-ids are just the first user of perf-inject - potentially anything that +needs userspace processing to augment the events stream with additional +information could make use of this facility. + +OPTIONS +------- +-b:: +--build-ids=:: + Inject build-ids into the output stream +-v:: +--verbose:: + Be more verbose. +-i:: +--input=:: + Input file name. (default: stdin) +-o:: +--output=:: + Output file name. (default: stdout) +-s:: +--sched-stat:: + Merge sched_stat and sched_switch for getting events where and how long + tasks slept. sched_switch contains a callchain where a task slept and + sched_stat contains a timeslice how long a task slept. + +--kallsyms=:: + kallsyms pathname + +--itrace:: + Decode Instruction Tracing data, replacing it with synthesized events. + Options are: + +include::itrace.txt[] + +--strip:: + Use with --itrace to strip out non-synthesized events. + +-j:: +--jit:: + Process jitdump files by injecting the mmap records corresponding to jitted + functions. This option also generates the ELF images for each jitted function + found in the jitdumps files captured in the input perf.data file. Use this option + if you are monitoring environment using JIT runtimes, such as Java, DART or V8. + +-f:: +--force:: + Don't complain, do it. + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-archive[1] diff --git a/Documentation/perf-kallsyms.txt b/Documentation/perf-kallsyms.txt new file mode 100644 index 0000000..f3c6209 --- /dev/null +++ b/Documentation/perf-kallsyms.txt @@ -0,0 +1,24 @@ +perf-kallsyms(1) +================ + +NAME +---- +perf-kallsyms - Searches running kernel for symbols + +SYNOPSIS +-------- +[verse] +'perf kallsyms' [] symbol_name[,symbol_name...] + +DESCRIPTION +----------- +This command searches the running kernel kallsyms file for the given symbol(s) +and prints information about it, including the DSO, the kallsyms begin/end +addresses and the addresses in the ELF kallsyms symbol table (for symbols in +modules). + +OPTIONS +------- +-v:: +--verbose=:: + Increase verbosity level, showing details about symbol table loading, etc. diff --git a/Documentation/perf-kmem.txt b/Documentation/perf-kmem.txt new file mode 100644 index 0000000..85b8ac6 --- /dev/null +++ b/Documentation/perf-kmem.txt @@ -0,0 +1,77 @@ +perf-kmem(1) +============ + +NAME +---- +perf-kmem - Tool to trace/measure kernel memory properties + +SYNOPSIS +-------- +[verse] +'perf kmem' {record|stat} [] + +DESCRIPTION +----------- +There are two variants of perf kmem: + + 'perf kmem record ' to record the kmem events + of an arbitrary workload. + + 'perf kmem stat' to report kernel memory statistics. + +OPTIONS +------- +-i :: +--input=:: + Select the input file (default: perf.data unless stdin is a fifo) + +-f:: +--force:: + Don't do ownership validation + +-v:: +--verbose:: + Be more verbose. (show symbol address, etc) + +--caller:: + Show per-callsite statistics + +--alloc:: + Show per-allocation statistics + +-s :: +--sort=:: + Sort the output (default: 'frag,hit,bytes' for slab and 'bytes,hit' + for page). Available sort keys are 'ptr, callsite, bytes, hit, + pingpong, frag' for slab and 'page, callsite, bytes, hit, order, + migtype, gfp' for page. This option should be preceded by one of the + mode selection options - i.e. --slab, --page, --alloc and/or --caller. + +-l :: +--line=:: + Print n lines only + +--raw-ip:: + Print raw ip instead of symbol + +--slab:: + Analyze SLAB allocator events. + +--page:: + Analyze page allocator events + +--live:: + Show live page stat. The perf kmem shows total allocation stat by + default, but this option shows live (currently allocated) pages + instead. (This option works with --page option only) + +--time=,:: + Only analyze samples within given time window: ,. Times + have the format seconds.microseconds. If start is not given (i.e., time + string is ',x.y') then analysis starts at the beginning of the file. If + stop time is not given (i.e, time string is 'x.y,') then analysis goes + to end of file. + +SEE ALSO +-------- +linkperf:perf-record[1] diff --git a/Documentation/perf-kvm.txt b/Documentation/perf-kvm.txt new file mode 100644 index 0000000..6a5bb2b --- /dev/null +++ b/Documentation/perf-kvm.txt @@ -0,0 +1,164 @@ +perf-kvm(1) +=========== + +NAME +---- +perf-kvm - Tool to trace/measure kvm guest os + +SYNOPSIS +-------- +[verse] +'perf kvm' [--host] [--guest] [--guestmount= + [--guestkallsyms= --guestmodules= | --guestvmlinux=]] + {top|record|report|diff|buildid-list} [] +'perf kvm' [--host] [--guest] [--guestkallsyms= --guestmodules= + | --guestvmlinux=] {top|record|report|diff|buildid-list|stat} [] +'perf kvm stat [record|report|live] [] + +DESCRIPTION +----------- +There are a couple of variants of perf kvm: + + 'perf kvm [options] top ' to generates and displays + a performance counter profile of guest os in realtime + of an arbitrary workload. + + 'perf kvm record ' to record the performance counter profile + of an arbitrary workload and save it into a perf data file. We set the + default behavior of perf kvm as --guest, so if neither --host nor --guest + is input, the perf data file name is perf.data.guest. If --host is input, + the perf data file name is perf.data.kvm. If you want to record data into + perf.data.host, please input --host --no-guest. The behaviors are shown as + following: + Default('') -> perf.data.guest + --host -> perf.data.kvm + --guest -> perf.data.guest + --host --guest -> perf.data.kvm + --host --no-guest -> perf.data.host + + 'perf kvm report' to display the performance counter profile information + recorded via perf kvm record. + + 'perf kvm diff' to displays the performance difference amongst two perf.data + files captured via perf record. + + 'perf kvm buildid-list' to display the buildids found in a perf data file, + so that other tools can be used to fetch packages with matching symbol tables + for use by perf report. As buildid is read from /sys/kernel/notes in os, then + if you want to list the buildid for guest, please make sure your perf data file + was captured with --guestmount in perf kvm record. + + 'perf kvm stat ' to run a command and gather performance counter + statistics. + Especially, perf 'kvm stat record/report' generates a statistical analysis + of KVM events. Currently, vmexit, mmio (x86 only) and ioport (x86 only) + events are supported. 'perf kvm stat record ' records kvm events + and the events between start and end . + And this command produces a file which contains tracing results of kvm + events. + + 'perf kvm stat report' reports statistical data which includes events + handled time, samples, and so on. + + 'perf kvm stat live' reports statistical data in a live mode (similar to + record + report but with statistical data updated live at a given display + rate). + +OPTIONS +------- +-i:: +--input=:: + Input file name. +-o:: +--output=:: + Output file name. +--host:: + Collect host side performance profile. +--guest:: + Collect guest side performance profile. +--guestmount=:: + Guest os root file system mount directory. Users mounts guest os + root directories under by a specific filesystem access method, + typically, sshfs. For example, start 2 guest os. The one's pid is 8888 + and the other's is 9999. + #mkdir ~/guestmount; cd ~/guestmount + #sshfs -o allow_other,direct_io -p 5551 localhost:/ 8888/ + #sshfs -o allow_other,direct_io -p 5552 localhost:/ 9999/ + #perf kvm --host --guest --guestmount=~/guestmount top +--guestkallsyms=:: + Guest os /proc/kallsyms file copy. 'perf' kvm' reads it to get guest + kernel symbols. Users copy it out from guest os. +--guestmodules=:: + Guest os /proc/modules file copy. 'perf' kvm' reads it to get guest + kernel module information. Users copy it out from guest os. +--guestvmlinux=:: + Guest os kernel vmlinux. +-v:: +--verbose:: + Be more verbose (show counter open errors, etc). + +STAT REPORT OPTIONS +------------------- +--vcpu=:: + analyze events which occur on this vcpu. (default: all vcpus) + +--event=:: + event to be analyzed. Possible values: vmexit, mmio (x86 only), + ioport (x86 only). (default: vmexit) +-k:: +--key=:: + Sorting key. Possible values: sample (default, sort by samples + number), time (sort by average time). +-p:: +--pid=:: + Analyze events only for given process ID(s) (comma separated list). + +STAT LIVE OPTIONS +----------------- +-d:: +--display:: + Time in seconds between display updates + +-m:: +--mmap-pages=:: + Number of mmap data pages (must be a power of two) or size + specification with appended unit character - B/K/M/G. The + size is rounded up to have nearest pages power of two value. + +-a:: +--all-cpus:: + System-wide collection from all CPUs. + +-p:: +--pid=:: + Analyze events only for given process ID(s) (comma separated list). + +--vcpu=:: + analyze events which occur on this vcpu. (default: all vcpus) + + +--event=:: + event to be analyzed. Possible values: vmexit, + mmio (x86 only), ioport (x86 only). + (default: vmexit) + +-k:: +--key=:: + Sorting key. Possible values: sample (default, sort by samples + number), time (sort by average time). + +--duration=:: + Show events other than HLT (x86 only) or Wait state (s390 only) + that take longer than duration usecs. + +--proc-map-timeout:: + When processing pre-existing threads /proc/XXX/mmap, it may take + a long time, because the file may be huge. A time out is needed + in such cases. + This option sets the time out limit. The default value is 500 ms. + +SEE ALSO +-------- +linkperf:perf-top[1], linkperf:perf-record[1], linkperf:perf-report[1], +linkperf:perf-diff[1], linkperf:perf-buildid-list[1], +linkperf:perf-stat[1] diff --git a/Documentation/perf-list.txt b/Documentation/perf-list.txt new file mode 100644 index 0000000..2549c34 --- /dev/null +++ b/Documentation/perf-list.txt @@ -0,0 +1,281 @@ +perf-list(1) +============ + +NAME +---- +perf-list - List all symbolic event types + +SYNOPSIS +-------- +[verse] +'perf list' [--no-desc] [--long-desc] + [hw|sw|cache|tracepoint|pmu|sdt|metric|metricgroup|event_glob] + +DESCRIPTION +----------- +This command displays the symbolic event types which can be selected in the +various perf commands with the -e option. + +OPTIONS +------- +--no-desc:: +Don't print descriptions. + +-v:: +--long-desc:: +Print longer event descriptions. + +--details:: +Print how named events are resolved internally into perf events, and also +any extra expressions computed by perf stat. + + +[[EVENT_MODIFIERS]] +EVENT MODIFIERS +--------------- + +Events can optionally have a modifier by appending a colon and one or +more modifiers. Modifiers allow the user to restrict the events to be +counted. The following modifiers exist: + + u - user-space counting + k - kernel counting + h - hypervisor counting + I - non idle counting + G - guest counting (in KVM guests) + H - host counting (not in KVM guests) + p - precise level + P - use maximum detected precise level + S - read sample value (PERF_SAMPLE_READ) + D - pin the event to the PMU + W - group is weak and will fallback to non-group if not schedulable, + only supported in 'perf stat' for now. + +The 'p' modifier can be used for specifying how precise the instruction +address should be. The 'p' modifier can be specified multiple times: + + 0 - SAMPLE_IP can have arbitrary skid + 1 - SAMPLE_IP must have constant skid + 2 - SAMPLE_IP requested to have 0 skid + 3 - SAMPLE_IP must have 0 skid, or uses randomization to avoid + sample shadowing effects. + +For Intel systems precise event sampling is implemented with PEBS +which supports up to precise-level 2, and precise level 3 for +some special cases + +On AMD systems it is implemented using IBS (up to precise-level 2). +The precise modifier works with event types 0x76 (cpu-cycles, CPU +clocks not halted) and 0xC1 (micro-ops retired). Both events map to +IBS execution sampling (IBS op) with the IBS Op Counter Control bit +(IbsOpCntCtl) set respectively (see AMD64 Architecture Programmer’s +Manual Volume 2: System Programming, 13.3 Instruction-Based +Sampling). Examples to use IBS: + + perf record -a -e cpu-cycles:p ... # use ibs op counting cycles + perf record -a -e r076:p ... # same as -e cpu-cycles:p + perf record -a -e r0C1:p ... # use ibs op counting micro-ops + +RAW HARDWARE EVENT DESCRIPTOR +----------------------------- +Even when an event is not available in a symbolic form within perf right now, +it can be encoded in a per processor specific way. + +For instance For x86 CPUs NNN represents the raw register encoding with the +layout of IA32_PERFEVTSELx MSRs (see [Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide] Figure 30-1 Layout +of IA32_PERFEVTSELx MSRs) or AMD's PerfEvtSeln (see [AMD64 Architecture Programmer’s Manual Volume 2: System Programming], Page 344, +Figure 13-7 Performance Event-Select Register (PerfEvtSeln)). + +Note: Only the following bit fields can be set in x86 counter +registers: event, umask, edge, inv, cmask. Esp. guest/host only and +OS/user mode flags must be setup using <>. + +Example: + +If the Intel docs for a QM720 Core i7 describe an event as: + + Event Umask Event Mask + Num. Value Mnemonic Description Comment + + A8H 01H LSD.UOPS Counts the number of micro-ops Use cmask=1 and + delivered by loop stream detector invert to count + cycles + +raw encoding of 0x1A8 can be used: + + perf stat -e r1a8 -a sleep 1 + perf record -e r1a8 ... + +You should refer to the processor specific documentation for getting these +details. Some of them are referenced in the SEE ALSO section below. + +ARBITRARY PMUS +-------------- + +perf also supports an extended syntax for specifying raw parameters +to PMUs. Using this typically requires looking up the specific event +in the CPU vendor specific documentation. + +The available PMUs and their raw parameters can be listed with + + ls /sys/devices/*/format + +For example the raw event "LSD.UOPS" core pmu event above could +be specified as + + perf stat -e cpu/event=0xa8,umask=0x1,name=LSD.UOPS_CYCLES,cmask=1/ ... + +PER SOCKET PMUS +--------------- + +Some PMUs are not associated with a core, but with a whole CPU socket. +Events on these PMUs generally cannot be sampled, but only counted globally +with perf stat -a. They can be bound to one logical CPU, but will measure +all the CPUs in the same socket. + +This example measures memory bandwidth every second +on the first memory controller on socket 0 of a Intel Xeon system + + perf stat -C 0 -a uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/ -I 1000 ... + +Each memory controller has its own PMU. Measuring the complete system +bandwidth would require specifying all imc PMUs (see perf list output), +and adding the values together. To simplify creation of multiple events, +prefix and glob matching is supported in the PMU name, and the prefix +'uncore_' is also ignored when performing the match. So the command above +can be expanded to all memory controllers by using the syntaxes: + + perf stat -C 0 -a imc/cas_count_read/,imc/cas_count_write/ -I 1000 ... + perf stat -C 0 -a *imc*/cas_count_read/,*imc*/cas_count_write/ -I 1000 ... + +This example measures the combined core power every second + + perf stat -I 1000 -e power/energy-cores/ -a + +ACCESS RESTRICTIONS +------------------- + +For non root users generally only context switched PMU events are available. +This is normally only the events in the cpu PMU, the predefined events +like cycles and instructions and some software events. + +Other PMUs and global measurements are normally root only. +Some event qualifiers, such as "any", are also root only. + +This can be overriden by setting the kernel.perf_event_paranoid +sysctl to -1, which allows non root to use these events. + +For accessing trace point events perf needs to have read access to +/sys/kernel/debug/tracing, even when perf_event_paranoid is in a relaxed +setting. + +TRACING +------- + +Some PMUs control advanced hardware tracing capabilities, such as Intel PT, +that allows low overhead execution tracing. These are described in a separate +intel-pt.txt document. + +PARAMETERIZED EVENTS +-------------------- + +Some pmu events listed by 'perf-list' will be displayed with '?' in them. For +example: + + hv_gpci/dtbp_ptitc,phys_processor_idx=?/ + +This means that when provided as an event, a value for '?' must +also be supplied. For example: + + perf stat -C 0 -e 'hv_gpci/dtbp_ptitc,phys_processor_idx=0x2/' ... + +EVENT GROUPS +------------ + +Perf supports time based multiplexing of events, when the number of events +active exceeds the number of hardware performance counters. Multiplexing +can cause measurement errors when the workload changes its execution +profile. + +When metrics are computed using formulas from event counts, it is useful to +ensure some events are always measured together as a group to minimize multiplexing +errors. Event groups can be specified using { }. + + perf stat -e '{instructions,cycles}' ... + +The number of available performance counters depend on the CPU. A group +cannot contain more events than available counters. +For example Intel Core CPUs typically have four generic performance counters +for the core, plus three fixed counters for instructions, cycles and +ref-cycles. Some special events have restrictions on which counter they +can schedule, and may not support multiple instances in a single group. +When too many events are specified in the group some of them will not +be measured. + +Globally pinned events can limit the number of counters available for +other groups. On x86 systems, the NMI watchdog pins a counter by default. +The nmi watchdog can be disabled as root with + + echo 0 > /proc/sys/kernel/nmi_watchdog + +Events from multiple different PMUs cannot be mixed in a group, with +some exceptions for software events. + +LEADER SAMPLING +--------------- + +perf also supports group leader sampling using the :S specifier. + + perf record -e '{cycles,instructions}:S' ... + perf report --group + +Normally all events in a event group sample, but with :S only +the first event (the leader) samples, and it only reads the values of the +other events in the group. + +OPTIONS +------- + +Without options all known events will be listed. + +To limit the list use: + +. 'hw' or 'hardware' to list hardware events such as cache-misses, etc. + +. 'sw' or 'software' to list software events such as context switches, etc. + +. 'cache' or 'hwcache' to list hardware cache events such as L1-dcache-loads, etc. + +. 'tracepoint' to list all tracepoint events, alternatively use + 'subsys_glob:event_glob' to filter by tracepoint subsystems such as sched, + block, etc. + +. 'pmu' to print the kernel supplied PMU events. + +. 'sdt' to list all Statically Defined Tracepoint events. + +. 'metric' to list metrics + +. 'metricgroup' to list metricgroups with metrics. + +. If none of the above is matched, it will apply the supplied glob to all + events, printing the ones that match. + +. As a last resort, it will do a substring search in all event names. + +One or more types can be used at the same time, listing the events for the +types specified. + +Support raw format: + +. '--raw-dump', shows the raw-dump of all the events. +. '--raw-dump [hw|sw|cache|tracepoint|pmu|event_glob]', shows the raw-dump of + a certain kind of events. + +SEE ALSO +-------- +linkperf:perf-stat[1], linkperf:perf-top[1], +linkperf:perf-record[1], +http://www.intel.com/sdm/[Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide], +http://support.amd.com/us/Processor_TechDocs/24593_APM_v2.pdf[AMD64 Architecture Programmer’s Manual Volume 2: System Programming] diff --git a/Documentation/perf-lock.txt b/Documentation/perf-lock.txt new file mode 100644 index 0000000..74d7745 --- /dev/null +++ b/Documentation/perf-lock.txt @@ -0,0 +1,70 @@ +perf-lock(1) +============ + +NAME +---- +perf-lock - Analyze lock events + +SYNOPSIS +-------- +[verse] +'perf lock' {record|report|script|info} + +DESCRIPTION +----------- +You can analyze various lock behaviours +and statistics with this 'perf lock' command. + + 'perf lock record ' records lock events + between start and end . And this command + produces the file "perf.data" which contains tracing + results of lock events. + + 'perf lock report' reports statistical data. + + 'perf lock script' shows raw lock events. + + 'perf lock info' shows metadata like threads or addresses + of lock instances. + +COMMON OPTIONS +-------------- + +-i:: +--input=:: + Input file name. (default: perf.data unless stdin is a fifo) + +-v:: +--verbose:: + Be more verbose (show symbol address, etc). + +-D:: +--dump-raw-trace:: + Dump raw trace in ASCII. + +-f:: +--force:: + Don't complan, do it. + +REPORT OPTIONS +-------------- + +-k:: +--key=:: + Sorting key. Possible values: acquired (default), contended, + avg_wait, wait_total, wait_max, wait_min. + +INFO OPTIONS +------------ + +-t:: +--threads:: + dump thread list in perf.data + +-m:: +--map:: + dump map of lock instances (address:name table) + +SEE ALSO +-------- +linkperf:perf[1] diff --git a/Documentation/perf-mem.txt b/Documentation/perf-mem.txt new file mode 100644 index 0000000..f8d2167 --- /dev/null +++ b/Documentation/perf-mem.txt @@ -0,0 +1,92 @@ +perf-mem(1) +=========== + +NAME +---- +perf-mem - Profile memory accesses + +SYNOPSIS +-------- +[verse] +'perf mem' [] (record [] | report) + +DESCRIPTION +----------- +"perf mem record" runs a command and gathers memory operation data +from it, into perf.data. Perf record options are accepted and are passed through. + +"perf mem report" displays the result. It invokes perf report with the +right set of options to display a memory access profile. By default, loads +and stores are sampled. Use the -t option to limit to loads or stores. + +Note that on Intel systems the memory latency reported is the use-latency, +not the pure load (or store latency). Use latency includes any pipeline +queueing delays in addition to the memory subsystem latency. + +OPTIONS +------- +...:: + Any command you can specify in a shell. + +-i:: +--input=:: + Input file name. + +-f:: +--force:: + Don't do ownership validation + +-t:: +--type=:: + Select the memory operation type: load or store (default: load,store) + +-D:: +--dump-raw-samples:: + Dump the raw decoded samples on the screen in a format that is easy to parse with + one sample per line. + +-x:: +--field-separator=:: + Specify the field separator used when dump raw samples (-D option). By default, + The separator is the space character. + +-C:: +--cpu=:: + Monitor only on the list of CPUs provided. Multiple CPUs can be provided as a + comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2. Default + is to monitor all CPUS. +-U:: +--hide-unresolved:: + Only display entries resolved to a symbol. + +-p:: +--phys-data:: + Record/Report sample physical addresses + +RECORD OPTIONS +-------------- +-e:: +--event :: + Event selector. Use 'perf mem record -e list' to list available events. + +-K:: +--all-kernel:: + Configure all used events to run in kernel space. + +-U:: +--all-user:: + Configure all used events to run in user space. + +-v:: +--verbose:: + Be more verbose (show counter open errors, etc) + +--ldlat :: + Specify desired latency for loads event. + +In addition, for report all perf report options are valid, and for record +all perf record options. + +SEE ALSO +-------- +linkperf:perf-record[1], linkperf:perf-report[1] diff --git a/Documentation/perf-probe.txt b/Documentation/perf-probe.txt new file mode 100644 index 0000000..b6866a0 --- /dev/null +++ b/Documentation/perf-probe.txt @@ -0,0 +1,299 @@ +perf-probe(1) +============= + +NAME +---- +perf-probe - Define new dynamic tracepoints + +SYNOPSIS +-------- +[verse] +'perf probe' [options] --add='PROBE' [...] +or +'perf probe' [options] PROBE +or +'perf probe' [options] --del='[GROUP:]EVENT' [...] +or +'perf probe' --list[=[GROUP:]EVENT] +or +'perf probe' [options] --line='LINE' +or +'perf probe' [options] --vars='PROBEPOINT' +or +'perf probe' [options] --funcs +or +'perf probe' [options] --definition='PROBE' [...] + +DESCRIPTION +----------- +This command defines dynamic tracepoint events, by symbol and registers +without debuginfo, or by C expressions (C line numbers, C function names, +and C local variables) with debuginfo. + + +OPTIONS +------- +-k:: +--vmlinux=PATH:: + Specify vmlinux path which has debuginfo (Dwarf binary). + Only when using this with --definition, you can give an offline + vmlinux file. + +-m:: +--module=MODNAME|PATH:: + Specify module name in which perf-probe searches probe points + or lines. If a path of module file is passed, perf-probe + treat it as an offline module (this means you can add a probe on + a module which has not been loaded yet). + +-s:: +--source=PATH:: + Specify path to kernel source. + +-v:: +--verbose:: + Be more verbose (show parsed arguments, etc). + Can not use with -q. + +-q:: +--quiet:: + Be quiet (do not show any messages including errors). + Can not use with -v. + +-a:: +--add=:: + Define a probe event (see PROBE SYNTAX for detail). + +-d:: +--del=:: + Delete probe events. This accepts glob wildcards('*', '?') and character + classes(e.g. [a-z], [!A-Z]). + +-l:: +--list[=[GROUP:]EVENT]:: + List up current probe events. This can also accept filtering patterns of + event names. + When this is used with --cache, perf shows all cached probes instead of + the live probes. + +-L:: +--line=:: + Show source code lines which can be probed. This needs an argument + which specifies a range of the source code. (see LINE SYNTAX for detail) + +-V:: +--vars=:: + Show available local variables at given probe point. The argument + syntax is same as PROBE SYNTAX, but NO ARGs. + +--externs:: + (Only for --vars) Show external defined variables in addition to local + variables. + +--no-inlines:: + (Only for --add) Search only for non-inlined functions. The functions + which do not have instances are ignored. + +-F:: +--funcs[=FILTER]:: + Show available functions in given module or kernel. With -x/--exec, + can also list functions in a user space executable / shared library. + This also can accept a FILTER rule argument. + +-D:: +--definition=:: + Show trace-event definition converted from given probe-event instead + of write it into tracing/[k,u]probe_events. + +--filter=FILTER:: + (Only for --vars and --funcs) Set filter. FILTER is a combination of glob + pattern, see FILTER PATTERN for detail. + Default FILTER is "!__k???tab_* & !__crc_*" for --vars, and "!_*" + for --funcs. + If several filters are specified, only the last filter is used. + +-f:: +--force:: + Forcibly add events with existing name. + +-n:: +--dry-run:: + Dry run. With this option, --add and --del doesn't execute actual + adding and removal operations. + +--cache:: + (With --add) Cache the probes. Any events which successfully added + are also stored in the cache file. + (With --list) Show cached probes. + (With --del) Remove cached probes. + +--max-probes=NUM:: + Set the maximum number of probe points for an event. Default is 128. + +--target-ns=PID: + Obtain mount namespace information from the target pid. This is + used when creating a uprobe for a process that resides in a + different mount namespace from the perf(1) utility. + +-x:: +--exec=PATH:: + Specify path to the executable or shared library file for user + space tracing. Can also be used with --funcs option. + +--demangle:: + Demangle application symbols. --no-demangle is also available + for disabling demangling. + +--demangle-kernel:: + Demangle kernel symbols. --no-demangle-kernel is also available + for disabling kernel demangling. + +In absence of -m/-x options, perf probe checks if the first argument after +the options is an absolute path name. If its an absolute path, perf probe +uses it as a target module/target user space binary to probe. + +PROBE SYNTAX +------------ +Probe points are defined by following syntax. + + 1) Define event based on function name + [[GROUP:]EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...] + + 2) Define event based on source file with line number + [[GROUP:]EVENT=]SRC:ALN [ARG ...] + + 3) Define event based on source file with lazy pattern + [[GROUP:]EVENT=]SRC;PTN [ARG ...] + + 4) Pre-defined SDT events or cached event with name + %[sdt_PROVIDER:]SDTEVENT + or, + sdt_PROVIDER:SDTEVENT + +'EVENT' specifies the name of new event, if omitted, it will be set the name of the probed function, and for return probes, a "\_\_return" suffix is automatically added to the function name. You can also specify a group name by 'GROUP', if omitted, set 'probe' is used for kprobe and 'probe_' is used for uprobe. +Note that using existing group name can conflict with other events. Especially, using the group name reserved for kernel modules can hide embedded events in the +modules. +'FUNC' specifies a probed function name, and it may have one of the following options; '+OFFS' is the offset from function entry address in bytes, ':RLN' is the relative-line number from function entry line, and '%return' means that it probes function return. And ';PTN' means lazy matching pattern (see LAZY MATCHING). Note that ';PTN' must be the end of the probe point definition. In addition, '@SRC' specifies a source file which has that function. +It is also possible to specify a probe point by the source line number or lazy matching by using 'SRC:ALN' or 'SRC;PTN' syntax, where 'SRC' is the source file path, ':ALN' is the line number and ';PTN' is the lazy matching pattern. +'ARG' specifies the arguments of this probe point, (see PROBE ARGUMENT). +'SDTEVENT' and 'PROVIDER' is the pre-defined event name which is defined by user SDT (Statically Defined Tracing) or the pre-cached probes with event name. +Note that before using the SDT event, the target binary (on which SDT events are defined) must be scanned by linkperf:perf-buildid-cache[1] to make SDT events as cached events. + +For details of the SDT, see below. +https://sourceware.org/gdb/onlinedocs/gdb/Static-Probe-Points.html + +ESCAPED CHARACTER +----------------- + +In the probe syntax, '=', '@', '+', ':' and ';' are treated as a special character. You can use a backslash ('\') to escape the special characters. +This is useful if you need to probe on a specific versioned symbols, like @GLIBC_... suffixes, or also you need to specify a source file which includes the special characters. +Note that usually single backslash is consumed by shell, so you might need to pass double backslash (\\) or wrapping with single quotes (\'AAA\@BBB'). +See EXAMPLES how it is used. + +PROBE ARGUMENT +-------------- +Each probe argument follows below syntax. + + [NAME=]LOCALVAR|$retval|%REG|@SYMBOL[:TYPE] + +'NAME' specifies the name of this argument (optional). You can use the name of local variable, local data structure member (e.g. var->field, var.field2), local array with fixed index (e.g. array[1], var->array[0], var->pointer[2]), or kprobe-tracer argument format (e.g. $retval, %ax, etc). Note that the name of this argument will be set as the last member name if you specify a local data structure member (e.g. field2 for 'var->field1.field2'.) +'$vars' and '$params' special arguments are also available for NAME, '$vars' is expanded to the local variables (including function parameters) which can access at given probe point. '$params' is expanded to only the function parameters. +'TYPE' casts the type of this argument (optional). If omitted, perf probe automatically set the type based on debuginfo (*). Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal integers (x/x8/x16/x32/x64), signedness casting (u/s), "string" and bitfield are supported. (see TYPES for detail) +On x86 systems %REG is always the short form of the register: for example %AX. %RAX or %EAX is not valid. + +TYPES +----- +Basic types (u8/u16/u32/u64/s8/s16/s32/s64) and hexadecimal integers (x8/x16/x32/x64) are integer types. Prefix 's' and 'u' means those types are signed and unsigned respectively, and 'x' means that is shown in hexadecimal format. Traced arguments are shown in decimal (sNN/uNN) or hex (xNN). You can also use 's' or 'u' to specify only signedness and leave its size auto-detected by perf probe. Moreover, you can use 'x' to explicitly specify to be shown in hexadecimal (the size is also auto-detected). +String type is a special type, which fetches a "null-terminated" string from kernel space. This means it will fail and store NULL if the string container has been paged out. You can specify 'string' type only for the local variable or structure member which is an array of or a pointer to 'char' or 'unsigned char' type. +Bitfield is another special type, which takes 3 parameters, bit-width, bit-offset, and container-size (usually 32). The syntax is; + + b@/ + +LINE SYNTAX +----------- +Line range is described by following syntax. + + "FUNC[@SRC][:RLN[+NUM|-RLN2]]|SRC[:ALN[+NUM|-ALN2]]" + +FUNC specifies the function name of showing lines. 'RLN' is the start line +number from function entry line, and 'RLN2' is the end line number. As same as +probe syntax, 'SRC' means the source file path, 'ALN' is start line number, +and 'ALN2' is end line number in the file. It is also possible to specify how +many lines to show by using 'NUM'. Moreover, 'FUNC@SRC' combination is good +for searching a specific function when several functions share same name. +So, "source.c:100-120" shows lines between 100th to l20th in source.c file. And "func:10+20" shows 20 lines from 10th line of func function. + +LAZY MATCHING +------------- + The lazy line matching is similar to glob matching but ignoring spaces in both of pattern and target. So this accepts wildcards('*', '?') and character classes(e.g. [a-z], [!A-Z]). + +e.g. + 'a=*' can matches 'a=b', 'a = b', 'a == b' and so on. + +This provides some sort of flexibility and robustness to probe point definitions against minor code changes. For example, actual 10th line of schedule() can be moved easily by modifying schedule(), but the same line matching 'rq=cpu_rq*' may still exist in the function.) + +FILTER PATTERN +-------------- + The filter pattern is a glob matching pattern(s) to filter variables. + In addition, you can use "!" for specifying filter-out rule. You also can give several rules combined with "&" or "|", and fold those rules as one rule by using "(" ")". + +e.g. + With --filter "foo* | bar*", perf probe -V shows variables which start with "foo" or "bar". + With --filter "!foo* & *bar", perf probe -V shows variables which don't start with "foo" and end with "bar", like "fizzbar". But "foobar" is filtered out. + +EXAMPLES +-------- +Display which lines in schedule() can be probed: + + ./perf probe --line schedule + +Add a probe on schedule() function 12th line with recording cpu local variable: + + ./perf probe schedule:12 cpu + or + ./perf probe --add='schedule:12 cpu' + +Add one or more probes which has the name start with "schedule". + + ./perf probe schedule* + or + ./perf probe --add='schedule*' + +Add probes on lines in schedule() function which calls update_rq_clock(). + + ./perf probe 'schedule;update_rq_clock*' + or + ./perf probe --add='schedule;update_rq_clock*' + +Delete all probes on schedule(). + + ./perf probe --del='schedule*' + +Add probes at zfree() function on /bin/zsh + + ./perf probe -x /bin/zsh zfree or ./perf probe /bin/zsh zfree + +Add probes at malloc() function on libc + + ./perf probe -x /lib/libc.so.6 malloc or ./perf probe /lib/libc.so.6 malloc + +Add a uprobe to a target process running in a different mount namespace + + ./perf probe --target-ns -x /lib64/libc.so.6 malloc + +Add a USDT probe to a target process running in a different mount namespace + + ./perf probe --target-ns -x /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64/jre/lib/amd64/server/libjvm.so %sdt_hotspot:thread__sleep__end + +Add a probe on specific versioned symbol by backslash escape + + ./perf probe -x /lib64/libc-2.25.so 'malloc_get_state\@GLIBC_2.2.5' + +Add a probe in a source file using special characters by backslash escape + + ./perf probe -x /opt/test/a.out 'foo\+bar.c:4' + + +SEE ALSO +-------- +linkperf:perf-trace[1], linkperf:perf-record[1], linkperf:perf-buildid-cache[1] diff --git a/Documentation/perf-record.txt b/Documentation/perf-record.txt new file mode 100644 index 0000000..cc37b3a --- /dev/null +++ b/Documentation/perf-record.txt @@ -0,0 +1,502 @@ +perf-record(1) +============== + +NAME +---- +perf-record - Run a command and record its profile into perf.data + +SYNOPSIS +-------- +[verse] +'perf record' [-e | --event=EVENT] [-a] +'perf record' [-e | --event=EVENT] [-a] -- [] + +DESCRIPTION +----------- +This command runs a command and gathers a performance counter profile +from it, into perf.data - without displaying anything. + +This file can then be inspected later on, using 'perf report'. + + +OPTIONS +------- +...:: + Any command you can specify in a shell. + +-e:: +--event=:: + Select the PMU event. Selection can be: + + - a symbolic event name (use 'perf list' to list all events) + + - a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a + hexadecimal event descriptor. + + - a symbolically formed PMU event like 'pmu/param1=0x3,param2/' where + 'param1', 'param2', etc are defined as formats for the PMU in + /sys/bus/event_source/devices//format/*. + + - a symbolically formed event like 'pmu/config=M,config1=N,config3=K/' + + where M, N, K are numbers (in decimal, hex, octal format). Acceptable + values for each of 'config', 'config1' and 'config2' are defined by + corresponding entries in /sys/bus/event_source/devices//format/* + param1 and param2 are defined as formats for the PMU in: + /sys/bus/event_source/devices//format/* + + There are also some parameters which are not defined in ...//format/*. + These params can be used to overload default config values per event. + Here are some common parameters: + - 'period': Set event sampling period + - 'freq': Set event sampling frequency + - 'time': Disable/enable time stamping. Acceptable values are 1 for + enabling time stamping. 0 for disabling time stamping. + The default is 1. + - 'call-graph': Disable/enable callgraph. Acceptable str are "fp" for + FP mode, "dwarf" for DWARF mode, "lbr" for LBR mode and + "no" for disable callgraph. + - 'stack-size': user stack size for dwarf mode + + See the linkperf:perf-list[1] man page for more parameters. + + Note: If user explicitly sets options which conflict with the params, + the value set by the parameters will be overridden. + + Also not defined in ...//format/* are PMU driver specific + configuration parameters. Any configuration parameter preceded by + the letter '@' is not interpreted in user space and sent down directly + to the PMU driver. For example: + + perf record -e some_event/@cfg1,@cfg2=config/ ... + + will see 'cfg1' and 'cfg2=config' pushed to the PMU driver associated + with the event for further processing. There is no restriction on + what the configuration parameters are, as long as their semantic is + understood and supported by the PMU driver. + + - a hardware breakpoint event in the form of '\mem:addr[/len][:access]' + where addr is the address in memory you want to break in. + Access is the memory access type (read, write, execute) it can + be passed as follows: '\mem:addr[:[r][w][x]]'. len is the range, + number of bytes from specified addr, which the breakpoint will cover. + If you want to profile read-write accesses in 0x1000, just set + 'mem:0x1000:rw'. + If you want to profile write accesses in [0x1000~1008), just set + 'mem:0x1000/8:w'. + + - a group of events surrounded by a pair of brace ("{event1,event2,...}"). + Each event is separated by commas and the group should be quoted to + prevent the shell interpretation. You also need to use --group on + "perf report" to view group events together. + +--filter=:: + Event filter. This option should follow a event selector (-e) which + selects either tracepoint event(s) or a hardware trace PMU + (e.g. Intel PT or CoreSight). + + - tracepoint filters + + In the case of tracepoints, multiple '--filter' options are combined + using '&&'. + + - address filters + + A hardware trace PMU advertises its ability to accept a number of + address filters by specifying a non-zero value in + /sys/bus/event_source/devices//nr_addr_filters. + + Address filters have the format: + + filter|start|stop|tracestop [/ ] [@] + + Where: + - 'filter': defines a region that will be traced. + - 'start': defines an address at which tracing will begin. + - 'stop': defines an address at which tracing will stop. + - 'tracestop': defines a region in which tracing will stop. + + is the name of the object file, is the offset to the + code to trace in that file, and is the size of the region to + trace. 'start' and 'stop' filters need not specify a . + + If no object file is specified then the kernel is assumed, in which case + the start address must be a current kernel memory address. + + can also be specified by providing the name of a symbol. If the + symbol name is not unique, it can be disambiguated by inserting #n where + 'n' selects the n'th symbol in address order. Alternately #0, #g or #G + select only a global symbol. can also be specified by providing + the name of a symbol, in which case the size is calculated to the end + of that symbol. For 'filter' and 'tracestop' filters, if is + omitted and is a symbol, then the size is calculated to the end + of that symbol. + + If is omitted and is '*', then the start and size will + be calculated from the first and last symbols, i.e. to trace the whole + file. + + If symbol names (or '*') are provided, they must be surrounded by white + space. + + The filter passed to the kernel is not necessarily the same as entered. + To see the filter that is passed, use the -v option. + + The kernel may not be able to configure a trace region if it is not + within a single mapping. MMAP events (or /proc//maps) can be + examined to determine if that is a possibility. + + Multiple filters can be separated with space or comma. + +--exclude-perf:: + Don't record events issued by perf itself. This option should follow + a event selector (-e) which selects tracepoint event(s). It adds a + filter expression 'common_pid != $PERFPID' to filters. If other + '--filter' exists, the new filter expression will be combined with + them by '&&'. + +-a:: +--all-cpus:: + System-wide collection from all CPUs (default if no target is specified). + +-p:: +--pid=:: + Record events on existing process ID (comma separated list). + +-t:: +--tid=:: + Record events on existing thread ID (comma separated list). + This option also disables inheritance by default. Enable it by adding + --inherit. + +-u:: +--uid=:: + Record events in threads owned by uid. Name or number. + +-r:: +--realtime=:: + Collect data with this RT SCHED_FIFO priority. + +--no-buffering:: + Collect data without buffering. + +-c:: +--count=:: + Event period to sample. + +-o:: +--output=:: + Output file name. + +-i:: +--no-inherit:: + Child tasks do not inherit counters. + +-F:: +--freq=:: + Profile at this frequency. Use 'max' to use the currently maximum + allowed frequency, i.e. the value in the kernel.perf_event_max_sample_rate + sysctl. Will throttle down to the currently maximum allowed frequency. + See --strict-freq. + +--strict-freq:: + Fail if the specified frequency can't be used. + +-m:: +--mmap-pages=:: + Number of mmap data pages (must be a power of two) or size + specification with appended unit character - B/K/M/G. The + size is rounded up to have nearest pages power of two value. + Also, by adding a comma, the number of mmap pages for AUX + area tracing can be specified. + +--group:: + Put all events in a single event group. This precedes the --event + option and remains only for backward compatibility. See --event. + +-g:: + Enables call-graph (stack chain/backtrace) recording. + +--call-graph:: + Setup and enable call-graph (stack chain/backtrace) recording, + implies -g. Default is "fp". + + Allows specifying "fp" (frame pointer) or "dwarf" + (DWARF's CFI - Call Frame Information) or "lbr" + (Hardware Last Branch Record facility) as the method to collect + the information used to show the call graphs. + + In some systems, where binaries are build with gcc + --fomit-frame-pointer, using the "fp" method will produce bogus + call graphs, using "dwarf", if available (perf tools linked to + the libunwind or libdw library) should be used instead. + Using the "lbr" method doesn't require any compiler options. It + will produce call graphs from the hardware LBR registers. The + main limitation is that it is only available on new Intel + platforms, such as Haswell. It can only get user call chain. It + doesn't work with branch stack sampling at the same time. + + When "dwarf" recording is used, perf also records (user) stack dump + when sampled. Default size of the stack dump is 8192 (bytes). + User can change the size by passing the size after comma like + "--call-graph dwarf,4096". + +-q:: +--quiet:: + Don't print any message, useful for scripting. + +-v:: +--verbose:: + Be more verbose (show counter open errors, etc). + +-s:: +--stat:: + Record per-thread event counts. Use it with 'perf report -T' to see + the values. + +-d:: +--data:: + Record the sample virtual addresses. + +--phys-data:: + Record the sample physical addresses. + +-T:: +--timestamp:: + Record the sample timestamps. Use it with 'perf report -D' to see the + timestamps, for instance. + +-P:: +--period:: + Record the sample period. + +--sample-cpu:: + Record the sample cpu. + +-n:: +--no-samples:: + Don't sample. + +-R:: +--raw-samples:: +Collect raw sample records from all opened counters (default for tracepoint counters). + +-C:: +--cpu:: +Collect samples only on the list of CPUs provided. Multiple CPUs can be provided as a +comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2. +In per-thread mode with inheritance mode on (default), samples are captured only when +the thread executes on the designated CPUs. Default is to monitor all CPUs. + +-B:: +--no-buildid:: +Do not save the build ids of binaries in the perf.data files. This skips +post processing after recording, which sometimes makes the final step in +the recording process to take a long time, as it needs to process all +events looking for mmap records. The downside is that it can misresolve +symbols if the workload binaries used when recording get locally rebuilt +or upgraded, because the only key available in this case is the +pathname. You can also set the "record.build-id" config variable to +'skip to have this behaviour permanently. + +-N:: +--no-buildid-cache:: +Do not update the buildid cache. This saves some overhead in situations +where the information in the perf.data file (which includes buildids) +is sufficient. You can also set the "record.build-id" config variable to +'no-cache' to have the same effect. + +-G name,...:: +--cgroup name,...:: +monitor only in the container (cgroup) called "name". This option is available only +in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to +container "name" are monitored when they run on the monitored CPUs. Multiple cgroups +can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup +to first event, second cgroup to second event and so on. It is possible to provide +an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have +corresponding events, i.e., they always refer to events defined earlier on the command +line. If the user wants to track multiple events for a specific cgroup, the user can +use '-e e1 -e e2 -G foo,foo' or just use '-e e1 -e e2 -G foo'. + +If wanting to monitor, say, 'cycles' for a cgroup and also for system wide, this +command line can be used: 'perf stat -e cycles -G cgroup_name -a -e cycles'. + +-b:: +--branch-any:: +Enable taken branch stack sampling. Any type of taken branch may be sampled. +This is a shortcut for --branch-filter any. See --branch-filter for more infos. + +-j:: +--branch-filter:: +Enable taken branch stack sampling. Each sample captures a series of consecutive +taken branches. The number of branches captured with each sample depends on the +underlying hardware, the type of branches of interest, and the executed code. +It is possible to select the types of branches captured by enabling filters. The +following filters are defined: + + - any: any type of branches + - any_call: any function call or system call + - any_ret: any function return or system call return + - ind_call: any indirect branch + - call: direct calls, including far (to/from kernel) calls + - u: only when the branch target is at the user level + - k: only when the branch target is in the kernel + - hv: only when the target is at the hypervisor level + - in_tx: only when the target is in a hardware transaction + - no_tx: only when the target is not in a hardware transaction + - abort_tx: only when the target is a hardware transaction abort + - cond: conditional branches + - save_type: save branch type during sampling in case binary is not available later + ++ +The option requires at least one branch type among any, any_call, any_ret, ind_call, cond. +The privilege levels may be omitted, in which case, the privilege levels of the associated +event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege +levels are subject to permissions. When sampling on multiple events, branch stack sampling +is enabled for all the sampling events. The sampled branch type is the same for all events. +The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k +Note that this feature may not be available on all processors. + +--weight:: +Enable weightened sampling. An additional weight is recorded per sample and can be +displayed with the weight and local_weight sort keys. This currently works for TSX +abort events and some memory events in precise mode on modern Intel CPUs. + +--namespaces:: +Record events of type PERF_RECORD_NAMESPACES. + +--transaction:: +Record transaction flags for transaction related events. + +--per-thread:: +Use per-thread mmaps. By default per-cpu mmaps are created. This option +overrides that and uses per-thread mmaps. A side-effect of that is that +inheritance is automatically disabled. --per-thread is ignored with a warning +if combined with -a or -C options. + +-D:: +--delay=:: +After starting the program, wait msecs before measuring. This is useful to +filter out the startup phase of the program, which is often very different. + +-I:: +--intr-regs:: +Capture machine state (registers) at interrupt, i.e., on counter overflows for +each sample. List of captured registers depends on the architecture. This option +is off by default. It is possible to select the registers to sample using their +symbolic names, e.g. on x86, ax, si. To list the available registers use +--intr-regs=\?. To name registers, pass a comma separated list such as +--intr-regs=ax,bx. The list of register is architecture dependent. + +--user-regs:: +Capture user registers at sample time. Same arguments as -I. + +--running-time:: +Record running and enabled time for read events (:S) + +-k:: +--clockid:: +Sets the clock id to use for the various time fields in the perf_event_type +records. See clock_gettime(). In particular CLOCK_MONOTONIC and +CLOCK_MONOTONIC_RAW are supported, some events might also allow +CLOCK_BOOTTIME, CLOCK_REALTIME and CLOCK_TAI. + +-S:: +--snapshot:: +Select AUX area tracing Snapshot Mode. This option is valid only with an +AUX area tracing event. Optionally the number of bytes to capture per +snapshot can be specified. In Snapshot Mode, trace data is captured only when +signal SIGUSR2 is received. + +--proc-map-timeout:: +When processing pre-existing threads /proc/XXX/mmap, it may take a long time, +because the file may be huge. A time out is needed in such cases. +This option sets the time out limit. The default value is 500 ms. + +--switch-events:: +Record context switch events i.e. events of type PERF_RECORD_SWITCH or +PERF_RECORD_SWITCH_CPU_WIDE. + +--clang-path=PATH:: +Path to clang binary to use for compiling BPF scriptlets. +(enabled when BPF support is on) + +--clang-opt=OPTIONS:: +Options passed to clang when compiling BPF scriptlets. +(enabled when BPF support is on) + +--vmlinux=PATH:: +Specify vmlinux path which has debuginfo. +(enabled when BPF prologue is on) + +--buildid-all:: +Record build-id of all DSOs regardless whether it's actually hit or not. + +--all-kernel:: +Configure all used events to run in kernel space. + +--all-user:: +Configure all used events to run in user space. + +--timestamp-filename +Append timestamp to output file name. + +--timestamp-boundary:: +Record timestamp boundary (time of first/last samples). + +--switch-output[=mode]:: +Generate multiple perf.data files, timestamp prefixed, switching to a new one +based on 'mode' value: + "signal" - when receiving a SIGUSR2 (default value) or + - when reaching the size threshold, size is expected to + be a number with appended unit character - B/K/M/G +