338 lines
11 KiB
Text
338 lines
11 KiB
Text
|
perf-stat(1)
|
||
|
============
|
||
|
|
||
|
NAME
|
||
|
----
|
||
|
perf-stat - Run a command and gather performance counter statistics
|
||
|
|
||
|
SYNOPSIS
|
||
|
--------
|
||
|
[verse]
|
||
|
'perf stat' [-e <EVENT> | --event=EVENT] [-a] <command>
|
||
|
'perf stat' [-e <EVENT> | --event=EVENT] [-a] -- <command> [<options>]
|
||
|
'perf stat' [-e <EVENT> | --event=EVENT] [-a] record [-o file] -- <command> [<options>]
|
||
|
'perf stat' report [-i file]
|
||
|
|
||
|
DESCRIPTION
|
||
|
-----------
|
||
|
This command runs a command and gathers performance counter statistics
|
||
|
from it.
|
||
|
|
||
|
|
||
|
OPTIONS
|
||
|
-------
|
||
|
<command>...::
|
||
|
Any command you can specify in a shell.
|
||
|
|
||
|
record::
|
||
|
See STAT RECORD.
|
||
|
|
||
|
report::
|
||
|
See STAT REPORT.
|
||
|
|
||
|
-e::
|
||
|
--event=::
|
||
|
Select the PMU event. Selection can be:
|
||
|
|
||
|
- a symbolic event name (use 'perf list' to list all events)
|
||
|
|
||
|
- a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a
|
||
|
hexadecimal event descriptor.
|
||
|
|
||
|
- a symbolically formed event like 'pmu/param1=0x3,param2/' where
|
||
|
param1 and param2 are defined as formats for the PMU in
|
||
|
/sys/bus/event_source/devices/<pmu>/format/*
|
||
|
|
||
|
- a symbolically formed event like 'pmu/config=M,config1=N,config2=K/'
|
||
|
where M, N, K are numbers (in decimal, hex, octal format).
|
||
|
Acceptable values for each of 'config', 'config1' and 'config2'
|
||
|
parameters are defined by corresponding entries in
|
||
|
/sys/bus/event_source/devices/<pmu>/format/*
|
||
|
|
||
|
Note that the last two syntaxes support prefix and glob matching in
|
||
|
the PMU name to simplify creation of events accross multiple instances
|
||
|
of the same type of PMU in large systems (e.g. memory controller PMUs).
|
||
|
Multiple PMU instances are typical for uncore PMUs, so the prefix
|
||
|
'uncore_' is also ignored when performing this match.
|
||
|
|
||
|
|
||
|
-i::
|
||
|
--no-inherit::
|
||
|
child tasks do not inherit counters
|
||
|
-p::
|
||
|
--pid=<pid>::
|
||
|
stat events on existing process id (comma separated list)
|
||
|
|
||
|
-t::
|
||
|
--tid=<tid>::
|
||
|
stat events on existing thread id (comma separated list)
|
||
|
|
||
|
|
||
|
-a::
|
||
|
--all-cpus::
|
||
|
system-wide collection from all CPUs (default if no target is specified)
|
||
|
|
||
|
-c::
|
||
|
--scale::
|
||
|
scale/normalize counter values
|
||
|
|
||
|
-d::
|
||
|
--detailed::
|
||
|
print more detailed statistics, can be specified up to 3 times
|
||
|
|
||
|
-d: detailed events, L1 and LLC data cache
|
||
|
-d -d: more detailed events, dTLB and iTLB events
|
||
|
-d -d -d: very detailed events, adding prefetch events
|
||
|
|
||
|
-r::
|
||
|
--repeat=<n>::
|
||
|
repeat command and print average + stddev (max: 100). 0 means forever.
|
||
|
|
||
|
-B::
|
||
|
--big-num::
|
||
|
print large numbers with thousands' separators according to locale
|
||
|
|
||
|
-C::
|
||
|
--cpu=::
|
||
|
Count only on the list of CPUs provided. Multiple CPUs can be provided as a
|
||
|
comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2.
|
||
|
In per-thread mode, this option is ignored. The -a option is still necessary
|
||
|
to activate system-wide monitoring. Default is to count on all CPUs.
|
||
|
|
||
|
-A::
|
||
|
--no-aggr::
|
||
|
Do not aggregate counts across all monitored CPUs.
|
||
|
|
||
|
-n::
|
||
|
--null::
|
||
|
null run - don't start any counters
|
||
|
|
||
|
-v::
|
||
|
--verbose::
|
||
|
be more verbose (show counter open errors, etc)
|
||
|
|
||
|
-x SEP::
|
||
|
--field-separator SEP::
|
||
|
print counts using a CSV-style output to make it easy to import directly into
|
||
|
spreadsheets. Columns are separated by the string specified in SEP.
|
||
|
|
||
|
-G name::
|
||
|
--cgroup name::
|
||
|
monitor only in the container (cgroup) called "name". This option is available only
|
||
|
in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to
|
||
|
container "name" are monitored when they run on the monitored CPUs. Multiple cgroups
|
||
|
can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup
|
||
|
to first event, second cgroup to second event and so on. It is possible to provide
|
||
|
an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have
|
||
|
corresponding events, i.e., they always refer to events defined earlier on the command
|
||
|
line. If the user wants to track multiple events for a specific cgroup, the user can
|
||
|
use '-e e1 -e e2 -G foo,foo' or just use '-e e1 -e e2 -G foo'.
|
||
|
|
||
|
If wanting to monitor, say, 'cycles' for a cgroup and also for system wide, this
|
||
|
command line can be used: 'perf stat -e cycles -G cgroup_name -a -e cycles'.
|
||
|
|
||
|
-o file::
|
||
|
--output file::
|
||
|
Print the output into the designated file.
|
||
|
|
||
|
--append::
|
||
|
Append to the output file designated with the -o option. Ignored if -o is not specified.
|
||
|
|
||
|
--log-fd::
|
||
|
|
||
|
Log output to fd, instead of stderr. Complementary to --output, and mutually exclusive
|
||
|
with it. --append may be used here. Examples:
|
||
|
3>results perf stat --log-fd 3 -- $cmd
|
||
|
3>>results perf stat --log-fd 3 --append -- $cmd
|
||
|
|
||
|
--pre::
|
||
|
--post::
|
||
|
Pre and post measurement hooks, e.g.:
|
||
|
|
||
|
perf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/clean' -- make -s -j64 O=defconfig-build/ bzImage
|
||
|
|
||
|
-I msecs::
|
||
|
--interval-print msecs::
|
||
|
Print count deltas every N milliseconds (minimum: 1ms)
|
||
|
The overhead percentage could be high in some cases, for instance with small, sub 100ms intervals. Use with caution.
|
||
|
example: 'perf stat -I 1000 -e cycles -a sleep 5'
|
||
|
|
||
|
--interval-count times::
|
||
|
Print count deltas for fixed number of times.
|
||
|
This option should be used together with "-I" option.
|
||
|
example: 'perf stat -I 1000 --interval-count 2 -e cycles -a'
|
||
|
|
||
|
--timeout msecs::
|
||
|
Stop the 'perf stat' session and print count deltas after N milliseconds (minimum: 10 ms).
|
||
|
This option is not supported with the "-I" option.
|
||
|
example: 'perf stat --time 2000 -e cycles -a'
|
||
|
|
||
|
--metric-only::
|
||
|
Only print computed metrics. Print them in a single line.
|
||
|
Don't show any raw values. Not supported with --per-thread.
|
||
|
|
||
|
--per-socket::
|
||
|
Aggregate counts per processor socket for system-wide mode measurements. This
|
||
|
is a useful mode to detect imbalance between sockets. To enable this mode,
|
||
|
use --per-socket in addition to -a. (system-wide). The output includes the
|
||
|
socket number and the number of online processors on that socket. This is
|
||
|
useful to gauge the amount of aggregation.
|
||
|
|
||
|
--per-core::
|
||
|
Aggregate counts per physical processor for system-wide mode measurements. This
|
||
|
is a useful mode to detect imbalance between physical cores. To enable this mode,
|
||
|
use --per-core in addition to -a. (system-wide). The output includes the
|
||
|
core number and the number of online logical processors on that physical processor.
|
||
|
|
||
|
--per-thread::
|
||
|
Aggregate counts per monitored threads, when monitoring threads (-t option)
|
||
|
or processes (-p option).
|
||
|
|
||
|
-D msecs::
|
||
|
--delay msecs::
|
||
|
After starting the program, wait msecs before measuring. This is useful to
|
||
|
filter out the startup phase of the program, which is often very different.
|
||
|
|
||
|
-T::
|
||
|
--transaction::
|
||
|
|
||
|
Print statistics of transactional execution if supported.
|
||
|
|
||
|
STAT RECORD
|
||
|
-----------
|
||
|
Stores stat data into perf data file.
|
||
|
|
||
|
-o file::
|
||
|
--output file::
|
||
|
Output file name.
|
||
|
|
||
|
STAT REPORT
|
||
|
-----------
|
||
|
Reads and reports stat data from perf data file.
|
||
|
|
||
|
-i file::
|
||
|
--input file::
|
||
|
Input file name.
|
||
|
|
||
|
--per-socket::
|
||
|
Aggregate counts per processor socket for system-wide mode measurements.
|
||
|
|
||
|
--per-core::
|
||
|
Aggregate counts per physical processor for system-wide mode measurements.
|
||
|
|
||
|
-M::
|
||
|
--metrics::
|
||
|
Print metrics or metricgroups specified in a comma separated list.
|
||
|
For a group all metrics from the group are added.
|
||
|
The events from the metrics are automatically measured.
|
||
|
See perf list output for the possble metrics and metricgroups.
|
||
|
|
||
|
-A::
|
||
|
--no-aggr::
|
||
|
Do not aggregate counts across all monitored CPUs.
|
||
|
|
||
|
--topdown::
|
||
|
Print top down level 1 metrics if supported by the CPU. This allows to
|
||
|
determine bottle necks in the CPU pipeline for CPU bound workloads,
|
||
|
by breaking the cycles consumed down into frontend bound, backend bound,
|
||
|
bad speculation and retiring.
|
||
|
|
||
|
Frontend bound means that the CPU cannot fetch and decode instructions fast
|
||
|
enough. Backend bound means that computation or memory access is the bottle
|
||
|
neck. Bad Speculation means that the CPU wasted cycles due to branch
|
||
|
mispredictions and similar issues. Retiring means that the CPU computed without
|
||
|
an apparently bottleneck. The bottleneck is only the real bottleneck
|
||
|
if the workload is actually bound by the CPU and not by something else.
|
||
|
|
||
|
For best results it is usually a good idea to use it with interval
|
||
|
mode like -I 1000, as the bottleneck of workloads can change often.
|
||
|
|
||
|
The top down metrics are collected per core instead of per
|
||
|
CPU thread. Per core mode is automatically enabled
|
||
|
and -a (global monitoring) is needed, requiring root rights or
|
||
|
perf.perf_event_paranoid=-1.
|
||
|
|
||
|
Topdown uses the full Performance Monitoring Unit, and needs
|
||
|
disabling of the NMI watchdog (as root):
|
||
|
echo 0 > /proc/sys/kernel/nmi_watchdog
|
||
|
for best results. Otherwise the bottlenecks may be inconsistent
|
||
|
on workload with changing phases.
|
||
|
|
||
|
This enables --metric-only, unless overriden with --no-metric-only.
|
||
|
|
||
|
To interpret the results it is usually needed to know on which
|
||
|
CPUs the workload runs on. If needed the CPUs can be forced using
|
||
|
taskset.
|
||
|
|
||
|
--no-merge::
|
||
|
Do not merge results from same PMUs.
|
||
|
|
||
|
When multiple events are created from a single event specification,
|
||
|
stat will, by default, aggregate the event counts and show the result
|
||
|
in a single row. This option disables that behavior and shows
|
||
|
the individual events and counts.
|
||
|
|
||
|
Multiple events are created from a single event specification when:
|
||
|
1. Prefix or glob matching is used for the PMU name.
|
||
|
2. Aliases, which are listed immediately after the Kernel PMU events
|
||
|
by perf list, are used.
|
||
|
|
||
|
--smi-cost::
|
||
|
Measure SMI cost if msr/aperf/ and msr/smi/ events are supported.
|
||
|
|
||
|
During the measurement, the /sys/device/cpu/freeze_on_smi will be set to
|
||
|
freeze core counters on SMI.
|
||
|
The aperf counter will not be effected by the setting.
|
||
|
The cost of SMI can be measured by (aperf - unhalted core cycles).
|
||
|
|
||
|
In practice, the percentages of SMI cycles is very useful for performance
|
||
|
oriented analysis. --metric_only will be applied by default.
|
||
|
The output is SMI cycles%, equals to (aperf - unhalted core cycles) / aperf
|
||
|
|
||
|
Users who wants to get the actual value can apply --no-metric-only.
|
||
|
|
||
|
EXAMPLES
|
||
|
--------
|
||
|
|
||
|
$ perf stat -- make -j
|
||
|
|
||
|
Performance counter stats for 'make -j':
|
||
|
|
||
|
8117.370256 task clock ticks # 11.281 CPU utilization factor
|
||
|
678 context switches # 0.000 M/sec
|
||
|
133 CPU migrations # 0.000 M/sec
|
||
|
235724 pagefaults # 0.029 M/sec
|
||
|
24821162526 CPU cycles # 3057.784 M/sec
|
||
|
18687303457 instructions # 2302.138 M/sec
|
||
|
172158895 cache references # 21.209 M/sec
|
||
|
27075259 cache misses # 3.335 M/sec
|
||
|
|
||
|
Wall-clock time elapsed: 719.554352 msecs
|
||
|
|
||
|
CSV FORMAT
|
||
|
----------
|
||
|
|
||
|
With -x, perf stat is able to output a not-quite-CSV format output
|
||
|
Commas in the output are not put into "". To make it easy to parse
|
||
|
it is recommended to use a different character like -x \;
|
||
|
|
||
|
The fields are in this order:
|
||
|
|
||
|
- optional usec time stamp in fractions of second (with -I xxx)
|
||
|
- optional CPU, core, or socket identifier
|
||
|
- optional number of logical CPUs aggregated
|
||
|
- counter value
|
||
|
- unit of the counter value or empty
|
||
|
- event name
|
||
|
- run time of counter
|
||
|
- percentage of measurement time the counter was running
|
||
|
- optional variance if multiple values are collected with -r
|
||
|
- optional metric value
|
||
|
- optional unit of metric
|
||
|
|
||
|
Additional metrics may be printed with all earlier fields being empty.
|
||
|
|
||
|
SEE ALSO
|
||
|
--------
|
||
|
linkperf:perf-top[1], linkperf:perf-list[1]
|