perf is a utility set added to Linux kernel 2.6.31. A quick example showing what perf output looks like is at Using Performance Counters for Linux.
How to profile
perf offers two major modes: record then report, or real-time "top" mode. Both are useful in different situations.
TIP: On many systems where -g is shown below, you may have to write --call-graph dwarf instead to get working stack traces.
Record then report
Recording data
To profile the system for a specific length of time, for example 60 seconds:
perf record -a -g -s sleep 60
To profile the system for the duration of a command:
perf record -a -g -s -- pg_ctl -D pgdata start
To profile the system until cancelled, just omit any command:
perf record -a -g
By default perf record will generate a perf.data file in the current working directory.
See "what to profile" for ways to narrow what you record.
Reporting
By default perf report will look for perf.data in the current working directory.
perf report -n
You can include call-graph output, if captured, with:
perf report -g
and a variety of options are available to sort, filter and aggregate the output; see the documentation for perf.
In real time
To profile system-wide in real time:
perf top
To profile in real time without accumulating stats, i.e. instantaneous profiling:
perf top -z
You can use the usual target-selection options with perf top, but it's most commonly used for system wide profiling with perf top -a.
Troubleshooting
Many systems will produce incomplete call-graphs with only the top stack entry when -g is used, if the PostgreSQL binaries are built with -fomit-frame-pointer. That's the default on x64. To work around this, you may be able to use --call-graph dwarf on a new enough perf with libunwind support. Otherwise you'll need to recompile PostgreSQL with -fno-omit-frame-pointer.
What to profile
perf lets you choose what to profile in a variety of ways, the most useful of which are:
-a whole system, all CPUs
-p pid one process ID
-u user processes for one user, e.g. "postgres"
You can use these commands with perf record and perf top.
Including user-space stacks
Newer perf releases support capturing the whole user-space
call stack, not just the top function on the stack. This lets it provide
extremely useful call graphs. Use the --call-graph dwarf option to perf record to enable this, and the -g option to perf report to display the results.
On x64 hosts, you must either rebuild PostgreSQL with ./configure CFLAGS="-fno-omit-frame-pointer -ggdb" ... or use a version of perf built with libunwind support.
More detail in Tracing PostgreSQL with perf.
Tracepoints
perf is not limited to cpu call sampling.
It can also trace syscalls, dynamic userspace probes, and more. See sudo perf list for events and look for the tracepoint events.
You can observe things like kernel block layer activity via this interface. For example, this event:
syscalls:sys_enter_fsync [Tracepoint event]
can be used to trace fsyncs, e.g.
perf top -e syscalls:sys_enter_fsync
perf record -e syscalls:sys_enter_fsync -u postgres
perf record -e syscalls:sys_enter_fsync -p $SOME_PID
etc.
You can use wildcards in events, and you can use multiple comma-separated events or separate -e arguments.
PostgreSQL pre-defined tracepoint events
PostgreSQL predefines some handy perf events via the SDT (statically
defined tracepoint) interface, like dtrace. This lets you see critical
PostgreSQL performance innards, e.g.
sdt_postgresql:buffer__sync__done [SDT event]
sdt_postgresql:buffer__sync__start [SDT event]
...
sdt_postgresql:checkpoint__done [SDT event]
sdt_postgresql:checkpoint__start [SDT event]
If you don't see these in perf list, see static probe points (gdb) and perf-buildid-cache. The exact steps required to enable probe points vary between perf versions. If you see mention of perf sdt-cache you're reading something outdated.
Also: Internally PostgreSQL generates these using systemtap on
Linux, so you must have had systemtap installed at build time and passed
--enable-dtrace to configure to get these tracepoints. They're enabled by default in the PGDG binary builds.
You must initialize sdt events with perf probe before using them. perf probe may report multiple event names, as in the example below.
# perf probe sdt_postgresql:query__execute__start
Added new events:
sdt_postgresql:query__execute__start (on %query__execute__start in /usr/pgsql-9.5/bin/postgres)
sdt_postgresql:query__execute__start_1 (on %query__execute__start in /usr/pgsql-9.5/bin/postgres)
You can now use it in all perf tools, such as:
perf record -e sdt_postgresql:query__execute__start_1 -aR sleep 1
# perf record -e sdt_postgresql:query__execute__start,sdt_postgresql:query__execute__start_1 -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.211 MB perf.data (2 samples) ]
# perf script
postgres 19995 [000] 430223.752970: sdt_postgresql:query__execute__start_1: (6c42e7)
postgres 20003 [002] 430227.276924: sdt_postgresql:query__execute__start_1: (6c42e7)
#
Adding new trace points
It's trivial to add new tracepoints in your own builds.
All you have to do is edit src/backend/utils/probes.d to add the tracepoint name, then add macros to call it where you want it to fire. The tracepoint name in probes.d must be lower case, have all underscores doubled, and lacks a leading TRACE_POSTGRESQL_. The tracepoint invocations are upper case, with single underscores, and lead with TRACE_POSTGRESQL_. For example, probes.d has
probe transaction__start(LocalTransactionId);
and it's called in StartTransaction in src/backend/access/transam/xact.c as
TRACE_POSTGRESQL_TRANSACTION_START(vxid.localTransactionId);
Note that you cannot use typedefs provided by the PostgreSQL headers; the probes.d file lists its own typedefs.
If you think it's generally useful, submit a patch to the
pgsql-hackers mailing list. A tracepoint is probably useful if you
cannot trivially achieve the same thing with a dynamic userspace probe
and it touches on something that's likely to be relevant to a broader
user base. Even probes that are easily done with dynamic tracepoints are
useful if they can serve as documentation for how to find and trace
some event many people will be interested in.
Dynamic tracepoints
Often there isn't a SDT event for just what you want. But that's OK.
If you have debug symbols and a vaguely recent perf (tested: Linux 4.4)
you can inject dynamic tracepoints where you want them, and even capture function arguments!
e.g. if you want to see the rate at which queries are executed you might inject a tracepoint into standard_ExecutorStart:
perf probe -x /usr/lib/postgresql/10/bin/postgres standard_ExecutorStart
This will then be available as the perf event probe_postgres:standard_ExecutorStart for use in perf record etc.
Probing extensions
Extensions are valid probe targets too. Just specify them as the
executable, e.g. to see the lsn of every 100th change processed by
pglogical's logical decoding output plugin as it examines it, you could:
perf probe -x /usr/lib/postgresql/10/lib/pglogical.so pg_decode_change 'change->lsn'
perf record -e probe_pglogical:pg_decode_change -aR -o - | perf script -i - -F event,trace | awk 'NR%100==0' -
Less common reports
Annotated source code, from a recorded session:
perf annotate -l -P
Trace output, from a recorded session:
perf script -L
Timecharts
perf timechart record
perf timechart
Benchmarking and statistics
If you're interested in particular tracepoints or statistics, the perf stat command is also useful. For example, the following will count fsync and write calls, and block device activity, during a command. It will run the command five times and summarize the results.
sudo perf stat -e block:block_rq_*,syscalls:sys_enter_write,syscalls:sys_enter_fsync -a -r 5 -- psql -q -U postgres postgres -c "drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;";
To learn more about using perf for tracing see tracing PostgreSQL with perf.
Availability of perf
The perf tool is available on recent Linux distributions such as:
- RedHat Linux Enterprise 6: perf package
- Debian Squeeze: add the linux-tools package.
It may be built and installed from the linux source code in the tools/perf subdirectory. You must use a perf that matches your running kernel version.
Advanced: Viewing and capturing function arguments
Lets say you want to collect information on the kinds of queries that are run. You can capture the CmdType operation field of the QueryDesc passed to standard_ExecutorStart with perf!. No patching postgres, no gdb.
perf probe -x /usr/lib/postgresql/10/bin/postgres standard_ExecutorStart 'queryDesc->operation'
You can capture multiple arguments. To get strings as c-strings not pointer values append :string to the argument defintion. e.g.
perf probe -x /usr/lib/postgresql/10/bin/postgres standard_ExecutorStart 'queryDesc->operation' 'queryDesc->sourceText:string'
The overhead of perf and size of traces is greatly increased by capturing arguments in probes, so only do it if you need to.
perf report doesn't seem to do anything with captured tracepoint arguments right now, but perf script
can show them, and you can filter the results through its perl/python
interpreters (if enabled) or via a pipe to another tool/script.
e.g. if you want to see every planned query run in realtime, you might set up the tracepoint above, then
perf record -e probe_postgres:standard_ExecutorStart -u postgres -o - | perf script -i -
(Some perf versions don't seem to play well with stdio, in which case
you may need to use separate record and script steps. If you get empty
output, try it.)
You should see results like
postgres 19499 [002] 429675.072758: probe_postgres:standard_ExecutorStart: (5bd2c0) operation=0x1 sourceText="SELECT 1;"
postgres 19499 [002] 429675.399764: probe_postgres:standard_ExecutorStart: (5bd2c0) operation=0x1 sourceText="SELECT 2;"
Of course this is a bit of a silly example since you'd just enable
query logging, and it ignores unplanned queries (utility statements) but
it illustrates the methods required for capturing data with probes.
Resources
perf
, what’s that?
The Linux perf
tool is an incredibly powerful tool, that can amongst other things be used for:
- Sampling CPU-bound processes (or the whole) system to analyse which part of your application is consuming the CPU time
- Accessing CPU performance counters (PMU)
- “user probes” (uprobes) which trigger for example when a certain function in your application runs
In general, perf
can count and/or record the call stacks of your threads when a certain event occurs. These events can be triggered by:
- Time (e.g. 1000 times per second), useful for time profiling. For an example use, see the CPU performance debugging guide.
- System calls, useful to see where your system calls are happening.
- Various system events, for example if you’d like to know when context switches occur.
- CPU performance counters, useful if your performance issues can be
traced down to micro-architectural details of your CPU (such as branch
mispredictions). For an example, see SwiftNIO’s Advanced Performance Analysis guide.
- and much more
Getting perf
to work
Unfortunately, getting perf
to work depends on your environment. Below, please find a selection of environments and how to get perf
to work there.
Installing perf
Technically, perf
is part of the Linux kernel sources and you’d want a perf
version that exactly matches your Linux kernel version. In many cases however a “close-enough” perf
version will do too. If in doubt, use a perf
version that’s slightly older than your kernel over one that’s newer.
You can confirm that your perf
installation works using perf stat -- sleep 0.1
(if you’re already root
) or sudo perf stat -- sleep 0.1
.
perf
on Ubuntu when you can’t match the kernel version
On Ubuntu (and other distributions that package perf
per kernel version) you may see an error after installing linux-tools-generic
. The error message will look similar to
$ perf stat -- sleep 0.1
WARNING: perf not found for kernel 5.10.25
You may need to install the following packages for this specific kernel:
linux-tools-5.10.25-linuxkit
linux-cloud-tools-5.10.25-linuxkit
You may also want to install one of the following packages to keep up to date:
linux-tools-linuxkit
linux-cloud-tools-linuxkit
The best fix for this is to follow what perf
says and to install one of the above packages. If you’re in a Docker
container, this may not be possible because you’d need to match the
kernel version (which is especially difficult in Docker for Mac because
it uses a VM). For example, the suggested linux-tools-5.10.25-linuxkit
is not actually available.
As a workaround, you can try one of the following options
After this, you should be able to use perf stat -- sleep 0.1
(if you’re already root
) or sudo perf stat -- sleep 0.1
successfully.
For a bare metal Linux machine, all you need to do is to install perf
which should then work in full fidelity.
You will need to launch your container with docker run --privileged
(don’t run this in production) and then you should have full access to perf (including the PMU).
To validate that perf
works correctly, run for example perf stat -- sleep 0.1
. Whether you’ll see the <not supported>
next to some information will depend on if you have access to the CPU’s
performance counters (the PMU). In Docker on bare metal, this should
work, ie. no <not supported>
s should show up.