Tuesday, October 24, 2023

Ad-hoc Postgres monitoring

 In some cases, we need to observe some values in Postgres or the environment it's running on (OS, FS, etc.), and:

  • we don't have good monitoring at all, or
  • the existing monitoring is lacking something we need, or
  • we don't fully trust the existing monitoring (e.g., you're an engineer who has only started working with some system).

Being able to organize an "ad hoc" observation on specific metrics is an important skill to master. Here we'll describe some tips and an approach that you can find useful.

In some cases, you can quickly install tools like Netdata (a very powerful modern monitoring that can be quickly installed, has a Postgres plugin or ad-hoc console tools such as pgCenter or pg_top. But you still may want to monitor some specific aspects manually.

Below, we assume that we are using Linux, but most of the considerations can be applied to macOS or BSD systems as well.

Key principles

  • observation should not depend on your internet connection
  • save results for long-term storage and later analysis
  • know timestamps
  • protect from accidental program interruption (Ctrl-C)
  • sustain Postgres restarts
  • observe both normal messages and errors
  • prefer collecting data in a form useful for programmed processing (e.g., CSV)

An example

Let's assume we need to collect samples pg_stat_activity (pgsa) to study long-running transactions – those transactions that last longer than 1 minute.

Here is the recipe – and below we discuss it in detail.

1) Start a tmux session. Here is an idempotent snippet that you can use – it either connects to existing session called "observe" or creates it, if it's not found:

export TMUX_NAME=observe
tmux a -t $TMUX_NAME || tmux new -s $TMUX_NAME

2) Now, start collecting pgsa samples in a log every second infinitely (until we interrupt this process):

while sleep 1; do
    psql -XAtc "
        copy (
        with samples as (
            select
            clock_timestamp(),
            clock_timestamp() - xact_start as xact_duration,
            * 
            from pg_stat_activity
        )
        select *
        from samples
        where xact_duration > interval '1 minute'
        order by xact_duration desc
        ) to stdout delimiter ',' csv
    " 2>&1 \
    | tee -a long_tx_$(date +%Y%m%d).log.csv
done

Here is how it looks like in action:

Below we discuss various tricks and tips used in this example.

Use tmux (or screen)

The benefits here are straightforward: if you have internet connectivity issues and disconnected, then work is not lost if tmux session is running on a server. Following a predefined session naming convention can be helpful in case you work in a team.

How to use loops / batches

Some programs you'll use support batched reporting (examples: iostat -x 5, top -b -n 100 -d 5), some don't support it. In the latter case, use a while loop.

I prefer using an infinite loop like while sleep 5; do ... ; done – this approach has a small downside – it starts with sleeping first, and only then perform useful work – but it has a benefit that most of the time, you can interrupt using Ctrl-C.

Use psql options: -X, -A, -t

Best practices of using psql for observability-related sampling (and work automation in general):

  1. Always use -X – this is going to be very helpful one day, when you happen to work with a server, when an unexpected ~/.psqlrc file that contains something that could make psql output unpredictable (e.g., a simple \timing on). The option -X tells psql to ignore ~/.psqlrc.
  2. Options -A and -t – provide unaligned output without headers. This is helpful to have a result that is easier to parse and process in large volumes.

Use CSV as output

There are several ways to produce a CSV:

  • use psql's command \copy – in this case, results will be saved to file on client's side
  • copy (...) to '/path/to/file' – this will save results on server (there might be permissions issue if path is not writable for OS user under which Postgres is running)
  • psql --csv -F, – produce a CSV (but there may be issues with escaping values collide with field separator)
  • copy (...) to stdout – this approach is, perhaps, most convenient for the sampling/logging purposes

Master log-fu: don't lose STDERR, have timestamps, append

For later analysis (who knows what we'll decide to check a couple of hours later?) it is better to save everything to a file.

But it is also critically important not to lose the errors – usually, they are printed to STDERR, so we either need to write them to a separate file. We also might want not to lose the existing content of the file so instead of just overwriting (>) we want to append (>>):

command   2>>error.log >>messages.log

Or just redirect everything to a single file:

command  &>>everything.log

If you want to both see everything and log it, use tee – or, with append mode, tee -a (here, 2>&1 redirects STDERR to STDOUT first, an then tee gets everything from STDOUT):

commend 2>&1 | tee -a everything.log

If the output you have lacks timestamps (not the case with the psql snippet we used above though), then use ts to prepend each line with a timestamp:

command 2>&1 \
  ts \
  tee -a everything.log

Finally, it is usually wise to name the file with result with some details and current date (or even date + time, depending on situation):

command 2>&1 \
  | ts \
  | tee -a observing_our_comand_$(date +%Y%m%d).log

One downside of using tee is that, in some cases, you might accidentally stop it (e.g., pressing Ctrl-C in a wrong tmux window/pane). Due to this, some people prefer using nohup ... & to run observability actions in background and observing the result using tail -f.

Processing result

The result, if it's CSV, is easy to process later – Postgres suits really well for it. We just need to remember the set of columns we've used, so we can create a table and load the data:

nik=# create table log_loaded as select clock_timestamp(), clock_timestamp() - xact_start as xact_duration, * from pg_stat_activity limit 0;
SELECT 0
nik=# copy log_loaded from '/Users/nik/long_tx_20231006.log.csv' csv delimiter ',';
COPY 4

And then analyze it with SQL. Or you may want to analyze and visualize the data somehow – for instance, load the CSV to some spreadsheet tool and then analyze the data and create some charts there.

Alternative sampling – right from psql

In psql, you can use \o | tee -a logfile and \watch to both see the data and log it at the same time – although, notice that it won't capture errors in the file. Example:

nik=# \o | tee -a alternative.log

nik=# \t
Tuples only is on.
nik=# \a
Output format is unaligned.
nik=# \f ┃
Field separator is "┃".
nik=# select clock_timestamp(), 'test' as c1 \watch 3
2023-10-06 21:01:05.33987-07┃test
2023-10-06 21:01:08.342017-07┃test
2023-10-06 21:01:11.339183-07┃test
^C
nik=# \! cat  alternative.log
2023-10-06 21:01:05.33987-07┃test
2023-10-06 21:01:08.342017-07┃test
2023-10-06 21:01:11.339183-07┃test
nik=#

Monday, October 23, 2023

How to troubleshoot Postgres performance using FlameGraphs and eBPF (or perf)

 Today we'll discuss how to understand what exactly a Postgres backend is doing inside CPU on a Linux machine (in-CPU analysis).

// Users of RDS and other managed Postgres services – sorry, this cannot be used in your case. Talk to your provider: you're paying much more than just for VM, but they give you only SQL level of access – that's not fair.

I'll keep this howto as simple as possible, and also demonstrate an interesting example that can be helpful in certain cases – we'll try to understand what's happening under the hood during query planning.

To perform in-CPU analysis, we'll be using fantastic Brendan Gregg's flamegraphs.

Per Brendan Gregg, a flamegraph is "a visualization of hierarchical data, created to visualize stack traces of profiled software so that the most frequent code-paths to be identified quickly and accurately" (discussed here)

To create a flamegraph, we need to perform a few quite simple steps:

  1. Install tools to collect data for analysis. Two options here: tools for eBPF or perf.
  2. Install packages with debug symbols for Postgres and, if needed, extensions we're using
  3. Clone the FlameGraph repo
  4. Get the process ID of the backend we're going to analyze
  5. Collect data
  6. Generate a flamegraph.

Demo setup (schema and workload)

Before we discuss the analysis steps, here is what we're going to explore. The steps below are in psql connected to Postgres locally.

Table schema:

create table t1 as select i from generate_series(1, 1000000) i;
create table t2 as select i from generate_series(1, 10000000) i;
alter table t1 add primary key (i);
alter table t2 add primary key (i);
vacuum analyze;

Then run some Postgres planner's activity in an infinite loop:

select pg_backend_pid(); -- remember it for step 4 
(but check again if Postgres is restarted)

explain select from t1
join t2 using (i)
where i between 1000 and 2000 \watch .01

– and leave it running.

Now, the analysis steps.

Step 1: Install tools to collect data

Here we're going to install both bcc-tools (eBPF) and linux-tools-*** (perf) just to show how it works – but you may want to choose only one of them.

Example for eBPF and Ubuntu 22.04:

sudo apt update
apt install -y \
  linux-headers-$(uname -r) \
  bpfcc-tools # might be called bcc-tools

To install perf:

sudo apt update
sudo apt install -y linux-tools-$(uname -r)

If that didn't work, try checking what you have:

uname -r
sudo apt search linux-tools | grep $(uname -r)

This page has some good tips for perf installation, including a few words about working with containers:

One good way to use perf is to just run the top command:

perf top

– and observe what's happening with the system as a whole. Note that many functions will appear in unreadable form such as 0x00000000001d6d88 – memory addresses of some functions which names cannot be resolved:

This is because we don't have debug symbols for Postgres yet. Let's fix it.

Step 2: Install Postgres debug symbols

I recommend having debug symbols – including all production systems.

For Postgres 16:

apt install -y postgresql-16-dbgsym

You may need more packages, for postgres-related software – for example:

apt search postgres | grep dbgsym

Note, though, that not every postgres-related package has postgres in its name, though – e.g., for pgBouncer, you need pgbouncer-dbgsym.

Once packages with debug symbols are installed, it is important not to forget to restart Postgres (and our infinite loop with EXPLAIN .. \watch in psql).

Now perf top is going to look much better – all the Postgres-related lines have function names:

Step 3: Get FlameGraph tooling

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

Step 4: Get Postgres PID

If you need to know the PID of Postgres backend you're dealing with:

select pg_backend_pid();

If you have something already running, a different session, you can use pg_stat_activity, for example:

select pid, query
from pg_stat_activity
where
  application_name = 'psql'
  and pid <> pg_backend_pid();

Once we know the PID, let's remember it in a variable:

export PGPID=<your value>

Step 5: Collect data for analysis

eBPF version, collecting data for process with pid $PGPID during 10 seconds:

profile-bpfcc -p $PGPID -F 99 -adf 10 > out

If you prefer perf:

perf record -F 99 -a -g -- sleep 10
perf script | ./stackcollapse-perf.pl > out

Step 6: Generate a flamegraph

./flamegraph.pl --colors=blue out > profile.svg

Result

That's it. Now you need to copy profile.svg to your machine and open it in, for example, a browser – it will show this SVG file and, moreover, clicking on various areas of it is supported (really helpful for exploring various areas).

Here is the result for our process running an infinite EXPLAIN loop:

It's very interesting that ~35% of CPU time is spent to analyzing if Merge Join is worth using, while eventually the planner picks a Nested Loop:

postgres=# explain (costs off) select
from t1 join t2 using (i)
where i between 1000 and 2000;
                    QUERY PLAN
---------------------------------------------------
 Nested Loop
   ->  Index Only Scan using t1_pkey on t1
         Index Cond: ((i >= 1000) AND (i <= 2000))
   ->  Index Only Scan using t2_pkey on t2
         Index Cond: (i = t1.i)
(5 rows)

In this case, the planning time is really low, sub-millisecond – but I encountered with cases, when planning happened to be extremely slow, many seconds or even dozens of seconds. And it turned out (thanks to flamegraphs!) that analysing the Merge Join paths was the reason, so with "set enable_mergejoin = off" the planning time dropped to very low, sane values. But this is another story.

Some good mate

Profiling with perf

perf is a utility set added to Linux kernel 2.6.31. A quick example showing what perf output looks like is at Using Performance Counters for Linux.

How to profile

perf offers two major modes: record then report, or real-time "top" mode. Both are useful in different situations.

TIP: On many systems where -g is shown below, you may have to write --call-graph dwarf instead to get working stack traces.

Record then report

Recording data

To profile the system for a specific length of time, for example 60 seconds:

 perf record -a -g -s sleep 60

To profile the system for the duration of a command:

 perf record -a -g -s -- pg_ctl -D pgdata start

To profile the system until cancelled, just omit any command:

 perf record -a -g

By default perf record will generate a perf.data file in the current working directory.

See "what to profile" for ways to narrow what you record.

Reporting

By default perf report will look for perf.data in the current working directory.

 perf report -n

You can include call-graph output, if captured, with:

 perf report -g

and a variety of options are available to sort, filter and aggregate the output; see the documentation for perf.

In real time

To profile system-wide in real time:

 perf top

To profile in real time without accumulating stats, i.e. instantaneous profiling:

 perf top -z

You can use the usual target-selection options with perf top, but it's most commonly used for system wide profiling with perf top -a.

Troubleshooting

Many systems will produce incomplete call-graphs with only the top stack entry when -g is used, if the PostgreSQL binaries are built with -fomit-frame-pointer. That's the default on x64. To work around this, you may be able to use --call-graph dwarf on a new enough perf with libunwind support. Otherwise you'll need to recompile PostgreSQL with -fno-omit-frame-pointer.

What to profile

perf lets you choose what to profile in a variety of ways, the most useful of which are:

 -a          whole system, all CPUs
 -p pid      one process ID
 -u user     processes for one user, e.g. "postgres"
 

You can use these commands with perf record and perf top.

Including user-space stacks

Newer perf releases support capturing the whole user-space call stack, not just the top function on the stack. This lets it provide extremely useful call graphs. Use the --call-graph dwarf option to perf record to enable this, and the -g option to perf report to display the results.

On x64 hosts, you must either rebuild PostgreSQL with ./configure CFLAGS="-fno-omit-frame-pointer -ggdb" ... or use a version of perf built with libunwind support.

More detail in Tracing PostgreSQL with perf.

Tracepoints

perf is not limited to cpu call sampling.

It can also trace syscalls, dynamic userspace probes, and more. See sudo perf list for events and look for the tracepoint events.

You can observe things like kernel block layer activity via this interface. For example, this event:

 syscalls:sys_enter_fsync                           [Tracepoint event]

can be used to trace fsyncs, e.g.

 perf top -e syscalls:sys_enter_fsync
 perf record -e syscalls:sys_enter_fsync -u postgres
 perf record -e syscalls:sys_enter_fsync -p $SOME_PID

etc.

You can use wildcards in events, and you can use multiple comma-separated events or separate -e arguments.

PostgreSQL pre-defined tracepoint events

PostgreSQL predefines some handy perf events via the SDT (statically defined tracepoint) interface, like dtrace. This lets you see critical PostgreSQL performance innards, e.g.

 sdt_postgresql:buffer__sync__done                  [SDT event]
 sdt_postgresql:buffer__sync__start                 [SDT event]
 ...
 sdt_postgresql:checkpoint__done                    [SDT event]
 sdt_postgresql:checkpoint__start                   [SDT event]

If you don't see these in perf list, see static probe points (gdb) and perf-buildid-cache. The exact steps required to enable probe points vary between perf versions. If you see mention of perf sdt-cache you're reading something outdated.

Also: Internally PostgreSQL generates these using systemtap on Linux, so you must have had systemtap installed at build time and passed --enable-dtrace to configure to get these tracepoints. They're enabled by default in the PGDG binary builds.

You must initialize sdt events with perf probe before using them. perf probe may report multiple event names, as in the example below.

 # perf probe sdt_postgresql:query__execute__start
 Added new events:
   sdt_postgresql:query__execute__start (on %query__execute__start in /usr/pgsql-9.5/bin/postgres)
   sdt_postgresql:query__execute__start_1 (on %query__execute__start in /usr/pgsql-9.5/bin/postgres)
 
 You can now use it in all perf tools, such as:
 
 	perf record -e sdt_postgresql:query__execute__start_1 -aR sleep 1
 
 # perf record -e sdt_postgresql:query__execute__start,sdt_postgresql:query__execute__start_1 -a
 ^C[ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 2.211 MB perf.data (2 samples) ]
 
 # perf script
       postgres 19995 [000] 430223.752970: sdt_postgresql:query__execute__start_1: (6c42e7)
       postgres 20003 [002] 430227.276924: sdt_postgresql:query__execute__start_1: (6c42e7)
 #

Adding new trace points

It's trivial to add new tracepoints in your own builds.

All you have to do is edit src/backend/utils/probes.d to add the tracepoint name, then add macros to call it where you want it to fire. The tracepoint name in probes.d must be lower case, have all underscores doubled, and lacks a leading TRACE_POSTGRESQL_. The tracepoint invocations are upper case, with single underscores, and lead with TRACE_POSTGRESQL_. For example, probes.d has

   probe transaction__start(LocalTransactionId);

and it's called in StartTransaction in src/backend/access/transam/xact.c as

   TRACE_POSTGRESQL_TRANSACTION_START(vxid.localTransactionId);

Note that you cannot use typedefs provided by the PostgreSQL headers; the probes.d file lists its own typedefs.

If you think it's generally useful, submit a patch to the pgsql-hackers mailing list. A tracepoint is probably useful if you cannot trivially achieve the same thing with a dynamic userspace probe and it touches on something that's likely to be relevant to a broader user base. Even probes that are easily done with dynamic tracepoints are useful if they can serve as documentation for how to find and trace some event many people will be interested in.

Dynamic tracepoints

Often there isn't a SDT event for just what you want. But that's OK. If you have debug symbols and a vaguely recent perf (tested: Linux 4.4) you can inject dynamic tracepoints where you want them, and even capture function arguments!

e.g. if you want to see the rate at which queries are executed you might inject a tracepoint into standard_ExecutorStart:

 perf probe -x /usr/lib/postgresql/10/bin/postgres standard_ExecutorStart

This will then be available as the perf event probe_postgres:standard_ExecutorStart for use in perf record etc.

Probing extensions

Extensions are valid probe targets too. Just specify them as the executable, e.g. to see the lsn of every 100th change processed by pglogical's logical decoding output plugin as it examines it, you could:

 perf probe -x /usr/lib/postgresql/10/lib/pglogical.so pg_decode_change 'change->lsn'
 perf record -e probe_pglogical:pg_decode_change -aR -o - | perf script -i - -F event,trace | awk 'NR%100==0' -

Less common reports

Annotated source code, from a recorded session:

 perf annotate -l -P

Trace output, from a recorded session:

 perf script -L

Timecharts

 perf timechart record
 perf timechart

Benchmarking and statistics

If you're interested in particular tracepoints or statistics, the perf stat command is also useful. For example, the following will count fsync and write calls, and block device activity, during a command. It will run the command five times and summarize the results.

 sudo perf stat -e block:block_rq_*,syscalls:sys_enter_write,syscalls:sys_enter_fsync -a -r 5 -- psql -q -U postgres postgres -c "drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;";

To learn more about using perf for tracing see tracing PostgreSQL with perf.

Availability of perf

The perf tool is available on recent Linux distributions such as:

  • RedHat Linux Enterprise 6: perf package
  • Debian Squeeze: add the linux-tools package.

It may be built and installed from the linux source code in the tools/perf subdirectory. You must use a perf that matches your running kernel version.


Advanced: Viewing and capturing function arguments

Lets say you want to collect information on the kinds of queries that are run. You can capture the CmdType operation field of the QueryDesc passed to standard_ExecutorStart with perf!. No patching postgres, no gdb.

 perf probe -x /usr/lib/postgresql/10/bin/postgres standard_ExecutorStart 'queryDesc->operation'

You can capture multiple arguments. To get strings as c-strings not pointer values append :string to the argument defintion. e.g.

 perf probe -x /usr/lib/postgresql/10/bin/postgres standard_ExecutorStart 'queryDesc->operation' 'queryDesc->sourceText:string'

The overhead of perf and size of traces is greatly increased by capturing arguments in probes, so only do it if you need to.

perf report doesn't seem to do anything with captured tracepoint arguments right now, but perf script can show them, and you can filter the results through its perl/python interpreters (if enabled) or via a pipe to another tool/script.

e.g. if you want to see every planned query run in realtime, you might set up the tracepoint above, then

 perf record -e probe_postgres:standard_ExecutorStart -u postgres -o - | perf script -i -

(Some perf versions don't seem to play well with stdio, in which case you may need to use separate record and script steps. If you get empty output, try it.)

You should see results like

       postgres 19499 [002] 429675.072758: probe_postgres:standard_ExecutorStart: (5bd2c0) operation=0x1 sourceText="SELECT 1;"
       postgres 19499 [002] 429675.399764: probe_postgres:standard_ExecutorStart: (5bd2c0) operation=0x1 sourceText="SELECT 2;"

Of course this is a bit of a silly example since you'd just enable query logging, and it ignores unplanned queries (utility statements) but it illustrates the methods required for capturing data with probes.

Resources

perf, what’s that?

The Linux perf tool is an incredibly powerful tool, that can amongst other things be used for:

  • Sampling CPU-bound processes (or the whole) system to analyse which part of your application is consuming the CPU time
  • Accessing CPU performance counters (PMU)
  • “user probes” (uprobes) which trigger for example when a certain function in your application runs

In general, perf can count and/or record the call stacks of your threads when a certain event occurs. These events can be triggered by:

  • Time (e.g. 1000 times per second), useful for time profiling. For an example use, see the CPU performance debugging guide.
  • System calls, useful to see where your system calls are happening.
  • Various system events, for example if you’d like to know when context switches occur.
  • CPU performance counters, useful if your performance issues can be traced down to micro-architectural details of your CPU (such as branch mispredictions). For an example, see SwiftNIO’s Advanced Performance Analysis guide.
  • and much more

Getting perf to work

Unfortunately, getting perf to work depends on your environment. Below, please find a selection of environments and how to get perf to work there.

Installing perf

Technically, perf is part of the Linux kernel sources and you’d want a perf version that exactly matches your Linux kernel version. In many cases however a “close-enough” perf version will do too. If in doubt, use a perf version that’s slightly older than your kernel over one that’s newer.

  • Ubuntu

      apt-get update && apt-get -y install linux-tools-generic
    

    See below for more information because Ubuntu packages a different perf per kernel version.

  • Debian

      apt-get update && apt-get -y install linux-perf
    
  • Fedora/RedHat derivates

     yum install -y perf
    

You can confirm that your perf installation works using perf stat -- sleep 0.1 (if you’re already root) or sudo perf stat -- sleep 0.1.

perf on Ubuntu when you can’t match the kernel version

On Ubuntu (and other distributions that package perf per kernel version) you may see an error after installing linux-tools-generic. The error message will look similar to

$ perf stat -- sleep 0.1
WARNING: perf not found for kernel 5.10.25

  You may need to install the following packages for this specific kernel:
    linux-tools-5.10.25-linuxkit
    linux-cloud-tools-5.10.25-linuxkit

  You may also want to install one of the following packages to keep up to date:
    linux-tools-linuxkit
    linux-cloud-tools-linuxkit

The best fix for this is to follow what perf says and to install one of the above packages. If you’re in a Docker container, this may not be possible because you’d need to match the kernel version (which is especially difficult in Docker for Mac because it uses a VM). For example, the suggested linux-tools-5.10.25-linuxkit is not actually available.

As a workaround, you can try one of the following options

  • If you’re already root and prefer a shell alias (only valid in this shell)

      alias perf=$(find /usr/lib/linux-tools/*/perf | head -1)
    
  • If you’re a user and would like to prefer to link /usr/local/bin/perf

      sudo ln -s "$(find /usr/lib/linux-tools/*/perf | head -1)" /usr/local/bin/perf
    

After this, you should be able to use perf stat -- sleep 0.1 (if you’re already root) or sudo perf stat -- sleep 0.1 successfully.

Bare metal

For a bare metal Linux machine, all you need to do is to install perf which should then work in full fidelity.

In Docker (running on bare-metal Linux)

You will need to launch your container with docker run --privileged (don’t run this in production) and then you should have full access to perf (including the PMU).

To validate that perf works correctly, run for example perf stat -- sleep 0.1. Whether you’ll see the <not supported> next to some information will depend on if you have access to the CPU’s performance counters (the PMU). In Docker on bare metal, this should work, ie. no <not supported>s should show up.

Friday, October 20, 2023

How to understand LSN values and WAL filenames

 

How to read LSN values

LSN – Log Sequence Number, a pointer to a location in the Write-Ahead Log (WAL). Understanding it and how to work with it is important for dealing with physical and logical replication, backups, recovery. Postgres docs:

LSN is a 8-byte (32-bit) value (source code). It can be represented in the form of A/B (more specifically, A/BBbbbbbb, see below), where both A and B are 4-byte values. For example:

nik=# select pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 5D/257E19B0
(1 row)

  • 5D here is higher 4-byte (32-bit) section of LSN
  • 257E19B0 can, in its turn, be split to two parts as well: 25 – lower 4-byte section of LSN (more specifically, only the highest 1 byte of of that 4-byte section) 7E19B0 – offset in WAL (which is 16 MiB by default; in some cases, it's changed – e.g., RDS changed it to 64 MiB)

Interesting that LSN values can be compared, and even subtracted one from each other – assuming we use the pg_lsn data type. The result will be in bytes:

nik=# select pg_lsn '5D/257D6780' - '5D/251A8118';
 ?column?
----------
  6481512
(1 row)
 nik=# select pg_size_pretty(pg_lsn '5D/257D6780' - '5D/251A8118');
 pg_size_pretty
----------------
 6330 kB
(1 row)

This also means that we can get the integer value of LSN just seeing how far we went from the "point 0" – value 0/0:

nik=# select pg_lsn '5D/257D6780' - '0/0';
   ?column?
--------------
 400060934016
(1 row)

How to read WAL filenames

Now let's see how the LSN values correspond to WAL file names (files located in $PGDATA/pg_wal). We can get WAL file name for any given LSN using function pg_walfile_name():

nik=# select pg_current_wal_lsn(), pg_walfile_name(pg_current_wal_lsn());
 pg_current_wal_lsn |     pg_walfile_name
--------------------+--------------------------
 5D/257E19B0        | 000000010000005D00000025
(1 row)

Here 000000010000005D00000025 is WAL filename, it consists of three 4-byte (32-bit) words:

1) 00000001 – timeline ID (TimeLineID), a sequential "history number" that starts with 1 when Postgres cluster is initialized. It "identifies different database histories to prevent confusion after restoring a prior state of a database installation" (source code).

2) 0000005D – higher 4-byte section of sequence number.

3) 00000025 – can be viewed as two parts:

  • 000000 – 6 leading zeroes,
  • 25 – the highest byte of the lower section of sequence number.

Important to remember: the third 4-byte word in the WAL filename has 6 leading zeroes – often, this leads to confusion and mistakes when comparing two WAL filenames to understand what LSNs to expect inside them.

Useful illustration (from this post):

LSN:                5D/      25   7E19B0
WAL: 00000001 0000005D 00000025

This can be very helpful if you need to work with LSN values or WAL filenames or with both of them, quickly navigating or comparing their values to understand the distance between them. Some examples when it can be useful to understand:

  1. How much bytes the server generates per day
  2. How much has passed since replication slot has been created
  3. What's the distance between two backups
  4. How much of WAL data needs to be replayed to reach consistency point

Good blog posts worth reading besides the official docs:


How to speed up pg_dump when dumping large Postgres databases

 

After a series of rather hefty posts, let's take a breather with some lighter material.

Today we'll discuss how to speed up pg_dump when dealing with large data volumes.

Speeding up options discussed here:

  1. Compression
  2. Dump/restore without saving dumps on disk
  3. Parallelized pg_dump
  4. Advanced custom parallelization

Option 1: Consider applying compression

In the cases of weak disks or network, it makes sense to apply compression. If the plain (default) format is used, then you can just use pg_dump ... | gzip. Note that this is not going to help to speed the process up if disk IO and network (if it's used) are not saturated (inspect resource utilization with standard tools like iostat, top, and so on).

Option 2: Avoid dumping to disk, restore on the fly

When the directory format is used (option -Fd; this format is the most flexible and I usually use it unless I have a specific situation), then compression is applied by default (gzip by default, also available lz4 and zstd).

It also makes sense to use pg_dump -h ... | pg_restore and avoid writing to disk and restoring the dump "on the fly". Unfortunately, this can be done only when pg_dump is creating a plain dump – with directory format, it's not working. To solve this problem, there is a 3rd-party tool called pgcopydb.

Option 3: pg_dump -j$N

For servers with a high number of CPUs, when you deal with multiple tables and create dumps in the directory format, parallelization (option -j$N) can be very helpful. A single, but partitioned, table is going to behave similarly to multiple tables – because physically, the dumping will be applied to multiple tables (partitions).

Consider a standard table pgbench_accounts, created by pgbench, partitioned into 30 partitions and relatively large, 100M rows – the size of this data (without indexes) is ~12 GiB:

❯ pgbench -i -s1000 --partitions=30 test
dropping old tables...
creating tables...
creating 30 partitions...
generating data (client-side)...
100000000 of 100000000 tuples (100%) done (elapsed 53.64 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 146.30 s (drop tables 0.15 s, create tables 0.22 s, client-side generate 54.02 s, vacuum 25.19 s, primary keys 66.71 s).

The size of data in the database test is now ~12 GiB and the majority of it is in the shared table pgbench_accounts.

The most flexible format of dumps is directory – option -Fd. Measuring time a few times to make sure caches are warmed up (macbook m1, 16 GiB RAM, PG15, shared_buffers='4GB'):

❯ time pg_dump -Fd -f ./test_dump test
pg_dump -Fd -f ./test_dump test  45.94s user 2.65s system 77% cpu 1:02.50 total
❯ rm -rf ./test_dump
❯ time pg_dump -Fd -f ./test_dump test
pg_dump -Fd -f ./test_dump test  45.83s user 2.69s system 79% cpu 1:01.06 total

– ~61 sec.

Speeding up with -j8 to have 8 parallel pg_dump workers:

❯ time pg_dump -Fd -j8 -f ./test_dump test
pg_dump -Fd -j8 -f ./test_dump test  57.29s user 6.02s system 259% cpu 24.363 total
❯ rm -rf ./test_dump
❯ time pg_dump -Fd -j8 -f ./test_dump test
pg_dump -Fd -j8 -f ./test_dump test  57.59s user 6.06s system 261% cpu 24.327 total

– ~24 sec (vs. ~61 sec above).

Option 4: Advanced parallelization for large unpartitioned tables

When dumping a database where there is a single unpartitioned table that is much larger than other tables (e.g., a huge unpartitioned "log" table containing some historical data), standard pg_dump parallelization, unfortunately, won't help. Using the same example as above, but without partitioning:

❯ pgbench -i -s1000 test
dropping old tables...
creating tables...
generating data (client-side)...
100000000 of 100000000 tuples (100%) done (elapsed 51.71 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 116.23 s (drop tables 0.73 s, create tables 0.03 s, client-side generate 51.93 s, vacuum 4.41 s, primary keys 59.14 s).
❯
❯ rm -rf ./test_dump
❯ time pg_dump -Fd -j8 -f ./test_dump test
pg_dump -Fd -j8 -f ./test_dump test  48.24s user 3.25s system 83% cpu 1:01.83 total

– multiple workers are not really helpful here, because most of the time, a single worker is going to work (dumping the largest unpartitioned table).

This is because pg_dump parallelization is working at table level and cannot parallelize dumping a single table.

To parallelize dumping a single large table, a custom solution is needed. To do that, we need to use multiple SQL clients such as psql, each one working with transaction at REPEATABLE READ isolation level (pg_dump is also using this level when working; see the docs), and (important!) all of the dumping transactions need to use the same snapshot.

The process can be as follows:

1) In one connection (e.g., in one psql session), start a transaction at the REPEATABLE READ level:

test=# start transaction isolation level repeatable read;
START TRANSACTION

2) This transaction must be open till the very end of the process – we need to make sure it is so.

3) In the same session, use the function pg_export_snapshot() to obtain the snapshot ID:

test=*# select pg_export_snapshot();
 pg_export_snapshot
---------------------
 00000004-000BF714-1
(1 row)

4) In other sessions, we also open REPEATABLE READ transactions and set them to use the very same snapshot (of course, we run all of them in parallel, this is the whole point of speeding up):

test=# start transaction isolation level repeatable read;
START TRANSACTION
test=*# set transaction snapshot '00000004-000BF714-1';
SET

5) Then in each session, we dump a chunk of the large table, making sure the access method is fast (Index Scan; alternatively, for example when no proper indexes are present, you can use the ranges of hidden column ctid and benefit from using TID Scan, avoiding Seq Scan). For example, dumping the chunk 1 of 8 for pgbench_accounts:

test=# start transaction isolation level repeatable read;
START TRANSACTION
test=*# set transaction snapshot '00000004-000BF714-1';
SET
test=*# copy (select * from pgbench_accounts where aid <= 12500000) to stdout;

6) To dump other, smaller tables, we can involve pg_dump - it also supports working with particular snapshots, via option --snapshot=.... In this case, we need to exclude the data for our large table using --exclude-table-data=... because we take care of it separately. In this case, we can also involve parallelization. For example:

❯ pg_dump \
  -Fd \
  -j2 \
  -f ./everything_but_pgba_data.dump \
  --snapshot="00000004-000BF714-1" \
  --exclude-table-data="pgbench_accounts" \
  test

7) Do not forget to close the first transaction when everything is done – long-running transaction are harmful for OLTP workloads.

8) To restore, we need to follow the usual pg_dump order: DDL defining objects except indexes; then data load; and finally, constraint validation and index creation. For this, we can benefit from having dump in the directory format and use pg_restore's options -l and -L to list the objects in the dump and filter them to restore, respectively.

A good post about dealing with snapshots when making database dumps: "Postgres 9.5 feature highlight - pg_dump and external snapshots". A very interesting additional consideration in that post is related to a special case of dumping: initialization of logical replicas. It is possible to use custom dumping methods synchronized with the position of logical slot, but creation of such slot has to be done via replication protocol (CREATE_REPLICATION_SLOT foo3 LOGICAL test_decoding;), not using SQL (select * from pg_create_logical_replication_slot(...);).

How to work with pg_stat_statments?

                                                 PART - 1

Two branches of Postgres query optimization

There are two big branches of #PostgreSQL query optimization:

  1. "Micro" optimization: analysis and improvement of particular queries. Main tool: EXPLAIN.
  2. "Macro" optimization: analysis of whole or large parts of workload, segmentation of it, studying characteristics, going from top to down, to identify and improve the parts that behave the worst. Main tools: pg_stat_statements (and additions or alternatives), wait event analysis, and Postgres logs.

Today we focus on how to read and use pg_stat_statements, starting from basics and proceeding to using the data from it for macro optimization.

pg_stat_statements basics

Extension pg_stat_statements (for short, "pgss") became standard de-facto for macro-analysis.

It tracks all queries, aggregating them to query groups – called "normalized queries" – where parameters are removed.

There are certain limitations, some of which are worth remembering:

  • it doesn't show anything about ongoing queries (can be found in pg_stat_activity)
  • a big issue: it doesn't track failing queries, which can sometimes lead to wrong conclusions (example: CPU and disk IO load are high, but 99% of our queries fail on statement_timeout, loading our system but not producing any useful results – in this case, pgss is blind)
  • if there are SQL comments, they are not removed, but only the first comment value is going to be present in the query column for each normalized query

The view pg_stat_statements has 3 kinds of columns:

  1. queryid – an identifier of normalized query. In the latest PG version it can also be used to connect (JOIN) data from pgss to pgsa (pg_stat_statements) and Postgres logs. Surprise: queryid value can be negative.
  2. Descriptive columns: ID of database (dbid), user (userid), and the query itself (query).
  3. Metrics. Almost all of them are cumulative: calls, total_time, rows, etc. Non-cumulative: stddev_plan_time, stddev_exec_time, min_exec_time, etc. In this post, we'll focus only on cumulative ones.

// Below I sometimes call normalized query "query group" or simply "group".

Let's mention some metrics that are usually most frequently used in macro optimization (full list):

1) calls – how many query calls happened for this query group (normalized query)

2) total_plan_time and total_exec_time – aggregated duration for planning and execution for this group (again, remember: failed queries are not tracked, including those that failed on statement_timeout)

3) rows – how many rows returned by queries in this group

4) shared_blks_hit and shared_blks_read – number if hit and read operations from the buffer pool. Two important notes here:

  • "read" here means a read from the buffer pool – it is not necessarily a physical read from disk, since data can be cached in the OS page cache. So we cannot say these reads are reads from disk. Some monitoring systems make this mistake, but there are cases that this nuance is essential for our analysis to produce correct results and conclusions.
  • The names "blocks hit" and "blocks read" might be a little bit misleading, suggesting that here we talk about data volumes – number of blocks (buffers). While aggregation here definitely make sense, we must keep in mind that the same buffers may be read or hit multiple times. So instead of "blocks have been hit" it is better to say "block hits".

5) wal_bytes – how many bytes are written to WAL by queries in this group

There are many more other interesting metrics, it is recommended to explore all of them (see the docs).

Dealing with cumulative metrics in pgss

To read and interpret data from pgss, you need three steps:

1) Take two snapshots corresponding to two points of time.

2) Calculate the diff for each cumulative metric and for time difference for the two points in time a special case is when the first point in time is the beginning of stats collection – in PG14+, there is a separate view, pg_stat_statements_info, that has information about when the pgss stats reset happened; in PG13 and older this info is not stored, unfortunately.

3) (the most interesting part!) Calculate three types of derived metrics for each cumulative metric diff – assuming that M is our metric and remembering some basics of calculus from high school:

  • a) dM/dt – time-based differentiation of the metric M;
  • b) dM/dc – calls-based differentiation (I'll explain it in detail in the next post);
  • c) %M – percentage that this normalized query takes in the whole workload considering metric M.

Step 3 can be also applied not to particular normalized queries on a single host but bigger groups – for example:

  • aggregated workload for all standby nodes
  • whole workload on a node (e.g., the primary)
  • bigger segments such as all queries from specific user or to specific database
  • all queries of specific type – e.g., all UPDATE queries

If your monitoring system supports pgss, you don't need to deal with working with snapshots manually – although, keep in mind that I personally don't know any monitoring that works with pgss perfectly, preserving all kinds of information discussed in this post (and I studied quite a few of Postgres monitoring tools).

Assuming you successfully obtained 2 snapshots of pgss (remembering timestamp when they were collected) or use proper monitoring tool, let's consider practical meaning of the three derivatives we discussed.

Derivative 1. Time-based differentiation

  • dM/dt, where M is calls – the meaning is simple. It's QPS (queries per second). If we talk about particular group (normalized query), it's that all queries in this group have. 10,000 is pretty large so, probably, you need to improve the client (app) behavior to reduce it, 10 is pretty small (of course, depending on situation). If we consider this derivative for whole node, it's our "global QPS".
  • dM/dt, where M is total_plan_time + total_exec_time – this is the most interesting and key metric in query macro analysis targeted at resource consumption optimization (goal: reduce time spent by server to process queries). Interesting fact: it is measured in "seconds per second", meaning: how many seconds our server spends to process queries in this query group. Very rough (but illustrative) meaning: if we have 2 sec/sec here, it means that we spend 2 seconds each second to process such queries – we definitely would like to have more than 2 vCPUs to do that. Although, this is a very rough meaning because pgss doesn't distinguish situations when query is waiting for some lock acquisition vs. performing some actual work in CPU (for that, we need to involve wait event analysis) – so there may be cases when the value here is high not having a significant effect on the CPU load.
  • dM/dt, where M is rows – this is the "stream" of rows returned by queries in the group, per second. For example, 1000 rows/sec means a noticeable "stream" from Postgres server to client. Interesting fact here is that sometimes, we might need to think how much load the results produced by our Postgres server put on the application nodes – returning too many rows may require significant resources on the client side.
  • dM/dt, where M is shared_blks_hit + shared_blks_read - buffer operations per second (only to read data, not to write it). This is another key metric for optimization. It is worth converting buffer operation numbers to bytes. In most cases, buffer size is 8 KiB (check: show block_size;), so 500,000 buffer hits&reads per second translates to 500000 bytes/sec * 8 / 1024 / 1024 = ~ 3.8 GiB/s of the internal data reading flow (again: the same buffer in the pool can be process multiple times). This is a significant load – you might want to check the other metrics to understand if it is reasonable to have or it is a candidate for optimization.
  • dM/dt, where M is wal_bytes – the stream of WAL bytes written. This is relatively new metric (PG13+) and can be used to understand which queries contribute to WAL writes the most – of course, the more WAL is written, the higher pressure to physical and logical replication, and to the backup systems we have. An example of highly pathological workload here is: a series of transactions like begin; delete from ...; rollback; deleting many rows and reverting this action – this produces a lot of WAL not performing any useful work. (Note: that despite the ROLLBACK here and inability of pgss to tracks failed statements, the statements here are going to be tracked because they are successful inside the transaction.)


That's it for the part 1 of pgss-related howto, in next parts we'll talk about dM/dc and %M, and other practical aspects of pgss-based macro optimization. 

                                                     PART -2

In part 1, we've discussed some basics of working with pgss, and the first set of derived metrics, dM/dt – time-based differentiation. Today we'll focus on the second set: dM/dc, where c is the number of calls (column calls in pgss).

Derivative 2. Calls-based differentiation

This set of metrics is not less important than time-based differentiation because it can provide you systematic view on characteristics of your workload and be a good tool for macro-optimization of query performance.

The metrics in this set help us understand the characteristics of a query, on average, for each query group.

Unfortunately, many monitoring systems disregard this kind of derived metrics. A good system has to present all or at least most of them, showing graphs how these values change over time (dM/dc time series).

Obtaining results for derived metrics of this kind is pretty straightforward:

  • calculate difference of values M (the metric being studied) between two pgss snapshots: M2 - M1
  • then, instead of using timestamps, get the difference of the "calls" values: c2 - c1
  • then get (M2 - M1) / (c2 - c1)

Let's consider the meanings of various derived metrics obtained in such way:

1) dM/dc, where M is calls – a degenerate case, the value is always 1 (number of calls divided by the same number of calls).

2) dM/dc, where M is total_plan_time + total_exec_time – average query duration time in particular pgss group, a critically important metric for query performance observability. It can also be called "query latency". When applied to the aggregated value for all normalized queries in pgss, its meaning is "average query latency on the server" (with two important comments that pgss doesn't track failing queries and sometimes can have skewed data due to the pg_stat_statements.max limit). The main cumulative statistics system in #PostgreSQL doesn't provide this kind of information – pg_stat_database tracks some time metrics, blk_read_time and blk_write_time if track_io_timing is enabled, and, in PG14+, active_time – but it doesn't have information about the number of statements (!), only the number for transactions, xact_commit & xact_rollback, is present; in some cases, we can obtain this data from other sources – e.g., pgbench reports it if we use it for benchmarks, and pgBouncer reports stats for both transaction and query average latencies, but in general case, in observability tools, pgss can be considered as the most generic way get the query latency information. The importance of it is hard to overestimate – for example:

  • If we know that normally the avg query duration is <1 ms, then any spike to 10ms should be considered as a serious incident (if it happened after a deployment, this deployment should be reconsidered/reverted). For troubleshooting, it also helps to apply segmentation and determine which particular query groups contributed to this latency spike – was it all of them or just particular ones?
  • In many cases, this can be taken as the most important metric for large load testing, benchmarks (for example: comparing average query duration for PG 15 vs. PG 16 when preparing for a major upgrade to PG 16).

3) dM/dc, where M is rows – average number of rows returned by a query in a given query group. For OLTP cases, the groups having large values (starting at a few hundreds or more, depending on the case) should be reviewed:if it's intentional (say, data dumps), no action needed, if it's a user-facing query and it's not related to data exports, then probably there is a mistake such as lack of LIMIT and proper pagination applied, then such queries should be fixed.

4) dM/dc, where M is shared_blks_hit + shared_blks_read – average number of "hits + reads" from the buffer pool. It is worth translating this to bytes: for example, 500,000 buffer hits&reads translates to 500000 GiB * 8 / 1024 / 1024 = ~ 3.8 GiB, this is a significant number for a single query, especially if its goal is to return just a row or a few. Large numbers here should be considered as a strong call for query optimization. Additional notes:

  • in many cases, it makes sense to have hits and reads can be also considered separately – there may be the cases when, for example, queries in some pgss group do not lead to high disk IO and reading from the page cache, but they have so many hits in the buffer pool, so their performance is suboptimal, even with all the data being cached in the buffer pool
  • to have real disk IO numbers, it is worth using https://github.com/powa-team/pg_stat_kcache
  • a sudden change in the values of this metric for a particular group that persists over time, can be a sign of plan flip and needs to be studied
  • high-level aggregated values are also interesting to observe, answering questions like "how many MiB do all queries, on average, read on this server?"

5) dM/dc, where M is wal_bytes (PG13+) – average amount of WAL generated by a query in the studied pgss group measured in bytes. It is helpful for identification of query groups that contribute most to WAL generation. A "global" aggregated value for all pgss records represents the average number of bytes for all statements on the server. Having graphs for this and for "dM/dc, where M is wal_fpi" can be very helpful in certain situations such as checkpoint tuning: with full_page_writes = on, increasing the distance between checkpoints, we should observe reduction of values in this area, and it may be interesting to study different particular groups in pgss separately. 

                                             PART -3

3rd type of derived metrics: percentage

Now, let's examine the third type of derived metrics: the percentage that a considered query group (normalized query or bigger groups such as "all statements from particular user" or "all UPDATE statements") takes in the whole workload with respect to metric M.

How to calculate it: first, apply time-based differentiation to all considered groups (as discussed in the part 1) — dM/dt — and then divide the value for particular group by the sum of values for all groups.

Visualization and interpretation of %M

While dM/dt gives us absolute values such as calls/sec or GiB/sec, the %M values are relative metrics. These values help us identify the "major players" in our workload considering various aspects of it — frequency, timing, IO operations, and so forth.

Analysis of relative values helps understand how big is the potential win from each optimization vector and prioritize our optimization activities, first focusing on those having the most potential. For example:

  • If the absolute value on QPS seems to be high — say, 1000 calls/sec — but if it represents just 3% of the whole workload, an attempt to reduce this query won't give a big win, and if we are concerned about QPS, we need to optimize other query groups.
  • However, if we have 1000 calls/sec and see that it's 50% of the whole, this single optimization step — say, reducing it to 10 calls/sec — helps us shave off almost half of all the QPS we have.

One of the ways to deal with proportion values in larger systems is to react on large percentage values, consider the corresponding query groups as candidates for optimization. For example, in systems with large number of query groups, it might make sense to apply the following approach:

  • Periodically, for certain metrics (for example, calls, total_exec_time, total_plan_time, shared_blks_dirtied, wal_bytes), build Top-10 lists showing query groups having the largest %M values.
  • If particular query group turns out to be a major contributor – say, >20% — on certain metrics, consider this query as a candidate for optimization. For example, in most cases, we don't want a single query group to be responsible for 1/2 of the aggregated total_exec_time ("total total_exec_time", apologies for tautology). – In certain cases, it is ok to decide that query doesn't require optimization — in this case we mark such group as exclusion and skip it in the next analyses.

The analysis of proportions can also be performed implicitly, visually in monitoring system: observing graphs of dM/dt (e.g., QPS, block hits per second), we can visually understand which query group contributes the most in the whole workload, considering a particular metric M. However, for this, graphs need to be "stacked".

If we deal with 2 snapshots, then it makes sense to obtain such values explicitly. Additionally, for visualization purposes, it makes sense to draw a pie chart for each metric we are analyzing.

%M examples

1) %M, where M is calls — this gives us proportions of QPS. For example, if we normally have ~10k QPS, but if some query group is responsible for ~7k QPS, this might be considered as abnormal, requiring optimizations on client side (usually, application code).

2) %M, where M is total_plan_time + total_exec_time — percentage in time that the server spends to process queries in a particular group. For example, if the absolute value is 20 seconds/second (quite a loaded system — each second #PostgreSQL needs to spend 20 seconds to process queries), and a particular group has 75% on this metric, it means we need to focus on optimizing this particular query group. Ways to optimize:

  • If QPS (calls/second) is high, then, first of all, we need to reduce.
  • If average latency (total_exec_time, less often total_plan_time or both) is high, then we need to apply micro-optimization using EXPLAIN and EXPLAIN (ANALYZE, BUFFERS).
  • In some cases, we need to combine both directions of optimization.

3) %M, where M is shared_blks_dirtied — percentage of changes in the buffer pool performed by the considered query group. This analysis may help us identify the write-intensive parts of the workload and find opportunities to reduce the volume of checkpoints and amount of disk IO.

4) %M, where M is wal_bytes — percentage of bytes written to WAL. This helps us identify those query groups where optimization will be more impactful in reducing the WAL volumes being produced.

Instead of summary: three macro-optimization goals and what to use for them

Now, with the analysis methods described here and in the previous 2 parts, let's consider three popular types of macro-optimization with respect to just a single metric — total_exec_time.

Understanding each of these three approaches (and then applying this logic to other metrics as well) can help you understand how your monitoring dashboards should appear.

  1. Macro-optimization aimed to reduce resource consumption. Here we want to reduce, first of all, CPU utilization, memory and disk IO operations. For that, we need to use the dM/dt type of derived metric – the number of seconds Postgres spends each second to process the queries. Reducing this metric — the aggregated "total" value of it for the whole server, and for the Top-N groups playing the biggest role — has to be our goal. Additionally, we may want to consider other metrics such as shared_blks_***, but the timing is probably the best starting point. This kind of optimization can help us with capacity planning, infrastructure cost optimization, reducing risks of resource saturation.
  2. Macro-optimization aimed to improve user experience. Here we want our users to have the best experience, therefore, in the OLTP context, we should focus on average latencies with a goal to reduce them. Therefore, we are going to use dM/dc — number of seconds (or milliseconds) each query lasts on average. If in the previous type of optimization, we would like to see Top-N query groups ordered by the dM/dt values (measured in seconds/second) in our monitoring system, here we want Top-N ordered by avg. latency (measured in seconds). Usually, this gives us a very different set of queries — perhaps, having lower QPS. But these very queries have worst latencies and these very queries annoy our users the most. In some cases, in this this type of analysis, we might want to exclude query groups with low QPS (e.g., those having QPS < 1 call/sec) or exclude specific parts of workload such as data dump activities that have inevitably long latencies.
  3. Macro-optimization aimed to balance workload. This is less common type of optimization, but this is exactly where %M plays its role. Developing our application, we might want, from time to time, check Top-N percentage on total_exec_time + total_plan_time and identify query groups holding the biggest portion — exactly as we discussed above.

Bonus: podcast episodes

Related Postgres.FM episodes::