How to work with pg_stat_statments?

PART - 1

Two branches of Postgres query optimization

There are two big branches of #PostgreSQL query optimization:

"Micro" optimization: analysis and improvement of particular queries. Main tool: EXPLAIN.
"Macro" optimization: analysis of whole or large parts of workload, segmentation of it, studying characteristics, going from top to down, to identify and improve the parts that behave the worst. Main tools: pg_stat_statements (and additions or alternatives), wait event analysis, and Postgres logs.

Today we focus on how to read and use pg_stat_statements, starting from basics and proceeding to using the data from it for macro optimization.

pg_stat_statements basics

Extension pg_stat_statements (for short, "pgss") became standard de-facto for macro-analysis.

It tracks all queries, aggregating them to query groups – called "normalized queries" – where parameters are removed.

There are certain limitations, some of which are worth remembering:

it doesn't show anything about ongoing queries (can be found in pg_stat_activity)
a big issue: it doesn't track failing queries, which can sometimes lead to wrong conclusions (example: CPU and disk IO load are high, but 99% of our queries fail on statement_timeout, loading our system but not producing any useful results – in this case, pgss is blind)
if there are SQL comments, they are not removed, but only the first comment value is going to be present in the query column for each normalized query

The view pg_stat_statements has 3 kinds of columns:

queryid – an identifier of normalized query. In the latest PG version it can also be used to connect (JOIN) data from pgss to pgsa (pg_stat_statements) and Postgres logs. Surprise: queryid value can be negative.
Descriptive columns: ID of database (dbid), user (userid), and the query itself (query).
Metrics. Almost all of them are cumulative: calls, total_time, rows, etc. Non-cumulative: stddev_plan_time, stddev_exec_time, min_exec_time, etc. In this post, we'll focus only on cumulative ones.

// Below I sometimes call normalized query "query group" or simply "group".

Let's mention some metrics that are usually most frequently used in macro optimization (full list):

1) calls – how many query calls happened for this query group (normalized query)

2) total_plan_time and total_exec_time – aggregated duration for planning and execution for this group (again, remember: failed queries are not tracked, including those that failed on statement_timeout)

3) rows – how many rows returned by queries in this group

4) shared_blks_hit and shared_blks_read – number if hit and read operations from the buffer pool. Two important notes here:

"read" here means a read from the buffer pool – it is not necessarily a physical read from disk, since data can be cached in the OS page cache. So we cannot say these reads are reads from disk. Some monitoring systems make this mistake, but there are cases that this nuance is essential for our analysis to produce correct results and conclusions.
The names "blocks hit" and "blocks read" might be a little bit misleading, suggesting that here we talk about data volumes – number of blocks (buffers). While aggregation here definitely make sense, we must keep in mind that the same buffers may be read or hit multiple times. So instead of "blocks have been hit" it is better to say "block hits".

5) wal_bytes – how many bytes are written to WAL by queries in this group

There are many more other interesting metrics, it is recommended to explore all of them (see the docs).

Dealing with cumulative metrics in pgss

To read and interpret data from pgss, you need three steps:

1) Take two snapshots corresponding to two points of time.

2) Calculate the diff for each cumulative metric and for time difference for the two points in time a special case is when the first point in time is the beginning of stats collection – in PG14+, there is a separate view, pg_stat_statements_info, that has information about when the pgss stats reset happened; in PG13 and older this info is not stored, unfortunately.

3) (the most interesting part!) Calculate three types of derived metrics for each cumulative metric diff – assuming that M is our metric and remembering some basics of calculus from high school:

a) dM/dt – time-based differentiation of the metric M;
b) dM/dc – calls-based differentiation (I'll explain it in detail in the next post);
c) %M – percentage that this normalized query takes in the whole workload considering metric M.

Step 3 can be also applied not to particular normalized queries on a single host but bigger groups – for example:

aggregated workload for all standby nodes
whole workload on a node (e.g., the primary)
bigger segments such as all queries from specific user or to specific database
all queries of specific type – e.g., all UPDATE queries

If your monitoring system supports pgss, you don't need to deal with working with snapshots manually – although, keep in mind that I personally don't know any monitoring that works with pgss perfectly, preserving all kinds of information discussed in this post (and I studied quite a few of Postgres monitoring tools).

Assuming you successfully obtained 2 snapshots of pgss (remembering timestamp when they were collected) or use proper monitoring tool, let's consider practical meaning of the three derivatives we discussed.

Derivative 1. Time-based differentiation

dM/dt, where M is calls – the meaning is simple. It's QPS (queries per second). If we talk about particular group (normalized query), it's that all queries in this group have. 10,000 is pretty large so, probably, you need to improve the client (app) behavior to reduce it, 10 is pretty small (of course, depending on situation). If we consider this derivative for whole node, it's our "global QPS".
dM/dt, where M is total_plan_time + total_exec_time – this is the most interesting and key metric in query macro analysis targeted at resource consumption optimization (goal: reduce time spent by server to process queries). Interesting fact: it is measured in "seconds per second", meaning: how many seconds our server spends to process queries in this query group. Very rough (but illustrative) meaning: if we have 2 sec/sec here, it means that we spend 2 seconds each second to process such queries – we definitely would like to have more than 2 vCPUs to do that. Although, this is a very rough meaning because pgss doesn't distinguish situations when query is waiting for some lock acquisition vs. performing some actual work in CPU (for that, we need to involve wait event analysis) – so there may be cases when the value here is high not having a significant effect on the CPU load.
dM/dt, where M is rows – this is the "stream" of rows returned by queries in the group, per second. For example, 1000 rows/sec means a noticeable "stream" from Postgres server to client. Interesting fact here is that sometimes, we might need to think how much load the results produced by our Postgres server put on the application nodes – returning too many rows may require significant resources on the client side.
dM/dt, where M is shared_blks_hit + shared_blks_read - buffer operations per second (only to read data, not to write it). This is another key metric for optimization. It is worth converting buffer operation numbers to bytes. In most cases, buffer size is 8 KiB (check: show block_size;), so 500,000 buffer hits&reads per second translates to 500000 bytes/sec * 8 / 1024 / 1024 = ~ 3.8 GiB/s of the internal data reading flow (again: the same buffer in the pool can be process multiple times). This is a significant load – you might want to check the other metrics to understand if it is reasonable to have or it is a candidate for optimization.
dM/dt, where M is wal_bytes – the stream of WAL bytes written. This is relatively new metric (PG13+) and can be used to understand which queries contribute to WAL writes the most – of course, the more WAL is written, the higher pressure to physical and logical replication, and to the backup systems we have. An example of highly pathological workload here is: a series of transactions like begin; delete from ...; rollback; deleting many rows and reverting this action – this produces a lot of WAL not performing any useful work. (Note: that despite the ROLLBACK here and inability of pgss to tracks failed statements, the statements here are going to be tracked because they are successful inside the transaction.)

That's it for the part 1 of pgss-related howto, in next parts we'll talk about dM/dc and %M, and other practical aspects of pgss-based macro optimization.

PART -2

In part 1, we've discussed some basics of working with pgss, and the first set of derived metrics, dM/dt – time-based differentiation. Today we'll focus on the second set: dM/dc, where c is the number of calls (column calls in pgss).

Derivative 2. Calls-based differentiation

This set of metrics is not less important than time-based differentiation because it can provide you systematic view on characteristics of your workload and be a good tool for macro-optimization of query performance.

The metrics in this set help us understand the characteristics of a query, on average, for each query group.

Unfortunately, many monitoring systems disregard this kind of derived metrics. A good system has to present all or at least most of them, showing graphs how these values change over time (dM/dc time series).

Obtaining results for derived metrics of this kind is pretty straightforward:

calculate difference of values M (the metric being studied) between two pgss snapshots: M2 - M1
then, instead of using timestamps, get the difference of the "calls" values: c2 - c1
then get (M2 - M1) / (c2 - c1)

Let's consider the meanings of various derived metrics obtained in such way:

1) dM/dc, where M is calls – a degenerate case, the value is always 1 (number of calls divided by the same number of calls).

2) dM/dc, where M is total_plan_time + total_exec_time – average query duration time in particular pgss group, a critically important metric for query performance observability. It can also be called "query latency". When applied to the aggregated value for all normalized queries in pgss, its meaning is "average query latency on the server" (with two important comments that pgss doesn't track failing queries and sometimes can have skewed data due to the pg_stat_statements.max limit). The main cumulative statistics system in #PostgreSQL doesn't provide this kind of information – pg_stat_database tracks some time metrics, blk_read_time and blk_write_time if track_io_timing is enabled, and, in PG14+, active_time – but it doesn't have information about the number of statements (!), only the number for transactions, xact_commit & xact_rollback, is present; in some cases, we can obtain this data from other sources – e.g., pgbench reports it if we use it for benchmarks, and pgBouncer reports stats for both transaction and query average latencies, but in general case, in observability tools, pgss can be considered as the most generic way get the query latency information. The importance of it is hard to overestimate – for example:

If we know that normally the avg query duration is <1 ms, then any spike to 10ms should be considered as a serious incident (if it happened after a deployment, this deployment should be reconsidered/reverted). For troubleshooting, it also helps to apply segmentation and determine which particular query groups contributed to this latency spike – was it all of them or just particular ones?
In many cases, this can be taken as the most important metric for large load testing, benchmarks (for example: comparing average query duration for PG 15 vs. PG 16 when preparing for a major upgrade to PG 16).

3) dM/dc, where M is rows – average number of rows returned by a query in a given query group. For OLTP cases, the groups having large values (starting at a few hundreds or more, depending on the case) should be reviewed:if it's intentional (say, data dumps), no action needed, if it's a user-facing query and it's not related to data exports, then probably there is a mistake such as lack of LIMIT and proper pagination applied, then such queries should be fixed.

4) dM/dc, where M is shared_blks_hit + shared_blks_read – average number of "hits + reads" from the buffer pool. It is worth translating this to bytes: for example, 500,000 buffer hits&reads translates to 500000 GiB * 8 / 1024 / 1024 = ~ 3.8 GiB, this is a significant number for a single query, especially if its goal is to return just a row or a few. Large numbers here should be considered as a strong call for query optimization. Additional notes:

in many cases, it makes sense to have hits and reads can be also considered separately – there may be the cases when, for example, queries in some pgss group do not lead to high disk IO and reading from the page cache, but they have so many hits in the buffer pool, so their performance is suboptimal, even with all the data being cached in the buffer pool
to have real disk IO numbers, it is worth using https://github.com/powa-team/pg_stat_kcache
a sudden change in the values of this metric for a particular group that persists over time, can be a sign of plan flip and needs to be studied
high-level aggregated values are also interesting to observe, answering questions like "how many MiB do all queries, on average, read on this server?"

5) dM/dc, where M is wal_bytes (PG13+) – average amount of WAL generated by a query in the studied pgss group measured in bytes. It is helpful for identification of query groups that contribute most to WAL generation. A "global" aggregated value for all pgss records represents the average number of bytes for all statements on the server. Having graphs for this and for "dM/dc, where M is wal_fpi" can be very helpful in certain situations such as checkpoint tuning: with full_page_writes = on, increasing the distance between checkpoints, we should observe reduction of values in this area, and it may be interesting to study different particular groups in pgss separately.

PART -3

3rd type of derived metrics: percentage

Now, let's examine the third type of derived metrics: the percentage that a considered query group (normalized query or bigger groups such as "all statements from particular user" or "all UPDATE statements") takes in the whole workload with respect to metric M.

How to calculate it: first, apply time-based differentiation to all considered groups (as discussed in the part 1) — dM/dt — and then divide the value for particular group by the sum of values for all groups.

Visualization and interpretation of %M

While dM/dt gives us absolute values such as calls/sec or GiB/sec, the %M values are relative metrics. These values help us identify the "major players" in our workload considering various aspects of it — frequency, timing, IO operations, and so forth.

Analysis of relative values helps understand how big is the potential win from each optimization vector and prioritize our optimization activities, first focusing on those having the most potential. For example:

If the absolute value on QPS seems to be high — say, 1000 calls/sec — but if it represents just 3% of the whole workload, an attempt to reduce this query won't give a big win, and if we are concerned about QPS, we need to optimize other query groups.
However, if we have 1000 calls/sec and see that it's 50% of the whole, this single optimization step — say, reducing it to 10 calls/sec — helps us shave off almost half of all the QPS we have.

One of the ways to deal with proportion values in larger systems is to react on large percentage values, consider the corresponding query groups as candidates for optimization. For example, in systems with large number of query groups, it might make sense to apply the following approach:

Periodically, for certain metrics (for example, calls, total_exec_time, total_plan_time, shared_blks_dirtied, wal_bytes), build Top-10 lists showing query groups having the largest %M values.
If particular query group turns out to be a major contributor – say, >20% — on certain metrics, consider this query as a candidate for optimization. For example, in most cases, we don't want a single query group to be responsible for 1/2 of the aggregated total_exec_time ("total total_exec_time", apologies for tautology). – In certain cases, it is ok to decide that query doesn't require optimization — in this case we mark such group as exclusion and skip it in the next analyses.

The analysis of proportions can also be performed implicitly, visually in monitoring system: observing graphs of dM/dt (e.g., QPS, block hits per second), we can visually understand which query group contributes the most in the whole workload, considering a particular metric M. However, for this, graphs need to be "stacked".

If we deal with 2 snapshots, then it makes sense to obtain such values explicitly. Additionally, for visualization purposes, it makes sense to draw a pie chart for each metric we are analyzing.

%M examples

1) %M, where M is calls — this gives us proportions of QPS. For example, if we normally have ~10k QPS, but if some query group is responsible for ~7k QPS, this might be considered as abnormal, requiring optimizations on client side (usually, application code).

2) %M, where M is total_plan_time + total_exec_time — percentage in time that the server spends to process queries in a particular group. For example, if the absolute value is 20 seconds/second (quite a loaded system — each second #PostgreSQL needs to spend 20 seconds to process queries), and a particular group has 75% on this metric, it means we need to focus on optimizing this particular query group. Ways to optimize:

If QPS (calls/second) is high, then, first of all, we need to reduce.
If average latency (total_exec_time, less often total_plan_time or both) is high, then we need to apply micro-optimization using EXPLAIN and EXPLAIN (ANALYZE, BUFFERS).
In some cases, we need to combine both directions of optimization.

3) %M, where M is shared_blks_dirtied — percentage of changes in the buffer pool performed by the considered query group. This analysis may help us identify the write-intensive parts of the workload and find opportunities to reduce the volume of checkpoints and amount of disk IO.

4) %M, where M is wal_bytes — percentage of bytes written to WAL. This helps us identify those query groups where optimization will be more impactful in reducing the WAL volumes being produced.

Instead of summary: three macro-optimization goals and what to use for them

Now, with the analysis methods described here and in the previous 2 parts, let's consider three popular types of macro-optimization with respect to just a single metric — total_exec_time.

Understanding each of these three approaches (and then applying this logic to other metrics as well) can help you understand how your monitoring dashboards should appear.

Macro-optimization aimed to reduce resource consumption. Here we want to reduce, first of all, CPU utilization, memory and disk IO operations. For that, we need to use the dM/dt type of derived metric – the number of seconds Postgres spends each second to process the queries. Reducing this metric — the aggregated "total" value of it for the whole server, and for the Top-N groups playing the biggest role — has to be our goal. Additionally, we may want to consider other metrics such as shared_blks_***, but the timing is probably the best starting point. This kind of optimization can help us with capacity planning, infrastructure cost optimization, reducing risks of resource saturation.
Macro-optimization aimed to improve user experience. Here we want our users to have the best experience, therefore, in the OLTP context, we should focus on average latencies with a goal to reduce them. Therefore, we are going to use dM/dc — number of seconds (or milliseconds) each query lasts on average. If in the previous type of optimization, we would like to see Top-N query groups ordered by the dM/dt values (measured in seconds/second) in our monitoring system, here we want Top-N ordered by avg. latency (measured in seconds). Usually, this gives us a very different set of queries — perhaps, having lower QPS. But these very queries have worst latencies and these very queries annoy our users the most. In some cases, in this this type of analysis, we might want to exclude query groups with low QPS (e.g., those having QPS < 1 call/sec) or exclude specific parts of workload such as data dump activities that have inevitably long latencies.
Macro-optimization aimed to balance workload. This is less common type of optimization, but this is exactly where %M plays its role. Developing our application, we might want, from time to time, check Top-N percentage on total_exec_time + total_plan_time and identify query groups holding the biggest portion — exactly as we discussed above.

Bonus: podcast episodes

All about Databases - Oracle, PostgeSQL, Mysql, Big Data .....

Friday, October 20, 2023