Monday, June 14, 2021

PG_SQUEEZE – SHRINKS TABLES BETTER THAN VACUUM

 pg_squeeze is an open source PostgreSQL extension that enables automatic and transparent fixing of bloated tables – without extensive table locking. The process fully works in the background.

KEY BENEFITS

  • More aggressive space reduction
  • Close-to lock-free table reorganization
  • Ability to move tables between tablespaces (without downtime)
  • Ability to cluster tables (without downtime)
  • Builtin advanced scheduling
  • Fully Open Source

pg_squeeze is not a replacement for autovacuum – it is an add-on to perform cleanups even better

PostgreSQL uses a mechanism called “MVCC” (Multi Version Concurrency Control) to store data. As the name already suggests, MVCC will hold various versions of a row to support as much concurrency as possible. At some point, those additional rows must be removed from the storage system and this is where VACUUM comes along.

Unlike with built-in commands “VACUUM FULL” or “CLUSTER”, with “pg_squeeze” there are no extended periods of full-table locking, and consequently reads and writes are not blocked during the rebuild. Also the rebuilding process is very efficient due to a novel approach of using transaction log files and logical decoding (instead of triggers) to capture possible data changes to the table being rebuilt. First of all, this helps save disk space and IO throughput, but even more importantly, it enables very short locking times, making it a perfect fit for mission-critical OLTP systems.

How does pg_squeeze work?

  1. The extension is implemented as a background-worker process (a framework introduced in version 9.4) that periodically monitors user-defined tables.
  2. When it detects that a table has exceeded the “bloat threshold”,
  3. it kicks in and rebuilds that table automatically.

 

 

Rebuilding happens concurrently in the background with minimal storage and computational overhead due to the use of Postgres’ built-in replication slots together with logical decoding to extract possible table changes happening during the rebuild from XLOG. Bloat threshold is of course configurable and the bloat ratio calculation is based on the Free Space Map (taking FILLFACTOR also into account) or under certain conditions on the “pgstattuple” extension when it’s available. Additionally, many customization parameters like “minimum table size” can be set with non-suitable tables being ignored. Moreover, reordering using an index or moving the table or indexes to a new tablespace are possible.

PG_SQUEEZE 1.3 DOWNLOAD

The newest version of pg_squeeze can be downloaded from github.com.

DOWNLOAD PG_SQUEEZE >>

INTRODUCING PG_SQUEEZE – A POSTGRESQL EXTENSION TO AUTO-REBUILD BLOATED TABLES

One of the few areas where out-of-the-box functionality by PostgreSQL is not 100% satisfying, is the “bloat problem”. Combating bloat, or just trying to ensure that your table data is physically ordered according to some column(s) (a.k.a. clustering) required accepting some inconvenient compromises until now. Extended periods of full table locking (no read or write activities) with built-in VACUUM FULL or CLUSTER commands or involving third party tooling, usually meaning “pg_repack”, were necessary. “pg_repack” offers good benefits like a lot smaller full-lock time, ordering by specific columns, but needs a bit of fiddling around – installing the extension, identifying bloated tables, running their command line client, and for larger tables it could also temporarily grow the disk size unnecessarily as it uses triggers to store the modifications made to tables during the pre-building phase.

To alleviate the situation, on behalf of the Cybertec development team, I’m really glad to announce a new bloat-painkiller called “pg_squeeze”! I myself, with my stereotypically calm Nordic temper, don’t usually get too excited by a piece of software, but this time as a day-to-day PostgreSQL user I must say that I’m really impressed – absolutely great piece of work! And also I wonder why nothing like that came about earlier.

What does pg_squeeze do exactly?

pg_squeeze is a PostgreSQL extension implementing a background worker process (one per DB) that periodically monitors tables defined by the user and when it detects a table crossing the “bloat threshold”it kicks in and rebuilds that table automatically! Rebuilding happens concurrently in the background with minimal storage and computational overhead due to using Postgres’ built-in replication slots together with logical decoding to extract possible table changes happening during the rebuild from XLOG. Bloat threshold is configurable and bloat ratio calculation is based on the free space map or under certain conditions based on concepts of “pgstattuple” extension. Additionally minimum table size can be set, with smaller tables being ignored. Additional requirement for the table to be considered for rebuilding is that they need to have a primary key or unique constraint defined.

Sample setup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Download and install the extension
 
git clone …
 
export PGCONFIG=/usr/bin/pg_config       # point it to your desired Postgres installation
 
make && sudo make install
 
cat <<-EOF>> testcluster/postgresql.conf
 
wal_level=logical
 
max_replication_slots = 1
 
shared_preload_libraries = 'pg_squeeze'
 
EOF
 
pg_ctl -D  testcluster restart
 
psql -c “CREATE EXTENSION pg_squeeze”
 
psql -c “INSERT INTO squeeze.tables \
 
(tabschema, tabname, first_check) \
 
VALUES ('public', 'foo', now());”
 
psql -c “SELECT squeeze.start_worker()”    # PS! not needed when we define the list of “squeezed”
 
# databases in postgresql.conf

Details

In addition to the above mentioned option to list databases and tables taking part in the auto-rebuild, also following “rebuild decision” aspects can be configured for every table by adjusting values in the “squeeze.tables” table. NB! Only this table and start/stop_worker() calls are meant to be the only “user interface” provided for the extension, with other tables/functions meant for internal use (although it is possible to launcher).

  • Bloat threshold in percentage (“free_space_extra” param), defaulting to 50%
  • Minimum disk space in megabytes the table must occupy to be eligible for processing
  • An index name to physically order tuples according to keys of that index
  • New tablespace for table/indexes can be specified
  • Minimum time between two rebuilds (safety)

Additionallyy provided on global (database) level:

  • squeeze.log – table storing rebuild events and their durations to see which tables get bloated the most.
  • squeeze.errors – table contains errors that happened during squeezing. Normally errors should only be of type “DDL being performed” (adding of a new column for example) or “max_xlock_time” reached.
  • squeeze.max_xlock_time – parameter specifying maximum exclusive lock holding time during the “table switch”.

Also due to using logical decoding, this means that only newer versions of PostgreSQL starting from version 9.4 can be taken into consideration.

Currently “pg_squeeze” only supports the latest PostgreSQL 9.6, but we’re thinking of backporting it also to 9.5.

Configuring automatic start of “squeeze” workers

If you want the background worker to start automatically during startup of the whole PostgreSQL cluster, add entries like this to “postgresql.conf”, selecting appropriate databases and a role for the worker. More explanations on that are available from the README.

1
2
3
squeeze.worker_autostart = 'mydb1 mydb2'
 
squeeze.worker_role = postgres

Grab the code here a try it out! Questions and feedback welcome.