Introducing pg_squeeze: auto-rebuild bloated tables

What does pg_squeeze do exactly?

pg_squeeze is a PostgreSQL extension implementing a background worker process (one per DB) that periodically monitors tables defined by the user and when it detects a table crossing the “bloat threshold”, it kicks in and rebuilds that table automatically! Rebuilding happens concurrently in the background with minimal storage and computational overhead due to using Postgres’ built-in replication slots together with logical decoding to extract possible table changes happening during the rebuild from XLOG. Bloat threshold is configurable and bloat ratio calculation is based on the free space map or under certain conditions based on concepts of “pgstattuple” extension. Additionally minimum table size can be set, with smaller tables being ignored. Additional requirement for the table to be considered for rebuilding is that they need to have a primary key or unique constraint defined.

Sample setup

# Download and install the extension

git clone …

export PGCONFIG=/usr/bin/pg_config       # point it to your desired Postgres installation

make && sudo make install

cat <<-EOF>> testcluster/postgresql.conf

wal_level=logical

max_replication_slots = 1

shared_preload_libraries = 'pg_squeeze'

EOF

pg_ctl -D  testcluster restart

psql -c “CREATE EXTENSION pg_squeeze”

psql -c “INSERT INTO squeeze.tables 

(tabschema, tabname, first_check) 

VALUES ('public', 'foo', now());”

psql -c “SELECT squeeze.start_worker()”    # PS! not needed when we define the list of “squeezed”

# databases in postgresql.conf

# Download and install the extension

git clone …

export PGCONFIG=/usr/bin/pg_config # point it to your desired Postgres installation

make && sudo make install

cat <<-EOF>> testcluster/postgresql.conf

wal_level=logical

max_replication_slots = 1

shared_preload_libraries = 'pg_squeeze'

EOF

pg_ctl -D testcluster restart

psql -c “CREATE EXTENSION pg_squeeze”

psql -c “INSERT INTO squeeze.tables

(tabschema, tabname, first_check)

VALUES ('public', 'foo', now());”

psql -c “SELECT squeeze.start_worker()” # PS! not needed when we define the list of “squeezed”

# databases in postgresql.conf

Details

In addition to the above-mentioned option to list databases and tables taking part in the auto-rebuild, also following “rebuild decision” aspects can be configured for every table by adjusting values in the “squeeze.tables” table. NB! Only this table and start/stop_worker() calls are meant to be the only “user interface” provided for the extension, with other tables/functions meant for internal use (although it is possible to launcher).

Bloat threshold in percentage (“free_space_extra” param), defaulting to 50%

Minimum disk space in megabytes the table must occupy to be eligible for processing

An index name to physically order tuples according to keys of that index

New tablespace for table/indexes can be specified

Minimum time between two rebuilds (safety)

Additionally provided on the global (database) level:

squeeze.log – table storing rebuild events and their durations to see which tables get bloated the most.

squeeze.errors - table contains errors that happened during squeezing. Normally errors should only be of type “DDL being performed” (adding of a new column for example) or “max_xlock_time” reached.

squeeze.max_xlock_time – parameter specifying maximum exclusive lock holding time during the “table switch”.

Also due to using logical decoding, this means that only newer versions of PostgreSQL starting from version 9.4 can be taken into consideration.

Currently “pg_squeeze” supports the latest PostgreSQL 16, see the Github page for the latest info.

Configuring automatic start of “squeeze” workers

If you want the background worker to start automatically during startup of the whole PostgreSQL cluster, add entries like this to “postgresql.conf”, selecting appropriate databases and a role for the worker. More explanations on that are available from the README.

squeeze.worker_autostart = 'mydb1 mydb2'

squeeze.worker_role = postgres

squeeze.worker_autostart = 'mydb1 mydb2'

squeeze.worker_role = postgres

21 responses to “Introducing pg_squeeze - a PostgreSQL extension to auto-rebuild bloated tables”

Vincent dP says:

October 28, 2016 at 10:00 am

The perspective of automatic debloat is very tempting but also rather scary : considering the ressource usage of pg_repack it's not an operation that I start lightly. Is pg_squeeze's debloating technique more efficient than pg_repack's ?

Reply
- Hans-Jürgen Schönig says:
  
  October 28, 2016 at 4:13 pm
  
  well, the point is that you can register tables, which are supposed to be cleaned automatically. it is not meant to brutally work through all tables regardless of your requirements. you can control that.
  the advantage over pg_repack is that pg_squeeze runs directly in the backend. it decides when to shrink a table. it simply gives you more automation than an external tool.
  
  Reply
man_brain says:

October 28, 2016 at 3:58 pm

> git clone …
I think, repo address is missing.

Reply
- Hans-Jürgen Schönig says:
  
  October 28, 2016 at 4:11 pm
  
  we will fire it up soon
  
  Reply
GeorgeDeglin says:

October 29, 2016 at 1:16 am

Have you considered using the pgcompact (https://github.com/grayhemp/pgtoolkit#pgcompact) bloat removal technique instead of repacking? We've had trouble with repacking on busy tables related to locking.

Reply
xedsdsss says:

October 29, 2016 at 1:53 pm

One thing missing to stop me from using it: being able to specify the window(s) when it can run. I want to control when it can run so I can avoid congestion during normal business hours.

Also, I hope it doesn't do what pg_repack does: act as a DBA and kill other PIDs that conflict with it getting locks!

Reply
- Kaarel says:
  
  October 31, 2016 at 9:21 am
  
  Thanks for the "user specified squeezing window" idea! Would be a nice feature indeed and we'll look into it.
  About the 2nd question - nope, other PIDs won't be killed, squeeze waits politely for a lock as a normal session.
  
  Reply
  - Michael says:
    
    November 1, 2016 at 4:39 pm
    
    Great about not doing it the pg_repack way! Regarding this "wait", is it something that can be configured as a parm like pg_repack?
    
    Reply
GeorgeDeglin says:

November 1, 2016 at 12:00 am

This is great!

Have you considered using the pgcompact bloat removal technique instead of repacking? We've had issues with repack's locking behavior on very busy tables.

Reply
knorke says:

March 16, 2017 at 7:34 am

pg_squeeze seemed to be the solution for all my problems with PG. But there ain’t no such thing as a free lunch. 🙁

Is it possible that pg_squeeze conflicts with UPSERT statements (INSERT … ON CONFLICT UPDATE)?

ERROR: ON CONFLICT is not supported on table "mytablename" used as a catalog table at character 147

Is there any way around this problem? The error message occurs exactly at the time of the first (scheduled) squeeze worker run. 🙁

Reply
- Antonín Houska says:
  
  March 24, 2017 at 6:48 pm
  
  Yes, I think the error message is related. pg_squeeze temporarily sets the user_catalog_table storage option on the table being squeezed, so that data changes can be captured while creation of new storage is in progress. This option ensures that VACUUM honors replication slots when trying to process the table, so the concurrent data changes are not deleted before pg_squeeze can decode them.
  
  I'll think if I should propose a change in PostgregreSQL core so that the error is only raised for regular catalog tables, as opposed to the "user catalogs".
  
  Otherwise the only thing you can do is to (try to) change the pg_squeeze schedule so that the conflict is less likely.
  
  Reply
  - knorke says:
    
    March 27, 2017 at 8:51 am
    
    Sadly, I have UPSERTs 24/7 and thus cannot find a "safe" time range for pg_squeeze to work.
    
    I haven't worked with logical decoding and user_catalog_table storage option yet and will do some RTFM as soon as I find time. But maybe you can clarify one little thing right now: Is the problem tied to usage of ON CONFLICT and could an "SELECT => IF FOUND THEN UPDATE ELSE INSERT;" (in application code) instead of ON CONFLICT work around the problem? At least in theory?
    My UPSERT has to be atomic so UPDATE or INSERT after testing with SELECT is not an option. I'm just asking to learn something, still searching for workarounds.
    
    Thanks in advance.
    
    Reply
    - Antonín Houska says:
      
      March 28, 2017 at 4:18 pm
      
      I think the error you reported is really specific to the INSERT ... ON CONFLICT command. So the workaround you propose should avoid the problem. However I'm not sure this good enough in terms of concurrency: things may change between your SELECT and the actual INSERT / UPDATE. So if you want to INSERT and another transaction does it in between, you'll either insert a duplicate row or violate unique constraint.
      
      Reply
      - knorke says:
        
        January 15, 2019 at 11:50 am
        
        Just FYI: I'm testing pg_squeeze (master branch) with PG11.1 and with the same UPSERT command from above. The table schema changed a bit but not much. Pg_squeeze seems to work nicely, even with UPSERTs on hash-partitioned tables. I like! Thanks! 🙂
      - Pavlo Golub says:
        
        January 15, 2019 at 12:30 pm
        
        Great! Thanks for the information.
      - Antonín Houska says:
        
        January 16, 2019 at 6:30 am
        
        I'm glad the new version helps. Note that the tradeoff is that with the new approach postgres does not make WAL available for archiving until the current call of squeeze_table() has completed. I'm about to check if this can be changed, but that's about PG core, not pg_squeeze. BTW, I recommend you to use the 1.1 branch instead of master (they should be identical right now, but can become different in the future).
      - knorke says:
        
        January 18, 2019 at 1:57 pm
        
        Now I found a bug and tracked it down. Issue 17
Dr. Yan says:

April 26, 2018 at 2:51 am

Does pg_squeeze support postgres 10?

Reply
- Antonín Houska says:
  
  June 18, 2018 at 6:27 am
  
  Yes, it does. See https://github.com/cybertec-postgresql/pg_squeeze/releases
  
  Reply
Steven Chang says:

May 1, 2020 at 1:11 am

Does this extension support AWS Postgres Aurora ??

Reply
Jon Nani says:

October 4, 2022 at 9:50 pm

I tried to run pg_squeeze on a partitioned table and got this error " ERROR: initial snapshot too large".

Reply

Introducing pg_squeeze - a PostgreSQL extension to auto-rebuild bloated tables

What does pg_squeeze do exactly?

Sample setup

Details

Configuring automatic start of “squeeze” workers

21 responses to “Introducing pg_squeeze - a PostgreSQL extension to auto-rebuild bloated tables”

Leave a Reply Cancel reply

CYBERTEC Guest

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts