By Kaarel Moppel - Auto-rebuild bloated tables with pg_squeeze: One of the few areas where out-of-the-box functionality by PostgreSQL is not 100% satisfying, is the “bloat problem”. Combating bloat, or just trying to ensure that your table data is physically ordered according to some column(s) (a.k.a. clustering) required accepting some inconvenient compromises until now. Extended periods of full table locking (no read or write activities) with built-in VACUUM FULL
or CLUSTER
commands or involving third party tooling, usually meaning “pg_repack”, were necessary. “pg_repack” offers good benefits like a lot smaller full-lock time, ordering by specific columns, but needs a bit of fiddling around - installing the extension, identifying bloated tables, running their command line client, and for larger tables it could also temporarily grow the disk size unnecessarily as it uses triggers to store the modifications made to tables during the pre-building phase.
Table of Contents
To alleviate the situation, on behalf of the Cybertec development team, I’m really glad to announce a new bloat-painkiller called "pg_squeeze"! I myself, with my stereotypically calm Nordic temper, don’t usually get too excited by a piece of software, but this time as a day-to-day PostgreSQL user I must say that I’m really impressed - absolutely great piece of work! And also I wonder why nothing like that came about earlier.
pg_squeeze is a PostgreSQL extension implementing a background worker process (one per DB) that periodically monitors tables defined by the user and when it detects a table crossing the “bloat threshold”, it kicks in and rebuilds that table automatically! Rebuilding happens concurrently in the background with minimal storage and computational overhead due to using Postgres’ built-in replication slots together with logical decoding to extract possible table changes happening during the rebuild from XLOG. Bloat threshold is configurable and bloat ratio calculation is based on the free space map or under certain conditions based on concepts of “pgstattuple” extension. Additionally minimum table size can be set, with smaller tables being ignored. Additional requirement for the table to be considered for rebuilding is that they need to have a primary key or unique constraint defined.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# Download and install the extension git clone … export PGCONFIG=/usr/bin/pg_config # point it to your desired Postgres installation make && sudo make install cat <<-EOF>> testcluster/postgresql.conf wal_level=logical max_replication_slots = 1 shared_preload_libraries = 'pg_squeeze' EOF pg_ctl -D testcluster restart psql -c “CREATE EXTENSION pg_squeeze” psql -c “INSERT INTO squeeze.tables (tabschema, tabname, first_check) VALUES ('public', 'foo', now());” psql -c “SELECT squeeze.start_worker()” # PS! not needed when we define the list of “squeezed” # databases in postgresql.conf |
In addition to the above-mentioned option to list databases and tables taking part in the auto-rebuild, also following “rebuild decision” aspects can be configured for every table by adjusting values in the “squeeze.tables” table. NB! Only this table and start/stop_worker() calls are meant to be the only “user interface” provided for the extension, with other tables/functions meant for internal use (although it is possible to launcher).
Additionally provided on the global (database) level:
Also due to using logical decoding, this means that only newer versions of PostgreSQL starting from version 9.4 can be taken into consideration.
Currently “pg_squeeze” supports the latest PostgreSQL 16, see the Github page for the latest info.
If you want the background worker to start automatically during startup of the whole PostgreSQL cluster, add entries like this to “postgresql.conf”, selecting appropriate databases and a role for the worker. More explanations on that are available from the README.
1 2 3 |
squeeze.worker_autostart = 'mydb1 mydb2' squeeze.worker_role = postgres |
Grab the code here and try it out! Questions and feedback welcome.
In case you need any assistance, please feel free to contact us.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
The perspective of automatic debloat is very tempting but also rather scary : considering the ressource usage of pg_repack it's not an operation that I start lightly. Is pg_squeeze's debloating technique more efficient than pg_repack's ?
well, the point is that you can register tables, which are supposed to be cleaned automatically. it is not meant to brutally work through all tables regardless of your requirements. you can control that.
the advantage over pg_repack is that pg_squeeze runs directly in the backend. it decides when to shrink a table. it simply gives you more automation than an external tool.
> git clone …
I think, repo address is missing.
we will fire it up soon
Have you considered using the pgcompact (https://github.com/grayhemp/pgtoolkit#pgcompact) bloat removal technique instead of repacking? We've had trouble with repacking on busy tables related to locking.
One thing missing to stop me from using it: being able to specify the window(s) when it can run. I want to control when it can run so I can avoid congestion during normal business hours.
Also, I hope it doesn't do what pg_repack does: act as a DBA and kill other PIDs that conflict with it getting locks!
Thanks for the "user specified squeezing window" idea! Would be a nice feature indeed and we'll look into it.
About the 2nd question - nope, other PIDs won't be killed, squeeze waits politely for a lock as a normal session.
Great about not doing it the pg_repack way! Regarding this "wait", is it something that can be configured as a parm like pg_repack?
This is great!
Have you considered using the pgcompact bloat removal technique instead of repacking? We've had issues with repack's locking behavior on very busy tables.
pg_squeeze seemed to be the solution for all my problems with PG. But there ain’t no such thing as a free lunch. 🙁
Is it possible that pg_squeeze conflicts with UPSERT statements (INSERT … ON CONFLICT UPDATE)?
ERROR: ON CONFLICT is not supported on table "mytablename" used as a catalog table at character 147
Is there any way around this problem? The error message occurs exactly at the time of the first (scheduled) squeeze worker run. 🙁
Yes, I think the error message is related. pg_squeeze temporarily sets the user_catalog_table storage option on the table being squeezed, so that data changes can be captured while creation of new storage is in progress. This option ensures that VACUUM honors replication slots when trying to process the table, so the concurrent data changes are not deleted before pg_squeeze can decode them.
I'll think if I should propose a change in PostgregreSQL core so that the error is only raised for regular catalog tables, as opposed to the "user catalogs".
Otherwise the only thing you can do is to (try to) change the pg_squeeze schedule so that the conflict is less likely.
Sadly, I have UPSERTs 24/7 and thus cannot find a "safe" time range for pg_squeeze to work.
I haven't worked with logical decoding and user_catalog_table storage option yet and will do some RTFM as soon as I find time. But maybe you can clarify one little thing right now: Is the problem tied to usage of ON CONFLICT and could an "SELECT => IF FOUND THEN UPDATE ELSE INSERT;" (in application code) instead of ON CONFLICT work around the problem? At least in theory?
My UPSERT has to be atomic so UPDATE or INSERT after testing with SELECT is not an option. I'm just asking to learn something, still searching for workarounds.
Thanks in advance.
I think the error you reported is really specific to the INSERT ... ON CONFLICT command. So the workaround you propose should avoid the problem. However I'm not sure this good enough in terms of concurrency: things may change between your SELECT and the actual INSERT / UPDATE. So if you want to INSERT and another transaction does it in between, you'll either insert a duplicate row or violate unique constraint.
Just FYI: I'm testing pg_squeeze (master branch) with PG11.1 and with the same UPSERT command from above. The table schema changed a bit but not much. Pg_squeeze seems to work nicely, even with UPSERTs on hash-partitioned tables. I like! Thanks! 🙂
Great! Thanks for the information.
I'm glad the new version helps. Note that the tradeoff is that with the new approach postgres does not make WAL available for archiving until the current call of squeeze_table() has completed. I'm about to check if this can be changed, but that's about PG core, not pg_squeeze. BTW, I recommend you to use the 1.1 branch instead of master (they should be identical right now, but can become different in the future).
Now I found a bug and tracked it down. Issue 17
Does pg_squeeze support postgres 10?
Yes, it does. See https://github.com/cybertec-postgresql/pg_squeeze/releases
Does this extension support AWS Postgres Aurora ??
I tried to run pg_squeeze on a partitioned table and got this error " ERROR: initial snapshot too large".