CYBERTEC PostgreSQL Logo

Introducing pg_squeeze - a PostgreSQL extension to auto-rebuild bloated tables

10.2016 / Category: / Tags: |

By Kaarel Moppel - Auto-rebuild bloated tables with pg_squeeze: One of the few areas where out-of-the-box functionality by PostgreSQL is not 100% satisfying, is the “bloat problem”. Combating bloat, or just trying to ensure that your table data is physically ordered according to some column(s) (a.k.a. clustering) required accepting some inconvenient compromises until now. Extended periods of full table locking (no read or write activities) with built-in VACUUM FULL or CLUSTER commands or involving third party tooling, usually meaning “pg_repack”, were necessary. “pg_repack” offers good benefits like a lot smaller full-lock time, ordering by specific columns, but needs a bit of fiddling around - installing the extension, identifying bloated tables, running their command line client, and for larger tables it could also temporarily grow the disk size unnecessarily as it uses triggers to store the modifications made to tables during the pre-building phase.

To alleviate the situation, on behalf of the Cybertec development team, I’m really glad to announce a new bloat-painkiller called "pg_squeeze"! I myself, with my stereotypically calm Nordic temper, don’t usually get too excited by a piece of software, but this time as a day-to-day PostgreSQL user I must say that I’m really impressed - absolutely great piece of work! And also I wonder why nothing like that came about earlier.

What does pg_squeeze do exactly?

pg_squeeze is a PostgreSQL extension implementing a background worker process (one per DB) that periodically monitors tables defined by the user and when it detects a table crossing the “bloat threshold”, it kicks in and rebuilds that table automatically! Rebuilding happens concurrently in the background with minimal storage and computational overhead due to using Postgres’ built-in replication slots together with logical decoding to extract possible table changes happening during the rebuild from XLOG. Bloat threshold is configurable and bloat ratio calculation is based on the free space map or under certain conditions based on concepts of “pgstattuple” extension. Additionally minimum table size can be set, with smaller tables being ignored. Additional requirement for the table to be considered for rebuilding is that they need to have a primary key or unique constraint defined.

Sample setup

Details

In addition to the above-mentioned option to list databases and tables taking part in the auto-rebuild, also following “rebuild decision” aspects can be configured for every table by adjusting values in the “squeeze.tables” table. NB! Only this table and start/stop_worker() calls are meant to be the only “user interface” provided for the extension, with other tables/functions meant for internal use (although it is possible to launcher).

  • Bloat threshold in percentage (“free_space_extra” param), defaulting to 50%
  • Minimum disk space in megabytes the table must occupy to be eligible for processing
  • An index name to physically order tuples according to keys of that index
  • New tablespace for table/indexes can be specified
  • Minimum time between two rebuilds (safety)

Additionally provided on the global (database) level:

  • squeeze.log – table storing rebuild events and their durations to see which tables get bloated the most.
  • squeeze.errors - table contains errors that happened during squeezing. Normally errors should only be of type “DDL being performed” (adding of a new column for example) or “max_xlock_time” reached.
  • squeeze.max_xlock_time – parameter specifying maximum exclusive lock holding time during the “table switch”.

Also due to using logical decoding, this means that only newer versions of PostgreSQL starting from version 9.4 can be taken into consideration.

Currently “pg_squeeze” supports the latest PostgreSQL 16, see the Github page for the latest info.

Configuring automatic start of “squeeze” workers

If you want the background worker to start automatically during startup of the whole PostgreSQL cluster, add entries like this to “postgresql.conf”, selecting appropriate databases and a role for the worker. More explanations on that are available from the README.

Grab the code here and try it out! Questions and feedback welcome.

In case you need any assistance, please feel free to contact us.

21 responses to “Introducing pg_squeeze - a PostgreSQL extension to auto-rebuild bloated tables”

  1. The perspective of automatic debloat is very tempting but also rather scary : considering the ressource usage of pg_repack it's not an operation that I start lightly. Is pg_squeeze's debloating technique more efficient than pg_repack's ?

    • well, the point is that you can register tables, which are supposed to be cleaned automatically. it is not meant to brutally work through all tables regardless of your requirements. you can control that.
      the advantage over pg_repack is that pg_squeeze runs directly in the backend. it decides when to shrink a table. it simply gives you more automation than an external tool.

  2. One thing missing to stop me from using it: being able to specify the window(s) when it can run. I want to control when it can run so I can avoid congestion during normal business hours.

    Also, I hope it doesn't do what pg_repack does: act as a DBA and kill other PIDs that conflict with it getting locks!

    • Thanks for the "user specified squeezing window" idea! Would be a nice feature indeed and we'll look into it.
      About the 2nd question - nope, other PIDs won't be killed, squeeze waits politely for a lock as a normal session.

      • Great about not doing it the pg_repack way! Regarding this "wait", is it something that can be configured as a parm like pg_repack?

  3. This is great!

    Have you considered using the pgcompact bloat removal technique instead of repacking? We've had issues with repack's locking behavior on very busy tables.

  4. pg_squeeze seemed to be the solution for all my problems with PG. But there ain’t no such thing as a free lunch. 🙁

    Is it possible that pg_squeeze conflicts with UPSERT statements (INSERT … ON CONFLICT UPDATE)?

    ERROR: ON CONFLICT is not supported on table "mytablename" used as a catalog table at character 147

    Is there any way around this problem? The error message occurs exactly at the time of the first (scheduled) squeeze worker run. 🙁

    • Yes, I think the error message is related. pg_squeeze temporarily sets the user_catalog_table storage option on the table being squeezed, so that data changes can be captured while creation of new storage is in progress. This option ensures that VACUUM honors replication slots when trying to process the table, so the concurrent data changes are not deleted before pg_squeeze can decode them.

      I'll think if I should propose a change in PostgregreSQL core so that the error is only raised for regular catalog tables, as opposed to the "user catalogs".

      Otherwise the only thing you can do is to (try to) change the pg_squeeze schedule so that the conflict is less likely.

      • Sadly, I have UPSERTs 24/7 and thus cannot find a "safe" time range for pg_squeeze to work.

        I haven't worked with logical decoding and user_catalog_table storage option yet and will do some RTFM as soon as I find time. But maybe you can clarify one little thing right now: Is the problem tied to usage of ON CONFLICT and could an "SELECT => IF FOUND THEN UPDATE ELSE INSERT;" (in application code) instead of ON CONFLICT work around the problem? At least in theory?
        My UPSERT has to be atomic so UPDATE or INSERT after testing with SELECT is not an option. I'm just asking to learn something, still searching for workarounds.

        Thanks in advance.

        • I think the error you reported is really specific to the INSERT ... ON CONFLICT command. So the workaround you propose should avoid the problem. However I'm not sure this good enough in terms of concurrency: things may change between your SELECT and the actual INSERT / UPDATE. So if you want to INSERT and another transaction does it in between, you'll either insert a duplicate row or violate unique constraint.

          • Just FYI: I'm testing pg_squeeze (master branch) with PG11.1 and with the same UPSERT command from above. The table schema changed a bit but not much. Pg_squeeze seems to work nicely, even with UPSERTs on hash-partitioned tables. I like! Thanks! 🙂

          • I'm glad the new version helps. Note that the tradeoff is that with the new approach postgres does not make WAL available for archiving until the current call of squeeze_table() has completed. I'm about to check if this can be changed, but that's about PG core, not pg_squeeze. BTW, I recommend you to use the 1.1 branch instead of master (they should be identical right now, but can become different in the future).

  5. I tried to run pg_squeeze on a partitioned table and got this error " ERROR: initial snapshot too large".

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram