CYBERTEC PostgreSQL Logo

BLOB cleanup in PostgreSQL

12.2020 / Category: / Tags: |

PostgreSQL offers a nice BLOB interface which is widely used. However, recently we came across problems faced by various customers, and it makes sense to reflect a bit and figure out how PostgreSQL handles BLOBs - and especially BLOB cleanup.

Using the PostgreSQL BLOB interface

In PostgreSQL, you can use various means to store binary data. The simplest form is definitely to make use of the “bytea” (= byte array) data type. In this case a binary field is basically seen as part of a row.
Here is how it works:

As you can see, this is a normal column and it can be used just like a normal column. The only thing worth mentioning is the encoding one has to use on the SQL level. PostgreSQL uses a variable to configure this behavior:

Bytea_output

The bytea_output variable accepts two values: “hex” tells PostgreSQL to send the data in hex format. “escape” means that data has to be fed in as an octal string. There is not much the application has to worry about here, apart from the maximum size of 1 GB per field.
However, PostgreSQL has a second interface to handle binary data: The BLOB interface. Let me show an example of this powerful tool in action:

In this case, the content of /etc/hosts has been imported into the database. Note that PostgreSQL has a copy of the data - it is not a link to the filesystem. What is noteworthy here is that the database will return the OID (object ID) of the new entry. To keep track of these OIDs, some developers do the following:

This is absolutely fine, unless you do something like below:

The problem is that the object id has been forgotten. However, the object is still there. pg_largeobject is the system table in charge of storing the binary data inside PostgreSQL. All lo_functions will simply talk to this system table in order to handle thesethings:

Why is that a problem? The reason is simple: Your database will grow and the number of “dead objects” will accumulate.

Therefore the correct way to kill a BLOB entry is as follows:

If you forget to unlink the object, you will suffer in the long run - and we have often seen that happen. It is a major issue if you are using the BLOB interface.

vacuumlo: Cleaning up dead large objects

However, how can one fix the problem once you have accumulated thousands, or maybe millions, of dead BLOBs? The answer is a command line tool called “vacuumlo”.
Let us first create a dead entry:

Then we can run vacuumlo from any client:

As you can see, two dead objects have been killed by the tool. vacuumlo is the easiest way to clean out orphan objects.

Additional functionality

However, there is more than just lo_import and lo_unlink. PostgreSQL offers a variety of functions to handle large objects in a nice way:

There are two more functions which don’t follow the naming convention for historic reasons: loread and lowrite:

They are functions whose names cannot easily be changed anymore. However, it is worth noting that they exist.

Finally …

The PostgreSQL BLOB interface is really useful and can be used for many things. The beauty is that it is fully transactional and therefore binary content and metadata cannot go out of sync anymore.

If you want to learn more about triggers to enforce constraints in PostgreSQL, we recommend you check out our blog post written by Laurenz Albe. It will shed some light on this important topic.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

5 responses to “BLOB cleanup in PostgreSQL”

  1. The BLOB system of PostgreSQL is quite horrible and I think should not be used. As noted, there is no referential integrity for blob values -- you can accidentally delete blobs that are still referenced, and also forget to delete them, for which the vacuumlo command is a workaround. Especially, nobody should use the shown lo_import() function, which requires superuser permissions and reads files on the database server's file system, both bad practices.

    In most cases it's far better to create a table with a bytea field to store your binary data, and use foreign key constraints to ensure referential integrity. For files larger than 1 GB one should consider an external data store.

  2. My table with bytea column is bloating.
    Every days my backup is bloating, as if deleted records were backed up.
    Is it possible that deleted records are being backed up by pg_dump ?

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram