Did you ever wonder why VACUUM
does not make your PostgreSQL tables smaller? Did you ever wonder why VACUUM
does not shrink data files? Well, maybe this is the article you have been looking for. The point is: usually, space is not returned to the operating system after a cleanup and it is important to understand why. Often people make wrong assumptions about the inner workings of VACUUM
. It makes sense to dig deeper and understand what's really going on.
Table of Contents
The following post reveals the most useful secrets about VACUUM to users.
To understand VACUUM in PostgreSQL, it is important to see how PostgreSQL handles visibility in the first place. The entire concept is based around a set of hidden columns which are part of the row. Here's how it works:
The table we have just created consists of just one column - to keep things simple. Note that data has been loaded into the table using two separate transactions. Each transaction has inserted 3 rows, which is clearly indicated in the hidden columns:
xmin
, xmax
, cmin
and cmax
are hidden columns containing transaction IDs. As you can see, the first three rows have been written by transaction number 764, while the rest of the data has been created using transaction number 765.
The hidden columns will handle visibility and PostgreSQL will (often but not always) determine based on those columns whether a row can be seen by a certain transaction or not.
Running UPDATE statements will show what's really going on here:
One row has been changed. But let's focus on the CTID which represents the physical position of the row on disk. Notice that (0, 6) is gone because PostgreSQL had to copy the row. If we run a second UPDATE
, the row will be copied again:
Copying these rows is important because we have to keep the old row around. Otherwise ROLLBACK
would not work - therefore the old version has to remain.
Let's inspect the table once again:
Everything between 5 and 9 is full of dead rows which have to be removed.
Keep in mind that COMMIT
must not kill dead rows either. Therefore the cleanup process has to be done asynchronously. This is exactly what VACUUM
does. Let's run it and see what happens:
VACUUM
is actively looking for rows which are not seen by anyone anymore. Those rows can be in the middle of the data file somewhere. What happens is that VACUUM
allows PostgreSQL to reuse that space - however, it does not return that space to the operating system. It can't do that because if you have a datafile that is 1 GB in size, you can't simply return “the middle of the file” to the operating system in case it is empty - there is no file system operation which supports that. Instead, PostgreSQL has to remember this free space and reuse it later.
However, there is an exception to the rule. Consider the following code snippet:
The table will retain its size even after the DELETE
statement. Remember: Cleanup is done asynchronously. Therefore VACUUM
can be called to delete those rows:
[sql gutter="false"]
test=# VACUUM t_test;
VACUUM
This is a bit of a special case. The rule is: If from a certain position in a table onward, ALL rows are dead, VACUUM
can truncate the table. That's exactly what happened in this example:
After all, in large tables there are always a handful of rows towards the end of the data file (under normal circumstances). For that reason, don't count on VACUUM to shrink tables.
One way to battle table bloat (which is the technical term used to describe tables growing out of proportion) is to use VACUUM FULL
. However, the problem is that VACUUM FULL
needs a table lock which can be a real problem if you are in a business critical environment. Therefore we have developed a method to reorganize large tables without extensive locking.
The solution is called pg_squeeze. Download it for free from Github or our website.
To find out more, you can watch my Youtube video about VACUUM
:
https://www.youtube.com/watch?v=lhIUoEsRlZg
If your VACUUM
process won't remove dead rows and you don't know why, check out Laurenz' blog, 4 Reasons Why VACUUM Won't Remove Dead Rows. In case you are looking to dig deeper, check out our further blogs related to VACUUM
.
To get important tips about how disabling AUTOCOMMIT affects autovacuum maintenance, read Laurenz Albe's blog, Disabling Autocommit in PostgreSQL Can Damage Your Health.
One thing that isn't clear to me is how much free space overhead is required with pg_squeeze? Do I need to store an entire duplicate copy of the table indexes during processing?
Yes, just like with
VACUUM (FULL)
.