Table of Contents
Most people know that autovacuum is necessary to get rid of dead tuples. These dead tuples are a side effect of PostgreSQL's MVCC implementation. So many people will be confused when they read that from PostgreSQL v13 on, commit b07642dbc adds support for autovacuuming insert-only tables (also known as “append-only tables”).
This article explains the reasons behind that and gives some advice on how to best use the new feature. It will also explain how to achieve similar benefits in older PostgreSQL releases.
Note that all that I say here about insert-only tables also applies to insert-mostly tables, which are tables that receive only few updates and deletes.
From v13 on, PostgreSQL will gather statistics on how many rows were inserted since a table last received a VACUUM
. You can see this new value in the new “n_ins_since_vacuum
” column of the pg_stat_all_tables
catalog view (and in pg_stat_user_tables
and pg_stat_sys_tables
).
Autovacuum runs on a table whenever that count exceeds a certain value. This value is calculated from the two new parameters “autovacuum_vacuum_insert_threshold
” (default 1000) and “autovacuum_vacuum_insert_scale_factor
” (default 0.2) as follows:
1 |
insert_threshold + insert_scale_factor * reltuples |
where reltuples
is the estimate for the number of rows in the table, taken from the pg_class
catalog.
Like other autovacuum parameters, you can override autovacuum_vacuum_insert_threshold
and autovacuum_vacuum_insert_scale_factor
with storage parameters of the same name for individual tables. You can disable the new feature by setting autovacuum_vacuum_insert_threshold
to -1.
You can use “toast.autovacuum_vacuum_insert_threshold
” and “toast.autovacuum_vacuum_insert_scale_factor
” to change the parameters for the associated TOAST table.
VACUUM
?PostgreSQL stores transaction IDs in the xmin
and xmax
system columns to determine which row version is visible to which query. These transaction IDs are unsigned 4-byte integer values, so after slightly more than 4 billion transactions the counter hits the upper limit. Then it “wraps around” and starts again at 3.
As described in this blog post, that method would cause data loss after about 2 billion transactions. So old table rows must be “frozen” (marked as unconditionally visible) before that happens. This is one of the many jobs of the autovacuum daemon.
The problem is that PostgreSQL only triggers such “anti-wraparound” runs once the oldest unfrozen table row is more than 200 million transactions old. For an insert-only table, this is normally the first time ever that autovacuum runs on a table. There are two potential problems with that:
VACUUM
manually. You can find a description of such cases in this and this blog.ACCESS EXCLUSIVE
lock (like DDL statements on the table). Such a blocked operation will block all other access to the table, and processing comes to a standstill. You can find such a case described in this blog.From PostgreSQL v13 on, the default settings should already protect you from this problem. This was indeed the motivation behind the new feature.
For PostgreSQL versions older than v13, you can achieve a similar effect by triggering anti-wraparound vacuum earlier, so that it becomes less disruptive. For example, if you want to vacuum a table every 100000 transactions, you can set this storage parameter:
1 2 3 |
ALTER TABLE mytable SET ( autovacuum_freeze_max_age = 100000 ); |
If all tables in your database are insert_only, you can reduce the overhead from autovacuum by setting vacuum_freeze_min_age
to 0, so that tuples get frozen right when the table is first vacuumed.
As mentioned above, each row contains the information for which transactions it is visible. However, the index does not contain this information. Now if you consider an SQL query like this:
1 |
SELECT count(*) FROM mytables WHERE id < 100; |
where you have an index on id
, all the information you need is available in the index. So you should not need to fetch the actual table row (“heap fetch”), which is the expensive part of an index scan. But unfortunately you have to visit the table row anyway, just to check if the index entry is visible or not.
To work around that, PostgreSQL has a shortcut that makes index-only scans possible: the visibility map. This data structure stores two bits per 8kB table block, one of which indicates if all rows in the block are visible to all transactions. If a query scans an index entry and finds that the block containing the referenced table row is all-visible, it can skip checking visibility for that entry.
So you can have index-only scans in PostgreSQL if most blocks of a table are marked all-visible in the visibility map.
Since VACUUM
removes dead tuples, which is required to make a table block all-visible, it is also VACUUM
that updates the visibility map. So to have most blocks all-visible in order to get an index-only scan, VACUUM
needs to run on the table often enough.
Now if a table receives enough UPDATE
s or DELETE
s, you can set autovacuum_vacuum_scale_factor
to a low value like 0.005. Then autovacuum will keep the visibility map in good shape.
But with an insert-only table, it is not as simple to get index-only scans before PostgreSQL v13. One report of a problem related to that is here.
From PostgreSQL v13 on, all you have to do is to lower autovacuum_vacuum_insert_scale_factor
on the table:
1 2 3 |
ALTER TABLE mytable SET ( autovacuum_vacuum_insert_scale_factor = 0.005 ); |
In older PostgreSQL versions, this is more difficult. You have two options:
VACUUM
runs with cron
or a different schedulerautovacuum_freeze_max_age
low for that table, so that autovacuum processes it often enoughIn PostgreSQL, the first query that reads a newly created row has to consult the commit log to figure out if the transaction that created the row was committed or not. It then sets a hint bit on the row that persists that information. That way, the first reader saves future readers the effort of checking the commit log.
As a consequence, the first reader of a new row “dirties” (modifies in memory) the block that contains it. If a lot of rows were recently inserted in a table, that can cause a performance hit for the first reader. Therefore, it is considered good practice in PostgreSQL to VACUUM
a table after you insert (or COPY
) a lot of rows into it.
But people don't always follow that recommendation. Also, if you want to write software that supports several database systems, it is annoying to have to add special cases for individual systems. With the new feature, PostgreSQL automatically vacuums insert-only tables after large inserts, so you have one less thing to worry about.
During the discussion for the new feature we saw that there is still a lot of room for improvement. Autovacuum is already quite complicated (just look at the many configuration parameters) and still does not do everything right. For example, truly insert-only tables would benefit from freezing rows right away. On the other hand, for tables that receive some updates or deletes as well as for table partitions that don't live long enough to reach wraparound age, such aggressive freezing can lead to unnecessary I/O activity.
One promising idea Andres Freund propagated was to freeze all tuples in a block whenever the block becomes dirty, that is, has to be written anyway.
The fundamental problem is that autovacuum serves so many different purposes. Basically, it is the silver bullet that should solve all of the problems of PostgreSQL's MVCC architecture. That is why it is so complicated. However, it would take a major redesign to improve that situation.
While it seems to be an oxymoron at first glance, autovacuum for insert-only tables mitigates several problems that large databases used to suffer from.
In a world where people collect “big data”, it becomes even more important to keep such databases running smoothly. With careful tuning, that was possible even before PostgreSQL v13. But autovacuum is not simple to tune, and many people lack the required knowledge. So it is good to have new autovacuum functionality that takes care of more potential problems automatically.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Great read. Thank you for the blog post.
Regarding one of the example:
ALTER TABLE mytable SET ( autovacuum_vacuum_insert_threshold = 0.005);
I hope you want to say autovacuum_vacuum_insert_scale_factor
Of course, thanks for the report. Fixed.
ALTER TABLE mytable SET (vacuum_freeze_min_age = 0);
does not work for me, givingERROR: unrecognized parameter "vacuum_freeze_min_age"
.Shouldn't that be
ALTER TABLE mytable SET (autovacuum_freeze_min_age = 0);
, because the per-table parameter is calledautovacuum_*
?Thanks!
You are absolutely right, I forgot that
vacuum_freeze_min_age
is a parameter that cannot be set as storage parameter on a single table, only globally.Thanks for the correction, I have fixed the paragraph.