CYBERTEC PostgreSQL Logo

Transaction ID wraparound: a walk on the wild side

01.2023 / Category: / Tags: | | |
transaction ID wraparound protection by freezing the server room?
© Laurenz Albe 2022

 

Most people are aware of transaction ID wraparound. The concept has been well explained in Hans' article, so I won't repeat all that here. But for most people it is an abstract concept, a bogeyman lurking in the dark around the corner. Many people know horror stories about anti-wraparound autovacuum tanking performance or databases that stop working, but who has ever seen actual data loss? I decided to face the beast (having developed a fondness for breaking things), and I learned something along the way. If you are curious, come along on this trip! We'll do some scary things that you shouldn't do to your databases at home.

Is data loss from transaction ID wraparound a real danger?

No, it isn't. There are strong protections in PostgreSQL to prevent that. As you will see, I have to use evil tricks to overcome those protections. Still, these protections can disrupt operation, so it is good to be aware of the problem. But don't panic: usually, you won't even notice when your transaction IDs wrap around.

Preparing a database cluster

Since I will cause data corruption, I'll create a new cluster that I can discard afterwards:

Let's start the cluster. I will use a free, non-standard port and allow a prepared transaction. That prepared transaction will later be used to stir trouble.

Now let's connect with

Now we can create some data (that we will destroy later). I'll return xmin for each row I create, which stores the transaction ID of the creating transaction. Similarly, I'll return xmax for all deleting statements. For detailed information about xmax, you can read my article on that topic.

It is very important that we don't SELECT from the table at this stage, since that would set hint bits, which would spoil the effect.

Causing trouble to facilitate transaction ID wraparound

There are a few things that are known to keep autovacuum from cleaning up. In addition, data corruption can keep autovacuum from procesing a table. I decided to use a prepared transaction that holds a lock that keeps autovacuum from running:

Prepared transactions are normally used for two-phase commit. They stay around until they are committed or rolled back, and even survive server restarts. A prepared transaction is a dangerous thing!

Proceeding towards transaction ID wraparound

At this point, I would have to run about two billion transactions, but I am too lazy for that. Fortunately, PostgreSQL comes with a ready-made tool to facilitate data corruption: pg_resetwal.

There are some legitimate use cases of pg_resetwal for the normal user, for example, to change the WAL segment size. But in general, it is a tool for the expert, and is used for salvaging data from a broken PostgreSQL cluster. The idea would be to somehow get a broken instance so that it can be started, then dump what data you can. We will use it to wantonly advance the transaction ID. For that, it is important to perform a clean shutdown of the cluster, otherwise pg_resetwal has to be called with the option -f, which will likely destroy some data.

We will set the transaction ID to 231 - 10000000:

We also have to “fake” a commit log file for this transaction:

Encounter the first safety against transaction ID wraparound

Let's start the server as indicated above and connect. Then we consume a transaction ID. The cheapest way to do this is by calling the function pg_current_xact_id():

This warning is shown when we are less than 40 million transactions from the point of data corruption. It is still easy to ignore, if you don't look into the log files.

Encounter the second safety against transaction ID wraparound

Again, we stop the server and move on to transaction 231 - 1000000, and again we create a matching commit log file:

We start the server and try to get a transaction ID:

Now it is game over for normal database activity. At this point, all you can do is to manually VACUUM the tables that still contain old, unfrozen tuples. Note that the hint is not quite accurate: since VACUUM doesn't consume a transaction ID, you don't need to start PostgreSQL in single-user mode for that. It is actually somewhat dangerous to use single-user mode, since that disarms the safety, and you can continue consuming transaction IDs. However, since the above error occurs three million transactions before data corruption, there is plenty of headroom for failed attempts.

Crossing the red line

Undeterred by all these warnings, we wrap around all the way to transaction 725. Note that we don't have to fake a commit log, because the old files are still there (autovacuum never could clean up anything).

Now we start PostgreSQL in single-user mode and reuse transactions 725 to 727. Then we commit the prepared transaction to remove the obstacle:

We terminate the session with Ctrl+D, which will shut down PostgreSQL.

But your dead will live; their bodies will rise. (Isa 26,19)

We start the server in the usual way. We don't need a prepared transaction any more:

Let's look at the table:

Indeed, row 1, which we deleted in the beginning, is alive again. Row 2, which should be visible, is gone. What happened? By rolling back transactions 726 and 727, we undid the DELETE and INSERT. Note that row 2 is invisible, but still present:

What a mess!

Performing surgery on the damage from transaction ID wraparound

We can use the pageinspect extension to examine what is on disk:

PostgreSQL v14 introduced a new contrib module pg_surgery. This module can be used to deal with damage like we have here. It has a function to kill (remove) table rows and a function to freeze rows (make them unconditionally visible). Let's try that:

The functions in this module are dangerous. We only use them here because we already suffered damage. The idea is not to repair the damage and go on, but to get the database into a state where the data can be dumped and restored to a new, healthy cluster. You should never continue working with a PostgreSQL instance that has suffered data corruption, even if the corruption seems to be fixed. There might be invisible surprises lurking somewhere.

What have we learned?

We have seen that it is impossible to suffer data corruption from transaction ID wraparound by normal means. We saw the dead rise. We got to use dangerous tools like pg_resetwal and pg_surgery. I hope you enjoyed the ride!

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram