Of course most people do not want to corrupt their databases. These people will profit from avoiding the techniques used in this article. But for some, it might be useful to corrupt a database, for example if you want to test a tool or procedure that will be used to detect or fix data corruption.
Table of Contents
We need a database with some data in it, and for some of our experiments, we will need to have some ongoing activity. For that, we can use the built-in PostgreSQL benchmark pgbench
. We use scale factor 100, so that the largest table contains 10 million rows:
1 2 3 4 5 6 7 8 |
$ pgbench -q -i -s 100 dropping old tables... creating tables... generating data (client-side)... 10000000 of 10000000 tuples (100%) done (elapsed 7.44 s, remaining 0.00 s) vacuuming... creating primary keys... done in 10.12 s (drop tables 0.18 s, create tables 0.01 s, client-side generate 7.52 s, vacuum 0.14 s, primary keys 2.28 s). |
Load will be generated with 5 concurrent client sessions:
1 |
$ pgbench -c 5 -T 3600 |
fsync = off
Let's set fsync = off
in postgresql.conf
and power off the server while it is under load.
After a few attempts, we can detect data corruption with the amcheck
extension:
1 2 3 4 5 6 7 |
postgres=# CREATE EXTENSION amcheck; CREATE EXTENSION postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE); WARNING: concurrent delete in progress within table 'pgbench_accounts' ERROR: could not access status of transaction 1949706 DETAIL: Could not read from file 'pg_subtrans/001D' at offset 196608: read too few bytes. CONTEXT: while checking uniqueness of tuple (131074,45) in relation 'pgbench_accounts' |
What happened? Data were no longer flushed to disk in the correct order, so that data modifications could hit the disk before the WAL did. This leads to data corruption during crash recovery.
While pgbench
is running, we create a base backup:
1 2 3 4 5 6 |
$ psql postgres=# SELECT pg_backup_start('test'); pg_backup_start ═════════════════ 1/47F8A130 (1 row) |
Note that since I am using PostgreSQL v15, the function to start backup mode is pg_backup_start()
rather than pg_start_backup()
. This is because the exclusive backup API, which had been deprecated since PostgreSQL 9.6, was finally removed in v15. To find out more, read my updated post in the link.
Let's figure out the object IDs of the database and of the primary key index of pgbench_accounts
:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
postgres=# SELECT relfilenode FROM pg_class WHERE relname = 'pgbench_accounts_pkey'; relfilenode ═════════════ 16430 (1 row) postgres=# SELECT oid FROM pg_database WHERE datname = 'postgres'; oid ═════ 5 (1 row) |
We create a backup by copying the data directory. Afterwards, we copy the primary key index of pgbench_accounts
and the commit log again to make sure that they are more recent than the rest:
1 2 3 4 |
$ cp -r data backup $ cp data/base/5/16430* backup/base/5 $ cp data/pg_xact/* backup/pg_xact/ $ rm backup/postmaster.pid |
backup_label
Now we exit backup mode, but ignore the contents of the backup_label
file returned from pg_backup_stop()
:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
postgres=# SELECT labelfile FROM pg_backup_stop(); NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup labelfile ════════════════════════════════════════════════════════════════ START WAL LOCATION: 1/47F8A130 (file 000000010000000100000047)↵ CHECKPOINT LOCATION: 1/65CD24F0 ↵ BACKUP METHOD: streamed ↵ BACKUP FROM: primary ↵ START TIME: 2022-07-05 08:32:47 CEST ↵ LABEL: test ↵ START TIMELINE: 1 ↵ (1 row) |
Then, let's make sure that the last checkpoint in the control file is different:
1 2 3 |
$ pg_controldata -D backup | grep REDO Latest checkpoint's REDO location: 1/890077D0 Latest checkpoint's REDO WAL file: 000000010000000100000089 |
Great! Let's start the server:
1 2 3 4 |
$ echo 'port = 5555' >> backup/postgresql.auto.conf $ pg_ctl -D backup start waiting for server to start..... done server started |
Now an index scan on pgbench_accounts
fails, because the index contains more recent data than the table:
1 2 |
postgres=# SELECT * FROM pgbench_accounts ORDER BY aid; ERROR: could not read block 166818 in file 'base/5/16422.1': read only 0 of 8192 bytes |
What happened? By omitting the backup_label
file from the backup, we recovered from the wrong checkpoint, so the data in the table and its index were no longer consistent. Note that we can get the same effect without pg_backup_start()
and pg_backup_stop()
, I only wanted to emphasize the importance of backup_label
.
pg_resetwal
While the database is under load from pgbench
, we crash it with
1 |
pg_ctl stop -m immediate -D data |
Then we run pg_resetwal
:
1 2 3 4 5 6 |
pg_resetwal -D data The database server was not shut down cleanly. Resetting the write-ahead log might cause data to be lost. If you want to proceed anyway, use -f to force reset. $ pg_resetwal -f -D data Write-ahead log reset |
Then we start the server and use amcheck
like before to check the index for integrity:
1 2 3 4 5 6 7 |
postgres=# CREATE EXTENSION amcheck; CREATE EXTENSION postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE); WARNING: concurrent delete in progress within table 'pgbench_accounts' ERROR: could not access status of transaction 51959 DETAIL: Could not read from file 'pg_subtrans/0000' at offset 204800: read too few bytes. CONTEXT: while checking uniqueness of tuple (1,1) in relation 'pgbench_accounts' |
What happened? pg_resetwal
is only safe to use on a cluster that was shutdown cleanly. The option -f
is intended as a last-ditch effort to get a corrupted server to start and salvage some data. Only experts should use it.
pg_upgrade --link
We create a second cluster with initdb
:
1 |
$ initdb -E UTF8 --locale=C -U postgres data2 |
Then we edit postgresql.conf
and choose a different port number. After shutting down the original cluster, we run an “upgrade” in link mode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
$ pg_upgrade -d /home/laurenz/data -D /home/laurenz/data2 > -b /usr/pgsql-15/bin -B /usr/pgsql-15/bin -U postgres --link Performing Consistency Checks ... Performing Upgrade ... Adding '.old' suffix to old global/pg_control ok If you want to start the old cluster, you will need to remove the '.old' suffix from /home/laurenz/data/global/pg_control.old. Because 'link' mode was used, the old cluster cannot be safely started once the new cluster has been started. ... Upgrade Complete ---------------- Optimizer statistics are not transferred by pg_upgrade. Once you start the new server, consider running: /usr/pgsql-15/bin/vacuumdb -U postgres --all --analyze-in-stages Running this script will delete the old cluster's data files: ./delete_old_cluster.sh |
pg_upgrade
renamed the control file of the old cluster, so that it cannot get started by accident. We'll undo that:
1 2 |
mv /home/laurenz/data/global/pg_control.old > /home/laurenz/data/global/pg_control |
Now we can start both clusters and run pgbench
on both. Soon we will see error messages like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
ERROR: unexpected data beyond EOF in block 1 of relation base/5/16397 HINT: This has been seen to occur with buggy kernels; consider updating your system. ERROR: duplicate key value violates unique constraint 'pgbench_accounts_pkey' DETAIL: Key (aid)=(8040446) already exists. WARNING: could not write block 13 of base/5/16404 DETAIL: Multiple failures --- write error might be permanent. ERROR: xlog flush request 0/98AEE3E0 is not satisfied --- flushed only to 0/648CDC58 CONTEXT: writing block 13 of relation base/5/16404 ERROR: could not access status of transaction 39798 DETAIL: Could not read from file 'pg_subtrans/0000' at offset 155648: read too few bytes. |
What happened? Since both clusters share the same data files, we managed to start two servers on the same data files. This leads to data corruption.
For that, we figure out the file name that belongs to the table pgbench_accounts
:
1 2 3 4 5 6 |
postgres=# SELECT relfilenode FROM pg_class WHERE relname = 'pgbench_accounts'; relfilenode ═════════════ 16396 (1 row) |
Now we stop the server and write some garbage into the first data block:
1 2 3 4 |
yes 'this is garbage' | dd of=data/base/5/16396 bs=1024 seek=2 count=1 conv=notrunc 0+1 records in 0+1 records out 1024 bytes (1.0 kB, 1.0 KiB) copied, 0.00031255 s, 3.3 MB/s |
Then we start the server and try to select from the table:
1 2 |
postgres=# TABLE pgbench_accounts ; ERROR: compressed pglz data is corrupt |
What happened? We tampered with the data files, so it's unsurprising that the table is corrupted.
Who needs ALTER TABLE
to drop a table column? We can simply run
1 2 3 |
DELETE FROM pg_attribute WHERE attrelid = 'pgbench_accounts'::regclass AND attname = 'bid'; |
After that, an attempt to query the table will result in an error:
1 |
ERROR: pg_attribute catalog is missing 1 attribute(s) for relation OID 16396 |
What happened? We ignored that dropping a column sets attisdropped
to TRUE
in pg_attribute
rather than actually removing the entry. Moreover, we didn't check for dependencies in pg_depend
, nor did we properly lock the table against concurrent access. Modifying catalog tables is unsupported, and if it breaks the database, you get to keep both pieces.
We have seen a number of ways how you can corrupt a PostgreSQL database. Some of these were obvious, some might surprise the beginner. If you don't want a corrupted database,
fsync = off
pg_resetwal -f
on a crashed serverpg_upgrade --link
backup_label
I hope you can save some databases with this information! If you'd like to know more about troubleshooting PostgreSQL performance, read my post on join strategies.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
This is a great article detailing the PG corruption side of things!
Could you please have a article on the repairing corrupt database?
corrupt catalog tables, user tables, missing file, etc.
That would be nice, wouldn't it?
But it is much easier to break things than to fix them. You need a lot of knowledge and patience to deal with data corruption. I am afraid that it exceeds the frame of a blog post.
You can get some ideas from the slides from my 2023 PGConf.EU talk on the topic.
Perhaps I can put one or the other aspect of that into a blog some day.