Table of Contents
A while ago, I wrote about B-tree improvements in v12. PostgreSQL v13, which will come out later this year, will feature index entry deduplication as an even more impressive improvement. So I thought it was time for a follow-up.
If the indexed keys for different table rows are identical, a GIN index will store that as a single index entry. The pointers to the several table rows (tuple IDs) are stored in the posting list of that entry.
In contrast, B-tree indexes used to store an index entry for each referenced table row. This makes maintenance faster but can lead to duplicate index entries. Now commit 0d861bbb70 has changed that by introducing deduplication of index entries for B-tree indexes.
The big difference is that duplicate index entries can still occur in B-tree indexes. GIN indexes have to integrate new entries into the posting list every time, which makes data modifications less efficient (this is mitigated by the pending list). In contrast, PostgreSQL deduplicates B-tree entries only when it would otherwise have to split the index page. So it has to do the extra work only occasionally, and only when it would have to do extra work anyway. This extra work is balanced by the reduced need for page splits and the resulting smaller index size.
Oh yes, it does. Every UPDATE
creates a new version of a table row, and each row version has to be indexed. Consequently, even a unique index can contain the same index entry multiple times. The new feature is useful for unique indexes because it reduces the index bloat that normally occurs if the same table row is updated frequently.
Deduplication results in a smaller index size for indexes with repeating entries. This saves disk space and, even more importantly, RAM when the index is cached in shared buffers. Scanning the index becomes faster, and index bloat is reduced.
If you want to make use of the new feature after upgrading with pg_upgrade
, you'll have to REINDEX
promising indexes that contain many duplicate entries. Upgrading with pg_dumpall
and restore or using logical replication rebuilds the index, and deduplication is automatically enabled.
If you want to retain the old behavior, you can disable deduplication by setting deduplicate_items = off
on the index.
I'll use the same test case as in my previous article:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
CREATE TABLE rel ( aid bigint NOT NULL, bid bigint NOT NULL ); ALTER TABLE rel ADD CONSTRAINT rel_pkey PRIMARY KEY (aid, bid); CREATE INDEX rel_bid_idx ON rel (bid); INSERT INTO rel (aid, bid) SELECT i, i / 10000 FROM generate_series(1, 20000000) AS i; /* set hint bits and calculate statistics */ VACUUM (ANALYZE) rel; |
The interesting index here is rel_bid_idx
. I'll measure its size, then the size after a REINDEX
. Finally, I'll execute the following query several times:
1 2 3 |
DO $$BEGIN PERFORM * FROM rel WHERE bid < 100::bigint; END;$$; |
This executes an index scan without actually transferring the result to the client. I'll measure the time with psql
's timing
on the client side.
I'll run that test on PostgreSQL v12 and v13. As a comparison, I'll do the same thing with a GIN index on bid
(this requires the btree_gin
contrib module).
v13 B-tree index | v12 B-tree index | v12 GIN index | |
---|---|---|---|
size | 126MB | 408MB | 51MB |
size after REINDEX |
133MB | 429MB | 47MB |
query duration | 130ms | 130ms | 140ms |
In my example, the deduplicated B-tree index is more than three times smaller than the v12 index, although it is still considerably larger than the GIN index.
The performance is comparable. I observed a bigger variance in the execution time with the deduplicated index, which I cannot explain.
This feature is another nice step ahead for PostgreSQL's B-tree indexes!
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Awesome !
Thanks Laurenz
So postgresql 13 deduplication is the default for btrees?
Yes.