Table of Contents
Since I only recently learned about the concept of “killed index tuples”, I thought there might be some others who are not yet familiar with this interesting PostgreSQL concept.
This may give you an explanation the next time you encounter wildly varying execution times for the same execution plan of the same PostgreSQL query.
Before we look more closely at the index, let's review the life cycle of a table row version (“heap tuple”).
It is widely known that the visibility of heap tuples is determined by the system columns xmin
and xmax
(though there is more to xmax
than meets the eye). A heap tuple is “dead” if its xmax
is less than the xmin
of all active transactions.
Now xmin
and xmax
are only valid if the respective transactions have been marked committed in the “commit log”. Consequently, any transaction that needs to know if it can see a tuple has to consult the commit log. To save future readers that extra work, the first one that consults the commit log will save the information in the tuple's “hint bits”.
Dead tuples are eventually reclaimed by VACUUM
.
This is all fairly well known, but how is the situation with index entries?
To avoid redundancy and to keep index tuples small, the visibility information is not stored in the index.
The status of an index tuple is determined by the heap tuple it points to, and both are removed by VACUUM
at the same time.
As a consequence, an index scan has to inspect the heap tuple to determine if it can “see” an entry. This is the case even if all the columns needed are in the index tuple itself. Even worse, this “heap access” will result in random I/O, which is not very efficient on spinning disks.
This makes index scans in PostgreSQL more expensive than in other database management systems that use a different architecture. To mitigate that, several features have been introduced over the years:
But I want to focus on another feature that was also introduced in 8.1, yet never made it into the release notes.
The feature was introduced with this commit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
commit 3f4d48802271126b1343289a9d2267ff1ed3788a Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri May 24 18:57:57 2002 +0000 Mark index entries 'killed' when they are no longer visible to any transaction, so as to avoid returning them out of the index AM. Saves repeated heap_fetch operations on frequently-updated rows. Also detect queries on unique keys (equality to all columns of a unique index), and don't bother continuing scan once we have found first match. Killing is implemented in the btree and hash AMs, but not yet in rtree or gist, because there isn't an equally convenient place to do it in those AMs (the outer amgetnext routine can't do it without re-pinning the index page). Did some small cleanup on APIs of HeapTupleSatisfies, heap_fetch, and index_insert to make this a little easier. |
Whenever an index scan fetches a heap tuple only to find that it is dead (that the entire “HOT chain” of tuples is dead, to be more precise), it marks the index tuple as “killed”. Then future index scans can simply ignore it.
This “hint bit for indexes” can speed up future index scans considerably.
Let's create a table to demonstrate killed index tuples. We have to disable autovacuum so that it does not spoil the effect by removing the dead tuples.
1 2 3 4 5 6 7 8 9 10 |
CREATE UNLOGGED TABLE hole ( id integer PRIMARY KEY, val text NOT NULL ) WITH (autovacuum_enabled = off); INSERT INTO hole SELECT i, 'text number ' || i FROM generate_series(1, 1000000) AS i; ANALYZE hole; |
We create a number of dead tuples by deleting rows from the table:
1 2 3 4 |
DELETE FROM hole WHERE id BETWEEN 501 AND 799500; ANALYZE hole; |
Now we run the same query twice:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF) SELECT * FROM hole WHERE id BETWEEN 1 AND 800000; QUERY PLAN --------------------------------------------------------------- Index Scan using hole_pkey on hole (actual rows=1000 loops=1) Index Cond: ((id >= 1) AND (id <= 800000)) Buffers: shared hit=7284 Planning Time: 0.346 ms Execution Time: 222.129 ms (5 rows) EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF) SELECT * FROM hole WHERE id BETWEEN 1 AND 800000; QUERY PLAN --------------------------------------------------------------- Index Scan using hole_pkey on hole (actual rows=1000 loops=1) Index Cond: ((id >= 1) AND (id <= 800000)) Buffers: shared hit=2196 Planning Time: 0.217 ms Execution Time: 14.540 ms (5 rows) |
There are a number of things that cannot account for the difference:
DELETE
.What happened is that the first execution had to visit all the table blocks and killed the 799000 index tuples that pointed to dead tuples.
The second execution didn't have to do that, which is the reason why it was ten times faster.
Note that the difference in blocks touched by the two queries is not as big as one might suspect, since each table block contains many dead tuples (there is a high correlation).
It is also worth noting that even though they are not shown in the EXPLAIN
output, some index pages must have been dirtied in the process, which will cause some disk writes from this read-only query.
Next time you encounter mysterious query run-time differences that cannot be explained by caching effects, consider killed index tuples as the cause. This may be an indication to configure autovacuum to run more often on the affected table, since that will get rid of the dead tuples and speed up the first query execution.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
zheap behaves similarly?
btpg12=# create unlogged table zhole
(id integer PRIMARY KEY,
val text NOT NULL)
with (storage_engine='zheap');
CREATE TABLE
btpg12=# INSERT INTO zhole
SELECT i, 'text number ' || i
FROM generate_series(1, 1000000) AS i;
INSERT 0 1000000
Time: 4429.463 ms (00:04.429)
btpg12=# ANALYZE zhole;
ANALYZE
Time: 961.388 ms
btpg12=# delete from zhole
btpg12-# WHERE id BETWEEN 501 AND 799500;
DELETE 799000
Time: 2437.737 ms (00:02.438)
btpg12=# ANALYZE zhole;
ANALYZE
Time: 2025.126 ms (00:02.025)
btpg12=# EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM zhole
WHERE id BETWEEN 1 AND 800000;
QUERY PLAN
-----------------------------------------------------------------
Index Scan using zhole_pkey on zhole (actual rows=1000 loops=1)
Index Cond: ((id >= 1) AND (id = 1) AND (id <= 800000))
Buffers: shared hit=79
Planning Time: 0.349 ms
Execution Time: 1.133 ms
(5 rows)
Time: 2.313 ms
Yes, that definitely looks like the same thing is happening.
This is more a feature of the index than of the heap, so I guess they just didn't modify the behaviour there.
the vacuum comparison is interesting
btpg12=# vacuum zhole;
VACUUM
Time: 34.849 ms
btpg12=# vacuum verbose zhole;
INFO: vacuuming "public.zhole"
INFO: "zhole": found 0 removable, 0 nonremovable row versions in 0 out of 3952 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 712
There were 0 unused item pointers.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: vacuuming "pg_toast.pg_toast_16875"
INFO: "pg_toast_16875": found 0 removable, 0 nonremovable row versions in 0 out of 1 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 713
There were 0 unused item pointers.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
btpg12=# vacuum verbose hole;
INFO: vacuuming "public.hole"
INFO: scanned index "hole_pkey" to remove 799000 row versions
DETAIL: CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.27 s
INFO: "hole": removed 799000 row versions in 5090 pages
DETAIL: CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s
INFO: index "hole_pkey" now contains 201000 row versions in 2745 pages
DETAIL: 799000 index row versions were removed.
2181 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "hole": found 0 removable, 201000 nonremovable row versions in 6370 out of 6370 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 714
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.36 s, system: 0.00 s, elapsed: 0.36 s.
INFO: vacuuming "pg_toast.pg_toast_16867"
INFO: index "pg_toast_16867_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pg_toast_16867": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 714
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
thanks - this is a very good article
Those killed entries are considered live on a standby, potentially leading to weird performance differences between primary and standby clusters on particular select queries (https://www.postgresql.org/message-id/flat/7067.1529246768@sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38)
Thanks for sharing this example. The final thing that was missing was re-running the same query after vacuum. For me, timing goes from 138ms before the delete, 93ms first run after, 6.3ms on second run, and .22ms if I do vacuum and re-run with shared hits reducing from 2196 to 15. Tune autovacuum!
this article very helpful for me! thank you!!!
Thanks for the article and examples.
> some index pages must have been dirtied in the process, which will cause some disk writes from this read-only query.
How does that work with physical replication where a read only query runs on a read-only replica? Are those hint bits replicated? I imagine what must happen is those hint bits are written on the primary and then replicated. But I had never thought about that before. I'm checking this out. https://wiki.postgresql.org/wiki/Hint_Bits Any other summaries of this would be helpful. Cheers.