Killed index tuples

11.2018 | Category: Performance | Tags: index, sql help, under the hood

11.2018

Category: Performance

Tags: index, sql help, under the hood

1000 killed index tuples — © Laurenz Albe 2018

Table of Contents

Since I only recently learned about the concept of “killed index tuples”, I thought there might be some others who are not yet familiar with this interesting PostgreSQL concept.

This may give you an explanation the next time you encounter wildly varying execution times for the same execution plan of the same PostgreSQL query.

Before we look more closely at the index, let's review the life cycle of a table row version (“heap tuple”).

Life, death and visibility in the table heap

It is widely known that the visibility of heap tuples is determined by the system columns xmin and xmax (though there is more to xmax than meets the eye). A heap tuple is “dead” if its xmax is less than the xmin of all active transactions.

Now xmin and xmax are only valid if the respective transactions have been marked committed in the “commit log”. Consequently, any transaction that needs to know if it can see a tuple has to consult the commit log. To save future readers that extra work, the first one that consults the commit log will save the information in the tuple's “hint bits”.

Dead tuples are eventually reclaimed by VACUUM.

This is all fairly well known, but how is the situation with index entries?

Life, death and visibility in the index

To avoid redundancy and to keep index tuples small, the visibility information is not stored in the index.
The status of an index tuple is determined by the heap tuple it points to, and both are removed by VACUUM at the same time.

As a consequence, an index scan has to inspect the heap tuple to determine if it can “see” an entry. This is the case even if all the columns needed are in the index tuple itself. Even worse, this “heap access” will result in random I/O, which is not very efficient on spinning disks.

This makes index scans in PostgreSQL more expensive than in other database management systems that use a different architecture. To mitigate that, several features have been introduced over the years:

PostgreSQL 8.1 introduced the “bitmp index scan”. This scan method first creates a list of heap blocks to visit and then scans them sequentially. This not only reduces the random I/O, but also avoids that the same block is visited several times during an index scan.
PostgreSQL 9.2 introduced the “index-only scan”, which avoids fetching the heap tuple. This requires that all the required columns are in the index and the “visibility map” shows that all tuples in the table block are visible to everybody.

But I want to focus on another feature that was also introduced in 8.1, yet never made it into the release notes.

Killed index tuples

The feature was introduced with this commit:

commit 3f4d48802271126b1343289a9d2267ff1ed3788a
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri May 24 18:57:57 2002 +0000

    Mark index entries 'killed' when they are no longer visible to any
    transaction, so as to avoid returning them out of the index AM.  Saves
    repeated heap_fetch operations on frequently-updated rows.  Also detect
    queries on unique keys (equality to all columns of a unique index), and
    don't bother continuing scan once we have found first match.

    Killing is implemented in the btree and hash AMs, but not yet in rtree
    or gist, because there isn't an equally convenient place to do it in
    those AMs (the outer amgetnext routine can't do it without re-pinning
    the index page).

    Did some small cleanup on APIs of HeapTupleSatisfies, heap_fetch, and
    index_insert to make this a little easier.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

commit 3f4d48802271126b1343289a9d2267ff1ed3788a

Author: Tom Lane <tgl@sss.pgh.pa.us>

Date: Fri May 24 18:57:57 2002 +0000

Mark index entries 'killed' when they are no longer visible to any

transaction, so as to avoid returning them out of the index AM. Saves

repeated heap_fetch operations on frequently-updated rows. Also detect

queries on unique keys (equality to all columns of a unique index), and

don't bother continuing scan once we have found first match.

Killing is implemented in the btree and hash AMs, but not yet in rtree

or gist, because there isn't an equally convenient place to do it in

those AMs (the outer amgetnext routine can't do it without re-pinning

the index page).

Did some small cleanup on APIs of HeapTupleSatisfies, heap_fetch, and

index_insert to make this a little easier.

Whenever an index scan fetches a heap tuple only to find that it is dead (that the entire “HOT chain” of tuples is dead, to be more precise), it marks the index tuple as “killed”. Then future index scans can simply ignore it.

This “hint bit for indexes” can speed up future index scans considerably.

An example

Let's create a table to demonstrate killed index tuples. We have to disable autovacuum so that it does not spoil the effect by removing the dead tuples.

CREATE UNLOGGED TABLE hole (
   id integer PRIMARY KEY,
   val text NOT NULL
) WITH (autovacuum_enabled = off);

INSERT INTO hole
SELECT i, 'text number ' || i
FROM generate_series(1, 1000000) AS i;

ANALYZE hole;

1

2

3

4

5

6

7

8

9

10

CREATE UNLOGGED TABLE hole (

id integer PRIMARY KEY,

val text NOT NULL

) WITH (autovacuum_enabled = off);

INSERT INTO hole

SELECT i, 'text number ' || i

FROM generate_series(1, 1000000) AS i;

ANALYZE hole;

We create a number of dead tuples by deleting rows from the table:

DELETE FROM hole
WHERE id BETWEEN 501 AND 799500;

ANALYZE hole;

1

2

3

4

DELETE FROM hole

WHERE id BETWEEN 501 AND 799500;

ANALYZE hole;

Spot the differences!

Now we run the same query twice:

EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF)
SELECT * FROM hole
WHERE id BETWEEN 1 AND 800000;

                          QUERY PLAN
---------------------------------------------------------------
 Index Scan using hole_pkey on hole (actual rows=1000 loops=1)
   Index Cond: ((id >= 1) AND (id <= 800000))
   Buffers: shared hit=7284
 Planning Time: 0.346 ms
 Execution Time: 222.129 ms
(5 rows)

EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF)
SELECT * FROM hole
WHERE id BETWEEN 1 AND 800000;

                          QUERY PLAN
---------------------------------------------------------------
 Index Scan using hole_pkey on hole (actual rows=1000 loops=1)
   Index Cond: ((id >= 1) AND (id <= 800000))
   Buffers: shared hit=2196
 Planning Time: 0.217 ms
 Execution Time: 14.540 ms
(5 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF)

SELECT * FROM hole

WHERE id BETWEEN 1 AND 800000;

QUERY PLAN

---------------------------------------------------------------

Index Scan using hole_pkey on hole (actual rows=1000 loops=1)

Index Cond: ((id >= 1) AND (id <= 800000))

Buffers: shared hit=7284

Planning Time: 0.346 ms

Execution Time: 222.129 ms

(5 rows)

EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF)

SELECT * FROM hole

WHERE id BETWEEN 1 AND 800000;

QUERY PLAN

---------------------------------------------------------------

Index Scan using hole_pkey on hole (actual rows=1000 loops=1)

Index Cond: ((id >= 1) AND (id <= 800000))

Buffers: shared hit=2196

Planning Time: 0.217 ms

Execution Time: 14.540 ms

(5 rows)

Discussion of the example

There are a number of things that cannot account for the difference:

Often the second execution is faster because the first run has to read more data from secondary storage, which are already cached on the second run. But you can observe from the execution plan that in both cases, all blocks were already in the cache (“shared hit”) and none had to be read from disk.
No hint bits had to be written during the first execution (no buffers were “dirtied”), because that was already done during the sequential scan from the DELETE.

What happened is that the first execution had to visit all the table blocks and killed the 799000 index tuples that pointed to dead tuples.
The second execution didn't have to do that, which is the reason why it was ten times faster.

Note that the difference in blocks touched by the two queries is not as big as one might suspect, since each table block contains many dead tuples (there is a high correlation).

It is also worth noting that even though they are not shown in the EXPLAIN output, some index pages must have been dirtied in the process, which will cause some disk writes from this read-only query.

Conclusion

Next time you encounter mysterious query run-time differences that cannot be explained by caching effects, consider killed index tuples as the cause. This may be an indication to configure autovacuum to run more often on the affected table, since that will get rid of the dead tuples and speed up the first query execution.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

7 responses to “Killed index tuples”

seattledba says:

November 6, 2018 at 3:49 am

zheap behaves similarly?

btpg12=# create unlogged table zhole
(id integer PRIMARY KEY,
val text NOT NULL)
with (storage_engine='zheap');
CREATE TABLE

btpg12=# INSERT INTO zhole
SELECT i, 'text number ' || i
FROM generate_series(1, 1000000) AS i;
INSERT 0 1000000

Time: 4429.463 ms (00:04.429)

btpg12=# ANALYZE zhole;
ANALYZE

Time: 961.388 ms

btpg12=# delete from zhole
btpg12-# WHERE id BETWEEN 501 AND 799500;
DELETE 799000

Time: 2437.737 ms (00:02.438)

btpg12=# ANALYZE zhole;
ANALYZE

Time: 2025.126 ms (00:02.025)

btpg12=# EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM zhole
WHERE id BETWEEN 1 AND 800000;
QUERY PLAN
-----------------------------------------------------------------
Index Scan using zhole_pkey on zhole (actual rows=1000 loops=1)
Index Cond: ((id >= 1) AND (id = 1) AND (id <= 800000))
Buffers: shared hit=79
Planning Time: 0.349 ms
Execution Time: 1.133 ms
(5 rows)

Time: 2.313 ms

Reply
- laurenz says:
  
  November 6, 2018 at 10:12 am
  
  Yes, that definitely looks like the same thing is happening.
  
  This is more a feature of the index than of the heap, so I guess they just didn't modify the behaviour there.
  
  Reply
  - seattledba says:
    
    November 6, 2018 at 1:59 pm
    
    the vacuum comparison is interesting
    
    btpg12=# vacuum zhole;
    VACUUM
    Time: 34.849 ms
    btpg12=# vacuum verbose zhole;
    INFO: vacuuming "public.zhole"
    INFO: "zhole": found 0 removable, 0 nonremovable row versions in 0 out of 3952 pages
    DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 712
    There were 0 unused item pointers.
    0 pages are entirely empty.
    CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
    INFO: vacuuming "pg_toast.pg_toast_16875"
    INFO: "pg_toast_16875": found 0 removable, 0 nonremovable row versions in 0 out of 1 pages
    DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 713
    There were 0 unused item pointers.
    0 pages are entirely empty.
    CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
    
    btpg12=# vacuum verbose hole;
    INFO: vacuuming "public.hole"
    INFO: scanned index "hole_pkey" to remove 799000 row versions
    DETAIL: CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.27 s
    INFO: "hole": removed 799000 row versions in 5090 pages
    DETAIL: CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s
    INFO: index "hole_pkey" now contains 201000 row versions in 2745 pages
    DETAIL: 799000 index row versions were removed.
    2181 index pages have been deleted, 0 are currently reusable.
    CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
    INFO: "hole": found 0 removable, 201000 nonremovable row versions in 6370 out of 6370 pages
    DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 714
    There were 0 unused item pointers.
    Skipped 0 pages due to buffer pins, 0 frozen pages.
    0 pages are entirely empty.
    CPU: user: 0.36 s, system: 0.00 s, elapsed: 0.36 s.
    INFO: vacuuming "pg_toast.pg_toast_16867"
    INFO: index "pg_toast_16867_index" now contains 0 row versions in 1 pages
    DETAIL: 0 index row versions were removed.
    0 index pages have been deleted, 0 are currently reusable.
    CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
    INFO: "pg_toast_16867": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
    DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 714
    There were 0 unused item pointers.
    Skipped 0 pages due to buffer pins, 0 frozen pages.
    0 pages are entirely empty.
    CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
    
    thanks - this is a very good article
    
    Reply
Oleksii says:

November 8, 2018 at 5:05 pm

Those killed entries are considered live on a standby, potentially leading to weird performance differences between primary and standby clusters on particular select queries (https://www.postgresql.org/message-id/flat/7067.1529246768@sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38)

Reply
Michael Lewis says:

August 7, 2019 at 10:54 pm

Thanks for sharing this example. The final thing that was missing was re-running the same query after vacuum. For me, timing goes from 138ms before the delete, 93ms first run after, 6.3ms on second run, and .22ms if I do vacuum and re-run with shared hits reducing from 2196 to 15. Tune autovacuum!

Reply
김성진 says:

April 27, 2020 at 12:08 am

this article very helpful for me! thank you!!!

Reply
Andy A says:

December 15, 2021 at 4:58 pm

Thanks for the article and examples.

> some index pages must have been dirtied in the process, which will cause some disk writes from this read-only query.

How does that work with physical replication where a read only query runs on a read-only replica? Are those hint bits replicated? I imagine what must happen is those hint bits are written on the primary and then replicated. But I had never thought about that before. I'm checking this out. https://wiki.postgresql.org/wiki/Hint_Bits Any other summaries of this would be helpful. Cheers.

Reply

Leave a Reply Cancel reply

Stay tuned with our

NEWSLETTER

CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0

office@cybertec-postgresql.com

ISO_27001_Badge

Customer Support

Support Platform

Services

Support CYBERTEC Partner PostgreSQL Books

Company

STAY TUNED WITH OUR NEWSLETTER

Get the newest PostgreSQL Info & Tools

Data Protection Policy Terms and conditions Terms of Service Imprint

©

2025

CYBERTEC PostgreSQL International GmbH