Making the PostgreSQL visibility map visible

06.2024 / Category: How To / Tags: development | sql help

visibility_map — © Laurenz Albe 2024

Enough has been written about the connection between VACUUM, the visibility map and index-only scans in PostgreSQL. But recently I demonstrated index-only scans to a class I was teaching and made some observations that surprised me at first. I could eventually explain what I saw, but I may have given the class more to chew on than I intended. I myself found the experience enlightening, and if you would like some insight into the implementation details and performance features of PostgreSQL, welcome to the show.

Table of Contents

The visibility map

Every PostgreSQL table has a visibility map that stores two bits for each of the table’s 8kB pages: the all-frozen flag, which allows anti-wraparound autovacuum to skip some pages for better performance, and the all-visible flag, which will be our focus in this article. If the all-visible bit for a page is set, PostgreSQL knows that all tuples in that page are visible to all transactions. It can then skip fetching the table row if the sole purpose is to check it for visibility, which leads to index-only scans that actually deserve that name. Since VACUUM removes dead tuples, which renders table pages all-visible, it also has the job of maintaining the visibility map. Consequently, you need to make sure that a table is VACUUMed often enough if you want efficient index-only scans.

An example to show the visibility map at play

All we need is a bigger table with an index (in this case, the primary key index). I use an unlogged table to avoid writing WAL, which won’t affect our observations.

CREATE UNLOGGED TABLE vis (
   id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   val double precision NOT NULL
);
 
INSERT INTO vis (val)
SELECT random() FROM generate_series(1, 1000000);

1

2

3

4

5

6

7

CREATE UNLOGGED TABLE vis (

id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,

val double precision NOT NULL

);

INSERT INTO vis (val)

SELECT random() FROM generate_series(1, 1000000);

We VACUUM the table to set the hint bits and build the visibility map. We then ANALYZE it to get good optimizer statistics.

VACUUM (ANALYZE) vis;

1	VACUUM (ANALYZE) vis;

Using `EXPLAIN (ANALYZE)` to check the quality of the index scan

Among many other valuable data, EXPLAIN (ANALYZE) shows us how many table tuples had to be fetched to check them for visibility, because the page’s all-visible flag wasn’t set:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT id FROM vis WHERE id < 1000;
 
                                    QUERY PLAN                                     
═══════════════════════════════════════════════════════════════════════════════════
 Index Only Scan using vis_pkey on vis (actual time=0.083..0.522 rows=999 loops=1)
   Index Cond: (id < 1000)
   Heap Fetches: 0
 Planning Time: 0.283 ms
 Execution Time: 0.713 ms

1

2

3

4

5

6

7

8

9

10

EXPLAIN (ANALYZE, COSTS OFF)

SELECT id FROM vis WHERE id < 1000;

QUERY PLAN

═══════════════════════════════════════════════════════════════════════════════════

Index Only Scan using vis_pkey on vis (actual time=0.083..0.522 rows=999 loops=1)

Index Cond: (id < 1000)

Heap Fetches: 0

Planning Time: 0.283 ms

Execution Time: 0.713 ms

We see that PostgreSQL didn’t have to fetch a single table row, since all pages were all-visible.

Data modifications reset the all-visible flag

If we update a single row in the table, PostgreSQL has to reset the all-visible flag in the visibility map for two table pages:

the page that contains the original row version, because that tuple becomes invisible
the page that contains the updated row version, because concurrent transactions cannot see it before the transaction commits

Here is a demonstration:

UPDATE vis SET val = 0.0 WHERE id = 7;
 
EXPLAIN (ANALYZE, COSTS OFF)
SELECT id FROM vis WHERE id < 1000;
 
                                    QUERY PLAN                                     
═══════════════════════════════════════════════════════════════════════════════════
 Index Only Scan using vis_pkey on vis (actual time=0.117..0.507 rows=999 loops=1)
   Index Cond: (id < 1000)
   Heap Fetches: 186
 Planning Time: 0.255 ms
 Execution Time: 0.660 ms

1

2

3

4

5

6

7

8

9

10

11

12

UPDATE vis SET val = 0.0 WHERE id = 7;

EXPLAIN (ANALYZE, COSTS OFF)

SELECT id FROM vis WHERE id < 1000;

QUERY PLAN

═══════════════════════════════════════════════════════════════════════════════════

Index Only Scan using vis_pkey on vis (actual time=0.117..0.507 rows=999 loops=1)

Index Cond: (id < 1000)

Heap Fetches: 186

Planning Time: 0.255 ms

Execution Time: 0.660 ms

The row with id = 7 was within the first page of the table, and that page must contain 185 table rows (one heap fetch is for the updated row version, which is on a different page). Since the page is no longer all-visible, PostgreSQL has to fetch all tuples in it and test if they are visible to the transaction.

Which row is in which table page?

It is good to have an explanation of what happened, but it is better to verify that our line of thinking is correct. We can verify our deductions by looking at the current tuple-ID of the table rows:

SELECT ctid, * FROM vis;
 
    ctid    │   id    │          val           
════════════╪═════════╪════════════════════════
 (0,1)      │       1 │     0.9428522231691654
 (0,2)      │       2 │     0.8280668280240011
 (0,3)      │       3 │     0.7906647703465222
 (0,4)      │       4 │    0.19579683924366442
 (0,5)      │       5 │     0.5798129594075341
 (0,6)      │       6 │     0.5612578971780957
 (0,8)      │       8 │    0.29021282947733007
 (0,9)      │       9 │     0.2606336534585936
 (0,10)     │      10 │     0.7574579143204805
...
 (0,184)    │     184 │    0.38348399704071734
 (0,185)    │     185 │    0.08413532836604021
 (1,1)      │     186 │     0.9946325078631779
 (1,2)      │     187 │      0.694070825069186
...
 (5405,74)  │  999999 │     0.8119554439812944
 (5405,75)  │ 1000000 │     0.3859512822842295
 (5405,76)  │       7 │                      0
(1000000 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

SELECT ctid, * FROM vis;

ctid │ id │ val

════════════╪═════════╪════════════════════════

(0,1) │ 1 │ 0.9428522231691654

(0,2) │ 2 │ 0.8280668280240011

(0,3) │ 3 │ 0.7906647703465222

(0,4) │ 4 │ 0.19579683924366442

(0,5) │ 5 │ 0.5798129594075341

(0,6) │ 6 │ 0.5612578971780957

(0,8) │ 8 │ 0.29021282947733007

(0,9) │ 9 │ 0.2606336534585936

(0,10) │ 10 │ 0.7574579143204805

...

(0,184) │ 184 │ 0.38348399704071734

(0,185) │ 185 │ 0.08413532836604021

(1,1) │ 186 │ 0.9946325078631779

(1,2) │ 187 │ 0.694070825069186

...

(5405,74) │ 999999 │ 0.8119554439812944

(5405,75) │ 1000000 │ 0.3859512822842295

(5405,76) │ 7 │ 0

(1000000 rows)

The first number of the ctid is the page number. So we indeed have 185 tuples in the first page (page number 0): 184 visible ones and an invisible (updated) one. Page 5405 is also no longer all-visible, so scanning the index for the row with id = 7 causes another heap fetch, and we come up with a total of 186 heap fetches.

Visualizing the visibility map

So far, we have only inferred that the UPDATE we executed above reset the all-visible flag on two pages, but we can also prove that the deduction was correct. Using the standard extension pg_visibility, we can examine the visibility map directly:

CREATE EXTENSION pg_visibility;
 
SELECT * FROM pg_visibility_map('vis');
 
 blkno │ all_visible │ all_frozen 
═══════╪═════════════╪════════════
     0 │ f           │ f
     1 │ t           │ f
     2 │ t           │ f
     3 │ t           │ f
...
  5401 │ t           │ f
  5402 │ t           │ f
  5403 │ t           │ f
  5404 │ t           │ f
  5405 │ f           │ f
(5406 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

CREATE EXTENSION pg_visibility;

SELECT * FROM pg_visibility_map('vis');

blkno │ all_visible │ all_frozen

═══════╪═════════════╪════════════

0 │ f │ f

1 │ t │ f

2 │ t │ f

3 │ t │ f

...

5401 │ t │ f

5402 │ t │ f

5403 │ t │ f

5404 │ t │ f

5405 │ f │ f

(5406 rows)

This confirms that all but the first and the last page of the table are all-visible.

The mysterious vanishing tuple

So far, everything was as I expected. My first surprise came when I repeated EXPLAIN (ANALYZE):

EXPLAIN (ANALYZE, COSTS OFF)
SELECT id FROM vis WHERE id < 1000;
 
                                    QUERY PLAN                                     
═══════════════════════════════════════════════════════════════════════════════════
 Index Only Scan using vis_pkey on vis (actual time=0.085..0.742 rows=999 loops=1)
   Index Cond: (id < 1000)
   Heap Fetches: 185
 Planning Time: 0.293 ms
 Execution Time: 0.966 ms

1

2

3

4

5

6

7

8

9

10

EXPLAIN (ANALYZE, COSTS OFF)

SELECT id FROM vis WHERE id < 1000;

QUERY PLAN

═══════════════════════════════════════════════════════════════════════════════════

Index Only Scan using vis_pkey on vis (actual time=0.085..0.742 rows=999 loops=1)

Index Cond: (id < 1000)

Heap Fetches: 185

Planning Time: 0.293 ms

Execution Time: 0.966 ms

Suddenly, there is one heap fetch less. What happened? Did autovacuum run on the table? That cannot be, because autovacuum doesn’t start running before dead tuples make up 20% of the table. We only updated a single row!

Fortunately I remembered something I blogged about a while ago: killed index tuples. If an index scan detects that a table row is no longer visible to any transaction, it marks the index entry LP_DEAD so that the next index scan knows it can ignore the entry. Our first EXPLAIN (ANALYZE) must have killed the index entry pointing to the updated row version, so the second one could avoid one heap fetch.

`VACUUM` doesn’t update the visibility map as it should

Next, I wanted to demonstrate that VACUUM removes dead tuples, which makes pages all-visible, and updates the visibility map:

VACUUM vis;
 
EXPLAIN (ANALYZE, COSTS OFF)
SELECT id FROM vis WHERE id < 1000;
 
                                    QUERY PLAN                                     
═══════════════════════════════════════════════════════════════════════════════════
 Index Only Scan using vis_pkey on vis (actual time=0.125..1.097 rows=999 loops=1)
   Index Cond: (id < 1000)
   Heap Fetches: 184
 Planning Time: 1.931 ms
 Execution Time: 1.535 ms

1

2

3

4

5

6

7

8

9

10

11

12

VACUUM vis;

EXPLAIN (ANALYZE, COSTS OFF)

SELECT id FROM vis WHERE id < 1000;

QUERY PLAN

═══════════════════════════════════════════════════════════════════════════════════

Index Only Scan using vis_pkey on vis (actual time=0.125..1.097 rows=999 loops=1)

Index Cond: (id < 1000)

Heap Fetches: 184

Planning Time: 1.931 ms

Execution Time: 1.535 ms

Huh? Why do we still have 184 heap fetches, only one less than before?

The explanation for `VACUUM`‘s failure to set all pages all-visible

I had to think a while before I remembered an optimization that went into PostgreSQL v14 with commit 5100010ee4:

Author: Peter Geoghegan <pg@bowt.ie>
Date:   Wed Apr 7 16:14:54 2021 -0700
 
    Teach VACUUM to bypass unnecessary index vacuuming.
     
    [...] VACUUM has generally managed to avoid
    index scans when there were clearly no index tuples to delete from
    indexes.  But cases with _close to_ no index tuples to delete were
    another matter -- a round of ambulkdelete() calls took place (one per
    index), each of which performed a full index scan.
     
    VACUUM now behaves just as if there were zero index tuples to delete in
    cases where there are in fact "virtually zero" such tuples.  That is, it
    can now bypass index vacuuming and heap vacuuming as an optimization
    (though not index cleanup).  Whether or not VACUUM bypasses indexes is
    determined dynamically, based on the just-observed number of heap pages
    in the table that have one or more LP_DEAD items (LP_DEAD items in heap
    pages have a 1:1 correspondence with index tuples that still need to be
    deleted from each index in the worst case).
 
    We only skip index vacuuming when 2% or less of the table's pages have
    one or more LP_DEAD items -- bypassing index vacuuming as an
    optimization must not noticeably impede setting bits in the visibility
    map.  [...]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Author: Peter Geoghegan <pg@bowt.ie>

Date: Wed Apr 7 16:14:54 2021 -0700

Teach VACUUM to bypass unnecessary index vacuuming.

[...] VACUUM has generally managed to avoid

index scans when there were clearly no index tuples to delete from

indexes. But cases with _close to_ no index tuples to delete were

another matter -- a round of ambulkdelete() calls took place (one per

index), each of which performed a full index scan.

VACUUM now behaves just as if there were zero index tuples to delete in

cases where there are in fact "virtually zero" such tuples. That is, it

can now bypass index vacuuming and heap vacuuming as an optimization

(though not index cleanup). Whether or not VACUUM bypasses indexes is

determined dynamically, based on the just-observed number of heap pages

in the table that have one or more LP_DEAD items (LP_DEAD items in heap

pages have a 1:1 correspondence with index tuples that still need to be

deleted from each index in the worst case).

We only skip index vacuuming when 2% or less of the table's pages have

one or more LP_DEAD items -- bypassing index vacuuming as an

optimization must not noticeably impede setting bits in the visibility

map. [...]

Since we updated only a single row, only one of the 5406 pages of the table contain a dead row, which is clearly less than 2%. So VACUUM skipped the expensive step of cleaning up the index. If VACUUM does not clean up the index and does not remove the index entry that points to the dead row version, it cannot remove the dead row version itself, and the first table page cannot become all-visible.

However, when VACUUM scans the last table page that contains the updated row, it notices that all entries in that page are visible to everybody and sets the all-visible flage for that page. As a consequence, the table row with id = 7 is now on an all-visible page, and PostgreSQL can skip the heap fetch for that one row.

Forcing `VACUUM` to clean up the index

If we desparately need to get the heap fetches down to zero again, we can use an option of VACUUM that was added in commit 3499df0dee:

Author: Peter Geoghegan <pg@bowt.ie>
Date:   Fri Jun 18 20:04:07 2021 -0700
 
    Support disabling index bypassing by VACUUM.
     
    Generalize the INDEX_CLEANUP VACUUM parameter (and the corresponding
    reloption): make it into a ternary style boolean parameter.  It now
    exposes a third option, "auto".  The "auto" option (which is now the
    default) enables the "bypass index vacuuming" optimization added by
    commit 1e55e7d1.
     
    "VACUUM (INDEX_CLEANUP TRUE)" is redefined to once again make VACUUM
    simply do any required index vacuuming, regardless of how few dead
    tuples are encountered during the first scan of the target heap relation
    (unless there are exactly zero).  This gives users a way of opting out
    of the "bypass index vacuuming" optimization, if for whatever reason
    that proves necessary. [...]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Author: Peter Geoghegan <pg@bowt.ie>

Date: Fri Jun 18 20:04:07 2021 -0700

Support disabling index bypassing by VACUUM.

Generalize the INDEX_CLEANUP VACUUM parameter (and the corresponding

reloption): make it into a ternary style boolean parameter. It now

exposes a third option, "auto". The "auto" option (which is now the

default) enables the "bypass index vacuuming" optimization added by

commit 1e55e7d1.

"VACUUM (INDEX_CLEANUP TRUE)" is redefined to once again make VACUUM

simply do any required index vacuuming, regardless of how few dead

tuples are encountered during the first scan of the target heap relation

(unless there are exactly zero). This gives users a way of opting out

of the "bypass index vacuuming" optimization, if for whatever reason

that proves necessary. [...]

Using INDEX_CLEANUP ON, we can make VACUUM clean up the index, no matter what:

VACUUM (INDEX_CLEANUP ON) vis;
 
EXPLAIN (ANALYZE, COSTS OFF)
SELECT id FROM vis WHERE id < 1000;
 
                                    QUERY PLAN                                     
═══════════════════════════════════════════════════════════════════════════════════
 Index Only Scan using vis_pkey on vis (actual time=0.084..0.482 rows=999 loops=1)
   Index Cond: (id < 1000)
   Heap Fetches: 0
 Planning Time: 0.585 ms
 Execution Time: 0.678 ms

1

2

3

4

5

6

7

8

9

10

11

12

VACUUM (INDEX_CLEANUP ON) vis;

EXPLAIN (ANALYZE, COSTS OFF)

SELECT id FROM vis WHERE id < 1000;

QUERY PLAN

═══════════════════════════════════════════════════════════════════════════════════

Index Only Scan using vis_pkey on vis (actual time=0.084..0.482 rows=999 loops=1)

Index Cond: (id < 1000)

Heap Fetches: 0

Planning Time: 0.585 ms

Execution Time: 0.678 ms

It is also possible to configure the table so that autovacuum always cleans up the index:

ALTER TABLE vis SET (vacuum_index_cleanup = on);

1	ALTER TABLE vis SET (vacuum_index_cleanup = on);

Conclusion

We had a thorough look at the PostgreSQL visibility map and how VACUUM keeps it up to date. An experiment showed that from v14 on, a performance optimization in PostgreSQL sometimes keeps VACUUM from making some pages all-visible. We also got to see the pg_visibility extension, which allows us to examine the visibility map directly.

If you’d like to read more about how the visibility map shows up in EXPLAIN (ANALYZE), check out Ants’ article about performance downsides of UUIDs.

6 responses to “Making the PostgreSQL visibility map visible”

Richard Hennessy says:

February 24, 2025 at 10:06 am

Hello,

Very interesting article, I am wondering if it is related to my problem. I have a very complex query, and the critical part of the execution plan can vary greatly between runs, as you can see here.

->Parallel Index Only Scan using zie5tikondat on tikondat(cost=0.70..1229.69 rows=13080 width=27) (actual time=0.110..6.950 rows=9972 loops=3)

->Index Only Scan using zie5tikondat on tikondat(cost=0.70..140.39 rows=2673 width=27) (actual time=0.053..16.115 rows=29915 loops=3)

->Index Only Scan using zie5tikondat on tikondat(cost=0.70..252.71 rows=2254 width=27) (actual time=0.049..83422.021 rows=261612559 loops=3)

The table is very large, and changes are being made throughout the day. But I can't get my head around the last scan where it returned 261612559 rows, where only 29915 actually match the condition. This scan also had a large number of heap fetchs.

I am thinking of using the INDEX_CLEANUP ON option. What do you think?

Regards
Richard Hennessy

Reply
- Laurenz Albe says:
  
  February 25, 2025 at 2:06 pm
  
  From that little bit of information, it is hard to say what is going on. There are several possibilities:
  
  - The table statistics are outdated. In that case, ANALYZE tikondat should improve the estimate.
  
  - The table statistics are not detailed enough. In that case, ALTER tikondat ALTER the_column SET STATISTICS 500 followed by ANALYZE tikondat might fix the problem.
  
  Note that in the last example, there were really 261612559 rows returned from the index scan, not 29915. Perhaps other filters later in the execution plan got rid of many of them.
  
  I recommend using EXPLAIN (ANALYZE, BUFFERS) so that you get the relevant buffer statistics as well. Also, enabling track_io_timing can provide valuable information in the EXPLAIN output.
  
  Reply
Richard Hennessy says:

February 28, 2025 at 11:22 am

Thank your for your reply.

ANALYZE is run regulary, both manually and automatically. I did try settings statistics to 1000, though to be honest that was earlier on in the analysis, and I cannot remember if that large index scan was present then. I will try it again.

The table has around 282000000 rows, I have no idea how the figure 261612599 came about, let alone why. The rows will have been filtered out later through various joins.

I have been using EXPLAIN (ANALYZE, BUFFERS) all the way though here, but I'll also try track_io_timing to see if it sheds more light on this issue.

Thanks again.

Reply
Raj Suvariya says:

March 13, 2025 at 8:43 am

I have few doubts

1. Why is the heap fetch count 186 just after updating the row? What I am unable to understand is that on the page 0 you mentioned it had to scan 185 tuples and then on page 5405 it had to scan one additional tuple, but why only 1 tuple on page 5405? why not all of them again as the all-visible flag for that page is also off.

2. Does index also contain multiple entries for same row if we update the row?

3. How does Query Processor know which pages from heap to scan for?

4. Let's assume on index we found an entry which says id=7, why does it need to check visibility map? And how does it check exactly (what happens when it sees all visible page)?

5. Let's assume few more rows were updated along with id=7, then when QP is looking for pages on heap does it have to go through all partially visible pages even if the tuple we are interested in is not on that page?

Reply
- Laurenz Albe says:
  
  March 14, 2025 at 9:44 am
  
  1. PostgreSQL only has to read a single tuple from page 5405: the updated version of the row. Scanning any other index entry that points to that page would also cause a heap fetch, but we have to read none of them.
  
  2. Yes.
  
  3. Because the page number is part of the physical address of the tuple (the ctid), which is the pointer to the table row stored with the index leaf entry.
  
  4. Because the table row that belongs to this index entry might not be visible in the query's snapshot. But if the page containing the tuple is all-visible in the visibility map we are done, because that means that all tuples in that page are visible to all transactions.
  
  5. No, because (as explained above) the physical address of the tuple, and hence the page number, is stored in the index entry.
  
  Reply
Raj Suvariya says:

March 15, 2025 at 9:49 am

Thanks for answering my questions. However, I have few more questions.

1. If the page entry is stored in the index entry then initially dead row (for id=7) should say the page=0. Now if we know page=0 and it's not all-visible, does it require us to scan all the tuples inside that page?
2. Assuming it requires scanning all tuples on page 0, then it figures out the required tuple is not present on the page, and then it finds a new index entry that says page=5405, this page also doesn't have the all-visible flag set to true, then how does it find the required tuple in just one index scan instead of going through all tuples?
3. I read LP_DEAD is not propagated to replica instances, which means they would end up going through all heap fetches on each query. What is the recommendation to reduce the heap fetches on replica instances? Any other way apart from having aggressive vacuum / analyze? Also, does table partitioning help with this situation anyhow?

Reply

Leave a Reply Cancel reply

Stay tuned with our

NEWSLETTER

CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0

office@cybertec-postgresql.com

ISO_27001_Badge

Customer Support

Support Platform

Services

Support CYBERTEC Partner PostgreSQL Books

Company

STAY TUNED WITH OUR NEWSLETTER

Get the newest PostgreSQL Info & Tools

Data Protection Policy Terms and conditions Terms of Service Imprint

©

2025

CYBERTEC PostgreSQL International GmbH