Enough has been written about the connection between VACUUM
, the visibility map and index-only scans in PostgreSQL. But recently I demonstrated index-only scans to a class I was teaching and made some observations that surprised me at first. I could eventually explain what I saw, but I may have given the class more to chew on than I intended. I myself found the experience enlightening, and if you would like some insight into the implementation details and performance features of PostgreSQL, welcome to the show.
Table of Contents
Every PostgreSQL table has a visibility map that stores two bits for each of the table’s 8kB pages: the all-frozen flag, which allows anti-wraparound autovacuum to skip some pages for better performance, and the all-visible flag, which will be our focus in this article. If the all-visible bit for a page is set, PostgreSQL knows that all tuples in that page are visible to all transactions. It can then skip fetching the table row if the sole purpose is to check it for visibility, which leads to index-only scans that actually deserve that name. Since VACUUM
removes dead tuples, which renders table pages all-visible, it also has the job of maintaining the visibility map. Consequently, you need to make sure that a table is VACUUM
ed often enough if you want efficient index-only scans.
All we need is a bigger table with an index (in this case, the primary key index). I use an unlogged table to avoid writing WAL, which won’t affect our observations.
1 2 3 4 5 6 7 |
CREATE UNLOGGED TABLE vis ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, val double precision NOT NULL ); INSERT INTO vis (val) SELECT random() FROM generate_series(1, 1000000); |
We VACUUM
the table to set the hint bits and build the visibility map. We then ANALYZE
it to get good optimizer statistics.
1 |
VACUUM (ANALYZE) vis; |
EXPLAIN (ANALYZE)
to check the quality of the index scanAmong many other valuable data, EXPLAIN (ANALYZE)
shows us how many table tuples had to be fetched to check them for visibility, because the page’s all-visible flag wasn’t set:
1 2 3 4 5 6 7 8 9 10 |
EXPLAIN (ANALYZE, COSTS OFF) SELECT id FROM vis WHERE id < 1000; QUERY PLAN ═══════════════════════════════════════════════════════════════════════════════════ Index Only Scan using vis_pkey on vis (actual time=0.083..0.522 rows=999 loops=1) Index Cond: (id < 1000) Heap Fetches: 0 Planning Time: 0.283 ms Execution Time: 0.713 ms |
We see that PostgreSQL didn’t have to fetch a single table row, since all pages were all-visible.
If we update a single row in the table, PostgreSQL has to reset the all-visible flag in the visibility map for two table pages:
Here is a demonstration:
1 2 3 4 5 6 7 8 9 10 11 12 |
UPDATE vis SET val = 0.0 WHERE id = 7; EXPLAIN (ANALYZE, COSTS OFF) SELECT id FROM vis WHERE id < 1000; QUERY PLAN ═══════════════════════════════════════════════════════════════════════════════════ Index Only Scan using vis_pkey on vis (actual time=0.117..0.507 rows=999 loops=1) Index Cond: (id < 1000) Heap Fetches: 186 Planning Time: 0.255 ms Execution Time: 0.660 ms |
The row with id = 7
was within the first page of the table, and that page must contain 185 table rows (one heap fetch is for the updated row version, which is on a different page). Since the page is no longer all-visible, PostgreSQL has to fetch all tuples in it and test if they are visible to the transaction.
It is good to have an explanation of what happened, but it is better to verify that our line of thinking is correct. We can verify our deductions by looking at the current tuple-ID of the table rows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
SELECT ctid, * FROM vis; ctid │ id │ val ════════════╪═════════╪════════════════════════ (0,1) │ 1 │ 0.9428522231691654 (0,2) │ 2 │ 0.8280668280240011 (0,3) │ 3 │ 0.7906647703465222 (0,4) │ 4 │ 0.19579683924366442 (0,5) │ 5 │ 0.5798129594075341 (0,6) │ 6 │ 0.5612578971780957 (0,8) │ 8 │ 0.29021282947733007 (0,9) │ 9 │ 0.2606336534585936 (0,10) │ 10 │ 0.7574579143204805 ... (0,184) │ 184 │ 0.38348399704071734 (0,185) │ 185 │ 0.08413532836604021 (1,1) │ 186 │ 0.9946325078631779 (1,2) │ 187 │ 0.694070825069186 ... (5405,74) │ 999999 │ 0.8119554439812944 (5405,75) │ 1000000 │ 0.3859512822842295 (5405,76) │ 7 │ 0 (1000000 rows) |
The first number of the ctid
is the page number. So we indeed have 185 tuples in the first page (page number 0): 184 visible ones and an invisible (updated) one. Page 5405 is also no longer all-visible, so scanning the index for the row with id = 7
causes another heap fetch, and we come up with a total of 186 heap fetches.
So far, we have only inferred that the UPDATE
we executed above reset the all-visible flag on two pages, but we can also prove that the deduction was correct. Using the standard extension pg_visibility, we can examine the visibility map directly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
CREATE EXTENSION pg_visibility; SELECT * FROM pg_visibility_map('vis'); blkno │ all_visible │ all_frozen ═══════╪═════════════╪════════════ 0 │ f │ f 1 │ t │ f 2 │ t │ f 3 │ t │ f ... 5401 │ t │ f 5402 │ t │ f 5403 │ t │ f 5404 │ t │ f 5405 │ f │ f (5406 rows) |
This confirms that all but the first and the last page of the table are all-visible.
So far, everything was as I expected. My first surprise came when I repeated EXPLAIN (ANALYZE)
:
1 2 3 4 5 6 7 8 9 10 |
EXPLAIN (ANALYZE, COSTS OFF) SELECT id FROM vis WHERE id < 1000; QUERY PLAN ═══════════════════════════════════════════════════════════════════════════════════ Index Only Scan using vis_pkey on vis (actual time=0.085..0.742 rows=999 loops=1) Index Cond: (id < 1000) Heap Fetches: 185 Planning Time: 0.293 ms Execution Time: 0.966 ms |
Suddenly, there is one heap fetch less. What happened? Did autovacuum run on the table? That cannot be, because autovacuum doesn’t start running before dead tuples make up 20% of the table. We only updated a single row!
Fortunately I remembered something I blogged about a while ago: killed index tuples. If an index scan detects that a table row is no longer visible to any transaction, it marks the index entry LP_DEAD
so that the next index scan knows it can ignore the entry. Our first EXPLAIN (ANALYZE)
must have killed the index entry pointing to the updated row version, so the second one could avoid one heap fetch.
VACUUM
doesn’t update the visibility map as it shouldNext, I wanted to demonstrate that VACUUM
removes dead tuples, which makes pages all-visible, and updates the visibility map:
1 2 3 4 5 6 7 8 9 10 11 12 |
VACUUM vis; EXPLAIN (ANALYZE, COSTS OFF) SELECT id FROM vis WHERE id < 1000; QUERY PLAN ═══════════════════════════════════════════════════════════════════════════════════ Index Only Scan using vis_pkey on vis (actual time=0.125..1.097 rows=999 loops=1) Index Cond: (id < 1000) Heap Fetches: 184 Planning Time: 1.931 ms Execution Time: 1.535 ms |
Huh? Why do we still have 184 heap fetches, only one less than before?
VACUUM
‘s failure to set all pages all-visibleI had to think a while before I remembered an optimization that went into PostgreSQL v14 with commit 5100010ee4:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Author: Peter Geoghegan <pg@bowt.ie> Date: Wed Apr 7 16:14:54 2021 -0700 Teach VACUUM to bypass unnecessary index vacuuming. [...] VACUUM has generally managed to avoid index scans when there were clearly no index tuples to delete from indexes. But cases with _close to_ no index tuples to delete were another matter -- a round of ambulkdelete() calls took place (one per index), each of which performed a full index scan. VACUUM now behaves just as if there were zero index tuples to delete in cases where there are in fact "virtually zero" such tuples. That is, it can now bypass index vacuuming and heap vacuuming as an optimization (though not index cleanup). Whether or not VACUUM bypasses indexes is determined dynamically, based on the just-observed number of heap pages in the table that have one or more LP_DEAD items (LP_DEAD items in heap pages have a 1:1 correspondence with index tuples that still need to be deleted from each index in the worst case). We only skip index vacuuming when 2% or less of the table's pages have one or more LP_DEAD items -- bypassing index vacuuming as an optimization must not noticeably impede setting bits in the visibility map. [...] |
Since we updated only a single row, only one of the 5406 pages of the table contain a dead row, which is clearly less than 2%. So VACUUM
skipped the expensive step of cleaning up the index. If VACUUM
does not clean up the index and does not remove the index entry that points to the dead row version, it cannot remove the dead row version itself, and the first table page cannot become all-visible.
However, when VACUUM
scans the last table page that contains the updated row, it notices that all entries in that page are visible to everybody and sets the all-visible flage for that page. As a consequence, the table row with id = 7
is now on an all-visible page, and PostgreSQL can skip the heap fetch for that one row.
VACUUM
to clean up the indexIf we desparately need to get the heap fetches down to zero again, we can use an option of VACUUM
that was added in commit 3499df0dee:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Author: Peter Geoghegan <pg@bowt.ie> Date: Fri Jun 18 20:04:07 2021 -0700 Support disabling index bypassing by VACUUM. Generalize the INDEX_CLEANUP VACUUM parameter (and the corresponding reloption): make it into a ternary style boolean parameter. It now exposes a third option, "auto". The "auto" option (which is now the default) enables the "bypass index vacuuming" optimization added by commit 1e55e7d1. "VACUUM (INDEX_CLEANUP TRUE)" is redefined to once again make VACUUM simply do any required index vacuuming, regardless of how few dead tuples are encountered during the first scan of the target heap relation (unless there are exactly zero). This gives users a way of opting out of the "bypass index vacuuming" optimization, if for whatever reason that proves necessary. [...] |
Using INDEX_CLEANUP ON, we can make VACUUM clean up the index, no matter what:
1 2 3 4 5 6 7 8 9 10 11 12 |
VACUUM (INDEX_CLEANUP ON) vis; EXPLAIN (ANALYZE, COSTS OFF) SELECT id FROM vis WHERE id < 1000; QUERY PLAN ═══════════════════════════════════════════════════════════════════════════════════ Index Only Scan using vis_pkey on vis (actual time=0.084..0.482 rows=999 loops=1) Index Cond: (id < 1000) Heap Fetches: 0 Planning Time: 0.585 ms Execution Time: 0.678 ms |
It is also possible to configure the table so that autovacuum always cleans up the index:
1 |
ALTER TABLE vis SET (vacuum_index_cleanup = on); |
We had a thorough look at the PostgreSQL visibility map and how VACUUM
keeps it up to date. An experiment showed that from v14 on, a performance optimization in PostgreSQL sometimes keeps VACUUM
from making some pages all-visible. We also got to see the pg_visibility extension, which allows us to examine the visibility map directly.
If you’d like to read more about how the visibility map shows up in EXPLAIN (ANALYZE)
, check out Ants’ article about performance downsides of UUIDs.
Hello,
Very interesting article, I am wondering if it is related to my problem. I have a very complex query, and the critical part of the execution plan can vary greatly between runs, as you can see here.
->Parallel Index Only Scan using zie5tikondat on tikondat(cost=0.70..1229.69 rows=13080 width=27) (actual time=0.110..6.950 rows=9972 loops=3)
->Index Only Scan using zie5tikondat on tikondat(cost=0.70..140.39 rows=2673 width=27) (actual time=0.053..16.115 rows=29915 loops=3)
->Index Only Scan using zie5tikondat on tikondat(cost=0.70..252.71 rows=2254 width=27) (actual time=0.049..83422.021 rows=261612559 loops=3)
The table is very large, and changes are being made throughout the day. But I can't get my head around the last scan where it returned 261612559 rows, where only 29915 actually match the condition. This scan also had a large number of heap fetchs.
I am thinking of using the INDEX_CLEANUP ON option. What do you think?
Regards
Richard Hennessy
From that little bit of information, it is hard to say what is going on. There are several possibilities:
- The table statistics are outdated. In that case,
ANALYZE tikondat
should improve the estimate.- The table statistics are not detailed enough. In that case,
ALTER tikondat ALTER the_column SET STATISTICS 500
followed byANALYZE tikondat
might fix the problem.Note that in the last example, there were really 261612559 rows returned from the index scan, not 29915. Perhaps other filters later in the execution plan got rid of many of them.
I recommend using
EXPLAIN (ANALYZE, BUFFERS)
so that you get the relevant buffer statistics as well. Also, enablingtrack_io_timing
can provide valuable information in theEXPLAIN
output.Thank your for your reply.
ANALYZE is run regulary, both manually and automatically. I did try settings statistics to 1000, though to be honest that was earlier on in the analysis, and I cannot remember if that large index scan was present then. I will try it again.
The table has around 282000000 rows, I have no idea how the figure 261612599 came about, let alone why. The rows will have been filtered out later through various joins.
I have been using EXPLAIN (ANALYZE, BUFFERS) all the way though here, but I'll also try track_io_timing to see if it sheds more light on this issue.
Thanks again.
I have few doubts
1. Why is the heap fetch count 186 just after updating the row? What I am unable to understand is that on the page 0 you mentioned it had to scan 185 tuples and then on page 5405 it had to scan one additional tuple, but why only 1 tuple on page 5405? why not all of them again as the all-visible flag for that page is also off.
2. Does index also contain multiple entries for same row if we update the row?
3. How does Query Processor know which pages from heap to scan for?
4. Let's assume on index we found an entry which says id=7, why does it need to check visibility map? And how does it check exactly (what happens when it sees all visible page)?
5. Let's assume few more rows were updated along with id=7, then when QP is looking for pages on heap does it have to go through all partially visible pages even if the tuple we are interested in is not on that page?
1. PostgreSQL only has to read a single tuple from page 5405: the updated version of the row. Scanning any other index entry that points to that page would also cause a heap fetch, but we have to read none of them.
2. Yes.
3. Because the page number is part of the physical address of the tuple (the
ctid
), which is the pointer to the table row stored with the index leaf entry.4. Because the table row that belongs to this index entry might not be visible in the query's snapshot. But if the page containing the tuple is all-visible in the visibility map we are done, because that means that all tuples in that page are visible to all transactions.
5. No, because (as explained above) the physical address of the tuple, and hence the page number, is stored in the index entry.
Thanks for answering my questions. However, I have few more questions.
1. If the page entry is stored in the index entry then initially dead row (for id=7) should say the page=0. Now if we know page=0 and it's not all-visible, does it require us to scan all the tuples inside that page?
2. Assuming it requires scanning all tuples on page 0, then it figures out the required tuple is not present on the page, and then it finds a new index entry that says page=5405, this page also doesn't have the all-visible flag set to true, then how does it find the required tuple in just one index scan instead of going through all tuples?
3. I read LP_DEAD is not propagated to replica instances, which means they would end up going through all heap fetches on each query. What is the recommendation to reduce the heap fetches on replica instances? Any other way apart from having aggressive vacuum / analyze? Also, does table partitioning help with this situation anyhow?