Table of Contents
Streaming replication in PostgreSQL is a well-established master-slave replication technique. It is simple to set up, stable and performs well. So many people are surprised when they learn about replication conflicts — after all, the standby server is read-only.
This article describes replication conflicts and tells you how to deal with them.
A replication conflict occurs whenever the recovery process cannot apply WAL information from the primary server to the standby, because the change would disrupt query processing there. These conflicts cannot happen with queries on the primary server, but they happen on the streaming replication standby server because the primary server has limited knowledge about what is going on on the standby.
There are several kinds of replication conflicts:
This is the most frequent replication conflict.
Snapshot conflicts can occur if VACUUM
processes a table and removes dead tuples. This removal is replayed on the standby. Now a query on the standby may have started before VACUUM
on the primary (it has an older snapshot), so it can still see the tuples that should be removed. This constitutes a snapshot conflict.
The queries on a standby server take an ACCESS SHARE
lock on the tables they are reading. So any ACCESS EXCLUSIVE
lock on the primary (which conflicts with ACCESS SHARE
) must be replayed on the standby to keep incompatible operations on the table from happening. PostgreSQL takes such a lock for operations that conflict with SELECT
, for example DROP TABLE
, TRUNCATE
and many ALTER TABLE
statements. If the standby should replay such a lock on a table that a query uses, we have a lock conflict.
One way to reduce the need for VACUUM
is to use HOT updates. Then any query on the primary that accesses a page with dead heap-only tuples and can get an exclusive lock on it will prune the HOT chains. PostgreSQL always holds such page locks for a short time, so there is no conflict with processing on the primary. There are other causes for page locks, but this is perhaps the most frequent one.
When the standby server should replay such an exclusive page lock and a query is using the page (“has the page pinned” in PostgreSQL jargon), you get a buffer pin replication conflict. Pages can be pinned for a while, for example during a sequential scan of a table on the outer side of a nested loop join.
HOT chain pruning can of course also lead to snapshot replication conflicts.
The following types of conflict are rare and will not bother you:
temp_tablespaces
on the standby server, and a query has temporary files there. When a DROP TABLESPACE
occurs from the primary, we get a conflict. PostgreSQL cancels all queries on the standby in that case.DROP DATABASE
causes a conflict if the standby has active sessions on the database. PostgreSQL terminates all connections to the database on the standby in that case.The statistics view pg_stat_database_conflicts
contains a detailed account of all replication conflicts that happened since the last statistics reset. You have to look at that view on the standby server, not the primary, because that is where replication conflicts occur.
Note that this view does not show all replication conflicts that occurred, it only shows the ones that led to a canceled query on the standby (see the next section).
The parameter max_standby_streaming_delay
determines what happens when WAL replay encounters a replication conflict (there is a similar parameter max_standby_archive_delay
that does the same thing for archive recovery). PostgreSQL suspends replay of the WAL information for at most max_standby_streaming_delay
milliseconds. If the conflicting query is still running after that time, PostgreSQL cancels it with an error message like:
1 2 |
ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. |
The detail message shows that this was from a snapshot replication conflict.
max_standby_streaming_delay
has a default value of 30 seconds, so queries on the standby get a “grace time” of half a minute to finish before they get canceled if they cause a replication conflict. This is a middle ground between the extreme settings of 0 (PostgreSQL cancels queries immediately, no delay for replay) and the special value -1 (PostgreSQL never cancels queries, arbitrarily long replay delay).
To understand how to best configure PostgreSQL to deal with replication conflicts, we have to take a look at the use cases for streaming replication.
Streaming replication is the basis for most high availability solutions. Together with a software like Patroni that manages failover, it provides a robust shared-nothing architecture to build a fault tolerant system.
Clearly, the main objective with high availability is to have as little replication delay as possible. This way, standby promotion is fast and little information is lost during the failover. In this case, you want to set max_standby_streaming_delay
low.
Note that you need not lose more data during a failover if the standby is behind with applying WAL – the WAL information is still streamed to the standby and written to pg_wal
. But it will take longer for the standby to catch up, so the failover time increases.
Big queries for reporting or data analysis can generate load that may overload the productive system. The best solution for that is a data warehouse specifically designed for such queries. But often a standby server can serve as a “poor man's data warehouse”.
Another example for off-loading are database backups: backing up a standby server puts no stress on the primary system.
The main objective in this case is to let the queries (or the backup) complete without interruption.
In this case, you want to set max_standby_streaming_delay
to a value greater than the lifetime of the longest query, and it is no problem if there is a delay replaying WAL.
You can use standby servers to distribute database workload across several machines. In practice, there are several limitations to the usefulness of this approach:
Another difficulty you face is that there is no good setting for max_standby_streaming_delay
: a low value will make queries on the standby fail, while a high value will cause queries on the standby to see stale data.
Ideally, a standby server serves only a single purpose, so that you can adjust max_standby_streaming_delay
or hot_standby
accordingly. So the king's way it to have dedicated standby servers for failover and off-loading work.
But sometimes you cannot afford a second standby server, or you may be stuck in a situation like in the “horizontal scaling” scenario above. Then your only option is to reduce the number of replication conflicts as much as possible.
Obviously there can be no replication conflicts if there are no queries on the standby server. So if you set hot_standby = off
on the standby, you don't have to worry about this at all.
But while this method is simple and effective, it will only be feasible if the standby server is exclusively used for high availability. If you are not in that fortunate position, read on.
The obvious measure to avoid lock conflicts is not to issue statements that take an ACCESS EXCLUSIVE
lock on the table. The most important statements that do so are:
But there is one kind of ACCESS EXCLUSIVE
lock that you cannot avoid that way: locks from VACUUM
truncation. When VACUUM
has finished processing a table, and the pages at the end of the table have become empty, it tries to get a short ACCESS EXCLUSIVE
lock on the table. If that succeeds, it will truncate the empty pages and immediately release the lock. While such locks don't disrupt processing on the primary, they can cause replication conflicts on the standby.
There are two ways to avoid VACUUM
truncation:
1 |
ALTER TABLE some_table SET (vacuum_truncate = off); |
old_snapshot_threshold
on the primary to a value other than -1. This disables VACUUM
truncation as an undocumented side effect.The way to reduce such conflicts is to keep the primary from removing dead tuples that might still be visible on the standby. There are two parameters that help with this:
hot_standby_feedback
on the standby to on
. Then the feedback messages from standby to primary server will contain the snapshot xmin
of the oldest active transaction on the standby, and the primary will not remove any tuples that this transaction still could see.vacuum_defer_cleanup_age
on the primary to a value greater than 0. Then VACUUM
will not clean up dead tuples unless they are more than vacuum_defer_cleanup_age
transactions old. This is less specific than hot_standby_feedback
and can also lead to table bloat.Note that while hot_standby_feedback = on
will get rid of most of the snapshot replication conflicts, it will not necessarily eliminate buffer pin conflicts, since the page that the standby is using could contain some very old tuples. Moreover, I have seen snapshot conflicts in databases even when hot_standby_feedback
was on
, although after consulting the source I don't understand how that could happen. Maybe a reader can enlighten me :^)
There is no very good way to avoid these conflicts. Perhaps you can reduce the number of HOT updates, but that would harm performance on the primary.
The best way to avoid replication conflicts is to have dedicated standby servers: one for high availability and one for off-loading queries or backups. Then you can easily configure each to avoid replication conflicts.
If you cannot afford that expense, or you want to use a standby for horizontal scaling, you will have to adjust hot_standby_feedback
, max_standby_streaming_delay
and the vacuum_truncate
storage parameter to get as few canceled queries as possible while avoiding excessive table bloat and long replication delays.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Another increasingly more popular query conflict avoiding technique is setting up a logical replication replica. It's a bit more work for sure than simple streaming standbys and needs monitoring and maintenance when schema is evolving, etc...but you can avoid most downsides - no extra bloat on the primary, no replication lag, no killed queries
Minor correction. View name is mentioned as "pg_stat_replication_conflicts" it is database_conflicts
Thanks, fixed.
I wrote "replication conflict" so often in this article that it got stuck...
I have a standby server on pg13. there is not any transaction or an request on it. when I display the pg_stat_replication view I see 46 hours replay lag. how could be this?
I guess you mean
pg_stat_replication
.Perhaps there is no data modification activity.
The relevant measure is lag in bytes. Try this query on the primary server:
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;
Yes I mean pg_stat_replicaiton. I see the replay_lag in this view. and it is about 2 days.. on the master there are lots of insert statements. but on the slave there is no any workload. Is there any way to force replaying the wal on the standby?
Make sure there are no queries running on the standby. Look at the standby's log file for errors. Buy some consulting from us.
Thank you Laurenz. I have resolved it.
Hi Laurenz, thank you for this very interesting blog post!!
I'd like to report a typo in "The queries on a standby server take an ACCESS EXCLUSIVE lock on the tables they are reading"
(replace "ACCESS EXCLUSIVE" by "ACCESS SHARE")
Thank you, what an embarrassing bug. Fixed.
minor correction :
Avoiding snapshot conflicts:
hot_standby_feedback should be set on standby
Regards,
Harikrishna kuchipudi
What an embarrassing mistake. Thanks! I have fixed the text.
Nice explanation...learnt lot of things..thank you so much
I admired your article. Thank you so much for writing such a nice article.
I have a question.
You explained as follows.
------------------------------------------------------
max_standby_streaming_delay has a default value of 30 seconds, so queries on the standby get a “grace time” of half a minute to finish before they get canceled if they cause a replication conflict. This is a middle ground between the extreme settings of 0 and the special value -1
------------------------------------------------------
Q1. Does "half a minute" mean 30minute? Or 30 seconds?
Q2. You explained that the intermediate value between the values "0" and "-1" of "max_standby_streaming_delay" is "grace time".
A value of "-1" means does not end the query, and is an arbitrary long duration.
In other words, it is not a quantitatively determined time, and I do not understand why it is explained that half of the time that is not quantitatively defined is "grace time" and that "grace time" is specified as "half a minute", a quantitative number.
Could you please explain this part as a supplement?
What I'm really curious about is how many minutes or seconds "grace time" is exactly, and why?
If there is a way to know the "grace time" when referring to the source code or the official documentation, please let me know.
Lastly, I would like to say thank you once again for writing such a good article.
Much of the language that confused you is not technical language, so let me try to clarify:
- "half a minute" is English for 30 seconds.
- "grace time" is not a technical time, all it means is "some extra time before I get nasty".
By default that is 30 seconds; if the query that causes the conflict has not finished after that time, it will be canceled.
- "30 seconds" is not halfway between 0 (immediately) and -1 (infinitely long).
But most queries take less than 30 seconds, so most queries will not get canceled. That's what I mean be "middle ground": most queries don't get canceled, and at the same time replay is not delayed for more than 30 seconds.
You will be truly blessed for being kind enough to answer my poor English skills. You are not only technically good, but you have a great personality.
@laurenz_albe:disqus thanks for your wonderful blog posts, they explain concepts in a very easy to understand manner. I prefer reading your blog first then go to official documentation. Now for "Buffer pin replication conflicts" - does it mean the way to avoid them is to disable (auto) vacuum? So that it wont hold exclusive lock on the dead tuple blocks and prune them? In streaming replication, will standby have same blocks at the storage level?
Thank you for the praise.
Even if you disable autovacuum (which would be a mistake), you would still get buffer pins and conflicts because of them. HOT updates lead to page locks even if autovacuum is disabled.
Don't try to avoid replication conflicts. Have two standby servers: one for high availability and one for running queries on.
How "two standby servers: one for high availability and one for running queries on" can solve replication conflicts ?
High availability standby server should have hot_standby = off
but second standby server with hot_standby = on will still have conflicts.
Yes, but it won't cancel queries in the case of a conflict if you set
max_standby_streaming_delay
to -1. Instead, replay of the WAL information will be delayed.I have master and standby server with hot_standby = on (pgpool with disable_load_balance_on_write = 'transaction'), last week I have checked settings with "max_standby_streaming_delay to -1" and synchronous_commit=remote_apply.
write_lag and flush_lag was ok but replay_lag had terrible delays, some SELECT FOR UPDATE and UPDATES waits for almost 2h and in logs "remaining connection slots are reserved for non-replication superuser connections....." eventually I had to turn off replication. It looked like never-ending queries.
Synchronous replication with
synchronous_commit = remote_apply
andmax_standby_streaming_delay = -1
are a toxic combination. That cannot work: queries on the standby will be allowed to delay replay of the WAL indefinitely, andCOMMIT
on the primary will wait for WAL to be replayed, so any write on the primary will start taking forever.You either have to tolerate stale reads or canceled queries on the standby. There is no third way.
Thank you
Hi Laurenz, thanks for the article!
About this bit:
> Moreover, I have seen snapshot conflicts in databases even when hot_standby_feedback was on, although after consulting the source I don't understand how that could happen. Maybe a reader can enlighten me :^)
Were you able to figure out eventually how this can happen?
The only 2 ideas I had were:
1. walreceiver has a network issue and the backend_xmin sent to the primary is invalidated. A VACUUM then cleans up a record that normally would have not been cleaned up.
2. Race condition where standby is too late to send its latest backend_xmin to the primary and a VACUUM already sends a cleanup record. Could maybe lower wal_receiver_status_interval to reduce the chance of this?
Do these ideas make sense? Do you have any other thoughts?
Thanks,
Engineer who is very tired of snapshot conflicts 🙁
No, I didn't get any wiser.
Thanks for your ideas, but that can't be it: if the feedback from the standby doesn't make it to the primary, all that can happen is that the
xmin
horizon is not advanced, so the primary doesn't discard old row versions that could actually be safely deleted.Thanks, after reading the walreceiver and walsender code (for version 13), I agree.
This comment for GetOldestXmin() is also insightful:
https://github.com/postgres/postgres/blob/0dc08bbfcc93ac4f04d2a3f4b9d1231a80e2cc0c/src/backend/storage/ipc/procarray.c#L1345-L1349
Specifically, this point:
> data is only protected if the walsender runs continuously while queries are executed on the standby.
So, it seems like it could happen with these edge cases:
- standby is not using a physical replication slot, walsender reboots, VACUUM runs before walreciver sends next status update
- hot_standby_feedback turned off and back on for a standby
However, I have seen snapshot conflicts in any environments where neither of those edge cases are happening.
So, my next guess is that the snapshot conflicts are due to non-VACUUM cleanup records.
After searching "ResolveRecoveryConflictWithSnapshot" in code, which seems to be the function used to find conflicting vxids on the standby and kill conflicting backends, I have found these cleanup records that can cause snapshot conflicts:
- XLOG_GIST_DELETE
- XLOG_GIST_PAGE_REUSE
- XLOG_HASH_VACUUM_ONE_PAGE
- XLOG_HEAP2_CLEANUP_INFO
- XLOG_HEAP2_CLEAN
- XLOG_HEAP2_VISIBLE
- XLOG_HEAP2_FREEZE_PAGE
- XLOG_BTREE_DELETE
- XLOG_BTREE_REUSE_PAGE
- XLOG_SPGIST_VACUUM_REDIRECT
As far as I can tell, most of these are generated by VACUUM and avoidable by hot_standby_feedback, however, it seems like a couple can be generated outside VACUUM:
- XLOG_GIST_DELETE -> new tuple insertion for gist index
- XLOG_GIST_PAGE_REUSE -> new index page creation for gist index
- XLOG_HEAP2_CLEAN -> early page pruning for HOT chains (?)
- XLOG_BTREE_DELETE -> index deduplication on writes
- XLOG_BTREE_REUSE_PAGE -> index page reuse on writes
These are probably where I will take my investigation next - but so far not able to reliably reproduce the issue and also not sure what the best test suites would be to produce these specific cleanup records either
Okay, after some digging, I think this describes a bug that can cause snapshot recovery conflict despite hot_standby_feedback being set to on:
https://www.postgresql.org/message-id/9aae233b-72ec-b1b8-5716-2a092909f89f%40amazon.com
It seems quite *rare*, but it is at least a paper trail of how it could happen.
Thanks for your interesting research.
Sure, a bug like that might have been the cause.
My theory so far was that it is not a bug, but that something else can cause legitimate snapshot conflicts, along the lines of your research in your previous comment.
It shouldn't be HOT chain pruning, because with
hot_standby_feedback
, a heap-only tuple also won't be removed if it is still needed on the standby.Thanks Laurenz
I am working on a system that was upgraded from an earlier version before index deduplication to v13. This is also a system that had a high amount of churn due to data pruning without any reindexing or repacking done after the fact.
My only guess at this point is that there are a lot of dead index pages which generate conflicts on reuse or deduplication and somehow in some cases, this causes conflicts.
If I find anymore interesting insights, or if the problem goes away after reindexing or major version upgrades, I'll share an update here. But otherwise, I don't know where to look next! So elusive!
(Also, I cannot figure out how to reply to your message rather than create a new message. The reply button seems to not work for me.)
I'm occasionally seeing queries canceled this way:
ERROR: canceling statement due to conflict with recovery (SQLSTATE 40P01)
User transaction caused buffer deadlock with recovery.
I was wondering if these are buffer pin conflicts due to the buffer part of the message, but pg_stat_database_conflicts.confl_bufferpin is 0, whereas confl_snapshot is 3. So it sounds like they are snapshot conflicts. Do you think that's right?
I have hot_standby_feedback enabled, so it seems I am encountering one of the mysterious cases of this that you mentioned and others have commented on. I'm running 16.6.
I'm currently thinking of trying to reindex my catalog tables as this system has been upgraded through many versions of Postgres (other tables have been reindexed). Maybe old indexes could be related as theorized in comments above.
Interestingly the only query that has been getting canceled is one that is looking at a catalog table (pg_inherits). However I am not completely sure that it is the catalog that is the issue as the query is part of a PL/pgSQL function that does another few things. However the part that gets shown at the top of the error message is the catalog table part.
Interesting. If buffer deadlocks are counted as snapshot conflicts, that would definitely explain why you can see shapshot conflicts even if
hot_standby_feedback = on
.I am kind of doubtful that rebuilding an index will make a difference, but on the other hand it is not difficult and certainly worth a try.
Pinning a buffer in
pg_inherits
can certainly lead to a conflict. Do you have any idea what the conflicting operation would be? Creating a new partition?