CYBERTEC PostgreSQL Logo

Dealing with streaming replication conflicts in PostgreSQL

11.2020 / Category: / Tags: | | |
replication conflict?
© Laurenz Albe 2020

 

Streaming replication in PostgreSQL is a well-established master-slave replication technique. It is simple to set up, stable and performs well. So many people are surprised when they learn about replication conflicts — after all, the standby server is read-only.

This article describes replication conflicts and tells you how to deal with them.

What is a replication conflict?

A replication conflict occurs whenever the recovery process cannot apply WAL information from the primary server to the standby, because the change would disrupt query processing there. These conflicts cannot happen with queries on the primary server, but they happen on the streaming replication standby server because the primary server has limited knowledge about what is going on on the standby.

There are several kinds of replication conflicts:

Snapshot replication conflicts

This is the most frequent replication conflict.

Snapshot conflicts can occur if VACUUM processes a table and removes dead tuples. This removal is replayed on the standby. Now a query on the standby may have started before VACUUM on the primary (it has an older snapshot), so it can still see the tuples that should be removed. This constitutes a snapshot conflict.

Lock replication conflicts

The queries on a standby server take an ACCESS SHARE lock on the tables they are reading. So any ACCESS EXCLUSIVE lock on the primary (which conflicts with ACCESS SHARE) must be replayed on the standby to keep incompatible operations on the table from happening. PostgreSQL takes such a lock for operations that conflict with SELECT, for example DROP TABLE, TRUNCATE and many ALTER TABLE statements. If the standby should replay such a lock on a table that a query uses, we have a lock conflict.

Buffer pin replication conflicts

One way to reduce the need for VACUUM is to use HOT updates. Then any query on the primary that accesses a page with dead heap-only tuples and can get an exclusive lock on it will prune the HOT chains. PostgreSQL always holds such page locks for a short time, so there is no conflict with processing on the primary. There are other causes for page locks, but this is perhaps the most frequent one.

When the standby server should replay such an exclusive page lock and a query is using the page (“has the page pinned” in PostgreSQL jargon), you get a buffer pin replication conflict. Pages can be pinned for a while, for example during a sequential scan of a table on the outer side of a nested loop join.

HOT chain pruning can of course also lead to snapshot replication conflicts.

Rare kinds of replication conflicts

The following types of conflict are rare and will not bother you:

  • Deadlock replication conflicts: A query on the standby blocks while using the shared buffer that is needed to replay WAL from the primary. PostgreSQL will cancel such a query immediately.
  • Tablespace replication conflicts: A tablespace is in temp_tablespaces on the standby server, and a query has temporary files there. When a DROP TABLESPACE occurs from the primary, we get a conflict. PostgreSQL cancels all queries on the standby in that case.
  • Database replication conflicts: Replication of DROP DATABASE causes a conflict if the standby has active sessions on the database. PostgreSQL terminates all connections to the database on the standby in that case.

Monitoring replication conflicts

The statistics view pg_stat_database_conflicts contains a detailed account of all replication conflicts that happened since the last statistics reset. You have to look at that view on the standby server, not the primary, because that is where replication conflicts occur.

Note that this view does not show all replication conflicts that occurred, it only shows the ones that led to a canceled query on the standby (see the next section).

How does the standby server resolve replication conflicts?

The parameter max_standby_streaming_delay determines what happens when WAL replay encounters a replication conflict (there is a similar parameter max_standby_archive_delay that does the same thing for archive recovery). PostgreSQL suspends replay of the WAL information for at most max_standby_streaming_delay milliseconds. If the conflicting query is still running after that time, PostgreSQL cancels it with an error message like:

The detail message shows that this was from a snapshot replication conflict.

max_standby_streaming_delay has a default value of 30 seconds, so queries on the standby get a “grace time” of half a minute to finish before they get canceled if they cause a replication conflict. This is a middle ground between the extreme settings of 0 (PostgreSQL cancels queries immediately, no delay for replay) and the special value -1 (PostgreSQL never cancels queries, arbitrarily long replay delay).

To understand how to best configure PostgreSQL to deal with replication conflicts, we have to take a look at the use cases for streaming replication.

Use cases for streaming replication

High availability

Streaming replication is the basis for most high availability solutions. Together with a software like Patroni that manages failover, it provides a robust shared-nothing architecture to build a fault tolerant system.

Clearly, the main objective with high availability is to have as little replication delay as possible. This way, standby promotion is fast and little information is lost during the failover. In this case, you want to set max_standby_streaming_delay low.

Note that you need not lose more data during a failover if the standby is behind with applying WAL – the WAL information is still streamed to the standby and written to pg_wal. But it will take longer for the standby to catch up, so the failover time increases.

Off-loading big queries

Big queries for reporting or data analysis can generate load that may overload the productive system. The best solution for that is a data warehouse specifically designed for such queries. But often a standby server can serve as a “poor man's data warehouse”.

Another example for off-loading are database backups: backing up a standby server puts no stress on the primary system.

The main objective in this case is to let the queries (or the backup) complete without interruption.

In this case, you want to set max_standby_streaming_delay to a value greater than the lifetime of the longest query, and it is no problem if there is a delay replaying WAL.

Horizontal scaling

You can use standby servers to distribute database workload across several machines. In practice, there are several limitations to the usefulness of this approach:

  • all writing statements have to go to the primary server, so only reading can be scaled
  • the application has to be able to direct queries and data modifications to different databases
  • the application has to cope with the problem that data modifications may not immediately be visible to queries (synchronous replication avoids this, but the performance impact on writing transactions is staggering)

Another difficulty you face is that there is no good setting for max_standby_streaming_delay: a low value will make queries on the standby fail, while a high value will cause queries on the standby to see stale data.

How to deal with conflicting requirements

Ideally, a standby server serves only a single purpose, so that you can adjust max_standby_streaming_delay or hot_standby accordingly. So the king's way it to have dedicated standby servers for failover and off-loading work.

But sometimes you cannot afford a second standby server, or you may be stuck in a situation like in the “horizontal scaling” scenario above. Then your only option is to reduce the number of replication conflicts as much as possible.

Avoiding replication conflicts

Avoiding all conflicts by disabling hot standby

Obviously there can be no replication conflicts if there are no queries on the standby server. So if you set hot_standby = off on the standby, you don't have to worry about this at all.

But while this method is simple and effective, it will only be feasible if the standby server is exclusively used for high availability. If you are not in that fortunate position, read on.

Avoiding lock conflicts

The obvious measure to avoid lock conflicts is not to issue statements that take an ACCESS EXCLUSIVE lock on the table. The most important statements that do so are:

  • DROP TABLE
  • TRUNCATE
  • LOCK
  • DROP INDEX
  • DROP TRIGGER
  • ALTER TABLE

But there is one kind of ACCESS EXCLUSIVE lock that you cannot avoid that way: locks from VACUUM truncation. When VACUUM has finished processing a table, and the pages at the end of the table have become empty, it tries to get a short ACCESS EXCLUSIVE lock on the table. If that succeeds, it will truncate the empty pages and immediately release the lock. While such locks don't disrupt processing on the primary, they can cause replication conflicts on the standby.

There are two ways to avoid VACUUM truncation:

  • From PostgreSQL v12 on, you can disable the feature for individual tables with

  • You can set old_snapshot_threshold on the primary to a value other than -1. This disables VACUUM truncation as an undocumented side effect.

Avoiding snapshot conflicts

The way to reduce such conflicts is to keep the primary from removing dead tuples that might still be visible on the standby. There are two parameters that help with this:

  • Set hot_standby_feedback on the standby to on. Then the feedback messages from standby to primary server will contain the snapshot xmin of the oldest active transaction on the standby, and the primary will not remove any tuples that this transaction still could see.
    That will get rid most of these replication conflicts, but now long running queries on the standby can lead to table bloat on the primary, which is why that setting is not enabled by default. Consider the risk carefully.
  • Set vacuum_defer_cleanup_age on the primary to a value greater than 0. Then VACUUM will not clean up dead tuples unless they are more than vacuum_defer_cleanup_age transactions old. This is less specific than hot_standby_feedback and can also lead to table bloat.

Note that while hot_standby_feedback = on will get rid of most of the snapshot replication conflicts, it will not necessarily eliminate buffer pin conflicts, since the page that the standby is using could contain some very old tuples. Moreover, I have seen snapshot conflicts in databases even when hot_standby_feedback was on, although after consulting the source I don't understand how that could happen. Maybe a reader can enlighten me :^)

Avoiding buffer pin conflicts

There is no very good way to avoid these conflicts. Perhaps you can reduce the number of HOT updates, but that would harm performance on the primary.

Conclusion

The best way to avoid replication conflicts is to have dedicated standby servers: one for high availability and one for off-loading queries or backups. Then you can easily configure each to avoid replication conflicts.

If you cannot afford that expense, or you want to use a standby for horizontal scaling, you will have to adjust hot_standby_feedback, max_standby_streaming_delay and the vacuum_truncate storage parameter to get as few canceled queries as possible while avoiding excessive table bloat and long replication delays.

29 responses to “Dealing with streaming replication conflicts in PostgreSQL”

  1. Another increasingly more popular query conflict avoiding technique is setting up a logical replication replica. It's a bit more work for sure than simple streaming standbys and needs monitoring and maintenance when schema is evolving, etc...but you can avoid most downsides - no extra bloat on the primary, no replication lag, no killed queries

  2. I have a standby server on pg13. there is not any transaction or an request on it. when I display the pg_stat_replication view I see 46 hours replay lag. how could be this?

    • I guess you mean pg_stat_replication.
      Perhaps there is no data modification activity.
      The relevant measure is lag in bytes. Try this query on the primary server:

      SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;

      • Yes I mean pg_stat_replicaiton. I see the replay_lag in this view. and it is about 2 days.. on the master there are lots of insert statements. but on the slave there is no any workload. Is there any way to force replaying the wal on the standby?

  3. Hi Laurenz, thank you for this very interesting blog post!!

    I'd like to report a typo in "The queries on a standby server take an ACCESS EXCLUSIVE lock on the tables they are reading"

    (replace "ACCESS EXCLUSIVE" by "ACCESS SHARE")

  4. minor correction :

    Avoiding snapshot conflicts:
    hot_standby_feedback should be set on standby
    Regards,
    Harikrishna kuchipudi

  5. I admired your article. Thank you so much for writing such a nice article.
    I have a question.

    You explained as follows.
    ------------------------------------------------------
    max_standby_streaming_delay has a default value of 30 seconds, so queries on the standby get a “grace time” of half a minute to finish before they get canceled if they cause a replication conflict. This is a middle ground between the extreme settings of 0 and the special value -1
    ------------------------------------------------------
    Q1. Does "half a minute" mean 30minute? Or 30 seconds?
    Q2. You explained that the intermediate value between the values "0" and "-1" of "max_standby_streaming_delay" is "grace time".
    A value of "-1" means does not end the query, and is an arbitrary long duration.
    In other words, it is not a quantitatively determined time, and I do not understand why it is explained that half of the time that is not quantitatively defined is "grace time" and that "grace time" is specified as "half a minute", a quantitative number.

    Could you please explain this part as a supplement?
    What I'm really curious about is how many minutes or seconds "grace time" is exactly, and why?
    If there is a way to know the "grace time" when referring to the source code or the official documentation, please let me know.
    Lastly, I would like to say thank you once again for writing such a good article.

    • Much of the language that confused you is not technical language, so let me try to clarify:
      - "half a minute" is English for 30 seconds.
      - "grace time" is not a technical time, all it means is "some extra time before I get nasty".
      By default that is 30 seconds; if the query that causes the conflict has not finished after that time, it will be canceled.
      - "30 seconds" is not halfway between 0 (immediately) and -1 (infinitely long).
      But most queries take less than 30 seconds, so most queries will not get canceled. That's what I mean be "middle ground": most queries don't get canceled, and at the same time replay is not delayed for more than 30 seconds.

      • You will be truly blessed for being kind enough to answer my poor English skills. You are not only technically good, but you have a great personality.

  6. @laurenz_albe:disqus thanks for your wonderful blog posts, they explain concepts in a very easy to understand manner. I prefer reading your blog first then go to official documentation. Now for "Buffer pin replication conflicts" - does it mean the way to avoid them is to disable (auto) vacuum? So that it wont hold exclusive lock on the dead tuple blocks and prune them? In streaming replication, will standby have same blocks at the storage level?

    • Thank you for the praise.
      Even if you disable autovacuum (which would be a mistake), you would still get buffer pins and conflicts because of them. HOT updates lead to page locks even if autovacuum is disabled.
      Don't try to avoid replication conflicts. Have two standby servers: one for high availability and one for running queries on.

      • How "two standby servers: one for high availability and one for running queries on" can solve replication conflicts ?

        High availability standby server should have hot_standby = off
        but second standby server with hot_standby = on will still have conflicts.

        • Yes, but it won't cancel queries in the case of a conflict if you set max_standby_streaming_delay to -1. Instead, replay of the WAL information will be delayed.

          • I have master and standby server with hot_standby = on (pgpool with disable_load_balance_on_write = 'transaction'), last week I have checked settings with "max_standby_streaming_delay to -1" and synchronous_commit=remote_apply.

            write_lag and flush_lag was ok but replay_lag had terrible delays, some SELECT FOR UPDATE and UPDATES waits for almost 2h and in logs "remaining connection slots are reserved for non-replication superuser connections....." eventually I had to turn off replication. It looked like never-ending queries.

          • Synchronous replication with synchronous_commit = remote_apply and max_standby_streaming_delay = -1 are a toxic combination. That cannot work: queries on the standby will be allowed to delay replay of the WAL indefinitely, and COMMIT on the primary will wait for WAL to be replayed, so any write on the primary will start taking forever.
            You either have to tolerate stale reads or canceled queries on the standby. There is no third way.

  7. Hi Laurenz, thanks for the article!

    About this bit:
    > Moreover, I have seen snapshot conflicts in databases even when hot_standby_feedback was on, although after consulting the source I don't understand how that could happen. Maybe a reader can enlighten me :^)

    Were you able to figure out eventually how this can happen?

    The only 2 ideas I had were:
    1. walreceiver has a network issue and the backend_xmin sent to the primary is invalidated. A VACUUM then cleans up a record that normally would have not been cleaned up.
    2. Race condition where standby is too late to send its latest backend_xmin to the primary and a VACUUM already sends a cleanup record. Could maybe lower wal_receiver_status_interval to reduce the chance of this?

    Do these ideas make sense? Do you have any other thoughts?

    Thanks,
    Engineer who is very tired of snapshot conflicts 🙁

    • No, I didn't get any wiser.

      Thanks for your ideas, but that can't be it: if the feedback from the standby doesn't make it to the primary, all that can happen is that the xmin horizon is not advanced, so the primary doesn't discard old row versions that could actually be safely deleted.

  8. Thanks, after reading the walreceiver and walsender code (for version 13), I agree.

    This comment for GetOldestXmin() is also insightful:
    https://github.com/postgres/postgres/blob/0dc08bbfcc93ac4f04d2a3f4b9d1231a80e2cc0c/src/backend/storage/ipc/procarray.c#L1345-L1349

    Specifically, this point:
    > data is only protected if the walsender runs continuously while queries are executed on the standby.

    So, it seems like it could happen with these edge cases:
    - standby is not using a physical replication slot, walsender reboots, VACUUM runs before walreciver sends next status update
    - hot_standby_feedback turned off and back on for a standby

    However, I have seen snapshot conflicts in any environments where neither of those edge cases are happening.

    So, my next guess is that the snapshot conflicts are due to non-VACUUM cleanup records.

    After searching "ResolveRecoveryConflictWithSnapshot" in code, which seems to be the function used to find conflicting vxids on the standby and kill conflicting backends, I have found these cleanup records that can cause snapshot conflicts:
    - XLOG_GIST_DELETE
    - XLOG_GIST_PAGE_REUSE
    - XLOG_HASH_VACUUM_ONE_PAGE
    - XLOG_HEAP2_CLEANUP_INFO
    - XLOG_HEAP2_CLEAN
    - XLOG_HEAP2_VISIBLE
    - XLOG_HEAP2_FREEZE_PAGE
    - XLOG_BTREE_DELETE
    - XLOG_BTREE_REUSE_PAGE
    - XLOG_SPGIST_VACUUM_REDIRECT

    As far as I can tell, most of these are generated by VACUUM and avoidable by hot_standby_feedback, however, it seems like a couple can be generated outside VACUUM:

    - XLOG_GIST_DELETE -> new tuple insertion for gist index
    - XLOG_GIST_PAGE_REUSE -> new index page creation for gist index
    - XLOG_HEAP2_CLEAN -> early page pruning for HOT chains (?)
    - XLOG_BTREE_DELETE -> index deduplication on writes
    - XLOG_BTREE_REUSE_PAGE -> index page reuse on writes

    These are probably where I will take my investigation next - but so far not able to reliably reproduce the issue and also not sure what the best test suites would be to produce these specific cleanup records either

    • Thanks for your interesting research.
      Sure, a bug like that might have been the cause.
      My theory so far was that it is not a bug, but that something else can cause legitimate snapshot conflicts, along the lines of your research in your previous comment.
      It shouldn't be HOT chain pruning, because with hot_standby_feedback, a heap-only tuple also won't be removed if it is still needed on the standby.

  9. Thanks Laurenz

    I am working on a system that was upgraded from an earlier version before index deduplication to v13. This is also a system that had a high amount of churn due to data pruning without any reindexing or repacking done after the fact.

    My only guess at this point is that there are a lot of dead index pages which generate conflicts on reuse or deduplication and somehow in some cases, this causes conflicts.

    If I find anymore interesting insights, or if the problem goes away after reindexing or major version upgrades, I'll share an update here. But otherwise, I don't know where to look next! So elusive!

    (Also, I cannot figure out how to reply to your message rather than create a new message. The reply button seems to not work for me.)

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram