CYBERTEC PostgreSQL Logo

End of the road for PostgreSQL streaming replication?

02.2025 / Category: / Tags: |

PostgreSQL streaming replication is pretty fast. It is based on transaction log (WAL) and crash recovery mechanism. This work is something the primary must handle anyway for safe operation, so the overhead on the primary side is negligible. The only extra work is pushing WAL out onto the network in addition to disk.

On the replica side, the apply process is also quite lightweight, because the transaction log contains physical layer changes. When the database system receives an update, tasks such as finding the row, locating space for the new version, finding the correct place where to insert new references into the indexes,and splitting pages if necessary will all be done by the primary. This work is encoded in the transaction log stream as modifications to specific pages. The replay process only has to find the page and replay the modification, therefore no “thinking” is necessary.

Ants Blog End of the road graphic 1

However, there is one problem here. On the primary side, transaction log sequencing is highly optimized. Each database backend prepares their transaction log record in local memory, then reserves space in WAL for it, and then proceeds to copy it to a shared memory buffer. The space reservation is the only thing that happens sequentially, everything else happens concurrently. And the sequential part is only incrementing a location pointer under a spin lock- about as fast as it can be. Therefore, on the primary we have lots and lots of backends running through write transactions, encoding the work done as transaction log, and inserting it in parallel. On the replica side, there is only one process that is applying all of this work. The job it is doing is simpler, but at some scale, the larger number of workers will be able to overwhelm it.

Finding the limits

So where is the point where replication will fall behind? As with everything in databases, the answer is “it depends”. Different workloads will have different characteristics. If the working set of modified pages does not fit into cache, the replay process will need to go and fetch them. The replay can skip this read for the first modification during a checkpoint cycle, as the primary will include a copy of the page in the WAL (a full page image, or FPI) to be able to recover from partial writes. If the same pages are modified many times during a checkpoint cycle, the primary is going to generate loads of small WAL records. If pages are only modified once per cycle, the WAL usage per transaction can be two orders of magnitude larger due to FPIs.

To try to quantify the crossover point, I ran a small test. I initialized a 8GB pgbench database that fits into shared buffers, then I set up WAL archiving and took a backup. Next, I ran 30min of pgbench with synchronous_commit=off to generate some WAL. On a 24 thread workstation, this generated 70GB of WAL containing 66M pgbench transactions, with an average speed of 36.6k tps. Finally, I configured Postgres to run recovery on the backup. This recovery was able to complete in 372 seconds, or 177k tps.
In theory, a 120 thread server, running updates as fast as it can- without waiting for commits to hit disc-, would be able to outrun its replica. Now keep in mind, this is done with pgbench, which is a notoriously write-heavy and simplistic benchmark. In a more realistic workload, the crossover point is going to be significantly higher because there is more work to do on the primary. This is beneficial, because we don’t want to be anywhere near that point, otherwise operating the cluster would like walking on the edge of a cliff.

Pushing the limits

However, can we introduce concurrency on the replica too? It is possible, but only to some extent. The sequencing of WAL records implicitly encodes the dependencies between records. As a simple example, two modifications to the same block must be applied in the same order as they were generated to get the same end result. There are also more complex dependencies- a transaction commit cannot be replayed before all the modifications done in that transaction, and an index pointer cannot be added before the row it points to has been replayed. If we ignore those dependencies, queries running on the replica will see incorrect results or even crash the database.

What helps is that these dependencies apply to the actual modifications themselves. Non-essential responsibilities in the apply process that could be peeled off to run concurrently, allowing the apply process to focus solely on the modifications. Some improvements towards this have already been made in PostgreSQL.

Starting from version 15, the replay process looks ahead in the WAL stream, decodes the referenced block numbers, attempts to look up the block, and if it’s missing, it instructs the operating system to prefetch it into page cache. All of this work still happens as part of the startup process, so no parallelism gained there. However, the prefetch command ensures that any necessary disk reads occur while the replay process is still handling previous records.And even more importantly, there can be many such reads in flight at the same time. Readahead distance and the amount of I/Os in flight can be set via wal_decode_buffer_size and maintenance_io_concurrency parameters.

When the long-awaited asynchronous I/O patch lands in PostgreSQL, we could go one step further and initiate the read into shared buffers directly, instead of just prefetching into the operating system page cache.

The above step helps with I/O, which -while very important-, was not the problem in the completely CPU-bound replay benchmark above. CPUs are no longer getting any faster, so if we want to improve our speed, we will need to do less work. So, what does that replay process do? To understand, I captured a few perf profiles from the process and built some flamegraphs.

Flame Graph

Surprisingly, what could be considered the actual replay work seems to be a minority of the total workload.. The largest parts involve reading WAL and decoding page references from it, followed by looking up those pages in the cache, and pinning them so they are not evicted while in use. All of this work could be performed concurrently with the replay loop. For example, a separate read-ahead process could handle these tasks, ensuring that the replay process receives a queue of transaction log records with associated cache references already pinned, ready for application.

If we do all this, it looks like we can push the wall forwards a few times. Possibly hitting half a million transactions per second. On the way, there will definitely be other challenges. At that speed, the time between wraparound vacuums would be measured in minutes, not days. But at that point, we have hit a limit of what a single transaction log can do and PostgreSQL will need to start adopting techniques from distributed databases to scale further. That is something to discuss another day.

2 responses to “End of the road for PostgreSQL streaming replication?”

  1. This is a helpful blog post. There has been a lot of desire to try to do WAL replay in parallel, but as you stated clearly, that is very complex given the sequential dependency of the WAL contents. You are suggesting additional _pipelining_, which I think is a valid approach to improve performance.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram