CYBERTEC PostgreSQL Logo

Why "SELECT * FROM table" is NOT a read

09.2024 / Category: / Tags: |

People tend to think that a SELECT-operation is always a read. This might look obvious and make sense but does it actually hold true? Is a SELECT-statement a read? Can we really determine from the statement what it does so that we can safely load balance? In fact, no. The fact that something seems to be a pure read does not make it a read - and that has huge implications which will be discussed in this short but
important post.

Load balancing: How to do it the wrong way

Often people try to build their own replication solution assumung that load balancing of queries can be simply handled by some kind of middleware. However, this is definitely doomed to fail. In short: Do NOT build software that magically decides in some middleware what to do.

SELECT FROM table graphic 1

Why is that? Consider the following:

But what if the following happens?

In this case I have defined a simple table and a function that just happens to insert some simple data. But what if we add the following view to the scenario?

This will change things because we can access the view easily:

If we call it again we will already see the leftovers of the previous execution:

From the application or middleware point of view we are still talking about a simple SELECT statement. What can possibly go wrong? Well, as you can see: A lot can go wrong because the string we send to the database does not carry all the information the middleware needs to make a decision whether we are talking about reading and writing.

Send reads to the secondary and writes to the primary?

Let us focus on what we want to solve here: We want to use our read-replicas which we have connected via streaming replication to do load balancing. To do that we have to know what is a read and what is a write. But who else is there who could have this information? There is only one component that knows: The application. To do load balancing the application has to be aware of what is going on.

Consider the following example:

We start a transaction assuming that we will just read. However, during the transaction it turns out (application logic) that we indeed have to write some data. The question is: What shall we do with the open transaction? It is impossible to move it to the primary as we would immediately…

  • destroy transactional integrity
  • loss of connection settings
  • destruction of open cursors
  • a lot more …

There is no other way: If you want to do load balancing you need AT LEAST two database connections. Your application has to be aware of what is a read and what is a write. There is no way around it.

Why SELECT FROM table is NOT a read graphic 2

Load balancing and replication

There is one more thing which has to be kept here: Suppose we have made our application smart enough. It knows when it does a write and when it does a read. Can we finally simply route reads to the replicas and load balance things? In PostgreSQL there is a thing called "synchronous replication". Most people
assume that synchronous replication (synchronous_commit = on) is enough to immediately see data on the replicas. This is not true. Let reflect on what we are doing here? By default synchronous replication means that data has been written (= as in "written to the WAL") on at least two machines. This is not enough to ensure that data is visible (= SELECT) on a replica. Why can this happen? The reason for this is the concept of a "replication conflict". There are two blogs which are recommended to learn about
this:

"Dealing with replication conflicts" by Laurenz Albe
"What hot_standby_feedback in PostgreSQL really does" by Hans-Jürgen Schönig

The core idea is that a read can prevent WAL from being applied on the replica (as described in our blogs). The fact that WAL has been written to disk on the replica therefore does not ensure that you can see the data that has changed unless synchronous_commit is changed to "remote_write" which of course comes with a handful of implications which is beyond the scope of this article.

Load balancing: Conclusion

The conclusion here is: If you are aiming for load balancing, please stop writing middleware that tries to figure out somehow how to load balance and what to load balance. It might backfire badly unless you are totally aware of what you are doing and unless you know exactly what can break things and what does not. In case transactional integrity is an issue, coming up with your own load balancing solution is most likely to fail miserably. We are posting this as a warning because we have seen too many people fail recently so that we decided to write a small article which hopefully prevents more disaster from happening.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

3.3 4 votes
Article Rating
Subscribe
Notify of
guest
4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Hari
Hari
9 days ago

What would be the preferred load balancers then? pgbouncer, pgpool? any etc?

Laurenz Albe
8 days ago
Reply to  Hari

I have the impression that you missed the point of the article: no load balancer is going to help you with that in a reliable fashion. You have to teach the application to explicitly perform reads and writes on different data sources.

Hari
Hari
5 days ago
Reply to  Laurenz Albe

Ah, when you mentioned writing your middleware, I understood that you are referring own poolers solutions apart from regular poolers in market. Do you think replication parameters like read_only_function_list will help here?

Last edited 5 days ago by Harisai M
Laurenz Albe
5 days ago
Reply to  Hari

Yes, that pgpool parameter is an attempt to address that problem. But we don't consider that a very robust solution: it requires that you configure pgpool with application knowledge, and if you forget something or the application changes, things will go wrong.
It is much better to have the decision where to route the statement right in the application code, so that it can be easily maintained whenever you change the code.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram