CYBERTEC PostgreSQL Logo

Avoiding “OR” for better query performance

05.2018 / Category: / Tags: | |
To be OR not to be...
 © Laurenz Albe 2018

 

PostgreSQL query tuning is our daily bread at CYBERTEC, and once you have done some of that, you'll start bristling whenever you see an OR in a query, because they are usually the cause for bad query performance.

Of course there is a reason why there is an OR in SQL, and if you cannot avoid it, you have to use it. But you should be aware of the performance implications.

In this article I'll explore “good” and “bad” ORs and what you can do to avoid the latter.

A little sample schema

We'll use this simple setup for demonstration:

Suppose that we want to run queries with equality and LIKE conditions on the text columns, so we need some indexes:

Have a look at the documentation if you don't understand text_pattern_ops.

The “good” OR

An OR is fine in most parts of an SQL query: if it is not used to filter out rows from your query result, it will have no negative effect on query performance.

So if your OR appears in a CASE expression in the SELECT list, don't worry.

Unfortunately you usually find the OR where it hurts: in the WHERE clause.

The “bad” OR

Now for an example of an OR in a WHERE clause that is still pretty nice:

PostgreSQL can actually use an index scan for the query, because it can combine the bitmaps for both indexes with a “bitmap OR”.
Note, however, that a bitmap index scan is more expensive than a normal index scan, since it has to build the bitmap. Moreover, it uses much more RAM; each of these bitmaps can use up to work_mem memory.

A multi-column index on (id, a_val) won't help at all with this query, so there is no cheaper way to execute it.

IN is better than OR

Now for a more stupid variant of the above query:

Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR:

You see? As soon as you get rid of the OR, an efficient index scan can be used!

You might say that this is good for equality conditions, but what about the following query:

To improve that query, observe that the PostgreSQL optimizer rewrote the IN in the previous query to = ANY.

This is a case of the standard SQL “quantified comparison predicate”: ANY is true if the comparison is TRUE for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).

Now LIKE is a comparison operator as well, so we can write:

Unfortunately, the index cannot be used here.

pg_trgm to the rescue

But we are not at the end of our wits yet! There is such a wealth of indexes in PostgreSQL; let's try a different one. For this, we need the pg_trgm extension:

Then we can create a GIN trigram index on the column:

Now things are looking better:

Feel the power of trigram indexes!

Note 1: This index can also be used if the search pattern starts with %

Note 2: The GIN index can become quite large. To avoid that, you can also use a GiST index, which is much smaller, but less efficient to search.

The “ugly” OR

Things become really bad if OR combines conditions from different tables:

Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition. In our example, that would mean computing 100000 rows only to throw away the 99999 that do not match the condition.

Avoiding the ugly OR

Fortunately, there is an equivalent query that is longer to write, but much cheaper to execute:

Both parts of the query can make use of efficient index scans and return one row, and since the rows happen to be identical, UNION will reduce them to one row.

If you can be certain that both branches of the query will return distinct sets, it is better to use UNION ALL instead of UNION, because that doesn't have to do the extra processing to remove duplicates.

When using this trick, you should be aware that rewriting a query in that fashion does not always result in an equivalent query: if the original query can return identical rows, these would be removed by the UNION. In our case, we don't have to worry, because the primary keys were included in the query result. I find that this is hardly ever a problem in practice.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

15 responses to “Avoiding “OR” for better query performance”

  1. Perhaps an alternative to the "ugly or" query could be to pass the filter to the join condition for the secondary table, which won't change the results as it's an INNER JOIN.

    EXPLAIN (COSTS off)
    SELECT a.id, a.a_val, b.b_val
    FROM a JOIN b ON (a.id=b.id and b.id = 42)
    WHERE a.id = 42;

    QUERY PLAN
    ------------------------------------
    Nested Loop
    -> Index Scan using a_pkey on a
    Index Cond: (id = 42)
    -> Index Scan using b_pkey on b
    Index Cond: (id = 42)
    (5 rows)

  2. Great post! I think I don't quite understand one part. You wrote that "a bitmap index scan is much more expensive than a normal index scan — it has to scan the complete index. ". Why is it so? AFAIK, the main difference between an index scan and a bitmap index scan is that the former goes to the table immediately after finding a matching tuple, while the latter adds them to bitmap, merges bitmaps from all indexes and visits the tuples in physical order.

    • OR is hard for all query optimizers, so in principle yes.
      Many of the techniques and workarounds in the article are PostgreSQL specific though, so they are not applicable to other databases.
      For example, I would be surprised if a database that knows no join algorithm other than nested loops has anything as sophisticated as a bitmap index scan.

  3. These "human powered query optimizations" usually result in poor performance down the road when the machine-powered query optimizer has adapted to the anti-case. I've seen many nearly-imposible-to-read SQL queries written 'for performance' that, when re-written to be naive, are 10 times faster.

    This seems like a good feature request for a new query optimizer step.

  4. In PostgreSQL 12 the sentence

    EXPLAIN (COSTS OFF)
    SELECT a.id, a.a_val, b.b_val
    FROM TO JOIN INTERNAL b EN b.id = a.id
    WHERE a.id = 42
    or b.id = 42;

    throws

    Hash Join "
    Hash Cond: (a.id = b.id)
    Join Filter: ((a.id = 42) OR (b.id = 42)) "
    -> Seq Scan on a
    -> Hash
    -> Seq Scan on b

    I have turned the matter around and I don't understand why it gives a different result

  5. Very interesting post, thanks.
    I get a different plan for the "ugly or" (increasing the size of the tables does not change it):

    QUERY PLAN
    ---------------------------------------------
    Hash Join
    Hash Cond: (a.id = b.id)
    Join Filter: ((a.id = 42) OR (b.id = 42))
    -> Seq Scan on a
    -> Hash
    -> Seq Scan on b
    (6 lignes)

    • The join strategy doesn't matter. Still, it will calculate the whole join and filter away whatever does not satisfy the condition.

  6. Hey Laurenz!

    Nice blog post (and kudos for your blog in general - I've read many articles here).
    I've met most of the mentioned problems multiple times in my work around DB performance.

    I've long been wondering about OR involving fields from two tables - you write that "Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition".
    Indeed, optimizers have problems with these scenario, but do you have any details as to why it's the case?
    Why can't optimizer first filter rows from one table, and then join to the second (also potentially having filtered it beforehand), given that suitable indices are present?
    Or rewrite it to something involving UNION?
    I never found a good explanation of the essential complexity here.. Probably you know some subtler details, or there's some paper about it, or a nice thread on Postgres hackers or someplace else?
    Thanks in advance!

    • The only thing the optimizer could possibly do is to rewrite the query to use UNION or UNION ALL, but there is a problem. Consider this query:

      SELECT col FROM a JOIN b USING (q) WHERE a.x = 1 OR b.y = 2;

      If we rewrite it with UNION ALL, we could get a bad duplicate result if a result row satisfies both conditions. If, on the other hand, we use UNION, that could eliminate a legitimate duplicate result.
      In some cases it may be possible to prove that one of the alternatives is indeed correct, but that is beyond the capabilities of the optimizer.

      • Thanks - makes sense regarding UNION!
        But why is filtering either table before the join not a viable strategy?

        • Think about it. Given this query:
          SELECT * FROM a JOIN b USING (col) WHERE a.x = 1 OR b.y = 2;
          there can be results where a.x is not 1 and results where b.y is not 2. You would miss those results if you filter before you join. That strategy only works for AND.

      • yep, it was funny - I listened to some Postgres related podcast, and they mentioned your (new) article, saying something along the lines "Someone asked Laurenz in the comments, and he wrote an article on UNION vs OR". And I thought - hmm, that someone is me!
        Thanks for the article 😀

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram