Table of Contents
PostgreSQL query tuning is our daily bread at CYBERTEC, and once you have done some of that, you'll start bristling whenever you see an OR
in a query, because they are usually the cause for bad query performance.
Of course there is a reason why there is an OR
in SQL, and if you cannot avoid it, you have to use it. But you should be aware of the performance implications.
In this article I'll explore “good” and “bad” OR
s and what you can do to avoid the latter.
We'll use this simple setup for demonstration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL); INSERT INTO a SELECT i, md5(i::text) FROM generate_series(1, 100000) i; CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL); INSERT INTO b SELECT i, md5(i::text) FROM generate_series(1, 100000) i; ALTER TABLE a ADD PRIMARY KEY (id); ALTER TABLE b ADD PRIMARY KEY (id); ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a; VACUUM (ANALYZE) a; VACUUM (ANALYZE) b; |
Suppose that we want to run queries with equality and LIKE
conditions on the text
columns, so we need some indexes:
1 2 |
CREATE INDEX a_val_idx ON a(a_val text_pattern_ops); CREATE INDEX b_val_idx ON b(b_val text_pattern_ops); |
Have a look at the documentation if you don't understand text_pattern_ops
.
OR
An OR
is fine in most parts of an SQL query: if it is not used to filter out rows from your query result, it will have no negative effect on query performance.
So if your OR
appears in a CASE
expression in the SELECT
list, don't worry.
Unfortunately you usually find the OR
where it hurts: in the WHERE
clause.
OR
Now for an example of an OR
in a WHERE
clause that is still pretty nice:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
EXPLAIN (COSTS off) SELECT id FROM a WHERE id = 42 OR a_val = 'value 42'; QUERY PLAN ----------------------------------------------------------- Bitmap Heap Scan on a Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text)) -> BitmapOr -> Bitmap Index Scan on a_pkey Index Cond: (id = 42) -> Bitmap Index Scan on a_val_idx Index Cond: (a_val = 'value 42'::text) (7 rows) |
PostgreSQL can actually use an index scan for the query, because it can combine the bitmaps for both indexes with a “bitmap OR”.
Note, however, that a bitmap index scan is more expensive than a normal index scan, since it has to build the bitmap. Moreover, it uses much more RAM; each of these bitmaps can use up to work_mem
memory.
A multi-column index on (id, a_val)
won't help at all with this query, so there is no cheaper way to execute it.
IN
is better than OR
Now for a more stupid variant of the above query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
EXPLAIN (COSTS off) SELECT id FROM a WHERE id = 42 OR id = 4711; QUERY PLAN -------------------------------------------- Bitmap Heap Scan on a Recheck Cond: ((id = 42) OR (id = 4711)) -> BitmapOr -> Bitmap Index Scan on a_pkey Index Cond: (id = 42) -> Bitmap Index Scan on a_pkey Index Cond: (id = 4711) (7 rows) |
Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR
:
1 2 3 4 5 6 7 8 9 |
EXPLAIN (COSTS off) SELECT id FROM a WHERE id IN (42, 4711); QUERY PLAN --------------------------------------------------- Index Only Scan using a_pkey on a Index Cond: (id = ANY ('{42,4711}'::integer[])) (2 rows) |
You see? As soon as you get rid of the OR
, an efficient index scan can be used!
You might say that this is good for equality conditions, but what about the following query:
1 2 3 |
SELECT id FROM a WHERE a_val LIKE 'something%' OR a_val LIKE 'other%'; |
To improve that query, observe that the PostgreSQL optimizer rewrote the IN
in the previous query to = ANY
.
This is a case of the standard SQL “quantified comparison predicate”:
is true if the comparison is TRUE
for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).
Now LIKE
is a comparison operator as well, so we can write:
1 2 3 4 5 6 7 8 9 |
EXPLAIN (COSTS off) SELECT id FROM a WHERE a_val LIKE ANY (ARRAY['something%', 'other%']); QUERY PLAN ---------------------------------------------------------- Seq Scan on a Filter: (a_val ~~ ANY ('{something%,other%}'::text[])) (2 rows) |
Unfortunately, the index cannot be used here.
pg_trgm
to the rescueBut we are not at the end of our wits yet! There is such a wealth of indexes in PostgreSQL; let's try a different one. For this, we need the pg_trgm
extension:
1 |
CREATE EXTENSION pg_trgm; |
Then we can create a GIN trigram index on the column:
1 |
CREATE INDEX a_val_trgm_idx ON a USING gin (a_val gin_trgm_ops); |
Now things are looking better:
1 2 3 4 5 6 7 8 9 10 11 |
EXPLAIN (COSTS off) SELECT id FROM a WHERE a_val LIKE ANY (ARRAY['something%', 'other%']); QUERY PLAN -------------------------------------------------------------------- Bitmap Heap Scan on a Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[])) -> Bitmap Index Scan on a_val_trgm_idx Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[])) (4 rows) |
Feel the power of trigram indexes!
Note 1: This index can also be used if the search pattern starts with %
Note 2: The GIN index can become quite large. To avoid that, you can also use a GiST index, which is much smaller, but less efficient to search.
OR
Things become really bad if OR
combines conditions from different tables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
EXPLAIN (COSTS off) SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 OR b.id = 42; QUERY PLAN --------------------------------------------- Merge Join Merge Cond: (a.id = b.id) Join Filter: ((a.id = 42) OR (b.id = 42)) -> Index Scan using a_pkey on a -> Index Scan using b_pkey on b (5 rows) |
Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition. In our example, that would mean computing 100000 rows only to throw away the 99999 that do not match the condition.
OR
Fortunately, there is an equivalent query that is longer to write, but much cheaper to execute:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
EXPLAIN (COSTS off) SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 UNION SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE b.id = 42; QUERY PLAN ---------------------------------------------------------- Unique -> Sort Sort Key: a.id, a.a_val, b.b_val -> Append -> Nested Loop -> Index Scan using a_pkey on a Index Cond: (id = 42) -> Index Scan using b_pkey on b Index Cond: (id = 42) -> Nested Loop -> Index Scan using a_pkey on a a_1 Index Cond: (id = 42) -> Index Scan using b_pkey on b b_1 Index Cond: (id = 42) (14 rows) |
Both parts of the query can make use of efficient index scans and return one row, and since the rows happen to be identical, UNION
will reduce them to one row.
If you can be certain that both branches of the query will return distinct sets, it is better to use UNION ALL
instead of UNION
, because that doesn't have to do the extra processing to remove duplicates.
When using this trick, you should be aware that rewriting a query in that fashion does not always result in an equivalent query: if the original query can return identical rows, these would be removed by the UNION
. In our case, we don't have to worry, because the primary keys were included in the query result. I find that this is hardly ever a problem in practice.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Perhaps an alternative to the "ugly or" query could be to pass the filter to the join condition for the secondary table, which won't change the results as it's an INNER JOIN.
EXPLAIN (COSTS off)
SELECT a.id, a.a_val, b.b_val
FROM a JOIN b ON (a.id=b.id and b.id = 42)
WHERE a.id = 42;
QUERY PLAN
------------------------------------
Nested Loop
-> Index Scan using a_pkey on a
Index Cond: (id = 42)
-> Index Scan using b_pkey on b
Index Cond: (id = 42)
(5 rows)
Great post! I think I don't quite understand one part. You wrote that "a bitmap index scan is much more expensive than a normal index scan — it has to scan the complete index. ". Why is it so? AFAIK, the main difference between an index scan and a bitmap index scan is that the former goes to the table immediately after finding a matching tuple, while the latter adds them to bitmap, merges bitmaps from all indexes and visits the tuples in physical order.
Hello, I'd like to translate the article https://www.cybertec-postgresql.com/avoid-or-for-better-performance/ into Japanese and publish on our tech blog https://techracho.bpsinc.jp/ for sharing it. Is it OK for you?
I make sure to indicate the link to original, title, author name in the case.
Best regards,
Forgive me for being so gross: does this apply to MySQL too?
OR
is hard for all query optimizers, so in principle yes.Many of the techniques and workarounds in the article are PostgreSQL specific though, so they are not applicable to other databases.
For example, I would be surprised if a database that knows no join algorithm other than nested loops has anything as sophisticated as a bitmap index scan.
These "human powered query optimizations" usually result in poor performance down the road when the machine-powered query optimizer has adapted to the anti-case. I've seen many nearly-imposible-to-read SQL queries written 'for performance' that, when re-written to be naive, are 10 times faster.
This seems like a good feature request for a new query optimizer step.
In PostgreSQL 12 the sentence
EXPLAIN (COSTS OFF)
SELECT a.id, a.a_val, b.b_val
FROM TO JOIN INTERNAL b EN b.id = a.id
WHERE a.id = 42
or b.id = 42;
throws
Hash Join "
Hash Cond: (a.id = b.id)
Join Filter: ((a.id = 42) OR (b.id = 42)) "
-> Seq Scan on a
-> Hash
-> Seq Scan on b
I have turned the matter around and I don't understand why it gives a different result
Very interesting post, thanks.
I get a different plan for the "ugly or" (increasing the size of the tables does not change it):
QUERY PLAN
---------------------------------------------
Hash Join
Hash Cond: (a.id = b.id)
Join Filter: ((a.id = 42) OR (b.id = 42))
-> Seq Scan on a
-> Hash
-> Seq Scan on b
(6 lignes)
The join strategy doesn't matter. Still, it will calculate the whole join and filter away whatever does not satisfy the condition.
Hey Laurenz!
Nice blog post (and kudos for your blog in general - I've read many articles here).
I've met most of the mentioned problems multiple times in my work around DB performance.
I've long been wondering about OR involving fields from two tables - you write that "Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition".
Indeed, optimizers have problems with these scenario, but do you have any details as to why it's the case?
Why can't optimizer first filter rows from one table, and then join to the second (also potentially having filtered it beforehand), given that suitable indices are present?
Or rewrite it to something involving UNION?
I never found a good explanation of the essential complexity here.. Probably you know some subtler details, or there's some paper about it, or a nice thread on Postgres hackers or someplace else?
Thanks in advance!
The only thing the optimizer could possibly do is to rewrite the query to use
UNION
orUNION ALL
, but there is a problem. Consider this query:SELECT col FROM a JOIN b USING (q) WHERE a.x = 1 OR b.y = 2;
If we rewrite it with
UNION ALL
, we could get a bad duplicate result if a result row satisfies both conditions. If, on the other hand, we useUNION
, that could eliminate a legitimate duplicate result.In some cases it may be possible to prove that one of the alternatives is indeed correct, but that is beyond the capabilities of the optimizer.
Thanks - makes sense regarding UNION!
But why is filtering either table before the join not a viable strategy?
Think about it. Given this query:
SELECT * FROM a JOIN b USING (col) WHERE a.x = 1 OR b.y = 2;
there can be results where
a.x
is not 1 and results whereb.y
is not 2. You would miss those results if you filter before you join. That strategy only works forAND
.I actually wrote a whole new article about that.
yep, it was funny - I listened to some Postgres related podcast, and they mentioned your (new) article, saying something along the lines "Someone asked Laurenz in the comments, and he wrote an article on UNION vs OR". And I thought - hmm, that someone is me!
Thanks for the article 😀