Table of Contents
In my article that reviles OR
, I showed how in certain cases, it is possible to rewrite OR
in a WHERE
condition to a longer query with UNION
that can perform much better (the “ugly” OR
). Now people have asked me repeatedly why the PostgreSQL optimizer does not perform such a transformation automatically. This article explains the problems involved.
OR
In the article referenced , I show the following query that will never perform well with bigger tables:
1 2 3 4 |
SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 OR b.id = 42; |
The solution was to rewrite OR
like this:
1 2 3 4 5 6 7 |
SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 UNION SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE b.id = 42; |
If both WHERE
conditions are selective, then indexes on a.id
and b.id
can turn both parts of the UNION
into fast nested loop joins.
OR
to UNION
?The above transformation is certainly no secret, and many people know about it. But I keep hearing comments that it is the job of the PostgreSQL optimizer to automatically transform queries like this. The following example shows why that is not always possible:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
CREATE TABLE a ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, x integer, p integer ); CREATE TABLE b ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, x integer, q integer ); INSERT INTO a (x, p) VALUES (1, 1), (1, 1), (2, 1); INSERT INTO b (x, q) VALUES (1, 3), (2, 3); |
The query with OR
in this case is:
1 2 3 4 5 6 7 8 9 10 |
SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE a.p = 1 OR b.q = 3; x │ p │ q ═══╪═══╪═══ 1 │ 1 │ 3 1 │ 1 │ 3 2 │ 1 │ 3 (3 rows) |
UNION
is no solutionIndeed, if we try it, we get a different result:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE a.p = 1 UNION SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE b.q = 3; x │ p │ q ═══╪═══╪═══ 2 │ 1 │ 3 1 │ 1 │ 3 (2 rows) |
The reason for the difference is that UNION
removes all duplicates in the result set, similar to DISTINCT
. Consequently, PostgreSQL removes the legitimate duplicate result row.
Why then did this transformation work in the query from the beginning of this article? Well, I made a tacit assumption there, namely that the column id
is the primary key of both tables. If we have a unique identifier in the result set, there cannot be any duplicate rows in the result set, and the above transformation is always correct.
UNION ALL
is no solution eitherWell, if UNION
removes duplicate results that we need, then we should try the more efficient UNION ALL
, which does not remove duplicates:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE a.p = 1 UNION ALL SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE b.q = 3; x │ p │ q ═══╪═══╪═══ 1 │ 1 │ 3 1 │ 1 │ 3 2 │ 1 │ 3 1 │ 1 │ 3 1 │ 1 │ 3 2 │ 1 │ 3 (6 rows) |
Now we have another problem: all rows that satisfy both conditions in the original OR
query appear twice in the result set, once from every branch in the UNION ALL
. In this example, all rows satisfy both conditions, so the whole result set is duplicated.
So neither of the transformations produces the correct result set!
OR
to UNION
When is it safe to transform OR
to UNION ALL
? It is safe if there is no result row that satisfies both of the conditions combined with OR
. This is probably often true, but I cannot think of a case where the database could verify that automatically.
On the other hand, let's consider when it is safe to transform OR
to UNION
. The condition is that the original result set does not contain any duplicates. The only way to guarantee this lack of duplicates is if the SELECT
list contains a non-NULL unique key from both tables involved. Now this is a frequent case: way too many people select more columns than they actually need!
So the query optimizer could in theory check if the SELECT
list contains a non-NULL unique key from both tables. If the tables in the join are actual base tables, such a check might not be too hard. If the tables are derived tables (for examples, the result of a join), then that check would be more difficult.
Such a change would certainly make query optimization more complicated. Every query with an OR
in the WHERE
condition would have to be checked:
FROM
expression is the join of two relationsSELECT
list contains a non-NULL unique key of both base tablesThat is, many queries would require more planning time, but only a few queries would benefit. Moreover, I suspect that a change like this is at the very least complicated, because the alternative query would be radically different from the original. Such changes are typically done in the query rewriter, but we cannot rewrite the query unconditionally, since the original query may perform better than the rewritten one (for example if the WHERE
condition is not selective, or if the necessary indexes are missing).
We have seen that it is not always possible to rewrite OR
to UNION
, and it is questionable whether automatic rewriting by PostgreSQL is feasible at all.
In practice, you will probably always be able to make the transformation: either you can determine that duplicate result don't matter and use UNION ALL
, or you can exclude that there cannot be duplicate results anyway and use UNION
. At any rate, you will have to do it by hand.
In case you're interested in improving query performance, check out Query Parameter Data Types and Performance
If you would learn the basics about SQL queries, see What is an Inner Join? And What is an Outer Join?
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
This could be a safe transformation when adding the condition of the first branch into the second branch with IS NOT TRUE:
SELECT x, a.p, b.q
FROM a JOIN b USING (x)
WHERE a.p = 1
UNION ALL
SELECT x, a.p, b.q
FROM a JOIN b USING (x)
WHERE b.q = 3 and a.p = 1 is not true
;
Ah, that is a good idea!
Very useful for manually rewriting such queries.