Table of Contents
SQL allows you to use subqueries almost anywhere where you could have a table or column name. All you have to do is surround the query with parentheses, like (SELECT ...)
, and you can use it in arbitrary expressions. This makes SQL a powerful language – and one that can be hard to read. But I don't want to discuss the beauty or ugliness of SQL. In this article, I want to tell you how to write subqueries that perform well. I'll start simple, but get to more surprising and complicated topics later.
In a subquery you can use table columns from the outside, like
1 2 3 |
SELECT a.col1, (SELECT b.col2 FROM b WHERE b.x = a.x) FROM a; |
The subquery will be different for each row in “a
”. Such a subquery is usually called a correlated subquery. An uncorrelated subquery is one that does not reference anything from the outside.
Uncorrelated subqueries are simple. If the PostgreSQL optimizer does not “pull it up” (integrate it in the main query tree), the executor will calculate it in a separate step. You can see that as an InitPlan
(initial plan) in the output of EXPLAIN
. Uncorrelated subqueries are almost never a performance problem. In the rest of this article, I will mostly deal with correlated subqueries.
If you write a subquery in a place in an SQL statement where you would otherwise have to write a single value, it is a scalar subquery. An example for a scalar subquery is the one in the previous section. A different example would be
1 2 3 4 5 |
SELECT a.col1 FROM a WHERE 1 = (SELECT count(*) FROM b WHERE b.x = a.x); |
If a scalar subquery returns no result, the resulting value is NULL. If the query returns more than a single row, you will receive a run-time error:
1 |
ERROR: more than one row returned by a subquery used as an expression |
A tabular subquery appears in a context where it can return more than one value:
FROM
list entry: FROM (SELECT ...) AS alias
WITH q AS (SELECT ...) SELECT ...
IN
or NOT IN
expression: WHERE a.x IN (SELECT ...)
EXISTS
or NOT EXISTS
expression: WHERE NOT EXISTS (SELECT ...)
My rule of thumb is: avoid correlated scalar subqueries whenever you can. The reason is that PostgreSQL can only execute a scalar subquery as a nested loop. For example, PostgreSQL will execute the subquery from the first section once for each row in table “a
”. This can be fine if “a
” is a small table (remember, my recommendation is just a rule of thumb). However, if table “a
” is large, even a fast subquery will make the query execution unpleasantly slow.
SELECT
list or WHERE
clauseIf correlated scalar subqueries are bad for performance, how can we avoid them? There is no single, straightforward answer, and you probably won't be able to rewrite the query to avoid such subqueries in all cases. But usually the solution is to convert the subquery into a join. For our first query, that will look like this:
1 2 3 4 |
SELECT a.col1, b.col2 FROM a LEFT JOIN b ON b.x = a.x; |
That query is semantically equivalent, with the exception that we don't get a run-time error if a row in “a
” matches more than one row in “b
”. We need an outer join to account for the case where the subquery returns no result.
For our second example, the rewritten query would look like this:
1 2 3 4 5 |
SELECT a.col1 FROM a JOIN b ON b.x = a.x GROUP BY a.pkey, a.col1 HAVING count(*) = 1; |
Here, a.pkey
is the primary key of “a
”. Grouping by a.col1
would not be sufficient, because two different rows from table “a
” could have the same value for col1
.
The advantage of rewriting the queries as shown above is that PostgreSQL can choose the optimal join strategy and is not restricted to nested loops. If the table “a
” has few rows, that may not make a difference, since a nested loop join may be the most efficient join strategy anyway. But you also won't lose by rewriting the query in that case. And if “a
” is large, you will be much faster with a hash or a merge join.
While correlated scalar subqueries are usually bad, the case is not so simple with tabular subqueries. Let's consider the different cases separately.
FROM
These cases are almost identical, because you can always rewrite a CTE to a subquery in FROM
unless it is a recursive, a MATERIALIZED
or a data modifying CTE. CTEs are never correlated, so they are never problematic. However, a FROM
clause entry can be correlated in a lateral join:
1 2 3 4 5 6 7 8 |
SELECT a.col1, sub.col2 FROM a CROSS JOIN LATERAL (SELECT b.col2 FROM b WHERE b.x = a.x ORDER BY b.sort LIMIT 1) AS sub; |
Again, PostgreSQL will execute such a subquery in a nested loop, which can perform badly for large table “a
”. Therefore, it's usually a good idea to rewrite the query to avoid a correlated subquery:
1 2 3 4 5 |
SELECT DISTINCT ON (a.pkey) a.col1, b.col2 FROM a JOIN b ON b.x = a.x ORDER BY a.pkey, b.sort; |
The rewritten query will perform better if “a
” has many rows, but it could perform worse if “a
” is small and “b
“ is large, but has an index on (x, sort)
.
EXISTS
and NOT EXISTS
This is a special case. So far I have always recommended avoiding correlated subqueries. But with EXISTS
and NOT EXISTS
, the PostgreSQL optimizer is able to transform the clause to a semi-join and anti-join, respectively. That allows PostgreSQL to use all join strategies, not only nested loops.
Consequently, PostgreSQL can process correlated subqueries in EXISTS
and NOT EXISTS
efficiently.
IN
and NOT IN
You will maybe expect that these two cases behave alike, but that is not the case. A query using IN
with a subquery can always be rewritten to use EXISTS
. For example, the following statement:
1 2 3 4 5 |
SELECT a.col1 FROM a WHERE a.foo IN (SELECT b.col2 FROM b WHERE a.x = b.x); |
is equivalent to
1 2 3 4 5 6 |
SELECT a.col1 FROM a WHERE EXISTS (SELECT NULL FROM b WHERE a.x = b.x AND a.foo = b.col2); |
The PostgreSQL optimizer can do that and will process the subquery in IN
as efficiently as the one in EXISTS
.
However, the case with NOT IN
is quite different. You can rewrite NOT IN
to NOT EXISTS
similar to the above, but that is not a transformation that PostgreSQL can do automatically, because the rewritten statement is semantically different: If the subquery returns at least one NULL value, NOT IN
will never be TRUE. The NOT EXISTS
clause does not exhibit this surprising behavior.
Now people normally don't care about this property of NOT IN
(and in fact, too few people know about it). Most people would prefer the behavior of NOT EXISTS
anyway. But you have to rewrite the SQL statement yourself and cannot expect PostgreSQL to do it automatically. So my recommendation is that you never use NOT IN
with a subquery and always use NOT EXISTS
instead.
So far, I have told you how to rewrite an SQL statement to avoid forcing the optimizer to use a nested loop. Yet sometimes you need the exact opposite: you want the optimizer to use a nested loop join, because you happen to know that that is the best join strategy. Then you can deliberately rewrite a regular join to a lateral cross join to force a nested loop. For example, this query
1 2 3 |
SELECT a.col1, b.col2 FROM a JOIN b ON a.x = b.x; |
is semantically equivalent to
1 2 3 4 5 6 |
SELECT a.col1, sub.col2 FROM a CROSS JOIN LATERAL (SELECT b.col2 FROM b WHERE a.x = b.x) AS sub; |
If you want good performance with subqueries, it is often a good idea to follow these guidelines:
EXISTS
, NOT EXISTS
and IN
clausesNOT IN
to NOT EXISTS
Don't take these rules as iron laws. Sometimes a correlated subquery can actually perform better, and sometimes you can use a correlated subquery to force the optimizer to use a nested loop if you are certain that is the right strategy to use.
If you are interested in improving the performance of a query by rewriting it, you may want to read my article about forcing the join order in PostgreSQL.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply