Table of Contents
(Updated 2023-02-24) There are three join strategies in PostgreSQL that work quite differently. If PostgreSQL chooses the wrong strategy, query performance can suffer a lot. This article explains the join strategies, how you can support them with indexes, what can go wrong with them and how you can tune your joins for better performance.
A join combines data from two relations. Such a relation can be a table (also called base relation) or the result of any plan node. For example, in a join like
1 2 3 4 |
SELECT ... FROM a JOIN (b LEFT JOIN c ON ...) ON ... |
the base relation a
will be joined to the result of the join of b
and c
.
A relation can also be the result of an index scan. So when I say below that “PostgreSQL scans the relation sequentially”, I don't mean that there has to be a sequential scan on a table. It is just that PostgreSQL reads “the thing that is joined” in its entirety.
The execution plan for any join looks like this:
1 2 3 4 5 6 7 8 9 10 11 |
EXPLAIN (COSTS OFF) SELECT * FROM a JOIN b USING (id); QUERY PLAN ---------------------------- Hash Join Hash Cond: (a.id = b.id) -> Seq Scan on a -> Hash -> Seq Scan on b (5 rows) |
We call the upper of the joined relations (in this case the sequential scan on a
) the outer relation of the join, and we call the lower relation (the hash computed from b
) the inner relation.
A Cartesian product or cross join of two relations is what you get if you combine each row from one relation with each row of the other. The join condition is a filter that excludes some of these combinations. There are several ways to write a join condition, but all can be transformed to
1 |
a <join type> JOIN b ON <join condition> |
If the join condition is of the form
1 |
a.col1 <operator> b.col2 [AND ...] |
I call “a.col1
” and “b.col2
” join keys.
Note that for inner joins there is no distinction between the join condition and the WHERE
condition, but that doesn't hold for outer joins.
This is the simplest and most general join strategy of all.
PostgreSQL scans the outer relation sequentially, and for each result row it scans the inner relation for matching rows.
Since we scan the outer relation sequentially, no index on the outer relation will help. But an index on the join key of the inner relation can speed up a nested loop join considerably.
Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient. If the outer relation is large, nested loop joins are usually very inefficient, even if they are supported by an index on the inner relation.
Apart from that, it is the only join strategy that can be used if no join condition uses the =
operator. So it also serves as a fall-back strategy if no other strategy can be used.
First, PostgreSQL scans the inner relation sequentially and builds a hash table, where the hash key consists of all join keys that use the =
operator. Then it scans the outer relation sequentially and probes the hash for each row found to find matching join keys.
This is somewhat similar to a nested loop join. Building the hash table is an extra start-up effort, but probing the hash is much faster than scanning the inner relation.
Since we scan both relations sequentially, an index on the join condition will not help with a hash join.
Hash joins are best if none of the involved relations are small, but the hash table for the smaller table fits in work_mem
. This is because otherwise PostgreSQL would build the hash in several batches and store them in temporary disk files, which hurts performance. In such cases the optimizer usually chooses a different join strategy like a merge join.
Looking up values in a hash table only works if the operator in the join condition is =
, so you need at least one join condition with that operator.
In a merge join, PostgreSQL picks all join conditions with the =
operator. It then sorts both tables by the join keys (which means that the data types must be sortable). Then it iterates through both sorted lists and finds matching entries.
An index on the sort keys can speed up sorting, so an index on the join keys on both relations can speed up a merge join. However, an explicit sort is often cheaper unless an index only scan can be used.
The optimizer usually chooses a merge join if the involved relations are both too big for a hash that fits into work_mem
. So this is the best strategy for joining really large tables.
Like a hash join, a merge join is only feasible if there is at least one join condition with the =
operator.
Nested Loop Join | Hash Join | Merge Join | |
---|---|---|---|
Algorithm | For each outer relation row, scan the inner relation | Build a hash from the inner relation, scan the outer relation, probe the hash | Sort both relations and merge rows |
Indexes that help | Index on the join keys of the inner relation | None | Indexes on the join keys of both relations |
Good strategy if | the outer table is small | the hash table fits into work_mem |
both tables are large |
Choosing the wrong join strategy leads to bad performance:
In both cases, a bad row count estimate is the cause of the problem. So while the join may be where we spend most of the execution time, the cause is a misestimate that happened earlier on.
Find out what the best join strategy is (perhaps PostgreSQL is doing the right thing anyway). You can disable different join strategies temporarily with the SET
command, which changes a parameter in your current database session:
1 2 3 |
SET enable_hashjoin = off; SET enable_mergejoin = off; SET enable_nestloop = off; |
Note that you cannot really disable nested loop joins, only discourage PostgreSQL from using them. If there is no join condition with an =
operator, a nested loop join is the only way.
Tuning queries is often not a simple, straightforward task. However, here are some guidelines and ideas:
ANALYZE
the table, perhaps with increased default_statistics_target
, and see if that makes a difference. Try rewriting the query with simpler WHERE
conditions to makes the optimizer's task easier.work_mem
and see if you get a cheaper hash join.random_page_cost
, effective_cache_size
and effective_io_concurrency
. That will allow it to price an index scan correctly.INCLUDE
clause) and make sure that the table is vacuumed often enough.Understanding the 3 join strategies - nested loop join, hash join, and merge join - is crucial for anyone who wants to understand execution plans and tune queries. The art of query tuning cannot be conveyed in a single article, but I hope I could collect some relevant information here.
If you want to read more about tuning queries with joins, read some of our other articles on the topic, like Joining 1 million tables or Speeding up GROUP BY and joins.
Please leave your comments below. In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Perfect article. Thanks a lot!
Nice Article. Fan of Cybertec.
muy bueno!
Hi Laurenz,
Are you sure that the outer relation is always scanned sequentially with nested loop join?
Are you sure that both relations are always scanned sequentially with hash join?
I think that an index can be used for both relations and both strategies.
Here is an example :
explain (costs off) select a,foo.b from foo join bar using(a) where foo.b = 'c81e728d9d4c2f636f067f89cc14862c';
QUERY PLAN
--------------------------------------------------------------------------
Hash Join
Hash Cond: (foo.a = bar.a)
-> Bitmap Heap Scan on foo
Recheck Cond: (b = 'c81e728d9d4c2f636f067f89cc14862c'::text)
-> Bitmap Index Scan on foo_b_idx
Index Cond: (b = 'c81e728d9d4c2f636f067f89cc14862c'::text)
-> Hash
-> Index Only Scan using bar_a_idx on bar
But I might have misunderstood what you wrote. If not, I can paste the full test case somewhere.
Best regards,
Frédéric
With "outer relation" In mean "the thing that is on the outer side of the join". That relation does not necessarily have to be a "base relation" (a table); in your case it is a Bitmap Heap Scan. That "relation" has to be read in its entirety from front to back.
I am aware that this use of the word "relation" is somewhat specific to PostgreSQL optimizer jargon and can lead to misunderstanding. I have reworked the definition of "relation" at the beginning of the article to clarify that "read the relation sequentially" does not have to mean a sequential scan on a table.
OK I got it, thank you very much!