CYBERTEC PostgreSQL Logo

Join strategies and performance in PostgreSQL

06.2020 / Category: / Tags: | |
join strategies misunderstood
© Laurenz Albe 2020

 

(Updated 2023-02-24) There are three join strategies in PostgreSQL that work quite differently. If PostgreSQL chooses the wrong strategy, query performance can suffer a lot. This article explains the join strategies, how you can support them with indexes, what can go wrong with them and how you can tune your joins for better performance.

Terminology of Join Strategies

Relation

A join combines data from two relations. Such a relation can be a table (also called base relation) or the result of any plan node. For example, in a join like

the base relation a will be joined to the result of the join of b and c.

A relation can also be the result of an index scan. So when I say below that “PostgreSQL scans the relation sequentially”, I don't mean that there has to be a sequential scan on a table. It is just that PostgreSQL reads “the thing that is joined” in its entirety.

Inner and outer relation

The execution plan for any join looks like this:

We call the upper of the joined relations (in this case the sequential scan on a) the outer relation of the join, and we call the lower relation (the hash computed from b) the inner relation.

Join condition and join key

A Cartesian product or cross join of two relations is what you get if you combine each row from one relation with each row of the other. The join condition is a filter that excludes some of these combinations. There are several ways to write a join condition, but all can be transformed to

If the join condition is of the form

I call “a.col1” and “b.col2join keys.

Note that for inner joins there is no distinction between the join condition and the WHERE condition, but that doesn't hold for outer joins.

Nested loop join strategy

This is the simplest and most general join strategy of all.

PostgreSQL scans the outer relation sequentially, and for each result row it scans the inner relation for matching rows.

Indexes that can help with nested loop joins

Since we scan the outer relation sequentially, no index on the outer relation will help. But an index on the join key of the inner relation can speed up a nested loop join considerably.

Use cases for the nested loop join strategy

Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient. If the outer relation is large, nested loop joins are usually very inefficient, even if they are supported by an index on the inner relation.

Apart from that, it is the only join strategy that can be used if no join condition uses the = operator. So it also serves as a fall-back strategy if no other strategy can be used.

Hash join strategy

First, PostgreSQL scans the inner relation sequentially and builds a hash table, where the hash key consists of all join keys that use the = operator. Then it scans the outer relation sequentially and probes the hash for each row found to find matching join keys.

This is somewhat similar to a nested loop join. Building the hash table is an extra start-up effort, but probing the hash is much faster than scanning the inner relation.

Indexes that can help with hash joins

Since we scan both relations sequentially, an index on the join condition will not help with a hash join.

Use cases for the hash join strategy

Hash joins are best if none of the involved relations are small, but the hash table for the smaller table fits in work_mem. This is because otherwise PostgreSQL would build the hash in several batches and store them in temporary disk files, which hurts performance. In such cases the optimizer usually chooses a different join strategy like a merge join.

Looking up values in a hash table only works if the operator in the join condition is =, so you need at least one join condition with that operator.

Merge join strategy

In a merge join, PostgreSQL picks all join conditions with the = operator. It then sorts both tables by the join keys (which means that the data types must be sortable). Then it iterates through both sorted lists and finds matching entries.

Indexes that help with a merge join

An index on the sort keys can speed up sorting, so an index on the join keys on both relations can speed up a merge join. However, an explicit sort is often cheaper unless an index only scan can be used.

Use cases for the merge join strategy

The optimizer usually chooses a merge join if the involved relations are both too big for a hash that fits into work_mem. So this is the best strategy for joining really large tables.

Like a hash join, a merge join is only feasible if there is at least one join condition with the = operator.

Summary table for PostgreSQL join strategies

Nested Loop Join Hash Join Merge Join
Algorithm For each outer relation row, scan the inner relation Build a hash from the inner relation, scan the outer relation, probe the hash Sort both relations and merge rows
Indexes that help Index on the join keys of the inner relation None Indexes on the join keys of both relations
Good strategy if the outer table is small the hash table fits into work_mem both tables are large

Impact on query performance

Choosing the wrong join strategy leads to bad performance:

  • If the optimizer underestimates a row count, it may choose a nested loop join by mistake. Then it scans the inner relation more often than it bargained for, which leads to bad performance.
  • If the optimizer overestimates a row count, it may choose a hash or merge join by mistake. Then it has to scan both relations completely, which can perform much worse than a nested loop join with an index on the inner relation.

In both cases, a bad row count estimate is the cause of the problem. So while the join may be where we spend most of the execution time, the cause is a misestimate that happened earlier on.

How to make PostgreSQL choose the correct join strategy

Find out what the best join strategy is (perhaps PostgreSQL is doing the right thing anyway). You can disable different join strategies temporarily with the SET command, which changes a parameter in your current database session:

Note that you cannot really disable nested loop joins, only discourage PostgreSQL from using them. If there is no join condition with an = operator, a nested loop join is the only way.

Tuning queries is often not a simple, straightforward task. However, here are some guidelines and ideas:

  • If the bad join strategy is chosen because of a misestimate, try to improve that estimate. ANALYZE the table, perhaps with increased default_statistics_target, and see if that makes a difference. Try rewriting the query with simpler WHERE conditions to makes the optimizer's task easier.
  • Try increasing work_mem and see if you get a cheaper hash join.
  • Configure the parameters that tell PostgreSQL about your hardware and resources: random_page_cost, effective_cache_size and effective_io_concurrency. That will allow it to price an index scan correctly.
  • You can speed up nested loop and merge joins with index-only scans. For that you must add all required columns to the index (ideally with the INCLUDE clause) and make sure that the table is vacuumed often enough.

Conclusion - join strategies and query tuning

Understanding the 3 join strategies - nested loop join, hash join, and merge join - is crucial for anyone who wants to understand execution plans and tune queries. The art of query tuning cannot be conveyed in a single article, but I hope I could collect some relevant information here.

If you want to read more about tuning queries with joins, read some of our other articles on the topic, like Joining 1 million tables or Speeding up GROUP BY and joins.


Please leave your comments below. In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on TwitterFacebook, or LinkedIn.

6 responses to “Join strategies and performance in PostgreSQL”

  1. Hi Laurenz,

    Are you sure that the outer relation is always scanned sequentially with nested loop join?

    Are you sure that both relations are always scanned sequentially with hash join?

    I think that an index can be used for both relations and both strategies.

    Here is an example :

    explain (costs off) select a,foo.b from foo join bar using(a) where foo.b = 'c81e728d9d4c2f636f067f89cc14862c';
    QUERY PLAN
    --------------------------------------------------------------------------
    Hash Join
    Hash Cond: (foo.a = bar.a)
    -> Bitmap Heap Scan on foo
    Recheck Cond: (b = 'c81e728d9d4c2f636f067f89cc14862c'::text)
    -> Bitmap Index Scan on foo_b_idx
    Index Cond: (b = 'c81e728d9d4c2f636f067f89cc14862c'::text)
    -> Hash
    -> Index Only Scan using bar_a_idx on bar

    But I might have misunderstood what you wrote. If not, I can paste the full test case somewhere.

    Best regards,
    Frédéric

    • With "outer relation" In mean "the thing that is on the outer side of the join". That relation does not necessarily have to be a "base relation" (a table); in your case it is a Bitmap Heap Scan. That "relation" has to be read in its entirety from front to back.

      I am aware that this use of the word "relation" is somewhat specific to PostgreSQL optimizer jargon and can lead to misunderstanding. I have reworked the definition of "relation" at the beginning of the article to clarify that "read the relation sequentially" does not have to mean a sequential scan on a table.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram