Table of Contents
For many people, “cross join” is something to be afraid of. They remember the time when they forgot the join condition and the DBA was angry, because the query hogged the CPU and filled the disk. However, there are valid use cases for cross joins which I want to explore in this article.
The term comes from relational algebra, which also calls the Cartesian product between two relations cross product and denotes it by A × B
. This is the most basic kind of join: it combines every row of one table with every row of another table.
A simple example:
Table A | Table B | |||||
---|---|---|---|---|---|---|
name | birthday | street | city | |||
John | 1989-08-24 | Roman Road | Kexborough | |||
Paul | 2001-10-03 | Earl Street | Fitzwilliam | |||
Claude | 1996-04-18 | |||||
The cross product of the tables would be:
name | birthday | street | city | ||
---|---|---|---|---|---|
John | 1989-08-24 | Roman Road | Kexborough | ||
Paul | 2001-10-03 | Roman Road | Kexborough | ||
Claude | 1996-04-18 | Roman Road | Kexborough | ||
John | 1989-08-24 | Earl Street | Fitzwilliam | ||
Paul | 2001-10-03 | Earl Street | Fitzwilliam | ||
Claude | 1996-04-18 | Earl Street | Fitzwilliam |
There are two ways to write the cross join of A
and B in SQL.
A comma separated list in the FROM
clause:
With the explicit CROSS JOIN
syntax:
Cross joins are the most basic joins, and you can think of an inner join as a cross join with an additional filter condition. Of course, PostgreSQL doesn't calculate inner joins that way. It uses more efficient join strategies.
If you write your joins using a comma separated table list (like in the first example above), an inner join and a cross join look very similar. The only difference is a WHERE
condition. Now it is a frequent mistake to forget a join condition when you develop an SQL query. The result of such an omission is that you get way more result rows than you reckoned with: a cross join between two tables with a million rows each would result in a trillion rows!
Such a query will run forever. If the query contains an ORDER BY
clause, the database server has to cache the whole result set in order to sort it. Since this result set doesn't fit into memory, PostgreSQL will start writing temporary files to hold the data. These temporary files can fill up the disk. As soon as the query runs out of disk space, PostgreSQL rolls it back and deletes the temporary files. However, if the timing is bad, even a short out-of-disk condition can cause the database server to crash.
Never use the “comma separated list” syntax to write joins!
If you write your inner joins as a JOIN b
, it is a syntax error to omit the join condition (ON
or USING
). A cross join is then explicitly written as CROSS JOIN
and cannot happen by mistake.
The above sounds pretty discouraging, but there are situations when a cross join is just what you need. In the following, I present two typical cases:
In a lateral join, a join relation (an expression or subquery) can refer to earlier entries in the FROM
clause. This is very often used in combination with table functions: if you want to join a row with all the table function results that belong to it, you use a lateral join. In that case, LATERAL
already implies that each row is only joined to the function results that belong to it, so there is no need for an extra join condition.
Here is an example (not recommended for your production database!):
Then the arrays could be unpacked with a lateral expression like this:
Sometimes you have a more complicated or expensive expression that you want to use in several places with one query. In that case, it can be a good idea to write a common table expression. You have to use that CTE in the FROM
clause, typically with a cross join:
You don't need cross joins very often, but sometimes they do come in handy.
Avoid the “comma separated list” join syntax, so that you don't get cross joins by mistake. Such a statement can run forever and use up your database machine's resources.
In case you need help to run your database in the most efficient way possible, CYBERTEC offers 24/7 support services to customers around the world.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Back in the days I used to write reporting queries for a company with multiple brands. The requested format was often a csv with "day, brand, fact columns..." with one row per brand per day, even if a brand had no activity. I often found myself doing "FROM generate_series(...) CROSS JOIN brands", then LEFT JOINing various things onto that, and wrapping fact columns in COALESCE(... ,0).
A similar trick that often comes in handy in for such reporting queries is to cross join a hardcoded set of values to put as one of the leftmost columns: CROSS JOIN (VALUES ('a'),('b'),('c')) foo(bar)
Another interesting thing you can do is create an index column from the array using ordinality, you then have an index key you can use
SELECT mytab.id, j.elem, ordinality- 1 as index
FROM mytab
CROSS JOIN LATERAL
jsonb_array_elements_text(
mytab.data -> 'key'
) with ordinality AS j(elem);