UPDATED AUGUST 2023: After my previous post, I received mails asking about what was meant by using “generate_series
” in an outer join.
Table of Contents
Let me try to address this issue with an example: Many people use PostgreSQL to aggregate data. In this example, we assume that we want to aggregate data on an hourly basis.
Here's some sample data:
1 2 |
test=# CREATE TABLE t_data (t timestamp, payload text); CREATE TABLE |
We create a table with just two fields:
1 2 3 4 5 6 7 8 9 10 11 |
test=# INSERT INTO t_data VALUES ('2012-04-04 03:12', 'data 1'); INSERT 0 1 test=# INSERT INTO t_data VALUES ('2012-04-04 04:16', 'data 2'); INSERT 0 1 test=# INSERT INTO t_data VALUES ('2012-04-04 04:28', 'data 3'); INSERT 0 1 test=# INSERT INTO t_data VALUES ('2012-04-04 06:45', 'data 4'); INSERT 0 1 |
We have data for 3am, 4am and 6am ... we've intentionally left 5am out. If we want to start to aggregate the data to see how many rows there are per hour, we can use the following query:
1 2 3 4 5 6 7 8 9 10 |
test=# SELECT date_trunc('hour', t), count(payload) FROM t_data GROUP BY 1 ORDER BY 1; date_trunc | count ---------------------+------- 2012-04-04 03:00:00 | 1 2012-04-04 04:00:00 | 2 2012-04-04 06:00:00 | 1 (3 rows) |
A simple GROUP BY
will reveal that there were no entries for the 5am group. However, the zero is not displayed here, because GROUP BY
will only return data for valid groups.
In order to create a list of times we want to see in our final report, we can execute generate_series
. Let's assume we want to have a complete table from midnight to 6am:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
test=# SELECT * FROM generate_series( '2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x; x --------------------- 2012-04-04 00:00:00 2012-04-04 01:00:00 2012-04-04 02:00:00 2012-04-04 03:00:00 2012-04-04 04:00:00 2012-04-04 05:00:00 2012-04-04 06:00:00 (7 rows) |
This will return a complete list – exactly what we need to display in the final results. Remember, we also want to display hours with no rows.
Let's now join things together using a standard PostgreSQL outer join:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
test=# SELECT * FROM generate_series('2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x LEFT JOIN t_data ON (date_trunc('hour', t) = x); x | t | payload ---------------------+---------------------+--------- 2012-04-04 00:00:00 | | 2012-04-04 01:00:00 | | 2012-04-04 02:00:00 | | 2012-04-04 03:00:00 | 2012-04-04 03:12:00 | data 1 2012-04-04 04:00:00 | 2012-04-04 04:16:00 | data 2 2012-04-04 04:00:00 | 2012-04-04 04:28:00 | data 3 2012-04-04 05:00:00 | | 2012-04-04 06:00:00 | 2012-04-04 06:45:00 | data 4 (8 rows) |
We'll run date_trunc
to make sure that the precision of our dates is actually in hours. Then we can use the date column as the join key. Note that the non-existing fields on the “data” side of the join will stay empty.
Once we have joined the data we have to aggregate to fetch the final result. When doing this aggregation, watch out for a common mistake:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
test=# SELECT x, count(*) FROM generate_series('2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x LEFT JOIN t_data ON (date_trunc('hour', t) = x) GROUP BY 1; x | count ---------------------+------- 2012-04-04 00:00:00 | 1 2012-04-04 01:00:00 | 1 2012-04-04 02:00:00 | 1 2012-04-04 03:00:00 | 1 2012-04-04 04:00:00 | 2 2012-04-04 05:00:00 | 1 2012-04-04 06:00:00 | 1 (7 rows) |
Never EVER use count(*)
along with an outer join.
The problem is, count(*)
will count ROWS – we don't want to count rows because we're not allowed to count NULL
values. As you can see, the result is just plain wrong.
To correct the problem, I advise you to count non-NULL
values. In SQL, this means that you have to use count(column)
rather than count(*)
. count(column)
will only count values which are not NULL
. That's exactly what we want:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
test=# SELECT x, count(payload) FROM generate_series('2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x LEFT JOIN t_data ON (date_trunc('hour', t) = x) GROUP BY 1; x | count ---------------------+------- 2012-04-04 00:00:00 | 0 2012-04-04 01:00:00 | 0 2012-04-04 02:00:00 | 0 2012-04-04 03:00:00 | 1 2012-04-04 04:00:00 | 2 2012-04-04 05:00:00 | 0 2012-04-04 06:00:00 | 1 (7 rows) |
Voilà, the result is exactly as desired.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information