While doing PostgreSQL consulting for a German client, I stumbled over an interesting issue this week, which might be worth sharing with some folks out on the Internet, it's all about grouping.
Suppose you are measuring the same thing various times on different sensors every, say, 15 minutes. Maybe some temperature, some air pressure or whatever. The data might look like it is shown in the next table:
1 2 3 4 5 6 7 8 9 10 11 |
CREATE TABLE t_data (t time, val int); COPY t_data FROM stdin; 14:00 12 14:01 22 14:01 43 14:14 32 14:15 33 14:16 27 14:30 19 . |
The human eye can instantly spot that 14:00 and 14:01 could be candidates for grouping (maybe the differences are just related to latency or some slightly inconsistent timing). The same applies to 14:14 to 14:16. You might want to have this data in the same group during aggregation.
The question now is: How can that be achieved with PostgreSQL?
The first thing to do is to check out those difference from one timestamp to the next:
1 2 |
SELECT *, lag(t, 1) OVER (ORDER BY t) FROM t_data; |
The lag function offers a nice way to solve this kind of problem:
1 2 3 4 5 6 7 8 9 10 |
t | val | lag ----------+-----+---------- 14:00:00 | 12 | 14:01:00 | 22 | 14:00:00 14:01:00 | 43 | 14:01:00 14:14:00 | 32 | 14:01:00 14:15:00 | 33 | 14:14:00 14:16:00 | 27 | 14:15:00 14:30:00 | 19 | 14:16:00 (7 rows) |
Now that we have used lag to "move" the time to the next row, there is a simple trick which can be applied:
1 2 3 4 5 6 |
SELECT *, CASE WHEN t - lag < '10 minutes' THEN currval('seq_a') ELSE nextval('seq_a') END AS g FROM ( SELECT *, lag(t, 1) OVER (ORDER BY t) FROM t_data ) AS x; |
Moving the lag to a subselect allows us to start all over again and to create those groups. The trick now is: If the difference from one line to the next is high, start a new group - otherwise stay within the group.
This leaves us with a simple result set:
1 2 3 4 5 6 7 8 9 10 |
t | val | lag | g ----------+-----+----------+--- 14:00:00 | 12 | | 1 14:01:00 | 22 | 14:00:00 | 1 14:01:00 | 43 | 14:01:00 | 1 14:14:00 | 32 | 14:01:00 | 2 14:15:00 | 33 | 14:14:00 | 2 14:16:00 | 27 | 14:15:00 | 2 14:30:00 | 19 | 14:16:00 | 3 (7 rows) |
From now on, life is easy. We can take this output and quickly aggregate on this data. "GROUP BY g" will give us nice groups for each value of "g".
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Here's a way to do that without sequences:
select x.*, sum(edge) over (order by t) as group_num
from (select *, case when (t - lag(t,1) over (order by t)) >= '10 minutes' then 1 else 0 end as edge
from t_data) x
order by t;
t | val | edge | group_num
---------- ----- ------ -----------
14:00:00 | 12 | 0 | 0
14:01:00 | 22 | 0 | 0
14:01:00 | 43 | 0 | 0
14:14:00 | 32 | 1 | 1
14:15:00 | 33 | 0 | 1
14:16:00 | 27 | 0 | 1
14:30:00 | 19 | 1 | 2
(7 rows)
this is a cool one as well. we used the sequence because you can run jobs continuously.
i like your query :). i got to remember that one.
Ah, didn't realize you wanted the group_num to be unique across all queries, not just within this query.
in this business case yes. but, your query is pretty cool as well :). maybe i just communicated badly.