These days everybody is talking about time series, time series analysis and alike for performance tuning. Analyzing time series data in PostgreSQL can provide valuable insights, help in making informed decisions and understanding data more deeply. By utilizing PostgreSQL’s powerful features, we can efficiently query all types of measurement data to track trends, patterns, and anomalies over time. However, often there is a tiny little requirement, people are struggling to understand. Consider the following data:
Table of Contents
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
CREATE TABLE t_sample ( tstamp timestamptz, sensor_id int, measurement numeric ); CREATE TABLE CREATE TABLE INSERT INTO t_sample SELECT x, y, random() * 10000 FROM generate_series('2025-01-01', '2025-01-06', '1 second'::interval) AS x, generate_series(1, 100) AS y ORDER BY random(); INSERT 0 43200100 |
What we see here is a classical simple table containing time series data in PostgreSQL. For a couple of days we store 1 measurement value for each of those 100 sensors. Overall this provides us with roughly 43 million rows of data.
Here is one of the most typical questions somebody might have given this data set:
In this post we will discuss how this can be done in the most efficient way possible.
This is most likely the most common use case we have seen. For each sensor in our series we want to see the highest value in the data set. Sounds easy? Well, it is … Let us take a look at the query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
test=# explain analyze SELECT sensor_id, max(measurement) FROM t_sample GROUP BY 1 ORDER BY 1; QUERY PLAN -------------------------------------------------------------------------- Finalize GroupAggregate (cost=546553.97..546579.31 rows=100 width=36) (actual time=3620.266..3621.127 rows=100 loops=1) Group Key: sensor_id -> Gather Merge (cost=546553.97..546577.31 rows=200 width=36) (actual time=3620.260..3621.079 rows=300 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=545553.95..545554.20 rows=100 width=36) (actual time=3616.385..3616.388 rows=100 loops=3) Sort Key: sensor_id Sort Method: quicksort Memory: 28kB Worker 0: Sort Method: quicksort Memory: 28kB Worker 1: Sort Method: quicksort Memory: 28kB -> Partial HashAggregate (cost=545549.63..545550.63 rows=100 width=36) (actual time=3616.322..<strong>3616.328</strong> rows=100 loops=3) Group Key: sensor_id Batches: 1 Memory Usage: 32kB Worker 0: Batches: 1 Memory Usage: 32kB Worker 1: Batches: 1 Memory Usage: 32kB -> Parallel Seq Scan on t_sample (cost=0.00..455549.42 rows=18000042 width=15) (actual time=0.545..1873.428 rows=14400033 loops=3) Planning Time: 0.154 ms Execution Time: 3621.199 ms (18 rows) |
What happens here is that PostgreSQL will read ALL the data and run aggregation. To speed things up we see a parallel query using multiple CPU cores at a time. However, the core problem is: We got to read all the data. Of course that leads to long runtimes which keep growing the more data we have.
Let us try to deploy and index and see if it changes anything:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# CREATE INDEX ON t_sample (sensor_id, measurement); CREATE INDEX test=# \d t_sample Table "public.t_sample" Column | Type | Collation | Nullable | Default -------------+--------------------------+-----------+----------+--------- tstamp | timestamp with time zone | | | sensor_id | integer | | | measurement | numeric | | | Indexes: "t_sample_sensor_id_measurement_idx" btree (sensor_id, measurement) |
Running the query again will not lead to a better execution plan. The way to address this problem is to simulate something that is generally known as “skip scan”. What does it mean? For each incarnation of sensor values we find the maximum. The system can find the maximum value for a single sensor quite quickly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
test=# explain analyze SELECT max(measurement) FROM t_sample WHERE sensor_id = 10; QUERY PLAN ----------------------------------------------------------------------------- Result (cost=0.60..0.61 rows=1 width=32) (actual time=0.111..0.112 rows=1 loops=1) InitPlan 1 (returns $0) -> Limit (cost=0.56..0.60 rows=1 width=11) (actual time=0.106..0.107 rows=1 loops=1) -> Index Only Scan Backward using t_sample_sensor_id_measurement_idx on t_sample (cost=0.56..13321.39 rows=375841 width=11) (actual time=0.105..0.105 rows=1 loops=1) Index Cond: ((sensor_id = 10) AND (measurement IS NOT NULL)) Heap Fetches: 0 Planning Time: 0.269 ms Execution Time: 0.143 ms (8 rows) |
Wow, we can find a single value in a fraction of a millisecond. Unfortunately we have to apply some trickery to do this for more than just a single sensor. There are two ways to emulate a skip scan:
The following example shows how we can use a lateral join to solve the problem:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
test=# explain analyze SELECT * FROM generate_series(1, 100) AS x, LATERAL (SELECT max(measurement) FROM t_sample WHERE sensor_id = x); QUERY PLAN --------------------------------------------------------------------------- Nested Loop (cost=0.60..63.05 rows=100 width=36) (actual time=2.656..5.002 rows=100 loops=1) -> Function Scan on generate_series x (cost=0.00..1.00 rows=100 width=4) (actual time=2.520..2.532 rows=100 loops=1) -> Result (cost=0.60..0.61 rows=1 width=32) (actual time=0.024..0.024 rows=1 loops=100) InitPlan 1 (returns $1) -> Limit (cost=0.56..0.60 rows=1 width=11) (actual time=0.023..0.023 rows=1 loops=100) -> Index Only Scan Backward using t_sample_sensor_id_measurement_idx on t_sample (cost=0.56..15312.59 rows=432001 width=11) (actual time=0.023..0.023 rows=1 loops=100) Index Cond: ((sensor_id = x.x) AND (measurement IS NOT NULL)) Heap Fetches: 0 Planning Time: 0.481 ms Execution Time: 5.124 ms (10 rows) |
What is a LATERAL? Let us step back a bit and see SQL from a more philosophical point of view:
1 |
SELECT whatever FROM foo; |
In code this means:
1 2 3 4 |
foreach x in foo loop do_whatever end loop; |
One can see a FROM clause as some kind of loop. Now: If we want to find the max for a list of values what we really need is a loop around that loop. That is exactly what LATERAL does for us: For each value returned from the generate_series call (= generates the list of sensors) we run the LATERAL part of the query. The problem is: We are generating the list of sensors which makes the query a bit too “static” so it is better to fetch the sensor list from a second relation (normalization).
However, we can also achieve the same thing using a simple recursion:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
test=# explain analyze WITH RECURSIVE x AS ( SELECT min(sensor_id) AS s FROM t_sample UNION ALL SELECT (SELECT min(sensor_id) FROM t_sample WHERE sensor_id > x.s) FROM x WHERE x.s IS NOT NULL ) SELECT s AS sensor_id, (SELECT max(measurement) FROM t_sample WHERE t_sample.sensor_id = x.s) AS measurement FROM x WHERE s IS NOT NULL; QUERY PLAN ----------------------------------------------------------------- CTE Scan on x (cost=64.66..127.73 rows=100 width=36) (actual time=0.103..2.410 rows=100 loops=1) Filter: (s IS NOT NULL) Rows Removed by Filter: 1 CTE x -> Recursive Union (cost=0.60..64.66 rows=101 width=4) (actual time=0.060..0.838 rows=101 loops=1) -> Result (cost=0.60..0.61 rows=1 width=4) (actual time=0.058..0.059 rows=1 loops=1) InitPlan 3 (returns $1) -> Limit (cost=0.56..0.60 rows=1 width=4) (actual time=0.051..0.052 rows=1 loops=1) -> Index Only Scan using t_sample_sensor_id_measurement_idx on t_sample t_sample_1 (cost=0.56..1423086.31 rows=43200100 width=4) (actual time=0.049..0.050 rows=1 loops=1) Index Cond: (sensor_id IS NOT NULL) Heap Fetches: 0 -> WorkTable Scan on x x_1 (cost=0.00..6.30 rows=10 width=4) (actual time=0.007..0.007 rows=1 loops=101) Filter: (s IS NOT NULL) Rows Removed by Filter: 0 SubPlan 2 -> Result (cost=0.60..0.61 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=100) InitPlan 1 (returns $3) -> Limit (cost=0.56..0.60 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100) -> Index Only Scan using t_sample_sensor_id_measurement_idx on t_sample (cost=0.56..510365.23 rows=14400033 width=4) (actual time=0.006..0.006 rows=1 loops=100) Index Cond: ((sensor_id IS NOT NULL) AND (sensor_id > x_1.s)) Heap Fetches: 0 SubPlan 6 -> Result (cost=0.60..0.61 rows=1 width=32) (actual time=0.015..0.015 rows=1 loops=100) InitPlan 5 (returns $6) -> Limit (cost=0.56..0.60 rows=1 width=11) (actual time=0.015..0.015 rows=1 loops=100) -> Index Only Scan Backward using t_sample_sensor_id_measurement_idx on t_sample t_sample_2 (cost=0.56..15312.59 rows=432001 width=11) (actual time=0.014..0.014 rows=1 loops=100) Index Cond: ((sensor_id = x.s) AND (measurement IS NOT NULL)) Heap Fetches: 0 Planning Time: 0.446 ms Execution Time: 2.495 ms (30 rows) |
The core idea behind the query is as follows: We find the smallest sensor_id in the table which is really fast. Then we look for the next smallest value larger than the one we found before. We keep doing this in the recursion until we have quickly compiled a unique list of sensor_id. The recursion is indeed the fastest way to find a list of DISTINCT sensor_id without having to read the entire table. Essentially this is what it means to simulate a “skip scan” as mentioned before. Finally we can use this list of unique sensor IDS (“x”) to find the maximum value for each of them which is of course a fast index lookup.
If you want to know more about time series in general you might want to check out my post about simple pattern matching and string encoding in PostgreSQL which is available on our website as well.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply