Analyzing time series often comes with challenges. This is especially true if you want to do a bit more than just run simple counts on values in a column. More often than not you'll want to create a month's report, a weekly overview and so on. This requires you to group on expressions rather than on entries in a table. What is not widely known is that this can cause issues with the optimizer - which should be addressed properly using CREATE STATISTICS
.
Table of Contents
This blog post will show you how to address wrong estimates and how to detect such issues in the first place.
To demonstrate what we are talking about in this blog, I've created a simple data set:
1 2 3 4 5 |
test=# CREATE TABLE t_timeseries ( t timestamptz, val numeric DEFAULT random() ); CREATE TABLE |
Basically, all you need is a time-series containing two columns. You can easily fill this table with some data:
1 2 3 |
test=# INSERT INTO t_timeseries SELECT * FROM generate_series('2022-01-01', '2024-12-31', '1 second'::interval); INSERT 0 94608001 |
All that was done here was to insert 94 million rows. For the duration of two years, insert 1 row per second. Finally, you can create optimizer statistics for this data:
1 2 |
test=# ANALYZE; ANALYZE |
What you want to do now is is to count how many entries we can expect per day. Therefore you can use the date_trunc
function which turns a precise timestamp into day, week, month, etc. In this case we have chosen to reduce the timestamp to full days:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
test=# explain analyze SELECT date_trunc('day', t), count(*) FROM t_timeseries GROUP BY 1; QUERY PLAN ------------------------------------------------------------------------------------------- HashAggregate (cost=11955563.80..14616414.92 rows=94608040 width=16) (actual time=28929.936..28930.176 rows=1096 loops=1) Group Key: date_trunc('day'::text, t) Planned Partitions: 256 Batches: 1 Memory Usage: 1681kB -> Seq Scan on t_timeseries (cost=0.00..1785199.50 rows=94608040 width=8) (actual time=0.030..19006.497 rows=94608001 loops=1) Planning Time: 0.187 ms Execution Time: 28930.333 ms (6 rows) |
However, this leads to a problem. Let's inspect the output of EXPLAIN ANALYZE
in more detail: The planner estimates that the GROUP BY
will return the staggering number of 94 million rows. In case it's grouped by “t” this guess is correct – however, what the planner was not able to grasp is that it's grouped by “day” and not by second. Therefore the estimate is way off. In many cases, a bad estimate can lead to a really bad plan so you have to be very cautious. So what causes the issue and how can you address this issue?
Behind the scenes, PostgreSQL stores statistics about the content of every single column. In case of “t” it has figured out that all rows contain different data (n_distinct
in pg_stats is therefore 1). Therefore this guess will be used to handle our GROUP BY
statement. This is safe to do because having an overestimate is usually less risky than an underestimate. However, in this case it's not just a slight overestimation – it's a major failure.
The reason for this failure is that PostgreSQL does have statistics for the column but it does not have statistics about the expression (= the output of date_trunc
).
CREATE STATISTICS
is a good way to tell the system to keep track of additional statistics which can help the planner:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
test=# h CREATE STATISTICS Command: CREATE STATISTICS Description: define extended statistics Syntax: CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON ( expression ) FROM table_name CREATE STATISTICS [ IF NOT EXISTS ] statistics_name [ ( statistics_kind [, ... ] ) ] ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...] FROM table_name URL: https://www.postgresql.org/docs/15/sql-createstatistics.html |
The beauty of this is that you can not only just tell PostgreSQL to keep track of multi-column related statistics, but also about expressions. Here's how it works:
1 2 3 4 5 6 7 |
test=# CREATE STATISTICS mystats ON (date_trunc('day', t)) FROM t_timeseries ; CREATE STATISTICS test=# ANALYZE ; ANALYZE |
What you've just told the system is to create a sample for date_trunc(‘day’, t)
and maintain this information just like simple column-related statistics. Once this has been done, the plan will immediately change:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
test=# explain SELECT date_trunc('day', t), count(*) FROM t_timeseries GROUP BY 1; QUERY PLAN ---------------------------------------------------------------------------------- Finalize GroupAggregate (cost=1293518.28..1293798.44 rows=1095 width=16) Group Key: (date_trunc('day'::text, t)) -> Gather Merge (cost=1293518.28..1293773.80 rows=2190 width=16) Workers Planned: 2 -> Sort (cost=1292518.26..1292521.00 rows=1095 width=16) Sort Key: (date_trunc('day'::text, t)) -> Partial HashAggregate (cost=1292449.29..1292462.98 rows=1095 width=16) Group Key: date_trunc('day'::text, t) -> Parallel Seq Scan on t_timeseries (cost=0.00..1095349.21 rows=39420017 width=8) (9 rows) |
What's noteworthy to see here is that PostgreSQL has instantly changed to a parallel query. Why is that? Well, PostgreSQL expects small intermediate results which makes inter-process communication a lot more attractive. Using a parallel query can be very beneficial to overall query performance.
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# SET max_parallel_workers_per_gather TO 0; SET test=# explain SELECT date_trunc('day', t), count(*) FROM t_timeseries GROUP BY 1; QUERY PLAN ------------------------------------------------------------------------------- HashAggregate (cost=2258239.70..2258253.39 rows=1095 width=16) Group Key: date_trunc('day'::text, t) -> Seq Scan on t_timeseries (cost=0.00..1785199.50 rows=94608040 width=8) (3 rows) |
What you see here is that PostgreSQL estimates both operations correctly: The sequential scans as well as the output of the aggregation are right on target.
EXPLAIN ANALYZE
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
test=# explain analyze SELECT date_trunc('day', t), count(*) FROM t_timeseries GROUP BY 1; QUERY PLAN ---------------------------------------------------------------------------------- HashAggregate (cost=2258239.70..2258253.39 rows=1095 width=16) (actual time=30192.973..30193.087 rows=1096 loops=1) Group Key: date_trunc('day'::text, t) Batches: 1 Memory Usage: 193kB -> Seq Scan on t_timeseries (cost=0.00..1785199.50 rows=94608040 width=8) (actual time=1.133..20028.498 rows=94608001 loops=1) Planning Time: 0.082 ms Execution Time: 30193.168 ms (6 rows) |
PostgreSQL expected 1095 rows - and there are in fact 1096 rows. It's absolutely on target. However, CREATE STATISTICS
is not only useful if you want to group by expressions – it's also really helpful if you are facing wrong estimates caused by a phenomenon known as “cross column correlation” which can be very painful in a data warehousing context.
The PostgreSQL optimizer is a powerful piece of software. If you want to know more about its features and its inner workings, I recommend you check out Laurenz Albe’s post about optimizer support functions which discusses a powerful method to handle wrong estimates when dealing with set returning functions.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply