Improving GROUP BY with CREATE STATISTICS

Analyzing time series often comes with challenges. This is especially true if you want to do a bit more than just run simple counts on values in a column. More often than not you'll want to create a month's report, a weekly overview and so on. This requires you to group on expressions rather than on entries in a table. What is not widely known is that this can cause issues with the optimizer - which should be addressed properly using CREATE STATISTICS.

Table of Contents

This blog post will show you how to address wrong estimates and how to detect such issues in the first place.

Creating sample data

To demonstrate what we are talking about in this blog, I've created a simple data set:

test=# CREATE TABLE t_timeseries (
		t     timestamptz, 
		val   numeric 	DEFAULT random()
);
CREATE TABLE

test=# CREATE TABLE t_timeseries (

t timestamptz,

val numeric DEFAULT random()

);

CREATE TABLE

Basically, all you need is a time-series containing two columns. You can easily fill this table with some data:

test=# INSERT INTO t_timeseries SELECT * 
       FROM generate_series('2022-01-01', '2024-12-31', '1 second'::interval);
INSERT 0 94608001

test=# INSERT INTO t_timeseries SELECT *

FROM generate_series('2022-01-01', '2024-12-31', '1 second'::interval);

INSERT 0 94608001

All that was done here was to insert 94 million rows. For the duration of two years, insert 1 row per second. Finally, you can create optimizer statistics for this data:

test=# ANALYZE;
ANALYZE

1 2	test=# ANALYZE; ANALYZE

GROUP BY queries featuring expressions

What you want to do now is is to count how many entries we can expect per day. Therefore you can use the date_trunc function which turns a precise timestamp into day, week, month, etc. In this case we have chosen to reduce the timestamp to full days:

test=# explain analyze 
	SELECT 	date_trunc('day', t), count(*) 
	FROM 	t_timeseries 
	GROUP BY 1;
                                       QUERY PLAN                                                             
-------------------------------------------------------------------------------------------
 HashAggregate  (cost=11955563.80..14616414.92 rows=94608040 width=16) 
	(actual time=28929.936..28930.176 rows=1096 loops=1)
   Group Key: date_trunc('day'::text, t)
   Planned Partitions: 256  Batches: 1  Memory Usage: 1681kB
   ->  Seq Scan on t_timeseries  (cost=0.00..1785199.50 rows=94608040 width=8) 
		(actual time=0.030..19006.497 rows=94608001 loops=1)
 Planning Time: 0.187 ms
 Execution Time: 28930.333 ms
(6 rows)

test=# explain analyze

SELECT date_trunc('day', t), count(*)

FROM t_timeseries

GROUP BY 1;

QUERY PLAN

-------------------------------------------------------------------------------------------

HashAggregate (cost=11955563.80..14616414.92 rows=94608040 width=16)

(actual time=28929.936..28930.176 rows=1096 loops=1)

Group Key: date_trunc('day'::text, t)

Planned Partitions: 256 Batches: 1 Memory Usage: 1681kB

-> Seq Scan on t_timeseries (cost=0.00..1785199.50 rows=94608040 width=8)

(actual time=0.030..19006.497 rows=94608001 loops=1)

Planning Time: 0.187 ms

Execution Time: 28930.333 ms

(6 rows)

However, this leads to a problem. Let's inspect the output of EXPLAIN ANALYZE in more detail: The planner estimates that the GROUP BY will return the staggering number of 94 million rows. In case it's grouped by “t” this guess is correct – however, what the planner was not able to grasp is that it's grouped by “day” and not by second. Therefore the estimate is way off. In many cases, a bad estimate can lead to a really bad plan so you have to be very cautious. So what causes the issue and how can you address this issue?

CREATE STATISTICS: Helping the PostgreSQL optimizer

Behind the scenes, PostgreSQL stores statistics about the content of every single column. In case of “t” it has figured out that all rows contain different data (n_distinct in pg_stats is therefore 1). Therefore this guess will be used to handle our GROUP BY statement. This is safe to do because having an overestimate is usually less risky than an underestimate. However, in this case it's not just a slight overestimation – it's a major failure.

The reason for this failure is that PostgreSQL does have statistics for the column but it does not have statistics about the expression (= the output of date_trunc).

CREATE STATISTICS is a good way to tell the system to keep track of additional statistics which can help the planner:

test=# h CREATE STATISTICS 
Command:     CREATE STATISTICS
Description: define extended statistics
Syntax:
CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    ON ( expression )
    FROM table_name

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...]
    FROM table_name

URL: https://www.postgresql.org/docs/15/sql-createstatistics.html

test=# h CREATE STATISTICS

Command: CREATE STATISTICS

Description: define extended statistics

Syntax:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name

ON ( expression )

FROM table_name

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name

[ ( statistics_kind [, ... ] ) ]

ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...]

FROM table_name

URL: https://www.postgresql.org/docs/15/sql-createstatistics.html

The beauty of this is that you can not only just tell PostgreSQL to keep track of multi-column related statistics, but also about expressions. Here's how it works:

test=# CREATE STATISTICS mystats 
		ON 	(date_trunc('day', t)) 
		FROM	t_timeseries ;
CREATE STATISTICS

test=# ANALYZE ;
ANALYZE

test=# CREATE STATISTICS mystats

ON (date_trunc('day', t))

FROM t_timeseries ;

CREATE STATISTICS

test=# ANALYZE ;

ANALYZE

What you've just told the system is to create a sample for date_trunc(‘day’, t) and maintain this information just like simple column-related statistics. Once this has been done, the plan will immediately change:

test=# explain 
	SELECT 	date_trunc('day', t), count(*) 
	FROM 	t_timeseries 
	GROUP BY 1;
                                   QUERY PLAN                                                
----------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1293518.28..1293798.44 rows=1095 width=16)
   Group Key: (date_trunc('day'::text, t))
   ->  Gather Merge  (cost=1293518.28..1293773.80 rows=2190 width=16)
         Workers Planned: 2
         ->  Sort  (cost=1292518.26..1292521.00 rows=1095 width=16)
               Sort Key: (date_trunc('day'::text, t))
               ->  Partial HashAggregate  
				(cost=1292449.29..1292462.98 rows=1095 width=16)
                     Group Key: date_trunc('day'::text, t)
                     ->  Parallel Seq Scan on t_timeseries  
				(cost=0.00..1095349.21 rows=39420017 width=8)
(9 rows)

test=# explain

SELECT date_trunc('day', t), count(*)

FROM t_timeseries

GROUP BY 1;

QUERY PLAN

----------------------------------------------------------------------------------

Finalize GroupAggregate (cost=1293518.28..1293798.44 rows=1095 width=16)

Group Key: (date_trunc('day'::text, t))

-> Gather Merge (cost=1293518.28..1293773.80 rows=2190 width=16)

Workers Planned: 2

-> Sort (cost=1292518.26..1292521.00 rows=1095 width=16)

Sort Key: (date_trunc('day'::text, t))

-> Partial HashAggregate

(cost=1292449.29..1292462.98 rows=1095 width=16)

Group Key: date_trunc('day'::text, t)

-> Parallel Seq Scan on t_timeseries

(cost=0.00..1095349.21 rows=39420017 width=8)

(9 rows)

What's noteworthy to see here is that PostgreSQL has instantly changed to a parallel query. Why is that? Well, PostgreSQL expects small intermediate results which makes inter-process communication a lot more attractive. Using a parallel query can be very beneficial to overall query performance.

Let's try the same thing using a single core operation:

test=# SET max_parallel_workers_per_gather TO 0;
SET
test=# explain 
	SELECT 	date_trunc('day', t), count(*) 
	FROM 	t_timeseries 
	GROUP BY 1;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 HashAggregate  (cost=2258239.70..2258253.39 rows=1095 width=16)
   Group Key: date_trunc('day'::text, t)
   ->  Seq Scan on t_timeseries  (cost=0.00..1785199.50 rows=94608040 width=8)
(3 rows)

test=# SET max_parallel_workers_per_gather TO 0;

SET

test=# explain

SELECT date_trunc('day', t), count(*)

FROM t_timeseries

GROUP BY 1;

QUERY PLAN

-------------------------------------------------------------------------------

HashAggregate (cost=2258239.70..2258253.39 rows=1095 width=16)

Group Key: date_trunc('day'::text, t)

-> Seq Scan on t_timeseries (cost=0.00..1785199.50 rows=94608040 width=8)

(3 rows)

What you see here is that PostgreSQL estimates both operations correctly: The sequential scans as well as the output of the aggregation are right on target.

You can easily verify this by running `EXPLAIN ANALYZE`:

test=# explain analyze 
	SELECT 	date_trunc('day', t), count(*) 
	FROM 	t_timeseries 
	GROUP BY 1;
                                         QUERY PLAN                                                             
----------------------------------------------------------------------------------
 HashAggregate  (cost=2258239.70..2258253.39 rows=1095 width=16) 
	(actual time=30192.973..30193.087 rows=1096 loops=1)
   Group Key: date_trunc('day'::text, t)
   Batches: 1  Memory Usage: 193kB
   ->  Seq Scan on t_timeseries  (cost=0.00..1785199.50 rows=94608040 width=8) 
		(actual time=1.133..20028.498 rows=94608001 loops=1)
 Planning Time: 0.082 ms
 Execution Time: 30193.168 ms
(6 rows)

test=# explain analyze

SELECT date_trunc('day', t), count(*)

FROM t_timeseries

GROUP BY 1;

QUERY PLAN

----------------------------------------------------------------------------------

HashAggregate (cost=2258239.70..2258253.39 rows=1095 width=16)

(actual time=30192.973..30193.087 rows=1096 loops=1)

Group Key: date_trunc('day'::text, t)

Batches: 1 Memory Usage: 193kB

-> Seq Scan on t_timeseries (cost=0.00..1785199.50 rows=94608040 width=8)

(actual time=1.133..20028.498 rows=94608001 loops=1)

Planning Time: 0.082 ms

Execution Time: 30193.168 ms

(6 rows)

PostgreSQL expected 1095 rows - and there are in fact 1096 rows. It's absolutely on target. However, CREATE STATISTICS is not only useful if you want to group by expressions – it's also really helpful if you are facing wrong estimates caused by a phenomenon known as “cross column correlation” which can be very painful in a data warehousing context.

Finally …

The PostgreSQL optimizer is a powerful piece of software. If you want to know more about its features and its inner workings, I recommend you check out Laurenz Albe’s post about optimizer support functions which discusses a powerful method to handle wrong estimates when dealing with set returning functions.

Improving GROUP BY with CREATE STATISTICS

Creating sample data

GROUP BY queries featuring expressions

CREATE STATISTICS: Helping the PostgreSQL optimizer

Let's try the same thing using a single core operation:

You can easily verify this by running `EXPLAIN ANALYZE`:

Finally …

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

Improving GROUP BY with CREATE STATISTICS

Creating sample data

GROUP BY queries featuring expressions

CREATE STATISTICS: Helping the PostgreSQL optimizer

Let's try the same thing using a single core operation:

You can easily verify this by running EXPLAIN ANALYZE:

Finally …

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

You can easily verify this by running `EXPLAIN ANALYZE`: