# GROUP BY: Fixing optimizer estimates

01.2025 / Category: How To / Tags: sql help

If you are using PostgreSQL for analytics or large-scale aggregations, you might occasionally notice the planner making false assumptions regarding the number of rows. While this isn't a problem for small aggregates, it is indeed an issue for large-scale aggregations featuring many different dimensions.

Table of Contents

In short: The more columns your GROUP BY statement contains, the more likely it is that optimizer overestimates the row count.

This blog explains how this can be handled in PostgreSQL.

Populating sample data in PostgreSQL

For this example, I have created a small table which contains a couple of integer values:

blog=# CREATE TABLE t_sample (
	a 	int, 
	b 	int, 
	c 	int, 
	d 	int, 
	e 	int, 
	f 	int 	DEFAULT random()*100000
);
CREATE TABLE

blog=# CREATE TABLE t_sample (

a int,

b int,

c int,

d int,

e int,

f int DEFAULT random()*100000

);

CREATE TABLE

As we can see, we have 5 integer columns (a-e), and a column that contains a value. The easiest way to add 10 million rows to this table is to make use of the generate_series function:

blog=# INSERT INTO t_sample 
     SELECT x % 10, x % 11, x % 12, x % 13, x % 14 
    FROM    generate_series(1, 10000000) AS x;
INSERT 0 10000000

blog=# INSERT INTO t_sample

SELECT x % 10, x % 11, x % 12, x % 13, x % 14

FROM generate_series(1, 10000000) AS x;

INSERT 0 10000000

The key point here is that our data has some interesting properties. Each column can only contain a fixed set of values (= 10 entries on the first column,11 distinct values in the second column and so on).

Once the data has been loaded, I am calling the ANALYZE command to create fresh optimizer statistics:

blog=# ANALYZE ;
ANALYZE

1 2	blog=# ANALYZE ; ANALYZE

What does "optimizer statistics" mean?

How does the PostgreSQL optimizer work? The most fundamental idea is column-statistics. For each column, PostgreSQL keeps important information such as the number of distinct values, the most common values, a histogram and a lot more:

blog=# SELECT     tablename, attname, 
        n_distinct, most_common_vals 
    FROM    pg_stats 
    WHERE   tablename = 't_sample' 
        AND attname IN ('a', 'b');
 tablename | attname | n_distinct |     most_common_vals     
-----------+---------+------------+--------------------------
 t_sample  | b       |         11 | {9,6,7,5,0,8,10,1,4,3,2}
 t_sample  | a       |         10 | {9,0,4,5,8,1,2,3,6,7}
(2 rows)

blog=# SELECT tablename, attname,

n_distinct, most_common_vals

FROM pg_stats

WHERE tablename = 't_sample'

AND attname IN ('a', 'b');

tablename | attname | n_distinct | most_common_vals

-----------+---------+------------+--------------------------

t_sample | b | 11 | {9,6,7,5,0,8,10,1,4,3,2}

t_sample | a | 10 | {9,0,4,5,8,1,2,3,6,7}

(2 rows)

Note that this information is stored for every single column. Under normal circumstances, autovacuum will ensure that the statistics is kept up to date.

When columns depend on each other

The following query is quite a common one: We want to determine how many entries exist in each group. The key detail is that the GROUP BY clause contains five dimensions.

To make the example easier to understand, we first disabled parallel queries in PostgreSQL. This is not a general performance recommendation for PostgreSQL, it is solely done to simplify the execution plan:

blog=# SET max_parallel_workers_per_gather TO 0;
SET
blog=# explain analyze 
    SELECT  a, b, c, d, e, count(*) 
    FROM    t_sample 
    GROUP BY 1, 2, 3, 4, 5;
                               QUERY PLAN
--------------------------------------------------------------------------
 HashAggregate  (cost=982455.57..1102046.81 rows=240240 width=28) 
    (actual time=1638.327..1784.077 rows=60060 loops=1)
   Group Key: a, b, c, d, e
   Planned Partitions: 4  Batches: 5  
    Memory Usage: 8241kB  Disk Usage: 53072kB
   ->  Seq Scan on t_sample  (cost=0.00..163696.15 rows=10000115 width=20) 
    (actual time=0.051..335.137 rows=10000000 loops=1)
 Planning Time: 0.167 ms
 Execution Time: 1787.892 ms
(6 rows)

blog=# SET max_parallel_workers_per_gather TO 0;

SET

blog=# explain analyze

SELECT a, b, c, d, e, count(*)

FROM t_sample

GROUP BY 1, 2, 3, 4, 5;

QUERY PLAN

--------------------------------------------------------------------------

HashAggregate (cost=982455.57..1102046.81 rows=240240 width=28)

(actual time=1638.327..1784.077 rows=60060 loops=1)

Group Key: a, b, c, d, e

Planned Partitions: 4 Batches: 5

Memory Usage: 8241kB Disk Usage: 53072kB

-> Seq Scan on t_sample (cost=0.00..163696.15 rows=10000115 width=20)

(actual time=0.051..335.137 rows=10000000 loops=1)

Planning Time: 0.167 ms

Execution Time: 1787.892 ms

(6 rows)

What stands out here? It comes down to two numbers: "240240" and "60060". The first number indicates what the optimizer estimates - it assumes that over 200,000 lines will be returned. However, the actual result is only 60060 rows. Why is this the case? The reason is that certain combinations of a, b, c, d, e simply don't exist. However, PostgreSQL does not know that. Remember, the optimizer maintains statistics about each column - it does not know about all those real permutations.

How does PostgreSQL come up with "240240"? Consider the following calculation:

blog=# SELECT 10 * 11 * 12 * 13 * 14;
 ?column? 
----------
   240240
(1 row)

blog=# SELECT 10 * 11 * 12 * 13 * 14;

?column?

----------

240240

(1 row)

PostgreSQL simply multiplied the number of elements in each dimension, which leads to an overestimation (which is often less problematic than an underestimation).

CREATE STATISTICS: Fixing estimates

The solution to fix those estimates is the use of extended statistics. The following command can be used to approach the problem:

blog=# \h CREATE STATISTICS 
Command:     CREATE STATISTICS
Description: define extended statistics
Syntax:
CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
    ON ( expression )
    FROM table_name

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
    [ ( statistics_kind [, ... ] ) ]
    ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...]
    FROM table_name

URL: https://www.postgresql.org/docs/17/sql-createstatistics.html

blog=# \h CREATE STATISTICS

Command: CREATE STATISTICS

Description: define extended statistics

Syntax:

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]

ON ( expression )

FROM table_name

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]

[ ( statistics_kind [, ... ] ) ]

ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...]

FROM table_name

URL: https://www.postgresql.org/docs/17/sql-createstatistics.html

In our case we need the "ndistinct" option of CREATE STATISTICS:

blog=# CREATE STATISTICS mystats (ndistinct) 
      ON    a, b, c, d, e 
      FROM  t_sample ;
CREATE STATISTICS
blog=# ANALYZE;
ANALYZE

blog=# CREATE STATISTICS mystats (ndistinct)

ON a, b, c, d, e

FROM t_sample ;

CREATE STATISTICS

blog=# ANALYZE;

ANALYZE

This creates statistics for permutations, not just for single values and columns. When we run the query again, we will see that the estimates have improved significantly:

blog=# explain analyze 
    SELECT  a, b, c, d, e, count(*) 
    FROM    t_sample 
    GROUP BY 1, 2, 3, 4, 5;
                                QUERY PLAN
----------------------------------------------------------------------------
 HashAggregate  (cost=313697.88..314288.34 rows=59046 width=28) 
    (actual time=1587.681..1733.595 rows=60060 loops=1)
   Group Key: a, b, c, d, e
   Batches: 5  Memory Usage: 8241kB  Disk Usage: 53072kB
   ->  Seq Scan on t_sample  (cost=0.00..163696.15 rows=10000115 width=20) 
    (actual time=0.060..309.344 rows=10000000 loops=1)
 Planning Time: 0.623 ms
 Execution Time: 1737.495 ms
(6 rows)

blog=# explain analyze

SELECT a, b, c, d, e, count(*)

FROM t_sample

GROUP BY 1, 2, 3, 4, 5;

QUERY PLAN

----------------------------------------------------------------------------

HashAggregate (cost=313697.88..314288.34 rows=59046 width=28)

(actual time=1587.681..1733.595 rows=60060 loops=1)

Group Key: a, b, c, d, e

Batches: 5 Memory Usage: 8241kB Disk Usage: 53072kB

-> Seq Scan on t_sample (cost=0.00..163696.15 rows=10000115 width=20)

(actual time=0.060..309.344 rows=10000000 loops=1)

Planning Time: 0.623 ms

Execution Time: 1737.495 ms

(6 rows)

Wow, this is close: "59046" vs "60060" - PostgreSQL is spot on. In this case, it doesn't impact performance. However, in real-world scenarios, fixing statistics can have a huge impact on speed and overall database
efficiency.

The question may naturally arise: What information does the optimizer use to come up with such an accurate estimate? Here is the answer:

blog=# \x
Expanded display is on.
blog=# SELECT * FROM pg_stats_ext;
-[ RECORD 1 ]----------+----------------------------------------------------------------
schemaname             | public
tablename              | t_sample
statistics_schemaname  | public
statistics_name        | mystats
statistics_owner       | hs
attnames               | {a,b,c,d,e}
exprs                  | 
kinds                  | {d}
inherited              | f
n_distinct             | {"1, 2": 110, "1, 3": 60, "1, 4": 130, "1, 5": 70, 
              "2, 3": 132, "2, 4": 143, "2, 5": 154, "3, 4": 156, 
              "3, 5": 84, "4, 5": 182, "1, 2, 3": 660, "1, 2, 4": 1430, 
              "1, 2, 5": 770, "1, 3, 4": 780, "1, 3, 5": 420, 
              "1, 4, 5": 910, "2, 3, 4": 1716, "2, 3, 5": 924, 
              "2, 4, 5": 2002, "3, 4, 5": 1092, "1, 2, 3, 4": 8576, 
              "1, 2, 3, 5": 4621, "1, 2, 4, 5": 10026, "1, 3, 4, 5": 5457,  
              "2, 3, 4, 5": 11949, "1, 2, 3, 4, 5": 59046}
dependencies           | 
most_common_vals       | 
most_common_val_nulls  | 
most_common_freqs      | 
most_common_base_freqs |

blog=# \x

Expanded display is on.

blog=# SELECT * FROM pg_stats_ext;

-[ RECORD 1 ]----------+----------------------------------------------------------------

schemaname | public

tablename | t_sample

statistics_schemaname | public

statistics_name | mystats

statistics_owner | hs

attnames | {a,b,c,d,e}

exprs |

kinds | {d}

inherited | f

n_distinct | {"1, 2": 110, "1, 3": 60, "1, 4": 130, "1, 5": 70,

"2, 3": 132, "2, 4": 143, "2, 5": 154, "3, 4": 156,

"3, 5": 84, "4, 5": 182, "1, 2, 3": 660, "1, 2, 4": 1430,

"1, 2, 5": 770, "1, 3, 4": 780, "1, 3, 5": 420,

"1, 4, 5": 910, "2, 3, 4": 1716, "2, 3, 5": 924,

"2, 4, 5": 2002, "3, 4, 5": 1092, "1, 2, 3, 4": 8576,

"1, 2, 3, 5": 4621, "1, 2, 4, 5": 10026, "1, 3, 4, 5": 5457,

"2, 3, 4, 5": 11949, "1, 2, 3, 4, 5": 59046}

dependencies |

most_common_vals |

most_common_val_nulls |

most_common_freqs |

most_common_base_freqs |

The pg_stats_ext view tells us how PostgreSQL PostgreSQL handles the n_distinct column in this case.

Finally …

Optimizing queries and the optimizer in general are fascinating topics worth exploring. If you want to learn more and dig deeper, consider reading my post about "How the PostgreSQL query optimizer works"

One response to “# GROUP BY: Fixing optimizer estimates”

Geoffrey Clark says:

January 21, 2025 at 9:43 am

Wow, great explanation! I can't wait to try this out my own data.

Reply

# GROUP BY: Fixing optimizer estimates

Populating sample data in PostgreSQL

What does "optimizer statistics" mean?

When columns depend on each other

CREATE STATISTICS: Fixing estimates

Finally …

One response to “# GROUP BY: Fixing optimizer estimates”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts