If you are using PostgreSQL for analytics or large-scale aggregations, you might occasionally notice the planner making false assumptions regarding the number of rows. While this isn't a problem for small aggregates, it is indeed an issue for large-scale aggregations featuring many different dimensions.
Table of Contents
In short: The more columns your GROUP BY statement contains, the more likely it is that optimizer overestimates the row count.
This blog explains how this can be handled in PostgreSQL.
For this example, I have created a small table which contains a couple of integer values:
1 2 3 4 5 6 7 8 9 |
blog=# CREATE TABLE t_sample ( a int, b int, c int, d int, e int, f int DEFAULT random()*100000 ); CREATE TABLE |
As we can see, we have 5 integer columns (a-e), and a column that contains a value. The easiest way to add 10 million rows to this table is to make use of the generate_series function:
1 2 3 4 |
blog=# INSERT INTO t_sample SELECT x % 10, x % 11, x % 12, x % 13, x % 14 FROM generate_series(1, 10000000) AS x; INSERT 0 10000000 |
The key point here is that our data has some interesting properties. Each column can only contain a fixed set of values (= 10 entries on the first column,11 distinct values in the second column and so on).
Once the data has been loaded, I am calling the ANALYZE command to create fresh optimizer statistics:
1 2 |
blog=# ANALYZE ; ANALYZE |
How does the PostgreSQL optimizer work? The most fundamental idea is column-statistics. For each column, PostgreSQL keeps important information such as the number of distinct values, the most common values, a histogram and a lot more:
1 2 3 4 5 6 7 8 9 10 |
blog=# SELECT tablename, attname, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 't_sample' AND attname IN ('a', 'b'); tablename | attname | n_distinct | most_common_vals -----------+---------+------------+-------------------------- t_sample | b | 11 | {9,6,7,5,0,8,10,1,4,3,2} t_sample | a | 10 | {9,0,4,5,8,1,2,3,6,7} (2 rows) |
Note that this information is stored for every single column. Under normal circumstances, autovacuum will ensure that the statistics is kept up to date.
The following query is quite a common one: We want to determine how many entries exist in each group. The key detail is that the GROUP BY clause contains five dimensions.
To make the example easier to understand, we first disabled parallel queries in PostgreSQL. This is not a general performance recommendation for PostgreSQL, it is solely done to simplify the execution plan:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
blog=# SET max_parallel_workers_per_gather TO 0; SET blog=# explain analyze SELECT a, b, c, d, e, count(*) FROM t_sample GROUP BY 1, 2, 3, 4, 5; QUERY PLAN -------------------------------------------------------------------------- HashAggregate (cost=982455.57..1102046.81 rows=240240 width=28) (actual time=1638.327..1784.077 rows=60060 loops=1) Group Key: a, b, c, d, e Planned Partitions: 4 Batches: 5 Memory Usage: 8241kB Disk Usage: 53072kB -> Seq Scan on t_sample (cost=0.00..163696.15 rows=10000115 width=20) (actual time=0.051..335.137 rows=10000000 loops=1) Planning Time: 0.167 ms Execution Time: 1787.892 ms (6 rows) |
What stands out here? It comes down to two numbers: "240240" and "60060". The first number indicates what the optimizer estimates - it assumes that over 200,000 lines will be returned. However, the actual result is only 60060 rows. Why is this the case? The reason is that certain combinations of a, b, c, d, e simply don't exist. However, PostgreSQL does not know that. Remember, the optimizer maintains statistics about each column - it does not know about all those real permutations.
How does PostgreSQL come up with "240240"? Consider the following calculation:
1 2 3 4 5 |
blog=# SELECT 10 * 11 * 12 * 13 * 14; ?column? ---------- 240240 (1 row) |
PostgreSQL simply multiplied the number of elements in each dimension, which leads to an overestimation (which is often less problematic than an underestimation).
The solution to fix those estimates is the use of extended statistics. The following command can be used to approach the problem:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
blog=# \h CREATE STATISTICS Command: CREATE STATISTICS Description: define extended statistics Syntax: CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] ON ( expression ) FROM table_name CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( statistics_kind [, ... ] ) ] ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...] FROM table_name URL: https://www.postgresql.org/docs/17/sql-createstatistics.html |
In our case we need the "ndistinct" option of CREATE STATISTICS:
1 2 3 4 5 6 |
blog=# CREATE STATISTICS mystats (ndistinct) ON a, b, c, d, e FROM t_sample ; CREATE STATISTICS blog=# ANALYZE; ANALYZE |
This creates statistics for permutations, not just for single values and columns. When we run the query again, we will see that the estimates have improved significantly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
blog=# explain analyze SELECT a, b, c, d, e, count(*) FROM t_sample GROUP BY 1, 2, 3, 4, 5; QUERY PLAN ---------------------------------------------------------------------------- HashAggregate (cost=313697.88..314288.34 rows=59046 width=28) (actual time=1587.681..1733.595 rows=60060 loops=1) Group Key: a, b, c, d, e Batches: 5 Memory Usage: 8241kB Disk Usage: 53072kB -> Seq Scan on t_sample (cost=0.00..163696.15 rows=10000115 width=20) (actual time=0.060..309.344 rows=10000000 loops=1) Planning Time: 0.623 ms Execution Time: 1737.495 ms (6 rows) |
Wow, this is close: "59046" vs "60060" - PostgreSQL is spot on. In this case, it doesn't impact performance. However, in real-world scenarios, fixing statistics can have a huge impact on speed and overall database
efficiency.
The question may naturally arise: What information does the optimizer use to come up with such an accurate estimate? Here is the answer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
blog=# \x Expanded display is on. blog=# SELECT * FROM pg_stats_ext; -[ RECORD 1 ]----------+---------------------------------------------------------------- schemaname | public tablename | t_sample statistics_schemaname | public statistics_name | mystats statistics_owner | hs attnames | {a,b,c,d,e} exprs | kinds | {d} inherited | f n_distinct | {"1, 2": 110, "1, 3": 60, "1, 4": 130, "1, 5": 70, "2, 3": 132, "2, 4": 143, "2, 5": 154, "3, 4": 156, "3, 5": 84, "4, 5": 182, "1, 2, 3": 660, "1, 2, 4": 1430, "1, 2, 5": 770, "1, 3, 4": 780, "1, 3, 5": 420, "1, 4, 5": 910, "2, 3, 4": 1716, "2, 3, 5": 924, "2, 4, 5": 2002, "3, 4, 5": 1092, "1, 2, 3, 4": 8576, "1, 2, 3, 5": 4621, "1, 2, 4, 5": 10026, "1, 3, 4, 5": 5457, "2, 3, 4, 5": 11949, "1, 2, 3, 4, 5": 59046} dependencies | most_common_vals | most_common_val_nulls | most_common_freqs | most_common_base_freqs | |
The pg_stats_ext view tells us how PostgreSQL PostgreSQL handles the n_distinct column in this case.
Optimizing queries and the optimizer in general are fascinating topics worth exploring. If you want to learn more and dig deeper, consider reading my post about "How the PostgreSQL query optimizer works"
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Wow, great explanation! I can't wait to try this out my own data.