CYBERTEC PostgreSQL Logo

# GROUP BY: Fixing optimizer estimates

01.2025 / Category: / Tags:

If you are using PostgreSQL for analytics or large-scale aggregations, you might occasionally notice the planner making false assumptions regarding the number of rows. While this isn't a problem for small aggregates, it is indeed an issue for large-scale aggregations featuring many different dimensions.

In short: The more columns your GROUP BY statement contains, the more likely it is that optimizer overestimates the row count.

This blog explains how this can be handled in PostgreSQL.

Populating sample data in PostgreSQL

For this example, I have created a small table which contains a couple of integer values:

As we can see, we have 5 integer columns (a-e), and a column that contains a value. The easiest way to add 10 million rows to this table is to make use of the generate_series function:

The key point here is that our data has some interesting properties. Each column can only contain a fixed set of values (= 10 entries on the first column,11 distinct values in the second column and so on).

Once the data has been loaded, I am calling the ANALYZE command to create fresh optimizer statistics:

What does "optimizer statistics" mean?

How does the PostgreSQL optimizer work? The most fundamental idea is column-statistics. For each column, PostgreSQL keeps important information such as the number of distinct values, the most common values, a histogram and a lot more:

Note that this information is stored for every single column. Under normal circumstances, autovacuum will ensure that the statistics is kept up to date.

When columns depend on each other

The following query is quite a common one: We want to determine how many entries exist in each group. The key detail is that the GROUP BY clause contains five dimensions.

To make the example easier to understand, we first disabled parallel queries in PostgreSQL. This is not a general performance recommendation for PostgreSQL, it is solely done to simplify the execution plan:

What stands out here? It comes down to two numbers: "240240" and "60060". The first number indicates what the optimizer estimates - it assumes that over 200,000 lines will be returned. However, the actual result is only 60060 rows. Why is this the case? The reason is that certain combinations of a, b, c, d, e simply don't exist. However, PostgreSQL does not know that. Remember, the optimizer maintains statistics about each column - it does not know about all those real permutations.

How does PostgreSQL come up with "240240"? Consider the following calculation:

PostgreSQL simply multiplied the number of elements in each dimension, which leads to an overestimation (which is often less problematic than an underestimation).

CREATE STATISTICS: Fixing estimates

The solution to fix those estimates is the use of extended statistics. The following command can be used to approach the problem:

In our case we need the "ndistinct" option of CREATE STATISTICS:

This creates statistics for permutations, not just for single values and columns. When we run the query again, we will see that the estimates have improved significantly:

Wow, this is close: "59046" vs "60060" - PostgreSQL is spot on. In this case, it doesn't impact performance. However, in real-world scenarios, fixing statistics can have a huge impact on speed and overall database
efficiency.

The question may naturally arise: What information does the optimizer use to come up with such an accurate estimate? Here is the answer:

The pg_stats_ext view tells us how PostgreSQL PostgreSQL handles the n_distinct column in this case.

Finally …

Optimizing queries and the optimizer in general are fascinating topics worth exploring. If you want to learn more and dig deeper, consider reading my post about "How the PostgreSQL query optimizer works"

One response to “# GROUP BY: Fixing optimizer estimates”

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram