CYBERTEC PostgreSQL Logo

Improving GROUP BY with CREATE STATISTICS

01.2023 / Category: / Tags: |

Analyzing time series often comes with challenges. This is especially true if you want to do a bit more than just run simple counts on values in a column. More often than not you'll want to create a month's report, a weekly overview and so on. This requires you to group on expressions rather than on entries in a table. What is not widely known is that this can cause issues with the optimizer - which should be addressed properly using CREATE STATISTICS.

This blog post will show you how to address wrong estimates and how to detect such issues in the first place.

Creating sample data

To demonstrate what we are talking about in this blog, I've created a simple data set:

Basically, all you need is a time-series containing two columns. You can easily fill this table with some data:

All that was done here was to insert 94 million rows. For the duration of two years, insert 1 row per second. Finally, you can create optimizer statistics for this data:

GROUP BY queries featuring expressions

What you want to do now is is to count how many entries we can expect per day. Therefore you can use the date_trunc function which turns a precise timestamp into day, week, month, etc. In this case we have chosen to reduce the timestamp to full days:

However, this leads to a problem. Let's inspect the output of EXPLAIN ANALYZE in more detail: The planner estimates that the GROUP BY will return the staggering number of 94 million rows. In case it's grouped by “t” this guess is correct – however, what the planner was not able to grasp is that it's grouped by “day” and not by second. Therefore the estimate is way off. In many cases, a bad estimate can lead to a really bad plan so you have to be very cautious. So what causes the issue and how can you address this issue?

CREATE STATISTICS: Helping the PostgreSQL optimizer

Behind the scenes, PostgreSQL stores statistics about the content of every single column. In case of “t” it has figured out that all rows contain different data (n_distinct in pg_stats is therefore 1). Therefore this guess will be used to handle our GROUP BY statement. This is safe to do because having an overestimate is usually less risky than an underestimate. However, in this case it's not just a slight overestimation – it's a major failure.

The reason for this failure is that PostgreSQL does have statistics for the column but it does not have statistics about the expression (= the output of date_trunc).

CREATE STATISTICS is a good way to tell the system to keep track of additional statistics which can help the planner:

The beauty of this is that you can not only just tell PostgreSQL to keep track of multi-column related statistics, but also about expressions. Here's how it works:

What you've just told the system is to create a sample for date_trunc(‘day’, t) and maintain this information just like simple column-related statistics. Once this has been done, the plan will immediately change:

What's noteworthy to see here is that PostgreSQL has instantly changed to a parallel query. Why is that? Well, PostgreSQL expects small intermediate results which makes inter-process communication a lot more attractive. Using a parallel query can be very beneficial to overall query performance.

Let's try the same thing using a single core operation:

What you see here is that PostgreSQL estimates both operations correctly: The sequential scans as well as the output of the aggregation are right on target.

You can easily verify this by running EXPLAIN ANALYZE:

PostgreSQL expected 1095 rows - and there are in fact 1096 rows. It's absolutely on target. However, CREATE STATISTICS is not only useful if you want to group by expressions – it's also really helpful if you are facing wrong estimates caused by a phenomenon known as “cross column correlation” which can be very painful in a data warehousing context.

Finally …

The PostgreSQL optimizer is a powerful piece of software. If you want to know more about its features and its inner workings, I recommend you check out Laurenz Albe’s post about optimizer support functions which discusses a powerful method to handle wrong estimates when dealing with set returning functions.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram