This post will deal with how to speed up analytics and window functions in PostgreSQL. As a professional PostgreSQL support company, we see a lot of SQL performance stuff which is worth sharing. One of those noteworthy things happened this week when I was out on a business trip to Berlin, Germany. This (excellent) customer was making extensive use of window functions and analytics. However, there is always room to speed things up.
Table of Contents
Most people simply write their SQL code and execute it, assuming that the optimizer will take care of things on its own. While this is usually the case there are still corner cases, where some clever optimization - done by a professional - can give you an edge and better database performance.
One way to improve speed is by rearranging window functions in a clever way. Consider the following example:
1 2 3 4 |
test=# CREATE TABLE data (id int); CREATE TABLE test=# INSERT INTO data SELECT * FROM generate_series(1, 5); INSERT 0 5 |
Our example is pretty simple: All we need is a table containing 5 rows:
1 2 3 4 5 6 7 8 9 |
test=# SELECT * FROM data; id ---- 1 2 3 4 5 (5 rows) |
1 2 3 4 5 6 7 8 9 |
test=# SELECT *, array_agg(id) OVER (ORDER BY id) FROM data; id | array_agg ----+------------- 1 | {1} 2 | {1,2} 3 | {1,2,3} 4 | {1,2,3,4} 5 | {1,2,3,4,5} (5 rows) |
What we have here is a simple aggregation. For the sake of simplicity I have used array_agg, which simply shows, how our data is aggregated. Of course we could also use min, max, sum, count or any other window function.
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT *, array_agg(id) OVER (ORDER BY id), array_agg(id) OVER (ORDER BY id DESC) FROM data; id | array_agg | array_agg ----+-------------+------------- 5 | {1,2,3,4,5} | {5} 4 | {1,2,3,4} | {5,4} 3 | {1,2,3} | {5,4,3} 2 | {1,2} | {5,4,3,2} 1 | {1} |
In this case, there are two columns with two different OVER-clauses. Note that those two aggregations are using different sort orders. One column needs ascending data and one needs descending data.
To understand what is really going on here, we can take a look at the execution plan provided by PostgreSQL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
test=# explain SELECT *, array_agg(id) OVER (ORDER BY id), array_agg(id) OVER (ORDER BY id DESC) FROM data ORDER BY id; QUERY PLAN ----------------------------------------------------------------- Sort (cost=557.60..563.97 rows=2550 width=68) Sort Key: id <- WindowAgg (cost=368.69..413.32 rows=2550 width=68) <- Sort (cost=368.69..375.07 rows=2550 width=36) Sort Key: id DESC <- WindowAgg (cost=179.78..224.41 rows=2550 width=36) <- Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: id <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4) (9 rows) |
First of all, PostgreSQL has to read the data and sort by “id”. This sorted data is fed to the first aggregation, before it is passed on to the next sort step (to sort descending). The second sort passes its output to its window aggregation. Finally, the data is again sorted by id because we want the final output to be ordered by the first column. Overall our data has to be sorted three times, which is not a good thing to do.
Sorting data three times is clearly not a good idea. Maybe we can do better. Let us simply swap two columns in the SELECT clause:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
test=# explain SELECT *, array_agg(id) OVER (ORDER BY id DESC), array_agg(id) OVER (ORDER BY id) FROM data ORDER BY id; QUERY PLAN ------------------------------------------------------------------- WindowAgg (cost=368.69..413.32 rows=2550 width=68) <- Sort (cost=368.69..375.07 rows=2550 width=36) Sort Key: id <- WindowAgg (cost=179.78..224.41 rows=2550 width=36) <- Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: id DESC <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4) (7 rows) |
Wow, by making this little change we have actually managed to skip one sorting step. First of all, the data is sorted descending because we need it for the first window function. However, the next column will need data in exactly the same order as the final ORDER BY at the end of the query. PostgreSQL knows that and can already use the sorted input. If you are processing a big data set, this kind of optimization can make a huge difference and speed up your queries tremendously.
At this point PostgreSQL is not able (yet?) to make those adjustments for you so some manual improvements will definitely help. Try to adjust your window functions in a way that columns needing identical sorting are actually next to each other.
Read more about window functions in my post about SQL Trickery: Configuring Window Functions
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply