In my previous posting on PostgreSQL 9.4 I wrote about aggregate FILTER clauses, which are a neat way to make partial aggregates more readable. Inspired by some comments to this blog post I decided to create a follow-up post, to see the impact this new FILTER
clause has on performance.
Table of Contents
To see the performance impact people can expect, here is some demo data:
1 2 3 4 5 6 7 8 9 10 11 |
test=# BEGIN; BEGIN test=# CREATE TABLE t_test (id int, dummy char(200)); CREATE TABLE test=# INSERT INTO t_test SELECT *, 'dummy' FROM generate_series(1, 20000000) AS x; INSERT 0 20000000 test=# COMMIT; COMMIT |
The data set is roughly 4.5 GB of data. On the test box this still fits nicely into memory:
1 2 3 4 5 |
test=# SELECT pg_size_pretty(pg_relation_size('t_test')); pg_size_pretty ---------------- 4596 MB (1 row) |
For years I have used the classical approach. The trick is that aggregation functions won't consider NULL values. So to do partial aggregation functions the idea is to just replace values on the fly. The code is as already stated in the previous post:
1 2 3 4 5 6 7 8 9 10 |
test=# SELECT count(CASE WHEN id % 2 = 0 THEN 1 ELSE NULL END) AS even, count(CASE WHEN id % 2 = 1 THEN 1 ELSE NULL END) AS odd FROM t_test; even | odd ----------+---------- 10000000 | 10000000 (1 row) Time: 6208.092 ms |
In this example the aggregation takes slightly over 6 seconds (with all hint bits set, etc.). This is not too bad.
Let us see now what happens when the new approach is used:
1 2 3 4 5 6 7 8 9 |
SELECT count(id) FILTER (WHERE id % 2 = 0) AS even, count(id) FILTER (WHERE id % 2 = 1) AS odd FROM t_test; even | odd ----------+---------- 10000000 | 10000000 (1 row) Time: 5353.172 ms |
The query is almost one second faster than the original one. The overhead of the new FILTER clause is simply ways smaller than in the original version of the query. Of course, the more clauses you add the larger the difference will be.
Before a more complex query is fired I decided to increase work_mem to make sure that all aggregations can happen in memory:
1 2 3 |
test=# SET work_mem TO '1 GB'; SET Time: 0.375 ms |
This time an additional GROUP BY clause is added to split the data set into smaller chunks. A GROUP BY clause is basically what most people will have in their queries in case FILTER is used:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# SELECT id % 3 AS g, count(CASE WHEN id % 2 = 0 THEN 1 ELSE NULL END) AS even, count(CASE WHEN id % 2 = 1 THEN 1 ELSE NULL END) AS odd FROM t_test GROUP BY 1; g | even | odd ---+---------+--------- 0 | 3333333 | 3333333 1 | 3333333 | 3333334 2 | 3333334 | 3333333 (3 rows) Time: 22119.977 ms |
The result is created in 22 seconds, which is actually quite ok.
The next query shows the same thing using the new syntax:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT id % 3 AS g, count(id) FILTER (WHERE id % 2 = 0) AS even, count(id) FILTER (WHERE id % 2 = 1) AS odd FROM t_test GROUP BY 1; g | even | odd ---+---------+--------- 0 | 3333333 | 3333333 1 | 3333333 | 3333334 2 | 3333334 | 3333333 (3 rows) Time: 21379.657 ms |
The important thing here is that the performance difference is again around 1 second. This makes sense because the performance gained by the FILTER clause is the same regardless of the grouping around it. This is somewhat expected.
In case you need any assistance, please feel free to contact us.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
My motivation was SQL code clarity, with performance as a nice-to-have which can later be improved.