PostgreSQL is one of the best OLTP databases (OLTP = online transaction processing) in the world. However, it can do more than just OLTP. PostgreSQL offers many additional features relevant to a more OLAP-style workload. One of those features is called “GROUPING SETS”.
Table of Contents
Before we dive into the details, I've compiled some sample data which you can easily load into your SQL database:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
CREATE TABLE t_sales ( country text, product_name text, year int, amount_sold numeric ); INSERT INTO t_sales VALUES ('Argentina', 'Shoes', 2020, 12), ('Argentina', 'Shoes', 2021, 14), ('Argentina', 'Hats', 2020, 54), ('Argentina', 'Hats', 2021, 57), ('Germany', 'Shoes', 2020, 34), ('Germany', 'Shoes', 2021, 29), ('Germany', 'Hats', 2020, 19), ('Germany', 'Hats', 2021, 22), ('USA', 'Shoes', 2020, 99), ('USA', 'Shoes', 2021, 103), ('USA', 'Hats', 2020, 81), ('USA', 'Hats', 2021, 90) ; |
Note that everything you are going to see in this blog is pretty SQL-standard compliant, so you can expect most of the stuff to work in other professional SQL databases as well.
Let's get started with a simple aggregation:
1 2 3 4 5 6 7 8 9 |
test=# SELECT country, sum(amount_sold) FROM t_sales GROUP BY 1; country | sum -----------+----- USA | 373 Germany | 104 Argentina | 137 (3 rows) |
There's not much to say here, apart from the fact that we will get one sum for each group. However, there's a bit of a philosophical discussion going on. “GROUP BY 1” basically means “GROUP BY country” which is the equivalent of the first column in the SELECT clause. Therefore “GROUP BY country” and “GROUP BY 1” are the same thing:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# SELECT country, product_name, sum(amount_sold) FROM t_sales GROUP BY 1, 2 ORDER BY 1, 2; country | product_name | sum -----------+--------------+----- Argentina | Hats | 111 Argentina | Shoes | 26 Germany | Hats | 41 Germany | Shoes | 63 USA | Hats | 171 USA | Shoes | 202 (6 rows) |
Of course, this works with more than one column as well. However, I want to point out something else. Consider the following example:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# SELECT CASE WHEN country = 'USA' THEN 'USA' ELSE 'non-US' END, sum(amount_sold) FROM t_sales GROUP BY 1; case | sum --------+----- USA | 373 non-US | 241 (2 rows) |
Most people group by a column. In some cases, it can make sense to group by an expression. In my case, we are forming groups on the fly (= one group for the US and one for non-US sales). This feature is often underappreciated. However, it is useful in many real-world scenarios. Keep in mind that all the things you are going to see also work with expressions, meaning more flexible grouping is possible.
GROUP BY will turn every distinct entry in a column into a group. Sometimes you might want to do more grouping at once. Why is that necessary? Suppose you are processing a 10 TB table. Clearly, reading this data is usually the limiting factor in terms of performance. So reading the data once and producing more results at once is appealing. That's exactly what you can do with GROUP BY GROUP SETS. Suppose we want to produce two results at once:
Here's how it works:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# SELECT country, product_name, sum(amount_sold) FROM t_sales GROUP BY GROUPING SETS ((1), (2)) ORDER BY 1, 2; country | product_name | sum -----------+--------------+----- Argentina | | 137 Germany | | 104 USA | | 373 | Hats | 323 | Shoes | 291 (5 rows) |
In this case, PostgreSQL simply appends the results. The first three lines represent “GROUP BY country”. The next two lines contain the result of “GROUP BY product_name”. Logically, it's the equivalent of the following query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
test=# SELECT NULL AS country , product_name, sum(amount_sold) FROM t_sales GROUP BY 1, 2 UNION ALL SELECT country, NULL, sum(amount_sold) FROM t_sales GROUP BY 1, 2 ORDER BY 1, 2; country | product_name | sum -----------+--------------+----- Argentina | | 137 Germany | | 104 USA | | 373 | Hats | 323 | Shoes | 291 (5 rows) |
However, the GROUPING SETS version is ways more efficient because it only has to read the data once.
When creating reports, you will often need the “bottom line” which sums up what has been shown in the table. The way to do that in SQL is to use “GROUP BY ROLLUP”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
test=# SELECT country, product_name, sum(amount_sold) FROM t_sales GROUP BY ROLLUP (1, 2) ORDER BY 1, 2; country | product_name | sum -----------+--------------+----- Argentina | Hats | 111 Argentina | Shoes | 26 Argentina | | 137 Germany | Hats | 41 Germany | Shoes | 63 Germany | | 104 USA | Hats | 171 USA | Shoes | 202 USA | | 373 | | 614 (10 rows) |
PostgreSQL will inject a couple of rows into the result. As you can see, “Argentina” returns 3 and not just 2 rows. The “product_name = NULL” entry was added by ROLLUP. It contains the sum of all argentinian sales (116 + 27 = 137). Additional rows are injected for both other countries. Finally, a row is added for the overall sales worldwide.
Often those NULL entries are not what people want to see, thus it can make sense to replace them with some other kind of entry. The way to do that is to use a subselect which checks for the NULL entry and does the replacement. Here's how it works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
test=# SELECT CASE WHEN country IS NULL THEN 'TOTAL' ELSE country END, CASE WHEN product_name IS NULL THEN 'TOTAL' ELSE product_name END, sum FROM (SELECT country, product_name, sum(amount_sold) FROM t_sales GROUP BY ROLLUP (1, 2) ORDER BY 1, 2 ) AS x; country | product_name | sum -----------+--------------+----- Argentina | Hats | 111 Argentina | Shoes | 26 Argentina | TOTAL | 137 Germany | Hats | 41 Germany | Shoes | 63 Germany | TOTAL | 104 USA | Hats | 171 USA | Shoes | 202 USA | TOTAL | 373 TOTAL | TOTAL | 614 (10 rows) |
As you can see, all NULL entries have been replaced with “TOTAL”, which in many cases is the more desirable way to display this data.
ROLLUP is useful if you want to add the “bottom line”. However, you often want to see all combinations of countries and products. GROUP BY CUBE will do exactly that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
test=# SELECT country, product_name, sum(amount_sold) FROM t_sales GROUP BY CUBE (1, 2) ORDER BY 1, 2; country | product_name | sum -----------+--------------+----- Argentina | Hats | 111 Argentina | Shoes | 26 Argentina | | 137 Germany | Hats | 41 Germany | Shoes | 63 Germany | | 104 USA | Hats | 171 USA | Shoes | 202 USA | | 373 | Hats | 323 | Shoes | 291 | | 614 (12 rows) |
In this case, we've got all the combinations. Technically, it's the same as: GROUP BY country + GROUP BY product_name + GROUP BY country_product_name + GROUP BY (). We could do that using more than just one statement, but doing it at once is easier - and a lot more efficient.
Again, NULL values have been added to indicate various aggregation levels. For a tutorial, check out my YouTube video about GROUP BY CUBE.
Grouping sets don’t just simply rewrite the query to turn it into a UNION ALL - there is actually specific code in the database engine to perform those aggregations.
What you will see is a “MixedAggregate” which is capable of aggregating at various levels at once. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
test=# explain SELECT country, product_name, sum(amount_sold) FROM t_sales GROUP BY CUBE (1, 2) ORDER BY 1, 2; QUERY PLAN ----------------------------------------------------------- Sort (cost=64.15..65.65 rows=601 width=96) Sort Key: country, product_name -> MixedAggregate (cost=0.00..36.41 rows=601 width=96) Hash Key: country, product_name Hash Key: country Hash Key: product_name Group Key: () -> Seq Scan on t_sales ... (8 rows) |
Looking at the MixedAggregate also reveals which aggregations are performed as part of the grouping set.
In general, grouping sets are a really cool feature which is often unknown or overlooked. We highly recommend making use of this awesome stuff to speed up your aggregations. It's particularly useful if you are dealing with a large data set.
If you want to know more about PostgreSQL and SQL in general, you might also like my post about “string encoding using SQL".
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply