PostgreSQL grouping sets: ROLLUP & CUBE

PostgreSQL is one of the best OLTP databases (OLTP = online transaction processing) in the world. However, it can do more than just OLTP. PostgreSQL offers many additional features relevant to a more OLAP-style workload. One of those features is called “GROUPING SETS”.

Table of Contents

Before we dive into the details, I've compiled some sample data which you can easily load into your SQL database:

CREATE TABLE t_sales
(
    country        text,
    product_name   text,
    year           int,
    amount_sold    numeric
);

INSERT INTO t_sales VALUES
    ('Argentina', 'Shoes', 2020, 12),
    ('Argentina', 'Shoes', 2021, 14),
    ('Argentina', 'Hats', 2020, 54),
    ('Argentina', 'Hats', 2021, 57),
    ('Germany', 'Shoes', 2020, 34),
    ('Germany', 'Shoes', 2021, 29),
    ('Germany', 'Hats', 2020, 19),
    ('Germany', 'Hats', 2021, 22),
    ('USA', 'Shoes', 2020, 99),
    ('USA', 'Shoes', 2021, 103),
    ('USA', 'Hats', 2020, 81),
    ('USA', 'Hats', 2021, 90)
;

CREATE TABLE t_sales

(

country text,

product_name text,

year int,

amount_sold numeric

);

INSERT INTO t_sales VALUES

('Argentina', 'Shoes', 2020, 12),

('Argentina', 'Shoes', 2021, 14),

('Argentina', 'Hats', 2020, 54),

('Argentina', 'Hats', 2021, 57),

('Germany', 'Shoes', 2020, 34),

('Germany', 'Shoes', 2021, 29),

('Germany', 'Hats', 2020, 19),

('Germany', 'Hats', 2021, 22),

('USA', 'Shoes', 2020, 99),

('USA', 'Shoes', 2021, 103),

('USA', 'Hats', 2020, 81),

('USA', 'Hats', 2021, 90)

;

Note that everything you are going to see in this blog is pretty SQL-standard compliant, so you can expect most of the stuff to work in other professional SQL databases as well.

Let's get started with a simple aggregation:

test=# SELECT  country, sum(amount_sold)
       FROM    t_sales
       GROUP BY 1;
 country   | sum
-----------+-----
 USA       | 373
 Germany   | 104
 Argentina | 137
(3 rows)

test=# SELECT country, sum(amount_sold)

FROM t_sales

GROUP BY 1;

country | sum

-----------+-----

USA | 373

Germany | 104

Argentina | 137

(3 rows)

There's not much to say here, apart from the fact that we will get one sum for each group. However, there's a bit of a philosophical discussion going on. “GROUP BY 1” basically means “GROUP BY country” which is the equivalent of the first column in the SELECT clause. Therefore “GROUP BY country” and “GROUP BY 1” are the same thing:

test=# SELECT  country, product_name, sum(amount_sold)
       FROM    t_sales
       GROUP BY 1, 2
       ORDER BY 1, 2;
 country   | product_name | sum
-----------+--------------+-----
 Argentina | Hats         | 111
 Argentina | Shoes        |  26
 Germany   | Hats         |  41
 Germany   | Shoes        |  63
 USA       | Hats         | 171
 USA       | Shoes        | 202
(6 rows)

test=# SELECT country, product_name, sum(amount_sold)

FROM t_sales

GROUP BY 1, 2

ORDER BY 1, 2;

country | product_name | sum

-----------+--------------+-----

Argentina | Hats | 111

Argentina | Shoes | 26

Germany | Hats | 41

Germany | Shoes | 63

USA | Hats | 171

USA | Shoes | 202

(6 rows)

Of course, this works with more than one column as well. However, I want to point out something else. Consider the following example:

test=# SELECT CASE WHEN country = 'USA'
                THEN 'USA'
                ELSE 'non-US'
              END,
              sum(amount_sold)
       FROM t_sales
       GROUP BY 1;
 case   | sum
--------+-----
 USA    | 373
 non-US | 241
(2 rows)

test=# SELECT CASE WHEN country = 'USA'

THEN 'USA'

ELSE 'non-US'

END,

sum(amount_sold)

FROM t_sales

GROUP BY 1;

case | sum

--------+-----

USA | 373

non-US | 241

(2 rows)

Most people group by a column. In some cases, it can make sense to group by an expression. In my case, we are forming groups on the fly (= one group for the US and one for non-US sales). This feature is often underappreciated. However, it is useful in many real-world scenarios. Keep in mind that all the things you are going to see also work with expressions, meaning more flexible grouping is possible.

GROUPING SETS: The basic building blocks

GROUP BY will turn every distinct entry in a column into a group. Sometimes you might want to do more grouping at once. Why is that necessary? Suppose you are processing a 10 TB table. Clearly, reading this data is usually the limiting factor in terms of performance. So reading the data once and producing more results at once is appealing. That's exactly what you can do with GROUP BY GROUP SETS. Suppose we want to produce two results at once:

GROUP BY country
GROUP BY product_name

Here's how it works:

test=# SELECT  country, product_name, sum(amount_sold)
       FROM    t_sales
       GROUP BY GROUPING SETS ((1), (2))
       ORDER BY 1, 2;
 country   | product_name | sum
-----------+--------------+-----
 Argentina |              | 137
 Germany   |              | 104
 USA       |              | 373
           | Hats         | 323
           | Shoes        | 291
(5 rows)

test=# SELECT country, product_name, sum(amount_sold)

FROM t_sales

GROUP BY GROUPING SETS ((1), (2))

ORDER BY 1, 2;

country | product_name | sum

-----------+--------------+-----

Argentina | | 137

Germany | | 104

USA | | 373

| Hats | 323

| Shoes | 291

(5 rows)

In this case, PostgreSQL simply appends the results. The first three lines represent “GROUP BY country”. The next two lines contain the result of “GROUP BY product_name”. Logically, it's the equivalent of the following query:

test=# SELECT  NULL AS country , product_name, sum(amount_sold)
       FROM    t_sales
       GROUP BY  1, 2
       UNION ALL
       SELECT  country, NULL, sum(amount_sold)
       FROM    t_sales
       GROUP BY  1, 2
       ORDER BY 1, 2;
 country   | product_name | sum
-----------+--------------+-----
 Argentina |              | 137
 Germany   |              | 104
 USA       |              | 373
           | Hats         | 323
           | Shoes        | 291
(5 rows)

test=# SELECT NULL AS country , product_name, sum(amount_sold)

FROM t_sales

GROUP BY 1, 2

UNION ALL

SELECT country, NULL, sum(amount_sold)

FROM t_sales

GROUP BY 1, 2

ORDER BY 1, 2;

country | product_name | sum

-----------+--------------+-----

Argentina | | 137

Germany | | 104

USA | | 373

| Hats | 323

| Shoes | 291

(5 rows)

However, the GROUPING SETS version is ways more efficient because it only has to read the data once.

ROLLUP: Adding the “bottom line”

When creating reports, you will often need the “bottom line” which sums up what has been shown in the table. The way to do that in SQL is to use “GROUP BY ROLLUP”:

test=# SELECT  country, product_name, sum(amount_sold)
       FROM    t_sales
       GROUP BY ROLLUP (1, 2)
       ORDER BY 1, 2;
 country   | product_name | sum
-----------+--------------+-----
 Argentina | Hats         | 111
 Argentina | Shoes        |  26
 Argentina |              | 137
 Germany   | Hats         |  41
 Germany   | Shoes        |  63
 Germany   |              | 104
 USA       | Hats         | 171
 USA       | Shoes        | 202
 USA       |              | 373
           |              | 614
(10 rows)

test=# SELECT country, product_name, sum(amount_sold)

FROM t_sales

GROUP BY ROLLUP (1, 2)

ORDER BY 1, 2;

country | product_name | sum

-----------+--------------+-----

Argentina | Hats | 111

Argentina | Shoes | 26

Argentina | | 137

Germany | Hats | 41

Germany | Shoes | 63

Germany | | 104

USA | Hats | 171

USA | Shoes | 202

USA | | 373

| | 614

(10 rows)

PostgreSQL will inject a couple of rows into the result. As you can see, “Argentina” returns 3 and not just 2 rows. The “product_name = NULL” entry was added by ROLLUP. It contains the sum of all argentinian sales (116 + 27 = 137). Additional rows are injected for both other countries. Finally, a row is added for the overall sales worldwide.

Often those NULL entries are not what people want to see, thus it can make sense to replace them with some other kind of entry. The way to do that is to use a subselect which checks for the NULL entry and does the replacement. Here's how it works:

test=# SELECT   CASE WHEN country IS NULL
                      THEN 'TOTAL' ELSE country END,
                CASE WHEN product_name IS NULL
                      THEN 'TOTAL' ELSE product_name END,
                sum
       FROM (SELECT    country, product_name, sum(amount_sold)
             FROM      t_sales
             GROUP BY ROLLUP (1, 2)
             ORDER BY 1, 2
           ) AS x;
 country   | product_name | sum
-----------+--------------+-----
 Argentina | Hats         | 111
 Argentina | Shoes        |  26
 Argentina | TOTAL        | 137
 Germany   | Hats         |  41
 Germany   | Shoes        |  63
 Germany   | TOTAL        | 104
 USA       | Hats         | 171
 USA       | Shoes        | 202
 USA       | TOTAL        | 373
 TOTAL     | TOTAL        | 614
(10 rows)

test=# SELECT CASE WHEN country IS NULL

THEN 'TOTAL' ELSE country END,

CASE WHEN product_name IS NULL

THEN 'TOTAL' ELSE product_name END,

sum

FROM (SELECT country, product_name, sum(amount_sold)

FROM t_sales

GROUP BY ROLLUP (1, 2)

ORDER BY 1, 2

) AS x;

country | product_name | sum

-----------+--------------+-----

Argentina | Hats | 111

Argentina | Shoes | 26

Argentina | TOTAL | 137

Germany | Hats | 41

Germany | Shoes | 63

Germany | TOTAL | 104

USA | Hats | 171

USA | Shoes | 202

USA | TOTAL | 373

TOTAL | TOTAL | 614

(10 rows)

As you can see, all NULL entries have been replaced with “TOTAL”, which in many cases is the more desirable way to display this data.

CUBE: Creating data cubes in PostgreSQL efficiently

ROLLUP is useful if you want to add the “bottom line”. However, you often want to see all combinations of countries and products. GROUP BY CUBE will do exactly that:

test=# SELECT  country, product_name, sum(amount_sold)
       FROM    t_sales
       GROUP BY CUBE (1, 2)
       ORDER BY 1, 2;
 country   | product_name | sum
-----------+--------------+-----
 Argentina | Hats         | 111
 Argentina | Shoes        |  26
 Argentina |              | 137
 Germany   | Hats         |  41
 Germany   | Shoes        |  63
 Germany   |              | 104
 USA       | Hats         | 171
 USA       | Shoes        | 202
 USA       |              | 373
           | Hats         | 323
           | Shoes        | 291
           |              | 614
(12 rows)

test=# SELECT country, product_name, sum(amount_sold)

FROM t_sales

GROUP BY CUBE (1, 2)

ORDER BY 1, 2;

country | product_name | sum

-----------+--------------+-----

Argentina | Hats | 111

Argentina | Shoes | 26

Argentina | | 137

Germany | Hats | 41

Germany | Shoes | 63

Germany | | 104

USA | Hats | 171

USA | Shoes | 202

USA | | 373

| Hats | 323

| Shoes | 291

| | 614

(12 rows)

In this case, we've got all the combinations. Technically, it's the same as: GROUP BY country + GROUP BY product_name + GROUP BY country_product_name + GROUP BY (). We could do that using more than just one statement, but doing it at once is easier - and a lot more efficient.

Again, NULL values have been added to indicate various aggregation levels. For a tutorial, check out my YouTube video about GROUP BY CUBE.

Grouping sets: Execution plans

Grouping sets don’t just simply rewrite the query to turn it into a UNION ALL - there is actually specific code in the database engine to perform those aggregations.

What you will see is a “MixedAggregate” which is capable of aggregating at various levels at once. Here is an example:

test=# explain SELECT country, product_name, sum(amount_sold)
           FROM t_sales
           GROUP BY CUBE (1, 2)
           ORDER BY 1, 2;
                          QUERY PLAN
-----------------------------------------------------------
 Sort  (cost=64.15..65.65 rows=601 width=96)
   Sort Key: country, product_name
   ->  MixedAggregate  (cost=0.00..36.41 rows=601 width=96)
     Hash Key: country, product_name
     Hash Key: country
     Hash Key: product_name
     Group Key: ()
     ->  Seq Scan on t_sales ...
(8 rows)

test=# explain SELECT country, product_name, sum(amount_sold)

FROM t_sales

GROUP BY CUBE (1, 2)

ORDER BY 1, 2;

QUERY PLAN

-----------------------------------------------------------

Sort (cost=64.15..65.65 rows=601 width=96)

Sort Key: country, product_name

-> MixedAggregate (cost=0.00..36.41 rows=601 width=96)

Hash Key: country, product_name

Hash Key: country

Hash Key: product_name

Group Key: ()

-> Seq Scan on t_sales ...

(8 rows)

Looking at the MixedAggregate also reveals which aggregations are performed as part of the grouping set.

Finally ...

In general, grouping sets are a really cool feature which is often unknown or overlooked. We highly recommend making use of this awesome stuff to speed up your aggregations. It's particularly useful if you are dealing with a large data set.

If you want to know more about PostgreSQL and SQL in general, you might also like my post about “string encoding using SQL".

PostgreSQL grouping sets: ROLLUP & CUBE

GROUPING SETS: The basic building blocks

ROLLUP: Adding the “bottom line”

CUBE: Creating data cubes in PostgreSQL efficiently

Grouping sets: Execution plans

Finally ...

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts