GROUP BY is nothing new and is available in any relational database I am aware of. It is an integral part of SQL and PostgreSQL but what many people might not know is the fact that the GROUP BY expression can do more than just group by simple fields. You can use expressions to group in an even more sophisticated way and here is how:
Table of Contents
To show you what you can do with GROUP BY I have compiled some test data:
1 2 3 4 5 6 7 8 |
test=# CREATE TABLE t_oil ( region text, country text, year int, production int, consumption int ); CREATE TABLE |
This data set is pretty easy to understand. It tells us how much oil was produced and consumed by which country in the past. To load the data you can either download the file or simply ask COPY to load it directly from the web.
1 2 3 |
test=# COPY t_oil FROM PROGRAM 'curl /secret/oil_ext.txt'; COPY 644 |
To use COPY … FROM PROGRAM you have to be superuser otherwise it does not work for security reasons. In my example 644 rows have been loaded successfully.
Using GROUP BY is pretty simple. If we want to make some kind of analysis for each group we basically got two choices:
1 2 3 4 5 6 |
test=# SELECT region, avg(production) FROM t_oil GROUP BY 1; region | avg ---------------+----------------------- North America | 4541.3623188405797101 Middle East | 1992.6036866359447005 (2 rows) |
“GROUP BY 1” basically means the same as the following query:
1 2 3 4 5 6 |
test=# SELECT region, avg(production) FROM t_oil GROUP BY region; region | avg ---------------+----------------------- North America | 4541.3623188405797101 Middle East | 1992.6036866359447005 (2 rows) |
“GROUP BY region” and “GROUP BY 1” are therefore identical. It is a question of faith which type of syntax you prefer. People have told me once than once that one or the other syntax is “evil”. In reality it makes absolutely no difference. It is just syntactic sugar.
Most people group by one or more columns or no column at all:
1 2 3 4 5 |
test=# SELECT avg(production) FROM t_oil WHERE country = 'USA'; avg ----------------------- 9141.3478260869565217 (1 row) |
However, there is more: You can also use an expression to determine the groups on the fly.
1 2 3 4 5 6 7 8 9 |
test=# SELECT production > 9000, count(production) FROM t_oil WHERE country = 'USA' GROUP BY production > 9000; ?column? | count ----------+------- f | 20 t | 26 (2 rows) |
In this case we got two groups: One group is for rows greater than 9000 and one for rows lower or equal than 9000. The name of those two groups is therefore “true” (> 9000) or “false” (<= 9000). You can use any expression to calculate those groups on your own. Consider the next example:
1 2 3 4 5 6 7 8 9 |
test=# SELECT count(production) FROM t_oil WHERE country = 'USA' GROUP BY CASE WHEN year % 2 = 0 THEN true ELSE false END; count ------- 23 23 (2 rows) |
In this example we are counting odd and even years. Note that it is not necessary to list the grouping criterial in the SELECT clause. You might not understand the result if you miss half of the column list but you are free to do that. What is also interesting is that you can use a full SQL query in the GROUP BY clause.
1 2 3 4 5 6 7 8 |
test=# SELECT count(production) FROM t_oil WHERE country = 'USA' GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END); count ------- 23 23 (2 rows) |
If you run those queries it makes sense to take a look at the execution plan. As you can see the plan is just like any other GROUP BY statement. It is also noteworthy that PostgreSQL does a really good job to estimate the number of groups:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# explain SELECT count(production) FROM t_oil WHERE country = 'USA' GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END); QUERY PLAN ------------------------------------------------------------- HashAggregate (cost=14.97..15.02 rows=2 width=9) Group Key: (SubPlan 1) -> Seq Scan on t_oil (cost=0.00..14.74 rows=46 width=5) Filter: (country = 'USA'::text) SubPlan 1 -> Result (cost=0.00..0.01 rows=1 width=1) (6 rows) |
PostgreSQL successfully figured out that two groups are to be expected.
HAVING clauses done right
Recently people have asked us about aliases in the HAVING clause: Can one use aliases in a HAVING clause?
1 2 3 4 5 6 7 |
test=# SELECT count(production) AS x FROM t_oil WHERE country = 'USA' GROUP BY year < 1990 HAVING x > 22; ERROR: column 'x' does not exist LINE 1: ...il WHERE country = 'USA' GROUP BY year < 1990 HAVING x > 22; ^ |
The answer is simply no. SQL does not allow that. I remember seeing people who actually wanted to implement this feature in PostgreSQL over the years but such a feature is not going to be supported.
If you want to use a HAVING clause it is necessary to explicitly use the entire expression:
1 2 3 4 5 6 7 8 |
test=# SELECT count(production) AS x FROM t_oil WHERE country = 'USA' GROUP BY year < 1990 HAVING count(production) > 22; x ---- 25 (1 row) |
1 2 3 4 5 6 7 8 9 |
test=# SELECT count(production) AS x FROM t_oil WHERE country = 'USA' GROUP BY year < 1990 HAVING avg(production) > 0; x ---- 21 25 (2 rows) |
As you can see we use “count” in the SELECT clause, but we use “avg” in the HAVING clause. This is definitely possible and in some cases really useful. You can use different aggregate functions in GROUP BY and HAVING clauses without any problems.
So far you have seen what GROUP BY can do for you in the simple case butthere is more. I don't want to go into too much detail in this blog but just one thing: You can do more than just one aggregation at the same time to speed up your query. Here is an example:
1 2 3 4 5 6 7 8 9 10 |
test=# SELECT year < 1990, count(production) AS x FROM t_oil WHERE country = 'USA' GROUP BY GROUPING SETS ((year < 1990), ()); ?column? | x ----------+---- | 46 f | 21 t | 25 (3 rows) |
In this case we got two grouping operations: One containing all rows and one for before respectively after 1990.
If you don't like this syntax you can also try ROLLUP which is equivalent to the query you have just seen:
1 2 3 4 5 6 7 8 9 10 |
test=# SELECT year < 1990, count(production) AS x FROM t_oil WHERE country = 'USA' GROUP BY ROLLUP (year < 1990); ?column? | x ----------+---- | 46 f | 21 t | 25 (3 rows) |
ROLLUP is basically the “bottom line” - it adds one additional row counting everything.
If you want to find out more about grouping in general I recommend checking out my post about speeding up GROUP BY in general.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply