SQL is a must, if you want to be a Data Analyst or a Data Scientist. However, every once in a while people wonder why a result is the way it is. While on the road in Berlin (Germany) the other day, I found a fairly interesting window function scenario which is pretty counter-intuitive to most people, and which might be worth sharing.
Table of Contents
PostgreSQL has provided window functions and analytics for quite some time now and this vital feature has been widely adopted by users, who are using PostgreSQL or SQL in general for more than just trivial queries. A modern database is just so much more than a simple data store and window functions are therefore certainly something to look into.
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT *, first_value(x) OVER (ORDER BY x) FROM generate_series(1, 5) AS x; x | first_value ---+------------- 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 (5 rows) |
What we want, is the first value in our data set. The ORDER BY
clause will ensure that data is fed to first_value in the right order. The result is therefore not surprising.
DESC
to our ORDER BY
. The result is totally obvious:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# SELECT *, first_value(x) OVER (ORDER BY x), first_value(x) OVER (ORDER BY x DESC) FROM generate_series(1, 5) AS x; x | first_value | first_value ---+-------------+------------- 5 | 1 | 5 4 | 1 | 5 3 | 1 | 5 2 | 1 | 5 1 | 1 | 5 (5 rows) |
However, what if we use last_value. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# SELECT *, last_value(x) OVER (ORDER BY x), last_value(x) OVER (ORDER BY x DESC) FROM generate_series(1, 5) AS x; x | last_value | last_value ---+------------+------------ 5 | 5 | 5 4 | 4 | 4 3 | 3 | 3 2 | 2 | 2 1 | 1 | 1 (5 rows) |
What you can see here is that both columns will return the SAME data – regardless of the different sort order provided by the ORDER BY clause. That comes as a surprise to most people. Actually most people would accept one column to contain only “5” and the other column to contain only “1”.
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# SELECT *, array_agg(x) OVER (ORDER BY x), array_agg(x) OVER (ORDER BY x DESC) FROM generate_series(1, 5) AS x; x | array_agg | array_agg ---+-------------+------------- 5 | {1,2,3,4,5} | {5} 4 | {1,2,3,4} | {5,4} 3 | {1,2,3} | {5,4,3} 2 | {1,2} | {5,4,3,2} 1 | {1} | {5,4,3,2,1} (5 rows) |
Let us take a look at which values last_value will actually see: array_agg will simply put them all into an array so that we can expect things in detail. As you can see, the last value in the array is identical in both cases, which means that both columns will produce exactly the identical output.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Funny enough:
SELECT *,
last_value(x) OVER (ORDER BY x DESC),
last_value(x) OVER (ORDER BY x)
FROM generate_series(1, 5) AS x;
would result with:
x | last_value | last_value
--- ------------ ------------
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
4 | 4 | 4
5 | 5 | 5
(5 rows)
You should mention the frame_clause, and how it defaults to RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The reason that this is the default is to facilitate running totals and the like, which are very common uses for windowing functions.
Making the default RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING would have been far less useful and made more common cases require far more verbose SQL.
Hans, this is a great article and what you explain is "spot on," which is great! However, Colin's response is EXACTLY the "reason" behind why postgres behaves this way. I wish the default was "UNBOUND PRECEDING AND UNBOUNDED FOLLOWING" because having windows functions "choose" default filtering is actually more confusing, IMO.
The postgres team made a decision based on what their experience has been instead of sticking to the facts that a window function is like a "join" to table itself and in other "JOINs," the table/view is not automatically filtered. I cannot think of the exact scenario, but I feel like this would be similar to filtering NULLs in a LEFT JOIN (because it is a "common" thing to do).
No disrespect to the postgres team as they have made an amazing product that I will continue to use for a long time, but this decision to implement a pattern that by default excludes rows from a table/view is certainly confusing because other patterns do not seem to go down this path.
In any case, this article lead me to the explanation of my exact problem (i.e. attempting to find the latest appointment date for a client using windows functions).
For that, I applaud you both!!
This ia good tutorial. Thanks. But i have a different situation as follows;
I want to select and get Aquila with 20 as the last value under the marks colum and omit the null rows.
How do i archive this? Please help.
My table looks like this.
.....................................................................
Student --------------Mark
.....................................................................
1. John ----------------25
2. Peter ---------------30
3. Jack -------------- Null
4. Aquila --------------20
5. Judy ----------------null
6. Kessy --------------null