UPDATED July 2023: Window functions and analytics have been around for quite some time and many people already make use of this awesome stuff in the PostgreSQL world. Timeseries are an especially important area in this context. However, not all features have been widely adopted and thus many developers have to implement functionality at the application level in a painful way instead of just using some of the more advanced SQL techniques.
Table of Contents
The idea of this blog is to demonstrate some of the advanced analytics so that more people out there can make use of PostgreSQL's true power.
For the purpose of this post I have created a basic data set:
1 2 3 4 5 |
test=# CREATE TABLE t_demo AS SELECT ordinality, day, date_part('week', day) AS week FROM generate_series('2020-01-02', '2020-01-15', '1 day'::interval) WITH ORDINALITY AS day; SELECT 14 |
In PostgreSQL, the generate_series
function will return one row each day spanning January 2nd, 2020 to January 15th, 2020. The WITH ORDINALITY clause tells PostgreSQL to add an “id” column to the result set of the function. The date_part
function will extract the number of the week out of our date. The purpose of this column is to have a couple of identical values in our timeseries.
In the next list you see the data set we'll use:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
test=# SELECT * FROM t_demo; ordinality | day | week ------------+------------------------+------ 1 | 2020-01-02 00:00:00+01 | 1 2 | 2020-01-03 00:00:00+01 | 1 3 | 2020-01-04 00:00:00+01 | 1 4 | 2020-01-05 00:00:00+01 | 1 5 | 2020-01-06 00:00:00+01 | 2 6 | 2020-01-07 00:00:00+01 | 2 7 | 2020-01-08 00:00:00+01 | 2 8 | 2020-01-09 00:00:00+01 | 2 9 | 2020-01-10 00:00:00+01 | 2 10 | 2020-01-11 00:00:00+01 | 2 11 | 2020-01-12 00:00:00+01 | 2 12 | 2020-01-13 00:00:00+01 | 3 13 | 2020-01-14 00:00:00+01 | 3 14 | 2020-01-15 00:00:00+01 | 3 (14 rows) |
One of the things you often have to do is to use a sliding window. In SQL this can easily be achieved using the OVER clause. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
test=# SELECT *, array_agg(ordinality) OVER (ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), avg(ordinality) OVER (ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM t_demo; ordinality | day | week | array_agg | avg ------------+------------------------+------+------------+--------------------- 1 | 2020-01-02 00:00:00+01 | 1 | {1,2} | 1.5000000000000000 2 | 2020-01-03 00:00:00+01 | 1 | {1,2,3} | 2.0000000000000000 3 | 2020-01-04 00:00:00+01 | 1 | {2,3,4} | 3.0000000000000000 4 | 2020-01-05 00:00:00+01 | 1 | {3,4,5} | 4.0000000000000000 5 | 2020-01-06 00:00:00+01 | 2 | {4,5,6} | 5.0000000000000000 6 | 2020-01-07 00:00:00+01 | 2 | {5,6,7} | 6.0000000000000000 7 | 2020-01-08 00:00:00+01 | 2 | {6,7,8} | 7.0000000000000000 8 | 2020-01-09 00:00:00+01 | 2 | {7,8,9} | 8.0000000000000000 9 | 2020-01-10 00:00:00+01 | 2 | {8,9,10} | 9.0000000000000000 10 | 2020-01-11 00:00:00+01 | 2 | {9,10,11} | 10.0000000000000000 11 | 2020-01-12 00:00:00+01 | 2 | {10,11,12} | 11.0000000000000000 12 | 2020-01-13 00:00:00+01 | 3 | {11,12,13} | 12.0000000000000000 13 | 2020-01-14 00:00:00+01 | 3 | {12,13,14} | 13.0000000000000000 14 | 2020-01-15 00:00:00+01 | 3 | {13,14} | 13.5000000000000000 (14 rows) |
array_agg
The OVER clause allows you to feed the data to the aggregate function. For the sake of simplicity, I have used the array_agg
function, which simply returns the data fed to the aggregate as an array. In a real-life scenario you would use something more common such as the avg, sum, min, max, or any other aggregation function. However, array_agg
is pretty useful, in that it shows which data is really passed to the function, and what values we have in use. ROWS BETWEEN … PRECEDING AND 1 … FOLLOWING
tells the system that we want to use 3 rows: The previous, the current one, as well as the one after the current row. At the beginning of the list, there is no previous row, so we will only see two values in the array column. At the end of the data set, there are also only two rows, because there are no more values after the last one.
In some cases, it can be useful to exclude the current row from the data passed to the aggregate. The EXCLUDE CURRENT ROW
clause has been developed to do exactly that. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
test=# SELECT *, array_agg(ordinality) OVER (ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW) FROM t_demo; ordinality | day | week | array_agg ------------+------------------------+------+----------- 1 | 2020-01-02 00:00:00+01 | 1 | {2} 2 | 2020-01-03 00:00:00+01 | 1 | {1,3} 3 | 2020-01-04 00:00:00+01 | 1 | {2,4} 4 | 2020-01-05 00:00:00+01 | 1 | {3,5} 5 | 2020-01-06 00:00:00+01 | 2 | {4,6} 6 | 2020-01-07 00:00:00+01 | 2 | {5,7} 7 | 2020-01-08 00:00:00+01 | 2 | {6,8} 8 | 2020-01-09 00:00:00+01 | 2 | {7,9} 9 | 2020-01-10 00:00:00+01 | 2 | {8,10} 10 | 2020-01-11 00:00:00+01 | 2 | {9,11} 11 | 2020-01-12 00:00:00+01 | 2 | {10,12} 12 | 2020-01-13 00:00:00+01 | 3 | {11,13} 13 | 2020-01-14 00:00:00+01 | 3 | {12,14} 14 | 2020-01-15 00:00:00+01 | 3 | {13} (14 rows) |
As you can see, the array is a little shorter now. The current value is not part of the array anymore.
The idea behind EXCLUDE TIES is to remove duplicates. Let's take a look at the following to make things clear:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
test=# SELECT day, week, array_agg(week) OVER (ORDER BY week ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS all, array_agg(week) OVER (ORDER BY week ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties FROM t_demo; day | week | all | ties ------------------------+------+-------------+--------- 2020-01-02 00:00:00+01 | 1 | {1,1,1} | {1} 2020-01-03 00:00:00+01 | 1 | {1,1,1,1} | {1} 2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} | {1,2} 2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} | {1,2,2} 2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} | {1,1,2} 2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} | {1,2} 2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} | {2,3} 2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} | {2,3,3} 2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} | {2,2,3} 2020-01-14 00:00:00+01 | 3 | {2,3,3,3} | {2,3} 2020-01-15 00:00:00+01 | 3 | {3,3,3} | {3} (14 rows) |
The first array_agg simply collects all values in the frame we have defined. The “ties” column is a bit more complicated to understand: Let's take a look at the 5th of January. The result says (1, 2, 2). As you can see, two incarnations of 1 have been removed. EXCLUDE TIES made sure that those duplicates are gone. However, this has no impact on those values in the “future”, because the future values are different from the current row. The documentation states what EXCLUDE TIES is all about: “EXCLUDE TIES excludes any peers of the current row from the frame, but not the current row itself.”
After this first example, we should also consider a fairly common mistake:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
test=# SELECT day, week, array_agg(week) OVER (ORDER BY week, day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties FROM t_demo; day | week | ties ------------------------+------+------------- 2020-01-02 00:00:00+01 | 1 | {1,1,1} 2020-01-03 00:00:00+01 | 1 | {1,1,1,1} 2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} 2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} 2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} 2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} 2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} 2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} 2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} 2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} 2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} 2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} 2020-01-14 00:00:00+01 | 3 | {2,3,3,3} 2020-01-15 00:00:00+01 | 3 | {3,3,3} (14 rows) |
Can you spot the difference between this and the previous example? Take all the time you need …
The problem is that the ORDER BY clause has two columns in this case. That means that there are no duplicates anymore from the ORDER BY's perspective. Thus, PostgreSQL is not going to prune values from the result set. I can assure you that this is a common mistake seen in many cases. The problem can be very subtle and go unnoticed for quite some time.
In some cases you might want to remove an entire set of rows from the result set. To do that, you can make use of EXCLUDE GROUP.
The following example shows how that works, and how our timeseries data can be analyzed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
test=# SELECT *, array_agg(week) OVER (ORDER BY week ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE GROUP) AS week, array_agg(week) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE GROUP) AS all FROM t_demo; ordinality | day | week | week | all ------------+------------------------+------+-------+----------- 1 | 2020-01-02 00:00:00+01 | 1 | | {1,1} 2 | 2020-01-03 00:00:00+01 | 1 | | {1,1,1} 3 | 2020-01-04 00:00:00+01 | 1 | {2} | {1,1,1,2} 4 | 2020-01-05 00:00:00+01 | 1 | {2,2} | {1,1,2,2} 5 | 2020-01-06 00:00:00+01 | 2 | {1,1} | {1,1,2,2} 6 | 2020-01-07 00:00:00+01 | 2 | {1} | {1,2,2,2} 7 | 2020-01-08 00:00:00+01 | 2 | | {2,2,2,2} 8 | 2020-01-09 00:00:00+01 | 2 | | {2,2,2,2} 9 | 2020-01-10 00:00:00+01 | 2 | | {2,2,2,2} 10 | 2020-01-11 00:00:00+01 | 2 | {3} | {2,2,2,3} 11 | 2020-01-12 00:00:00+01 | 2 | {3,3} | {2,2,3,3} 12 | 2020-01-13 00:00:00+01 | 3 | {2,2} | {2,2,3,3} 13 | 2020-01-14 00:00:00+01 | 3 | {2} | {2,3,3} 14 | 2020-01-15 00:00:00+01 | 3 | | {3,3} (14 rows) |
The first aggregation function does ORDER BY week and the array_agg will aggregate on the very same column. In this case, we will see a couple of NULL columns, because in some cases all entries within the frame are simply identical. The last row in the result set is a good example: All array entries are 3 and the current row contains 3 as well. Thus all incarnations of 3 are simply removed, which leaves us with an empty array.
To calculate the last column, the data is ordered by day. In this case, there are no duplicates and therefore no data can be removed. For more information on EXCLUDE GROUP, see the documentation on window functions.
In PostgreSQL, an aggregate function is capable of handling DISTINCT. The following example shows what I mean:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# SELECT DISTINCT week FROM t_demo; week ------ 3 1 2 (3 rows) test=# SELECT count(DISTINCT week) FROM t_demo; count ------- 3 (1 row) |
“count” does not simply count all columns, but filters the duplicates beforehand. Therefore the result is simply 3.
In PostgreSQL there is no way to use DISTINCT as part of a window function. PostgreSQL will simply error out:
1 2 3 4 5 6 |
test=# SELECT *, array_agg(DISTINCT week) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) FROM t_demo; ERROR: DISTINCT is not implemented for window functions LINE 2: array_agg(DISTINCT week) OVER (ORDER BY day ROWS |
The natural question which arises is: How can we achieve the same result without using DISTINCT inside the window function? What you have to do is to filter the duplicates on a higher level. You can use a subselect, unroll the array, remove the duplicates and assemble the array again. It is not hard to do, but it is not as elegant as one might expect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
test=# SELECT *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b FROM ( SELECT *, array_agg(week) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x FROM t_demo ) AS a; ordinality | day | week | x | b ------------+------------------------+------+-------------+------- 1 | 2020-01-02 00:00:00+01 | 1 | {1,1,1} | {1} 2 | 2020-01-03 00:00:00+01 | 1 | {1,1,1,1} | {1} 3 | 2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} | {1,2} 4 | 2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} | {1,2} 5 | 2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} | {1,2} 6 | 2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} | {1,2} 7 | 2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 8 | 2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 9 | 2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 10 | 2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} | {2,3} 11 | 2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} | {2,3} 12 | 2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} | {2,3} 13 | 2020-01-14 00:00:00+01 | 3 | {2,3,3,3} | {2,3} 14 | 2020-01-15 00:00:00+01 | 3 | {3,3,3} | {3} (14 rows) |
In case you are interested in timeseries and aggregation in general, consider checking out some of our other blog posts including “Speeding up count(*)” by Laurenz Albe. If you are interested in high-performance PostgreSQL check out my blog post about finding slow queries.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply