Recently we had some clients who had the desire to store timeseries in PostgreSQL. One of the questions which interested them is related to calculating the difference between values in timeseries data. How can you calculate the difference between the current and the previous row?
To answer this question I have decided to share some simple queries outlining what can be done. Note that this is not a complete tutorial about analytics and windowing functions but just a short introduction to what can be done in general.
Table of Contents
Let us load some sample data:
1 2 3 4 5 6 7 8 9 10 11 12 |
cypex=# CREATE TABLE t_oil ( region text, country text, year int, production int, consumption int ); cypex=# COPY t_oil FROM PROGRAM 'curl /secret/oil_ext.txt'; COPY 644 |
If you are a superuser you can use COPY ... FROM PROGRAM directly. Otherwise you have to load the text file in a different way.
If you want to calculate the difference between two rows, you can make use of the “lag” function. However, there is a question naturally arising here: If we want to access the previous row. What is the previous row? We will need some kind of order. To do that SQL provides the OVER-clause:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
cypex=# SELECT country, year, production, lag(production, 1) OVER (ORDER BY year) FROM t_oil WHERE country = 'USA' LIMIT 10; country | year | production | lag ---------+------+------------+------- USA | 1965 | 9014 | USA | 1966 | 9579 | 9014 USA | 1967 | 10219 | 9579 USA | 1968 | 10600 | 10219 USA | 1969 | 10828 | 10600 USA | 1970 | 11297 | 10828 USA | 1971 | 11156 | 11297 USA | 1972 | 11185 | 11156 USA | 1973 | 10946 | 11185 USA | 1974 | 10461 | 10946 (10 rows) |
In my example I have ordered the data by year to make sure that the previous year can indeed be found in the previous row.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
cypex=# SELECT country, year, production - lag(production, 1) OVER (ORDER BY year) AS diff FROM t_oil WHERE country = 'USA' LIMIT 10; country | year | diff ---------+------+------ USA | 1965 | USA | 1966 | 565 USA | 1967 | 640 USA | 1968 | 381 USA | 1969 | 228 USA | 1970 | 469 USA | 1971 | -141 USA | 1972 | 29 USA | 1973 | -239 USA | 1974 | -485 (10 rows) |
What is important to see here is that the first row contains a NULL entry because there is no known difference to the previous row.
What many people need is the difference between the current and the first row in the data set. PostgreSQL (or ANSI SQL in general to be more precise) offers the “first_value” function which returns the first row given the order provided by us. Here is how it works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
cypex=# SELECT t_oil.country, t_oil.year, t_oil.production, t_oil.production - first_value(t_oil.production) OVER (ORDER BY t_oil.year) AS diff_first FROM t_oil WHERE t_oil.country = 'USA'::text LIMIT 10; country | year | production | diff_first ---------+------+------------+------------ USA | 1965 | 9014 | 0 USA | 1966 | 9579 | 565 USA | 1967 | 10219 | 1205 USA | 1968 | 10600 | 1586 USA | 1969 | 10828 | 1814 USA | 1970 | 11297 | 2283 USA | 1971 | 11156 | 2142 USA | 1972 | 11185 | 2171 USA | 1973 | 10946 | 1932 USA | 1974 | 10461 | 1447 (10 rows) |
As you can see, in this case everything is relative to the first row.
The data looks correct so we can move on to the next examples.
But what happens if we start to look at two countries? If we order by year we might hit the wrong row. If we order by both columns we might still hit a row associated to a different country. The solution is the PARTITION BY clause. PostgreSQL will break up the data in various groups and calculate the difference again (for each group). Here is how it works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
cypex=# SELECT country, year, production, lag(production) OVER (PARTITION BY country ORDER BY year) AS diff FROM t_oil WHERE country IN ('Canada', 'Mexico') AND year < 1970; country | year | production | diff ---------+------+------------+------ Canada | 1965 | 920 | Canada | 1966 | 1012 | 920 Canada | 1967 | 1106 | 1012 Canada | 1968 | 1194 | 1106 Canada | 1969 | 1306 | 1194 Mexico | 1965 | 362 | Mexico | 1966 | 370 | 362 Mexico | 1967 | 411 | 370 Mexico | 1968 | 439 | 411 Mexico | 1969 | 461 | 439 (10 rows) |
In this example each group contains a NULL value because there is no “previous” value. This is proof that PostgreSQL handles the groups separately.
One more thing many people are interested in is the necessity to calculate moving averages. I decided to include this example in this post about differences because the problem pops up so often that it deserves some more attention. In many cases this type of operation is calculated on the application level which is clearly the wrong place to do because of performance reasons:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
cypex=# SELECT country, year, production, avg(production) OVER (ORDER BY year ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS mov FROM t_oil WHERE country IN ('Saudi Arabien') AND year BETWEEN 1975 AND 1990; country | year | production | mov ---------------+------+------------+----------------------- Saudi Arabien | 1975 | 7216 | 8465.6666666666666667 Saudi Arabien | 1976 | 8762 | 8487.7500000000000000 Saudi Arabien | 1977 | 9419 | 8758.4000000000000000 Saudi Arabien | 1978 | 8554 | 9369.2000000000000000 Saudi Arabien | 1979 | 9841 | 9668.0000000000000000 Saudi Arabien | 1980 | 10270 | 9176.4000000000000000 Saudi Arabien | 1981 | 10256 | 8455.8000000000000000 Saudi Arabien | 1982 | 6961 | 7394.4000000000000000 Saudi Arabien | 1983 | 4951 | 6060.6000000000000000 Saudi Arabien | 1984 | 4534 | 5051.0000000000000000 Saudi Arabien | 1985 | 3601 | 4578.6000000000000000 Saudi Arabien | 1986 | 5208 | 4732.4000000000000000 Saudi Arabien | 1987 | 4599 | 4952.6000000000000000 Saudi Arabien | 1988 | 5720 | 5653.4000000000000000 Saudi Arabien | 1989 | 5635 | 5764.7500000000000000 Saudi Arabien | 1990 | 7105 | 6153.3333333333333333 (16 rows) |
This defines the number of rows going into the aggregate function (in our case “avg”). The idea of a moving average is to flatten the curve and create a smoother line. The following picture shows how this works:
As you can see in the CYPEX visualization the moving average is a lot smoother than the rate underlying data. Saudi Arabia is a so called “swing producer”. Depending on the political situation the production rate might vary significantly so using a moving average actually makes a lot of sense.
However, there is more: Some of you might know that PostgreSQL supports composite data type. Basically every row can be seen as a single element containing various components. Usually a SELECT clause lists all desired fields, but you can also see a table as a single field as shown in the next example:
1 2 3 4 5 6 7 8 9 |
cypex=# SELECT t_oil, lag(t_oil) OVER (ORDER BY year) FROM t_oil WHERE country = 'Mexico' AND year IN (1980, 1981); t_oil | lag -----------------------------------------+----------------------------------------- ('North America',Mexico,1980,2129,1048) | ('North America',Mexico,1981,2553,1172) | ('North America',Mexico,1980,2129,1048) (2 rows) |
In this case all columns of a row are packed into a single field. You can use the “lag” function normally …
The trick now is: You can use “=” to compare two rows directly. Why is that important? Sometimes you want to see if two rows were imported twice or you simply want to know if two consecutive rows are identical. This is how it works:
1 2 3 4 5 6 7 8 9 |
cypex=# SELECT t_oil = lag(t_oil) OVER (ORDER BY year) FROM t_oil WHERE country = 'Mexico' AND year IN (1980, 1981); ?column? ---------- f (2 rows) |
It is possible to compare entire rows to each other. PostgreSQL will inspect one field after the other and only issue true in case all fields are the same. In other words: “lag” can even be abused to detect duplicate rows.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
I think it's better to demonstrate row comparison with IS NOT DISTINCT FROM, rather than =, because of the possibility of NULL values.