Calculating differences between rows in SQL

06.2020 | Category: | Tags: sql help , timeseries

06.2020

Category:

Recently we had some clients who had the desire to store timeseries in PostgreSQL. One of the questions which interested them is related to calculating the difference between values in timeseries data. How can you calculate the difference between the current and the previous row?
To answer this question I have decided to share some simple queries outlining what can be done. Note that this is not a complete tutorial about analytics and windowing functions but just a short introduction to what can be done in general.

Table of Contents

Loading sample data

Let us load some sample data:

cypex=# CREATE TABLE t_oil 
        (
          region       text, 
          country      text, 
          year         int, 
          production   int, 
          consumption  int
        );

cypex=# COPY t_oil 
        FROM PROGRAM 'curl /secret/oil_ext.txt';
COPY 644

cypex=# CREATE TABLE t_oil

(

region text,

country text,

year int,

production int,

consumption int

);

cypex=# COPY t_oil

FROM PROGRAM 'curl /secret/oil_ext.txt';

COPY 644

If you are a superuser you can use COPY ... FROM PROGRAM directly. Otherwise you have to load the text file in a different way.

lag: How to access a different row in SQL

If you want to calculate the difference between two rows, you can make use of the “lag” function. However, there is a question naturally arising here: If we want to access the previous row. What is the previous row? We will need some kind of order. To do that SQL provides the OVER-clause:

cypex=# SELECT country, year, production, 
               lag(production, 1) OVER (ORDER BY year) 
        FROM   t_oil 
        WHERE country = 'USA' 
        LIMIT 10;
 country | year | production |  lag  
---------+------+------------+-------
 USA     | 1965 |       9014 |      
 USA     | 1966 |       9579 |  9014
 USA     | 1967 |      10219 |  9579
 USA     | 1968 |      10600 | 10219
 USA     | 1969 |      10828 | 10600
 USA     | 1970 |      11297 | 10828
 USA     | 1971 |      11156 | 11297
 USA     | 1972 |      11185 | 11156
 USA     | 1973 |      10946 | 11185
 USA     | 1974 |      10461 | 10946
(10 rows)

cypex=# SELECT country, year, production,

lag(production, 1) OVER (ORDER BY year)

FROM t_oil

WHERE country = 'USA'

LIMIT 10;

country | year | production | lag

---------+------+------------+-------

USA | 1965 | 9014 |

USA | 1966 | 9579 | 9014

USA | 1967 | 10219 | 9579

USA | 1968 | 10600 | 10219

USA | 1969 | 10828 | 10600

USA | 1970 | 11297 | 10828

USA | 1971 | 11156 | 11297

USA | 1972 | 11185 | 11156

USA | 1973 | 10946 | 11185

USA | 1974 | 10461 | 10946

(10 rows)

In my example I have ordered the data by year to make sure that the previous year can indeed be found in the previous row.

Once you find the correct row, the rest is easy:

cypex=# SELECT country, year, 
               production - lag(production, 1) OVER (ORDER BY year) AS diff 
        FROM   t_oil 
        WHERE  country = 'USA' 
        LIMIT 10;
 country | year | diff 
---------+------+------
 USA     | 1965 |     
 USA     | 1966 |  565
 USA     | 1967 |  640
 USA     | 1968 |  381
 USA     | 1969 |  228
 USA     | 1970 |  469
 USA     | 1971 | -141
 USA     | 1972 |   29
 USA     | 1973 | -239
 USA     | 1974 | -485
(10 rows)

cypex=# SELECT country, year,

production - lag(production, 1) OVER (ORDER BY year) AS diff

FROM t_oil

WHERE country = 'USA'

LIMIT 10;

country | year | diff

---------+------+------

USA | 1965 |

USA | 1966 | 565

USA | 1967 | 640

USA | 1968 | 381

USA | 1969 | 228

USA | 1970 | 469

USA | 1971 | -141

USA | 1972 | 29

USA | 1973 | -239

USA | 1974 | -485

(10 rows)

What is important to see here is that the first row contains a NULL entry because there is no known difference to the previous row.

first_value function

What many people need is the difference between the current and the first row in the data set. PostgreSQL (or ANSI SQL in general to be more precise) offers the “first_value” function which returns the first row given the order provided by us. Here is how it works:

cypex=# SELECT t_oil.country, 
               t_oil.year, 
               t_oil.production, 
               t_oil.production - first_value(t_oil.production) OVER (ORDER BY t_oil.year) AS diff_first
        FROM   t_oil
        WHERE  t_oil.country = 'USA'::text 
        LIMIT 10;
 country | year | production | diff_first 
---------+------+------------+------------
 USA     | 1965 |       9014 |          0
 USA     | 1966 |       9579 |        565
 USA     | 1967 |      10219 |       1205
 USA     | 1968 |      10600 |       1586
 USA     | 1969 |      10828 |       1814
 USA     | 1970 |      11297 |       2283
 USA     | 1971 |      11156 |       2142
 USA     | 1972 |      11185 |       2171
 USA     | 1973 |      10946 |       1932
 USA     | 1974 |      10461 |       1447
(10 rows)

cypex=# SELECT t_oil.country,

t_oil.year,

t_oil.production,

t_oil.production - first_value(t_oil.production) OVER (ORDER BY t_oil.year) AS diff_first

FROM t_oil

WHERE t_oil.country = 'USA'::text

LIMIT 10;

country | year | production | diff_first

---------+------+------------+------------

USA | 1965 | 9014 | 0

USA | 1966 | 9579 | 565

USA | 1967 | 10219 | 1205

USA | 1968 | 10600 | 1586

USA | 1969 | 10828 | 1814

USA | 1970 | 11297 | 2283

USA | 1971 | 11156 | 2142

USA | 1972 | 11185 | 2171

USA | 1973 | 10946 | 1932

USA | 1974 | 10461 | 1447

(10 rows)

As you can see, in this case everything is relative to the first row.

To visualize the results I have quickly built a mini CYPEX dashboard:

The data looks correct so we can move on to the next examples.

Mixing data sets in analytics

But what happens if we start to look at two countries? If we order by year we might hit the wrong row. If we order by both columns we might still hit a row associated to a different country. The solution is the PARTITION BY clause. PostgreSQL will break up the data in various groups and calculate the difference again (for each group). Here is how it works:

cypex=# SELECT country, year, production, 
               lag(production) OVER (PARTITION BY country ORDER BY year) AS diff 
        FROM   t_oil 
        WHERE country IN ('Canada', 'Mexico') 
               AND year < 1970;
 country | year | production | diff 
---------+------+------------+------
 Canada  | 1965 |        920 |     
 Canada  | 1966 |       1012 |  920
 Canada  | 1967 |       1106 | 1012
 Canada  | 1968 |       1194 | 1106
 Canada  | 1969 |       1306 | 1194
 Mexico  | 1965 |        362 |     
 Mexico  | 1966 |        370 |  362
 Mexico  | 1967 |        411 |  370
 Mexico  | 1968 |        439 |  411
 Mexico  | 1969 |        461 |  439
(10 rows)

cypex=# SELECT country, year, production,

lag(production) OVER (PARTITION BY country ORDER BY year) AS diff

FROM t_oil

WHERE country IN ('Canada', 'Mexico')

AND year < 1970;

country | year | production | diff

---------+------+------------+------

Canada | 1965 | 920 |

Canada | 1966 | 1012 | 920

Canada | 1967 | 1106 | 1012

Canada | 1968 | 1194 | 1106

Canada | 1969 | 1306 | 1194

Mexico | 1965 | 362 |

Mexico | 1966 | 370 | 362

Mexico | 1967 | 411 | 370

Mexico | 1968 | 439 | 411

Mexico | 1969 | 461 | 439

(10 rows)

In this example each group contains a NULL value because there is no “previous” value. This is proof that PostgreSQL handles the groups separately.

Using sliding windows

One more thing many people are interested in is the necessity to calculate moving averages. I decided to include this example in this post about differences because the problem pops up so often that it deserves some more attention. In many cases this type of operation is calculated on the application level which is clearly the wrong place to do because of performance reasons:

cypex=# SELECT country, year, production, 
	       avg(production) OVER (ORDER BY year 
		   ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS mov 
        FROM 	t_oil 
        WHERE country IN ('Saudi Arabien') 
              AND year BETWEEN 1975 AND 1990;
    country    | year | production |         mov          
---------------+------+------------+-----------------------
 Saudi Arabien | 1975 |       7216 | 8465.6666666666666667
 Saudi Arabien | 1976 |       8762 | 8487.7500000000000000
 Saudi Arabien | 1977 |       9419 | 8758.4000000000000000
 Saudi Arabien | 1978 |       8554 | 9369.2000000000000000
 Saudi Arabien | 1979 |       9841 | 9668.0000000000000000
 Saudi Arabien | 1980 |      10270 | 9176.4000000000000000
 Saudi Arabien | 1981 |      10256 | 8455.8000000000000000
 Saudi Arabien | 1982 |       6961 | 7394.4000000000000000
 Saudi Arabien | 1983 |       4951 | 6060.6000000000000000
 Saudi Arabien | 1984 |       4534 | 5051.0000000000000000
 Saudi Arabien | 1985 |       3601 | 4578.6000000000000000
 Saudi Arabien | 1986 |       5208 | 4732.4000000000000000
 Saudi Arabien | 1987 |       4599 | 4952.6000000000000000
 Saudi Arabien | 1988 |       5720 | 5653.4000000000000000
 Saudi Arabien | 1989 |       5635 | 5764.7500000000000000
 Saudi Arabien | 1990 |       7105 | 6153.3333333333333333
(16 rows)

cypex=# SELECT country, year, production,

avg(production) OVER (ORDER BY year

ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS mov

FROM t_oil

WHERE country IN ('Saudi Arabien')

AND year BETWEEN 1975 AND 1990;

country | year | production | mov

---------------+------+------------+-----------------------

Saudi Arabien | 1975 | 7216 | 8465.6666666666666667

Saudi Arabien | 1976 | 8762 | 8487.7500000000000000

Saudi Arabien | 1977 | 9419 | 8758.4000000000000000

Saudi Arabien | 1978 | 8554 | 9369.2000000000000000

Saudi Arabien | 1979 | 9841 | 9668.0000000000000000

Saudi Arabien | 1980 | 10270 | 9176.4000000000000000

Saudi Arabien | 1981 | 10256 | 8455.8000000000000000

Saudi Arabien | 1982 | 6961 | 7394.4000000000000000

Saudi Arabien | 1983 | 4951 | 6060.6000000000000000

Saudi Arabien | 1984 | 4534 | 5051.0000000000000000

Saudi Arabien | 1985 | 3601 | 4578.6000000000000000

Saudi Arabien | 1986 | 5208 | 4732.4000000000000000

Saudi Arabien | 1987 | 4599 | 4952.6000000000000000

Saudi Arabien | 1988 | 5720 | 5653.4000000000000000

Saudi Arabien | 1989 | 5635 | 5764.7500000000000000

Saudi Arabien | 1990 | 7105 | 6153.3333333333333333

(16 rows)

In SQL you can use ROWS BETWEEN ... PRECEDING AND ... FOLLOWING.

This defines the number of rows going into the aggregate function (in our case “avg”). The idea of a moving average is to flatten the curve and create a smoother line. The following picture shows how this works:

As you can see in the CYPEX visualization the moving average is a lot smoother than the rate underlying data. Saudi Arabia is a so called “swing producer”. Depending on the political situation the production rate might vary significantly so using a moving average actually makes a lot of sense.

Composite types and row comparisons

However, there is more: Some of you might know that PostgreSQL supports composite data type. Basically every row can be seen as a single element containing various components. Usually a SELECT clause lists all desired fields, but you can also see a table as a single field as shown in the next example:

cypex=# SELECT t_oil, lag(t_oil) OVER (ORDER BY year) 
        FROM   t_oil 
        WHERE  country = 'Mexico'  
               AND year IN (1980, 1981);
                  t_oil                  |                   lag                   
-----------------------------------------+-----------------------------------------
 ('North America',Mexico,1980,2129,1048) | 
 ('North America',Mexico,1981,2553,1172) | ('North America',Mexico,1980,2129,1048)
(2 rows)

cypex=# SELECT t_oil, lag(t_oil) OVER (ORDER BY year)

FROM t_oil

WHERE country = 'Mexico'

AND year IN (1980, 1981);

t_oil | lag

-----------------------------------------+-----------------------------------------

('North America',Mexico,1980,2129,1048) |

('North America',Mexico,1981,2553,1172) | ('North America',Mexico,1980,2129,1048)

(2 rows)

In this case all columns of a row are packed into a single field. You can use the “lag” function normally …

The trick now is: You can use “=” to compare two rows directly. Why is that important? Sometimes you want to see if two rows were imported twice or you simply want to know if two consecutive rows are identical. This is how it works:

cypex=# SELECT t_oil = lag(t_oil) OVER (ORDER BY year) 
        FROM   t_oil 
        WHERE  country = 'Mexico' 
               AND year IN (1980, 1981);
 ?column? 
----------
 
 f
(2 rows)

cypex=# SELECT t_oil = lag(t_oil) OVER (ORDER BY year)

FROM t_oil

WHERE country = 'Mexico'

AND year IN (1980, 1981);

?column?

----------

(2 rows)

It is possible to compare entire rows to each other. PostgreSQL will inspect one field after the other and only issue true in case all fields are the same. In other words: “lag” can even be abused to detect duplicate rows.

Read more about PostgreSQL in Laurenz Albe's blog about Forcing the Join Order.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

One response to “Calculating differences between rows in SQL”

Gábor Szabó says:

June 30, 2020 at 3:45 pm

I think it's better to demonstrate row comparison with IS NOT DISTINCT FROM, rather than =, because of the possibility of NULL values.

Reply

Calculating differences between rows in SQL

Loading sample data

lag: How to access a different row in SQL

Once you find the correct row, the rest is easy:

first_value function

To visualize the results I have quickly built a mini CYPEX dashboard:

Mixing data sets in analytics

Using sliding windows

In SQL you can use ROWS BETWEEN ... PRECEDING AND ... FOLLOWING.

Composite types and row comparisons

One response to “Calculating differences between rows in SQL”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts