CYBERTEC PostgreSQL Logo

Calculating differences between rows in SQL

06.2020 / Category: / Tags: |

Recently we had some clients who had the desire to store timeseries in PostgreSQL. One of the questions which interested them is related to calculating the difference between values in timeseries data. How can you calculate the difference between the current and the previous row?
To answer this question I have decided to share some simple queries outlining what can be done. Note that this is not a complete tutorial about analytics and windowing functions but just a short introduction to what can be done in general.

Loading sample data

Let us load some sample data:

If you are a superuser you can use COPY ... FROM PROGRAM directly. Otherwise you have to load the text file in a different way.

lag: How to access a different row in SQL

If you want to calculate the difference between two rows, you can make use of the “lag” function. However, there is a question naturally arising here: If we want to access the previous row. What is the previous row? We will need some kind of order. To do that SQL provides the OVER-clause:

In my example I have ordered the data by year to make sure that the previous year can indeed be found in the previous row.

Once you find the correct row, the rest is easy:

What is important to see here is that the first row contains a NULL entry because there is no known difference to the previous row.

first_value function

What many people need is the difference between the current and the first row in the data set. PostgreSQL (or ANSI SQL in general to be more precise) offers the “first_value” function which returns the first row given the order provided by us. Here is how it works:

As you can see, in this case everything is relative to the first row.

To visualize the results I have quickly built a mini CYPEX dashboard:

cypex visualization

The data looks correct so we can move on to the next examples.

Mixing data sets in analytics

But what happens if we start to look at two countries? If we order by year we might hit the wrong row. If we order by both columns we might still hit a row associated to a different country. The solution is the PARTITION BY clause. PostgreSQL will break up the data in various groups and calculate the difference again (for each group). Here is how it works:

In this example each group contains a NULL value because there is no “previous” value. This is proof that PostgreSQL handles the groups separately.

Using sliding windows

One more thing many people are interested in is the necessity to calculate moving averages. I decided to include this example in this post about differences because the problem pops up so often that it deserves some more attention. In many cases this type of operation is calculated on the application level which is clearly the wrong place to do because of performance reasons:

In SQL you can use ROWS BETWEEN ... PRECEDING AND ... FOLLOWING.

This defines the number of rows going into the aggregate function (in our case “avg”). The idea of a moving average is to flatten the curve and create a smoother line. The following picture shows how this works:

Cypex dashboard

As you can see in the CYPEX visualization the moving average is a lot smoother than the rate underlying data. Saudi Arabia is a so called “swing producer”. Depending on the political situation the production rate might vary significantly so using a moving average actually makes a lot of sense.

Composite types and row comparisons

However, there is more: Some of you might know that PostgreSQL supports composite data type. Basically every row can be seen as a single element containing various components. Usually a SELECT clause lists all desired fields, but you can also see a table as a single field as shown in the next example:

In this case all columns of a row are packed into a single field. You can use the “lag” function normally …

The trick now is: You can use “=” to compare two rows directly. Why is that important? Sometimes you want to see if two rows were imported twice or you simply want to know if two consecutive rows are identical. This is how it works:

It is possible to compare entire rows to each other. PostgreSQL will inspect one field after the other and only issue true in case all fields are the same. In other words: “lag” can even be abused to detect duplicate rows.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

One response to “Calculating differences between rows in SQL”

  1. I think it's better to demonstrate row comparison with IS NOT DISTINCT FROM, rather than =, because of the possibility of NULL values.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram