CYBERTEC PostgreSQL Logo

PostgreSQL: Creating checksums for tables

03.2022 / Category: / Tags: |

There are certain types of questions which accompany any PostgreSQL consultant throughout his or her entire career. Invariably, one of those questions is: “How can we compare the content of two tables?” This is an important thing to know, because we might want to ensure that data is identical after a migration, or perhaps we want to verify that data has not been corrupted during storage. These questions arise so frequently that it was necessary to come up with a speedy solution for comparing the content of tables. The answer lies in creating checksums for tables.

md5sum on Linux

On a Linux system you normally use a program such as md5sum to get the checksum, or "digital fingerprint", of a file. Here's how it works:

md5sum is a widely-used method to obtain checksums. Checksums can be compared more easily than entire files - especially when those files are large.

The question which naturally arises is: How can we achieve the same thing in a database? What can we do in SQL to create such a checksum?

Ordered sets in SQL

The solution to the problem in SQL is to use an “ordered set”. Now what is that? Here is an example of a simple ordered set:

This is the SQL way to calculate a median value. What happens is: The generate_series function creates values from 1 to 43243 (one row per value). The GROUP BY clause dissects them into 2 groups. One group will contain odd numbers and the other group will contain even numbers. The trick is: If we calculate the average (mean) number, the order of data does not matter. Suppose we have 1, 2, 3, 4, 5: The average will always be 3 - regardless of the order. However, how does a median work? We order the data and then we go 50% into the ordered data set - percentile_disc(0.5) does exactly that. Finding the “middle” value does not work without order.

The same is true for checksums. If we want to create a fingerprint for a table, the order of entries does matter. Identical data in a different order will lead to different results - and we want to avoid that.

Checksums in PostgreSQL

To solve this problem, we implement an ordered set which consumes sorted data and produces a checksum.

Before we implement the ordered set we need to see what's already available in PostgreSQL:

A simple md5 function is available, but there's no corresponding aggregate. We need to define one. Here's how it works:

An “ordered set” needs an INITCOND. We initialize an empty string. The general idea is: We take a string, add the next row and turn it into a checksum. Then we add a row, turn it into a checksum and so on. We do that as long as there is data. Basically, it is a checksum over checksums, to ensure that the amount of data we're working on is never too large to fit into memory. For every row in the loop, we call SFUNC (in our case, md5_agg_sfunc).

Order

In the context of an ordered set, one thing is of great importance: You might have guessed it - it is order. We use a little trick here:

After loading some data, we can identify what's possible:

Note that we're actually sorting by the “table”. What that means is that we're sorting by every column (so in our case, ORDER BY a, b). This is important, as you will see in a moment.

It's also possible is to use an entire row just like a column:

That is convenient: Let's recap the definition of the aggregate:

CREATE  AGGREGATE md5_agg (ORDER BY anyelement)...

What we see here is that the aggregate can basically order by “anything”. So, as in this case, we fully exploit that. As long as a row can be sorted, we can use it for our checksums:

We've successfully created a checksum for this table. Of course, we can also checksum subsets of data:

We can always use a GROUP BY statement to split the data into groups.

Finally…

Creating checksums can be incredibly useful to help ensure the quality of your data. However, there is more. If you really want to protect your data and ensure maximum security, we suggest taking a look at PostgreSQL TDE which offers a way to encrypt data on disk. TDE is part of PGEE. Check out our blog post about managing encryption keys with PostgreSQL TDE to read more.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

One response to “PostgreSQL: Creating checksums for tables”

  1. Hi, i use the same kind of process, to check data.
    As an oracle user, I did not know that we could use
    select my_view from my_view.
    so i build the same on oracle side, and also using a user defined aggregate function.
    I am changing my scribedb process, to use your approach
    thanks a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram