There are certain types of questions which accompany any PostgreSQL consultant throughout his or her entire career. Invariably, one of those questions is: “How can we compare the content of two tables?” This is an important thing to know, because we might want to ensure that data is identical after a migration, or perhaps we want to verify that data has not been corrupted during storage. These questions arise so frequently that it was necessary to come up with a speedy solution for comparing the content of tables. The answer lies in creating checksums for tables.
Table of Contents
On a Linux system you normally use a program such as md5sum to get the checksum, or "digital fingerprint", of a file. Here's how it works:
1 2 |
[hs@fedora ~]$ md5sum /tmp/file.txt d41d8cd98f00b204e9800998ecf8427e /tmp/file.txt |
md5sum is a widely-used method to obtain checksums. Checksums can be compared more easily than entire files - especially when those files are large.
The question which naturally arises is: How can we achieve the same thing in a database? What can we do in SQL to create such a checksum?
The solution to the problem in SQL is to use an “ordered set”. Now what is that? Here is an example of a simple ordered set:
1 2 3 4 5 6 7 8 9 |
test=# SELECT id % 2 AS grp, percentile_disc(0.5) WITHIN GROUP (ORDER BY id) FROM generate_series(1, 43243) AS id GROUP BY 1; grp | percentile_disc ----+----------------- 0 | 21622 1 | 21621 (2 rows) |
This is the SQL way to calculate a median value. What happens is: The generate_series
function creates values from 1 to 43243 (one row per value). The GROUP BY
clause dissects them into 2 groups. One group will contain odd numbers and the other group will contain even numbers. The trick is: If we calculate the average (mean) number, the order of data does not matter. Suppose we have 1, 2, 3, 4, 5: The average will always be 3 - regardless of the order. However, how does a median work? We order the data and then we go 50% into the ordered data set - percentile_disc(0.5)
does exactly that. Finding the “middle” value does not work without order.
The same is true for checksums. If we want to create a fingerprint for a table, the order of entries does matter. Identical data in a different order will lead to different results - and we want to avoid that.
To solve this problem, we implement an ordered set which consumes sorted data and produces a checksum.
Before we implement the ordered set we need to see what's already available in PostgreSQL:
1 2 3 4 5 |
test=# SELECT md5('hans'); md5 ---------------------------------- f2a0ffe83ec8d44f2be4b624b0f47dde (1 row) |
A simple md5
function is available, but there's no corresponding aggregate. We need to define one. Here's how it works:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE FUNCTION md5_agg_sfunc(text, anyelement) RETURNS text LANGUAGE sql AS $ SELECT md5($1 || $2::text) $; CREATE AGGREGATE md5_agg (ORDER BY anyelement) ( STYPE = text, SFUNC = md5_agg_sfunc, INITCOND = '' ); |
An “ordered set” needs an INITCOND
. We initialize an empty string. The general idea is: We take a string, add the next row and turn it into a checksum. Then we add a row, turn it into a checksum and so on. We do that as long as there is data. Basically, it is a checksum over checksums, to ensure that the amount of data we're working on is never too large to fit into memory. For every row in the loop, we call SFUNC
(in our case, md5_agg_sfunc
).
In the context of an ordered set, one thing is of great importance: You might have guessed it - it is order. We use a little trick here:
1 2 3 4 5 6 7 |
test=# CREATE TABLE t_test (a int, b int); CREATE TABLE test=# INSERT INTO t_test SELECT x, x + 10 FROM generate_series(1, 1000) AS x; INSERT 0 1000 |
After loading some data, we can identify what's possible:
1 2 3 4 5 6 7 |
test=# SELECT * FROM t_test ORDER BY t_test LIMIT 3; a | b ---+---- 1 | 11 2 | 12 3 | 13 (3 rows) |
Note that we're actually sorting by the “table”. What that means is that we're sorting by every column (so in our case, ORDER BY a, b
). This is important, as you will see in a moment.
It's also possible is to use an entire row just like a column:
1 2 3 4 5 6 7 |
test=# SELECT t_test FROM t_test ORDER BY t_test LIMIT 3; t_test -------- (1,11) (2,12) (3,13) (3 rows) |
That is convenient: Let's recap the definition of the aggregate:
CREATE AGGREGATE md5_agg (ORDER BY anyelement)...
What we see here is that the aggregate can basically order by “anything”. So, as in this case, we fully exploit that. As long as a row can be sorted, we can use it for our checksums:
1 2 3 4 5 |
test=# SELECT md5_agg() WITHIN GROUP (ORDER BY t_test) FROM t_test; md5_agg ---------------------------------- d8586008a105a1cac121818f30841eec (1 row) |
We've successfully created a checksum for this table. Of course, we can also checksum subsets of data:
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT a % 2 AS grp, md5_agg() WITHIN GROUP (ORDER BY t_test) FROM t_test GROUP BY a % 2; grp | md5_agg -----+---------------------------------- 0 | 7e310dd10088ae69ab8d8a2b66e36aa2 1 | 62f385154b241e74be8c91de150550c6 (2 rows) |
We can always use a GROUP BY
statement to split the data into groups.
Creating checksums can be incredibly useful to help ensure the quality of your data. However, there is more. If you really want to protect your data and ensure maximum security, we suggest taking a look at PostgreSQL TDE which offers a way to encrypt data on disk. TDE is part of PGEE. Check out our blog post about managing encryption keys with PostgreSQL TDE to read more.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Hi, i use the same kind of process, to check data.
As an oracle user, I did not know that we could use
select my_view from my_view.
so i build the same on oracle side, and also using a user defined aggregate function.
I am changing my scribedb process, to use your approach
thanks a lot.