Many people use PostgreSQL to store data and do some basic analysis. However, there is more to data than many people might think. Benford's law, for instance, is a mechanism, which is widely used in many fields. Especially accounting fraud and so on can be detected nicely using this very simple law of mathematics.
Table of Contents
What is the idea of Benford's law? Basically the concept is simple: Not all numbers are created equally. If you look at data, which has been observed in a natural way, not all numbers are equally likely. If you look at the first digit of a number, you will see that “1” is a lot more likely than, say, “8” or “9”.
Implementing Benford's law in PostgreSQL is actually simple and easy to achieve. All yyou have to do is to look at the first digit of numbers you want to analyze and see, if these numbers are distributed as proposed by this simple law. If the distribution of numbers is not as expected we can safely assume that something is fishy.
To give Benford's law a try we just take some accounting data provided by the “Iowa Department of Management” (http://www.dom.state.ia.us/local/county/index.html). Note, this is just some random data I have found on the web.
For our experiment we have chosen the first document in the list (“County Budget Summary FY14”).
So, let us see now, if those total expenditures listed in the report are expected to be fraudulent or not:
First of all we create ourselves a little helper table to store the data:
1 2 |
test=# CREATE TABLE t_accounting (data int4); CREATE TABLE |
Then we copy the total expenditure column to a simple text file and remove those commas shown in the Excel sheet using some simple Vim magic (:1,$s/,//gi). This leaves us with a simple text file:
1 2 3 4 5 6 7 8 9 10 11 |
[hs@paula tmp]$ head data.txt 11772966 7705043 14454112 11044060 7699062 23530265 62224226 21602215 18415355 18589757 |
We can easily import this data into PostgreSQL:
1 2 |
test=# COPY t_accounting FROM '/tmp/data.txt'; COPY 99 |
In SQL it is pretty easy to extract the first digit of a number. All we have to do now is to run a trivial aggregation and see, what we get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
test=# SELECT substring(data::text, 1, 1), count(*) FROM t_accounting GROUP BY 1 ORDER BY 1; substring | count -----------+------- 1 | 54 2 | 17 3 | 8 4 | 1 5 | 1 6 | 2 7 | 4 8 | 8 9 | 4 (9 rows) |
Obviously smaller numbers are ways more likely than larger ones. It is exactly what we should expect from “proper” accounting data.
NOTE: If you want to check your observation, we suggest using a chi-square or a Kolmogorov-Smirnow test.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply