CYBERTEC PostgreSQL Logo

How to run a clustering algorithm within PostgreSQL

02.2020 / Category: / Tags: |

By Kevin Speyer - In this post, we will cover a step by step guide on how to implement a clustering algorithm within PostgreSQL. The clustering algorithm is the well-known k-means, which splits data into groups by minimising the sum of the squared distances of each point to the corresponding group average. This method belongs to the so called unsupervised learning algorithms, because it does not require labels to train the model. Instead, the model extracts the underlying structure by observing the distribution of data.
In a previous post we've covered how to execute a python k-means algorithm inside Postgres. In this case, we will use another library written in c, from this repository.

Set up

First, let's clone the repository into our computer:

[sourcecode language="bash" wraplines="false" collapse="false"]
$ git clone git@github.com:cybertec-postgresql/kmeans-postgresql.git
$ cd kmeans-postgresql
[/sourcecode]

Now, we have to install the package:

[sourcecode language="bash" wraplines="false" collapse="false"]
$ sudo make install
[/sourcecode]

To test the extension, I will create a new database. So let's start the PostgreSQL client:

[sourcecode language="bash" wraplines="false" collapse="false"]
$ psql -U postgres
[/sourcecode]

[sourcecode language="sql" wraplines="false" collapse="false"]
# CREATE DATABASE kmeans
[/sourcecode]

Finally, let's run the kmeans.sql script to create the functions in the database. This will allow us to run the k-means algorithm from within the database.

[sourcecode language="bash" wraplines="false" collapse="false"]
$ psql -f kmeans--1.1.0.sql -U user -d kmeans
[/sourcecode]

We are done with the setup!

Using the cluster algorithm

Now, let's test the algorithm. We can use the test data from the repository, which is located in the "data/" directory. We have to create a table to copy the data into:

[sourcecode language="bash" wraplines="false" collapse="false"]
$ psql -U user -d kmeans
[/sourcecode]

[sourcecode language="sql" wraplines="false" collapse="false"]
kmeans=> CREATE TABLE testdata(val1 float, val2 float, val3 float);
kmeans=> i input/kmeans.source
kmeans=> COPY testdata FROM './data/testdata.txt' ( FORMAT CSV, DELIMITER(' ') );
[/sourcecode]

Now, we can call the kmeans function from the database and see the results:

[sourcecode language="sql" wraplines="false" collapse="false"]
kmeans=> SELECT kmeans(ARRAY[val1, val2], 5) OVER (), val1, val2 FROM testdata limit 10;
kmeans | val1 | val2
--------+-----------+-----------
2 | 1.208985 | 0.421448
3 | 0.504542 | -0.28573
2 | 0.630568 | 1.054712
2 | 1.056364 | 0.601873
0 | 1.095326 | -1.447579
1 | -0.210165 | 0.000284
0 | -0.367151 | -1.255189
0 | 0.868013 | -1.063465
3 | 1.704441 | -0.644833
0 | 0.565619 | -1.637858
(10 rows)
[/sourcecode]

The last parameter (5 in this case) is the number of clusters to divide the data into. The OVER clause gives us the possibility to perform the clusterization in a window. This can be achieved using PARTITION BY, like this:

[sourcecode language="sql" wraplines="false" collapse="false"]
kmeans=> SELECT kmeans(ARRAY[val1, val2], 2) OVER (partition by val1 > 0), val1, val2 FROM testdata  limit10;
[/sourcecode]

In this case, the algorithm runs separately for all data with val1 > 0, exactly like a window function.

Visualizing the clusters

To visualize the results, we can run the perl script in the repository. If you have never used gnuplot, it is necessary to install it first:

[sourcecode language="bash" wraplines="false" collapse="false"]
$ sudo dnf install -y gnuplot
[/sourcecode]

or for APT package manager

[sourcecode language="bash" wraplines="false" collapse="false"]
$ sudo apt-get install -y gnuplot
[/sourcecode]

Lastly, run the plot script and observe the results!

[sourcecode language="bash" wraplines="false" collapse="false"] perl plot-cybertec.pl -d kmeans -k 4 [/sourcecode]

The output of the script is displayed: the algorithm divides the data into four clusters

 Test data divided into four clusters by the algorithm

 If you are not working on a local machine, or just want to look at the plot in the terminal, you can add the -t option:

[sourcecode language="bash" wraplines="false" collapse="false"]

perl plot-cybertec.pl -d kmeans -k 4 -t

[/sourcecode]

And you should see something like this in the terminal:

One response to “How to run a clustering algorithm within PostgreSQL”

  1. git clone git@github.com:cybertec-postgresql/kmeans-postgresql.git
    Cloning into 'kmeans-postgresql'...
    git@github.com: Permission denied (publickey).
    fatal: Could not read from remote repository.

    Please make sure you have the correct access rights
    and the repository exists.

    Please let me know how to clone your git.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram