CYBERTEC PostgreSQL Logo

CLUSTER: Improving PostgreSQL performance

03.2020 / Category: / Tags: |

When designing a database application, the layout of data in disk is often neglected. However, the way data is clustered by PostgreSQL can have a major performance impact. Therefore it makes sense to take a look at what can be done to improve speed and throughput. In this post you will learn one of the most important tricks.

PostgreSQL: To sort or not to sort

To demonstrate the importance of the on-disk layout I have created a simple test set:

Note that both data sets are absolutely identical. I have loaded 10 million rows into a simple table. However, in the first case data has been sorted, then inserted. generate_series returns data in ascending order and because the table is new data will be written to disk in that order.
In the second case I decided to shuffle the data before insertion. We are still talking about the same data set. However, it is not in the same order:

In both cases the size on disk is the same. There are no changes in terms of space consumption which can be an important factor as well.

Creating an index in PostgreSQL

Let us create an index on both tables:

Even creating the index is already faster on sorted data for various reasons. However, creating initial indexes does not happen too often, so you should not worry too much.

In the next step we can already create optimizer statistics and make sure that all hint bits are set to ensure a fair performance comparison:

Reading blocks of database

Now that all the test data sets are in place we can run a simple test: Let us fetch 49000 rows from the sorted data set first:

Not bad. We need 11.785 milliseconds to read the data. What is most important to consider here is that the number of 8k blocks needed is 138, which is not much. “shared hit” means that all the data has come from memory.

Let me run the same test again:

In this case the query took a bit longer: 13.4 ms. However, let us talk about the most important number here: The number of blocks needed to return this result. 18799 blocks. Wow. That is roughly 150 times more.

One could argue that the query is not really that much slower. This is true. However, in my example all data is coming from memory. Let us assume for a moment that data has to be read from disk because for some reason we get no cache hits. The situation would change dramatically. Let us assume that reading one block from disk takes 0.1 ms:

138 * 0.1 + 11.7 = 25.5 ms
vs.
18799 * 0.1 + 13.4 = 1893.3 ms

That is a major difference. This is why the number of blocks does make a difference – even if it might not appear to be the case at first glance. The lower your cache hit rates are, the bigger the problem will become.

There is one more aspect to consider in this example: Note that if you want to read a handful of rows only the on-disk layout does not make too much of a difference. However, if the subset of data contains thousands of rows, the way is ordered on disk does have an impact on performance.

CLUSTER: PostgreSQL comes to the rescue

The CLUSTER command has been introduced many years ago to address exactly the issues I have just outlined. It allows you to organize data according to an index. Here is the syntax:

URL: https://www.postgresql.org/docs/12/sql-cluster.html

Utilizing the CLUSTER command is easy. The following code snipped will show how you can do that:

To see what happens I have executed the same query as before again. However, there is something important to be seen:

PostgreSQL has changed the execution plan. This happens due to wrong statistics. Therefore it is important to run ANALYZE to make sure that the optimizer has up-to date information:

Once the new optimizer statistics is in place the execution plan will be as expected again:

Maintaining order

If you have decided to cluster a table it does NOT mean that order on disk is maintained forever. If you run UPDATES etc. frequently the table might gradually loose order again. Therefore, CLUSTER is especially useful if your data is rather static. It can also make sense to order data as you import it to ensure physical order.

Finally …

If you want to learn more about database performance and storage consider checking out my post about shrinking the storage footprint of PostgreSQL.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

3 responses to “CLUSTER: Improving PostgreSQL performance”

  1. Thanks for sharing! I like the idea in the docs to reduce the table fillfactor a little if the data is not static. It sounds like this could avoid requiring a re-cluster, or at least less often. Is it worth adding to the maintaining order section? Or is this a bad idea? Thanks again!

  2. I don't get it. If I am understanding it correctly, the last example is actually the worst performing of the whole article: 15 ms, when before CLUSTERing it was 13ms.

  3. Getting columns that are not part of the index would be a better way to demonstrate the advantages of clustering a table.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram