When designing a database application, the layout of data in disk is often neglected. However, the way data is clustered by PostgreSQL can have a major performance impact. Therefore it makes sense to take a look at what can be done to improve speed and throughput. In this post you will learn one of the most important tricks.
Table of Contents
To demonstrate the importance of the on-disk layout I have created a simple test set:
1 2 3 4 5 6 7 |
test=# CREATE TABLE t_test AS SELECT * FROM generate_series(1, 10000000); SELECT 10000000 test=# CREATE TABLE t_random AS SELECT * FROM t_test ORDER BY random(); SELECT 10000000 |
Note that both data sets are absolutely identical. I have loaded 10 million rows into a simple table. However, in the first case data has been sorted, then inserted. generate_series returns data in ascending order and because the table is new data will be written to disk in that order.
In the second case I decided to shuffle the data before insertion. We are still talking about the same data set. However, it is not in the same order:
1 2 3 4 5 6 7 |
test=# d+ List of relations Schema | Name | Type | Owner | Size | Description --------+----------+-------+-------+--------+------------- public | t_random | table | hs | 346 MB | public | t_test | table | hs | 346 MB | (2 rows) |
In both cases the size on disk is the same. There are no changes in terms of space consumption which can be an important factor as well.
Let us create an index on both tables:
1 2 3 4 5 6 7 8 |
test=# timing Timing is on. test=# CREATE INDEX idx_test ON t_test (generate_series); CREATE INDEX Time: 3699.416 ms (00:03.699) test=# CREATE INDEX idx_random ON t_random (generate_series); CREATE INDEX Time: 5084.823 ms (00:05.085) |
Even creating the index is already faster on sorted data for various reasons. However, creating initial indexes does not happen too often, so you should not worry too much.
In the next step we can already create optimizer statistics and make sure that all hint bits are set to ensure a fair performance comparison:
1 2 |
test=# VACUUM ANALYZE; VACUUM |
Now that all the test data sets are in place we can run a simple test: Let us fetch 49000 rows from the sorted data set first:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# explain (analyze, buffers) SELECT * FROM t_test WHERE generate_series BETWEEN 1000 AND 50000; QUERY PLAN ------------------------------------------------------------------------------------- Index Only Scan using idx_test on t_test (cost=0.43..1362.62 rows=43909 width=4) (actual time=0.017..7.810 rows=49001 loops=1) Index Cond: ((generate_series >= 1000) AND (generate_series <= 50000)) Heap Fetches: 0 Buffers: shared hit=138 Planning Time: 0.149 ms Execution Time: 11.785 ms (6 rows) |
Not bad. We need 11.785 milliseconds to read the data. What is most important to consider here is that the number of 8k blocks needed is 138, which is not much. “shared hit” means that all the data has come from memory.
Let me run the same test again:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# explain (analyze, buffers) SELECT * FROM t_random WHERE generate_series BETWEEN 1000 AND 50000; QUERY PLAN ------------------------------------------------------------------------------------------ Index Only Scan using idx_random on t_random (cost=0.43..1665.17 rows=53637 width=4) (actual time=0.013..9.892 rows=49001 loops=1) Index Cond: ((generate_series >= 1000) AND (generate_series <= 50000)) Heap Fetches: 0 Buffers: shared hit=18799 Planning Time: 0.102 ms Execution Time: 13.386 ms (6 rows) |
In this case the query took a bit longer: 13.4 ms. However, let us talk about the most important number here: The number of blocks needed to return this result. 18799 blocks. Wow. That is roughly 150 times more.
One could argue that the query is not really that much slower. This is true. However, in my example all data is coming from memory. Let us assume for a moment that data has to be read from disk because for some reason we get no cache hits. The situation would change dramatically. Let us assume that reading one block from disk takes 0.1 ms:
138 * 0.1 + 11.7 = 25.5 ms
vs.
18799 * 0.1 + 13.4 = 1893.3 ms
That is a major difference. This is why the number of blocks does make a difference – even if it might not appear to be the case at first glance. The lower your cache hit rates are, the bigger the problem will become.
There is one more aspect to consider in this example: Note that if you want to read a handful of rows only the on-disk layout does not make too much of a difference. However, if the subset of data contains thousands of rows, the way is ordered on disk does have an impact on performance.
The CLUSTER command has been introduced many years ago to address exactly the issues I have just outlined. It allows you to organize data according to an index. Here is the syntax:
1 2 3 4 5 6 |
test=# h CLUSTER Command: CLUSTER Description: cluster a table according to an index Syntax: CLUSTER [VERBOSE] table_name [ USING index_name ] CLUSTER [VERBOSE] |
URL: https://www.postgresql.org/docs/12/sql-cluster.html
Utilizing the CLUSTER command is easy. The following code snipped will show how you can do that:
1 2 |
test=# CLUSTER t_random USING idx_random; CLUSTER |
To see what happens I have executed the same query as before again. However, there is something important to be seen:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
test=# explain (analyze, buffers) SELECT * FROM t_random WHERE generate_series BETWEEN 1000 AND 50000; QUERY PLAN --------------------------------------------------------------------------------- Bitmap Heap Scan on t_random (cost=1142.21..48491.32 rows=53637 width=4) (actual time=3.328..9.564 rows=49001 loops=1) Recheck Cond: ((generate_series >= 1000) AND (generate_series <= 50000)) Heap Blocks: exact=218 Buffers: shared hit=2 read=353 -> Bitmap Index Scan on idx_random (cost=0.00..1128.80 rows=53637 width=0) (actual time=3.284..3.284 rows=49001 loops=1) Index Cond: ((generate_series >= 1000) AND (generate_series <= 50000)) Buffers: shared hit=2 read=135 Planning Time: 1.024 ms Execution Time: 13.077 ms (9 rows) |
PostgreSQL has changed the execution plan. This happens due to wrong statistics. Therefore it is important to run ANALYZE to make sure that the optimizer has up-to date information:
1 2 |
test=# ANALYZE; ANALYZE |
Once the new optimizer statistics is in place the execution plan will be as expected again:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# explain (analyze, buffers) SELECT * FROM t_random WHERE generate_series BETWEEN 1000 AND 50000; QUERY PLAN ----------------------------------------------------------------------------------------- Index Only Scan using idx_random on t_random (cost=0.43..1807.12 rows=50884 width=4) (actual time=0.012..11.737 rows=49001 loops=1) Index Cond: ((generate_series >= 1000) AND (generate_series <= 50000)) Heap Fetches: 49001 Buffers: shared hit=355 Planning Time: 0.220 ms Execution Time: 15.267 ms (6 rows) |
If you have decided to cluster a table it does NOT mean that order on disk is maintained forever. If you run UPDATES etc. frequently the table might gradually loose order again. Therefore, CLUSTER is especially useful if your data is rather static. It can also make sense to order data as you import it to ensure physical order.
If you want to learn more about database performance and storage consider checking out my post about shrinking the storage footprint of PostgreSQL.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Thanks for sharing! I like the idea in the docs to reduce the table fillfactor a little if the data is not static. It sounds like this could avoid requiring a re-cluster, or at least less often. Is it worth adding to the maintaining order section? Or is this a bad idea? Thanks again!
I don't get it. If I am understanding it correctly, the last example is actually the worst performing of the whole article: 15 ms, when before CLUSTERing it was 13ms.
Getting columns that are not part of the index would be a better way to demonstrate the advantages of clustering a table.