CYBERTEC PostgreSQL Logo

Speeding up index creation in PostgreSQL

07.2024 / Category: / Tags:

Indexes are, by far, the most important feature related to performance that every general purpose relational database provides. Without indexes, there are no such things as efficient search operations, no unique constraints, and no primary keys. Achieving good performance on reasonable amounts of data without indexes is therefore totally impossible.

Generating sample data

The main question now is: what happens during index creation, and how can we speed up the process? Creating indexes on billions of rows is quite costly, because data has to be sorted and turned into an efficient search structure. To demonstrate how we can build indexes quickly, I have created a simple set of demo data:

This sample data has a couple of interesting properties: the “id” column is an ascending number. During index creation, this makes a huge difference. The second column contains a random value multiplied by the number of rows as integer value. The third column contains a “double precision” number, and finally, at the end, we are storing similar data as “numeric”, which is a floating point number that does not use the FPU (floating point unit of the CPU) internally. 

Once these billion rows have been created, we can easily check the size of the table by using the following command: 

Once the test data have been created, it makes sense to set the so-called “hint bits” in PostgreSQL, which will allow us to make a fair comparison between various runs. If you are interested in hint bits and their relation to VACUUM, consider checking out our blog post about this topic. 

While VACUUM is running, we can check the progress of this operation. In case a VACUUM operation takes long, this can give us some interesting insights into what is going on inside your database:

After creating the data set, we can move forward and create simple indexes on our table:

Creating simple indexes

The following listing shows how each column is indexed, and how long it takes to create the index. Note that all those operations take are performed with PostgreSQL default settings, which is far from ideal, as we will see later. All tests have been executed on an “AMD Ryzen Threadripper 2950X 16-Core” processor:

As we can see, index creation takes almost 5 minutes. Remember, we are dealing with 1 billion rows and we have not even started to tune the database. What is interesting to see is what PostgreSQL does while the index is created. We can inspect a system view to give us some clues:

In the first step, PostgreSQL will scan the table and prepare the data for the sort operation which is about to come next:

Sorting can be very time consuming, and it is important to tune this process properly to achieve good performance. After the sort process, PostgreSQL will add those sorted tuples into the index as shown in the next listing: 

After creating the first index, we can take a look at the second index column. Note that the data type and the amount of data are the same. Still, we can observe that the process takes longer, because data is fed to the index creation in random order, which can make a huge difference in most cases: 

Wow, the index creation takes almost 2 minutes longer, which is really relevant - we are talking about a substantially longer sort of operation than before. 

However, the physical order of data is not the only thing that can make a huge difference. The same is true for the data type we want to index. Just take a look and see how the runtime will change depending on the data type:

The “double precision” column is again 40 seconds slower. One has to keep in mind that a “double precision” value is larger than an integer value, which definitely contributes to those runtime differences.

The last thing to take a look at is to index the “numeric”, which behaves really in a totally different way than everything before: 

What is important to see here is that two things really matter:

  • Are the data already sorted?
  • Which data type is used?

Both aspects are often totally underestimated, because most people only think of “more RAM” and “more CPUs” (as if nothing else mattered at all).

Trying too many performance improvements

What people often do is to throw hardware and configuration parameters at the problem.

Often, the following parameters are changed:

Once this is done, we can restart the database. Note that without changing shared_buffers, “SELECT pg_reload_conf()” would be totally sufficient.

Let us discuss those parameters one at a time:

  • max_wal_size: This controls the distance between checkpoints and the size of the WAL. It helps a lot to reduce the amount of I/O and to speed up I/O in general.
  • max_parallel_maintenance_workers: This variable controls how many worker processes PostgreSQL is allowed to fire up to build the indexes. It defines an upper limit of workers.
  • maintenance_work_mem: This defines how much can happen in memory for each operation.
  • shared_buffers: The size of the I/O cache

Let us run the index creation again:

This is interesting: we managed to reduce the time from around 13 min 19 sec down to under 8 minutes. Note that this is not even double the speed. However, what does the CPU say?

We caught the CPU while sorting the data. Now this is relevant, because the sort phase can be done in parallel in a nice way - many other stages cannot happen in RAM or cannot happen concurrently. That is the reason why, even with countless CPU cores, we only managed to speed up things a bit. What definitely limits us here is the capacity of the local SSD, which yields around 600 MB / sec during sorting. But, what is even more noteworthy, is the fact that the default configuration sorting integer variables is actually faster than a fully parallel index creation, using many higher parameter settings.

The fact that the data type indeed matters shows us that tweaking configuration parameters is definitely not the only path to enlightenment. Deciding on the right data type can make a huge difference as well.

11 responses to “Speeding up index creation in PostgreSQL”

  1. Thanks for that, overall interesting. A couple of minor corrections:
    "TRYING TO MANY PERFORMANCE IMPROVEMENTS" (TOO, not TO)
    "from around 13 min 19 sec down to almost 8 seconds" (you meant minutes not seconds)

  2. Hi! nice article, thank you! A small corrective remark: "down to almost 8 seconds" should be "down to almost 8 minutes".

  3. Good read as always!
    Suggesting some minor rectifications in the blog:

    1. TRYING TO MANY PERFORMANCE IMPROVEMENTS
    2. "we managed to reduce the time from around 13 min 19 sec down to almost 8 seconds."

    Thanks for sharing!

  4. Interesting test!
    I think that in "we managed to reduce the time from around 13 min 19 sec down to almost 8 seconds." the last word should be "minutes"

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram