CYBERTEC PostgreSQL Logo

PostgreSQL: Create indexes after bulk loading

09.2021 / Category: / Tags: |

Over the years, many of our PostgreSQL clients have asked whether it makes sense to create indexes before - or after - importing data. Does it make sense to disable indexes when bulk loading data, or is it better to keep them enabled? This is an important question for people involved in data warehousing and large-scale data ingestion. So let’s dig in and figure it out:

B-tree indexes in PostgreSQL

Before we dive into the difference in performance, we need to take a look at how B-trees in PostgreSQL are actually structured:

create indexes after bulk loading

An index points to pages in the heap (= “table”). In PostgreSQL, a B-tree index always contains at least as much data as the underlying table. Inserting content into the table also means that data has to be added to the B-tree structure, which can easily turn a single change into a more expensive process. Keep in mind that this happens for each row (unless you are using partial indexes).

In reality, the additional data can pile up, resulting in a really expensive operation.

Some basic performance data

To show the difference between indexing after an import and before an import, I have created a simple test on my old Mac OS X machine. However, one can expect to see similar results on almost all other operating systems, including Linux.

Here are some results:

As you can see, we need roughly 2 minutes and 50 seconds to load the data, and a little more than one minute to create the index.

But what happens if we drop the table, and create the index before loading the data?

Wow, the runtime has skyrocketed to close to 5 minutes! The overhead of the additional index is relevant.

bulk loading

The situation can become even worse if there is more than just one index.
In general, our recommendation is therefore:

Create indexes after bulk loading!

Incremental loading

The natural question which then arises is: What should I do if there is already data in a table, and even more data is added? In most cases, there is no other way to deal with it but to take on the overhead of the existing indexes, because in most cases, you can’t live without indexes if there is already data in the database.

A second option is to consider PostgreSQL partitioning. If you load data on a monthly basis, it might make sense to create a partition for each month, and attach it after the loading process.

In general, adding to existing data is way more critical, and not so straightforward, as creating data from scratch.

Finally...

Bulk loading is an important topic and everybody has to load large amounts of data once in a while.

However, bulk loading and indexes are not the only important issues. Docker is also a growing technology that has found many friends around the world. If you want to figure out how to best run PostgreSQL in a container, check out our article about this topic.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

2 responses to “PostgreSQL: Create indexes after bulk loading”

  1. It's a shame that PostgreSQL doesn't have the equivalent of Onracle's direct path loading, as the index maintenance mechanism on that can really even out the total times.

    I wish PostgreSQL had the same ability to log wait events also, as direct comparisons of wall clock time often fail to show the real drivers of a delay.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram