CYBERTEC PostgreSQL Logo

PostgreSQL: Bulk loading huge amounts of data

05.2021 / Category: / Tags: |

Bulk loading is the quickest way to import large amounts of data into a PostgreSQL database. There are various ways to facilitate large-scale imports, and many different ways to scale are also available. This post will show you how to use some of these tricks, and explain how fast importing works. You can use this knowledge to optimize data warehousing or any other data-intensive workload.

There are several things to take into consideration in order to speed up bulk loading of massive amounts of data using PostgreSQL:

  • INSERT vs. COPY
  • Optimizing checkpoints
  • Logged vs. unlogged tables
  • Recreating indexes
  • Enabled and disabled triggers
  • Improving column order and space consumption

Let us take a look at these things in greater detail.

INSERT vs. COPY

The first thing to consider is that COPY is usually a LOT better than plain inserts. The reason is that INSERT has a lot of overhead. People often ask: What kind of overhead is there? What makes COPY so much faster than INSERT? There are a variety of reasons: In the case of INSERT, every statement has to check for locks, check for the existence of the table and the columns in the table, check for permissions, look up data types and so on. In the case of COPY, this is only done once, which is a lot faster. Whenever you want to write large amounts of data, data COPY is usually the way to go.

To show what kind of impact this change has in terms of performance, I have compiled a short example.

Let’s create a table as well as some sample data:

The sample table consists of 4 columns which is pretty simple.

In the next step, we will compile a script containing 1 million INSERT statements in a single transaction:

Running the script can be done using psql:

We need around 81 seconds to run this simple test, which is A LOT of time. Single INSERT statements are therefore obviously not the solution to perform quick imports and efficient bulk loading.

As I have already mentioned, COPY is a lot more efficient than INSERT, so let’s use the same data, but feed it to COPY instead;

Running the script is again an easy thing to do:

The speed has improved from 81 to just 2.6 seconds:

Keep in mind that I have executed this test on a fairly old machine, on a totally untuned database. On modern hardware and on a more modern operating system, a lot more can be achieved than on my local iMac desktop machine. Loading 1 million lines or more is not uncommon in the real world. Of course, this data depends on the length of a “record” and so on. However, it is important to get a feeling for what is possible and what is not.

NOTE: Runtimes might vary. This has many reasons. One of them is certainly related to the hardware in use here. We have seen that many SSDs provide us with quite unstable response times.

Adjusting checkpoints for faster bulk loading

The PostgreSQL configuration does have an impact on bulk loading performance. There are many configuration parameters which are vital to database performance, and loading in particular. However, I explicitly want to focus your attention on checkpoint and I/O performance. If you want to load billions of rows, I/O is king. There are various angles to approach the topic:

  • Reduce the amount of data written
    • Make tables smaller if possible (column order)
    • Reduce the amount of WAL written
  • Write data more efficiently
    • Longer checkpoint distances
    • Better I/O scheduling

The following settings are important:

  • max_wal_size: Maximum amount of WAL to be created (soft limit)
  • checkpoint_completion_target: Control checkpoint behavior

In general it is a REALLY good idea to stretch checkpoints by a significant amount.

Setting this value to 100 or 200 GB in case of bulk-load intense workloads is definitely not out of scope.

Keep in mind that increased checkpoint distances DO NOT put your server at risk. It merely affects the way PostgreSQL writes data. Also keep in mind that more disk space will be consumed and recovery might take longer, in case of a crash.

If you want to learn more about checkpointing, check out this article about reducing the amount of WAL written.

CREATE TABLE vs. CREATE UNLOGGED TABLE

However, what if there were a way to get rid of WAL altogether? Well, there is one. It is called an “unlogged table”. What is the general idea? Often we got the following sequence of events:

  • Load large amount of data into a PostgreSQL (“staging area”)
  • Perform some aggregations
  • Drop the initial import

This is the ideal scenario to use the WAL bypass provided by unlogged tables:

Let’s load the same data again:

As you can see, the entire thing is a lot faster. 81 seconds vs. 59 seconds and 2.6 vs. 0.6 seconds. The difference is massive.

The reason is that an unlogged table does not have to write the data twice (no WAL needed). However, this comes with a price tag attached to it:

  • In case of a normal shutdown, an unlogged table is just like a normal table
  • In case of a crash, an unlogged table is guaranteed to be empty
  • The content of an unlogged table is not replicated

These restrictions imply that an unlogged table is not suitable for storing “normal” data. However, it is ideal for staging areas and bulk loading.

Tables can be made logged and unlogged. Many people expect these to be cheap operations but this is not true.

Let’s take a look and see what happens:

In addition to setting the table from UNLOGGED to LOGGED, I have measured the current WAL position.

What we can see is that a lot of data has been written:

Wow, we have produced 80 MB of WAL (if you do exactly one COPY on an empty table - the amount will grow if you run more imports). In case of COPY + INSERT, the volume will be a lot higher.

From this, we draw the conclusion that if we want to do efficient bulk loading setting a table from LOGGED to UNLOGGED, importing the data and setting it back to LOGGED might not be the best of all ideas - because as soon as a table is set back to LOGGED, the entire content of the table has to be sent to the WAL, to make sure that the replicas can receive the content of the table.

Direct import vs. recreating indexes

It takes around 8 seconds to copy the data over. Let’s try the same thing by creating the indexes later:

We can see that the copy process (= INSERT) is a lot faster than before. In total, it is quicker to produce the index later. Also keep in mind that I am using synthetic data on Mac OSX (not too efficient) here. If you repeat the test with a lot more real data, the difference is a lot higher.

Create indexes after importing data if possible.

Enabled triggers vs. disabled triggers

Triggers are also an important factor. One could say that triggers are “the natural enemy” of bulk loading performance. Let’s take a look at the following example:

Our trigger is really simple. All it does is to modify two entries in our data. However, the trigger will add an extra function call to every row, which really adds up.
In our case, we have got the following data: The variation with trigger is around 3 times slower. However, the real difference highly depends on the complexity of the trigger, the size of a row and a lot more. There is no way to state that “a trigger slows things down by a factor of X”. One has to see, case-by-case, what happens.

Optimizing column order for bulk loading

There is more to importing large amounts of data into PostgreSQL than meets the eye. So far, we have optimized checkpoints, touched indexes, triggers and so on. But what about the column order? Let’s try to find out.

In PostgreSQL, column order does make a real difference. It is generally a good idea to put “fixed length” columns in front. In other words: int8, int4, timestamptz and so on should be at the beginning of the table. Variable length data types such as varchar, text and so on should be at the end of the table. The reason for this is that CPU alignment is an issue on disk. This is true for normal heap tables (not for zheap).

Shrinking the size of a table without changing the content can speed things up, because it helps to avoid or reduce one of the key bottlenecks when bulk loading data: I/O. Check out this article to find out more.

Tooling for bulk loading

If what you have seen so far is still not enough, we can recommend some tools to improve bulk loading even more. The following tools can be recommended:

Both tools are very well known and widely used. You can use them safely.

If you have further questions regarding these tools, please don’t hesitate to ask in the comment section, or send us an email.

Finally …

If you want to know more about PostgreSQL performance, we also recommend checking out our consulting services. We help you to tune your database and make sure that your servers are operating perfectly.

4 responses to “PostgreSQL: Bulk loading huge amounts of data”

  1. Thanks for a good article Hans-Jürgen. Say, do you know whether PostgreSQL does some optimization when inserting multiple records in "batch", e.g. using multiple threads when processing multiple INSERT statements in a single transaction?

  2. The first test copy vs insert isn't a like for like. Either have 3 copy statements or 1 insert statement with multiple values

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram