Table of Contents
There are several techniques to bulk load data into PostgreSQL. I decided to compare their performance in a simple test case. I'll add some recommendations for parameter settings to improve the performance even more.
The table is simple enough:
1 2 3 4 |
CREATE TABLE instest ( id bigint PRIMARY KEY, value text NOT NULL ); |
It is a narrow table (only two columns), but it has a primary key index. Loading data would be much faster without the index, but in real life you cannot always drop all indexes and constraints before loading data. Moreover, dropping indexes before the bulk load and re-creating them afterwards can be slower, if the table already contains data.
For the performance test, I'm going to load 10 million rows that look like (1, '1')
, (2, '2')
and so on, counting up to 10000000. The test will be performed on PostgreSQL v16 with the default configuration. I am aware that the default configuration is not perfect for bulk loading, but that does not bother me, since I am only interested in comparing the different methods.
I'll try the following six techniques:
INSERT
s in autocommit mode (the fool's way)This will be terribly slow, since each statement will run in its own transaction. For each transaction, PostgreSQL has to write the WAL (the transaction log) out to disk, which leads to 10 million I/O requests.
Naturally, this is not the correct thing to do for a bulk load. But it is the way a transactional application loads data into the database, with many clients inserting small amounts of data, and no way to bundle the individual requests into bigger transactions. There are some remedies available to boost performance in such a case:
synchronous_commit
to off
. That will boost performance amazingly, but an operating system crash could lead to some committed transactions getting lost.commit_delay
to a value greater than 0 and tune commit_siblings
. That can reduce the number of I/O requests. The effect won't be as marked as with synchronous_commit
, but you can never lose a committed transaction.INSERT
s in one transactionThe only difference to the previous test is that we will insert all 10 million rows in a single transaction. That is bound to boost performance quite a lot.
INSERT
s with a prepared statement in one transactionThe difference to the previous test is that we use a prepared statement:
1 2 |
PREPARE stmt(bigint,text) AS INSERT INTO instest (id, value) VALUES ($1, $2); |
and perform the inserts like this:
1 |
EXECUTE stmt(1, '1'); |
The performance should be better, because PostgreSQL can reuse the execution plan for the INSERT
statement rather than planning it 10 million times. With short statements, that can be a notable performance improvement.
INSERT
s in one transactionYou can insert several rows with a single INSERT
statement:
1 2 3 4 5 |
INSERT INTO instest (id, value) VALUES (1, '1'), (2, '2'), ... (1000, '1000'); |
For this test, I'll insert 1000 rows per statement, so there will be 10000 such INSERT
statements. The benefit is that there are fewer client-server round trips, and PostgreSQL has to plan and execute fewer statements.
INSERT
s with a prepared statement in one transactionThis test is like the previous one, except that I'll use a prepared statement:
1 2 3 4 5 6 |
PREPARE stmt(bigint,text,bigint,text,...) AS INSERT INTO instest (id, value) VALUES ($1, $2), ($3, $4), ... ($1999, $2000); |
The idea is to combine the benefits of prepared statements and multi-line INSERT
s.
COPY
(the king's way)It is well known that COPY
is the fastest way to load data into PostgreSQL. Let's see how much better it really is! For this test, I'll load all 10 million rows with a single COPY
statement.
The down side of COPY
is that it is a non-standard SQL statement, so not all APIs support it, and you cannot use it in programs that are to support other database systems as well.
Note that you can use COPY (FREEZE)
if you create or truncate the table in the same transaction. That won't speed up loading, but it avoids the overhead of setting hint bits on the rows by the first reader and anti-wraparound autovacuum on the table.
The tests were run on my Laptop with Fedora Linux 37, untuned PostgreSQL v16 with local domain socket connections and ext4 file systems on an NVMe disk.
Bulk load method | Duration |
---|---|
single INSERT s, many transactions |
8954 s |
single INSERT s |
841 s |
single INSERT s, prepared statement |
688 s |
bulk INSERT s |
52 s |
bulk INSERT s, prepared statement |
57 s |
COPY |
14 s |
Most test results are as expected:
INSERT
in its own transaction is unbearable slowINSERT
s is a winCOPY
What may surprise is that using prepared statements for the multi-line statements slows down processing. I am not certain what the reason is, but it could be the overhead of the extended query protocol.
For completeness' sake, I will add some hints how you can configure PostgreSQL to speed up bulk loads. There is little you can to to speed up writes, but you can reduce the number of redundant, unnecessary writes.
max_wal_size
. The default value is 1GB. If more than that amount of WAL has been written since the latest checkpoint, PostgreSQL will trigger another checkpoint. The idea is to keep crash recovery time short. But 1GB is very little when it comes to bulk loading, and you can end up having a checkpoint every couple of seconds. That is an unnecessary overhead, particularly since PostgreSQL might have to flush the same (index) pages to disk again and again.checkpoint_timeout
. Fewer checkpoints mean fewer redundant writes. Moreover, reducing the number of checkpoints will reduce the size of the WAL, since fewer “full-page images” have to be written.Note that tuning these parameters won't improve my special test case a lot, since the table is only extended and the index values are increasing. B-tree indexes are optimized for that usage pattern. If I had chosen UUIDs as primary keys, the index modifications would not be localized, and reducing the number of full-page images written would matter more.
For bulk load, COPY
is the way to go.
Speed up PostgreSQL data loading with COPY (FREEZE) Learn how this feature works with hint bits.
If you are interested in PostgreSQL performance, you may also want to read about the performance of various methods of auto-generated primary keys or ways of storing large binary data to enhance performance.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Does this command make the task load faster?
Which command?
network speed between servers and disk speed is also important when moving and processing data. filesystem types, how to mount then (ext4 nodiratime, nobarrier, noatime)
I have added some more about the connections and the file system used, thanks.
I don't think that the file system has a big impact, but the connections probably do.
> For this test, I’ll insert 1000 rows per statement, so there will be 10000 such INSERT statements
How does tuning that affect the performance, i.e. more rows/statement ?
Thanks for an interesting article.
I didn't test it, so I don't know. My guess is that beyond a certain number of rows (less than 1000) there will be no more performance improvement, and if you keep increasing, the performance will go down again.