Table of Contents
Table partitioning is one of the best-liked features out of the more recent PostgreSQL developments. However, there is no support for automatic partition creation yet. This article shows what you can do to remedy that.
There are essentially two use cases:
I will call the first option time-triggered partitioning and the latter on-demand partitioning.
The lack of PostgreSQL support in this area is an inconvenience at worst. There are several ways to work around the limitation:
You can use the operating system scheduler (cron
on Unix-like systems or “Task Scheduler” on Windows). The job would connect to the database using psql
and run the appropriate CREATE TABLE
and ALTER TABLE ... ATTACH PARTITION
statements.
Typically, you would want to schedule such a job directly on the database server machine. Sometimes you cannot do that, for example, because you are using a hosted database and have no access to the operating system. In that case, you can schedule the job on a different machine.
PostgreSQL doesn't have a built-in job scheduling system. However, there are extensions like pg_timetable or pg_cron that allow you to schedule database jobs. Installing a third-party extension adds another dependency to your software architecture, and it may not be possible on a hosted database. On the other hand, it will make your job independent from the operating system scheduler and is less likely to be forgotten when you move the database.
pg_partman is a PostgreSQL extension that specializes in partition management. It was useful and necessary in the bad old days before v10 introduced “declarative partitioning”: it created triggers and constraints for you. Today, most of this functionality is unnecessary, but the support for automatic partition creation can still come in handy.
This is more difficult, and here the lack of support from PostgreSQL core is more painful.
The naïve approach to try is to have a BEFORE INSERT
trigger that creates and attaches a new partition. However, that will cause one of the following “object in use” errors:
1 2 |
ERROR: cannot CREATE TABLE .. PARTITION OF 'tab' because it is being used by active queries in this session |
or
1 2 |
ERROR: cannot ALTER TABLE 'tab' because it is being used by active queries in this session |
Essentially, by the time the INSERT
has started, you cannot modify the table definition any more.
LISTEN
and NOTIFY
as a workaroundWhat we need is asynchronous processing: you have to create the new partition in a different transaction. Since we certainly cannot create the partition before we INSERT
, it has to be afterwards. However, the partition has to exist at the time of the INSERT
, so we would seem to be at an impasse.
But there is another option: instead of creating the partition in a single step, we can first create a table and then attach it later, as a partition of the partitioned table. The sequence of events could then be like this:
BEFORE INSERT
trigger, create a new table that is not yet a partition (skipped if the table already exists)The remaining question is how to perform the attachment “at some later time”. There is a convenient feature in PostgreSQL for asynchronous processing: LISTEN
and NOTIFY
. A session can register for notifications on a channel (often a table name) with LISTEN
and will then be notified asynchronously whenever another backend calls NOTIFY
for that channel. The sending backend can add a message string to the notification.
BEFORE INSERT
trigger as follows:NOTIFY
with the new partition key as the argumentINSERT
, since we already inserted the row in the partition
Note that this requires that we have a default partition, otherwise inserting a row that does not match an existing partition will fail.
In a different session, a daemon process waits for notifications and attaches the new table to the partitioned table.
Here is the definition of the partitioned table:
1 2 3 4 5 6 7 |
CREATE TABLE tab ( id bigint GENERATED ALWAYS AS IDENTITY, ts timestamp NOT NULL, data text ) PARTITION BY LIST ((ts::date)); CREATE TABLE tab_def PARTITION OF tab DEFAULT; |
I chose list partitioning to simplify the listener code, although that is not required.
The trigger looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
CREATE FUNCTION part_trig() RETURNS trigger LANGUAGE plpgsql AS $$BEGIN BEGIN /* try to create a table for the new partition */ EXECUTE format( 'CREATE TABLE %I (LIKE tab INCLUDING INDEXES)', 'tab_' || to_char(NEW.ts, 'YYYY-MM-DD') ); /* * tell listener to attach the partition * (only if a new table was created) */ EXECUTE format( 'NOTIFY tab, %L', to_char(NEW.ts, 'YYYY-MM-DD') ); EXCEPTION WHEN duplicate_table THEN NULL; -- ignore END; /* insert into the new partition */ EXECUTE format( 'INSERT INTO %I VALUES ($1.*)', 'tab_' || to_char(NEW.ts, 'YYYY-MM-DD') ) USING NEW; /* skip insert into the partitioned table */ RETURN NULL; END;$$; CREATE TRIGGER part_trig BEFORE INSERT ON TAB FOR EACH ROW WHEN (pg_trigger_depth() < 1) EXECUTE FUNCTION part_trig(); |
To understand why the trigger function returns NULL
, read my blog post on the topic. The WHEN
clause in the trigger definition avoids infinite recursion: without it, inserting a row into a partition would call the trigger function again.
For the listener, we need to use a PostgreSQL client API that supports asynchronous notification. I chose to use C:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
#include <libpq-fe.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/select.h> #include <errno.h> static int attach_partition(PGconn *conn, char *table, char *datestr) { PGresult *res; char *esc_tab, *esc_part, *esc_date; /* identifiers are at most 63 bytes long */ char stmt[400], part[64]; /* escape table and partition name */ esc_tab = PQescapeIdentifier(conn, table, 63); snprintf(part, 63, '%s_%s', table, datestr); esc_part = PQescapeIdentifier(conn, part, 63); esc_date = PQescapeLiteral(conn, datestr, 10); /* construct statement */ sprintf(stmt, 'ALTER TABLE %s ATTACH PARTITION %s FOR VALUES IN (%s)', esc_tab, esc_part, esc_date); free(esc_tab); free(esc_part); free(esc_date); /* execute statement */ if ((res = PQexec(conn, stmt)) == NULL) { fprintf(stderr, 'Out of memory sending statementn'); return 0; } if (PQresultStatus(res) != PGRES_COMMAND_OK) { fprintf(stderr, 'Error attaching partition: %sn', PQresultErrorMessage(res)); return 0; } PQclear(res); return 1; } int main (int argc, char **argv) { PGconn *conn; int sock; PGresult *res; /* connect to PostgreSQL with default parameters */ if ((conn = PQconnectdb('application_name=listener')) == NULL) { fprintf(stderr, 'Out of memory connecting to PostgreSQLn'); return 1; } if (PQstatus(conn) != CONNECTION_OK) { fprintf(stderr, 'Error connecting to PostgreSQL: %sn', PQerrorMessage(conn)); PQfinish(conn); return 1; } /* get network socket for the connection */ if ((sock = PQsocket(conn)) < 0) { fprintf(stderr, 'Error getting connection network socketn'); PQfinish(conn); return 1; } /* listen on a channel */ if ((res = PQexec(conn, 'LISTEN tab')) == NULL) { fprintf(stderr, 'Out of memory sending LISTENn'); PQfinish(conn); return 1; } if (PQresultStatus(res) != PGRES_COMMAND_OK) { fprintf(stderr, 'Error listening on channel: %sn', PQresultErrorMessage(res)); PQfinish(conn); return 1; } PQclear(res); while(1) { fd_set sockets; struct pgNotify *note; /* block waiting for activity on the network socket */ FD_ZERO(&sockets); FD_SET(sock, &sockets); errno = 0; if (select(sock + 1, &sockets, NULL, NULL, NULL) < 0) { perror('Error waiting for notification'); PQfinish(conn); return 1; } /* consume the input */ if (PQconsumeInput(conn) == 0) { fprintf(stderr, 'Error receiving data: %sn', PQerrorMessage(conn)); PQfinish(conn); return 1; } /* check for notifications */ note = PQnotifies(conn); if (note != NULL) { if (!attach_partition(conn, note->relname, note->extra)) { PQfinish(conn); return 1; } PQfreemem(note); } } } |
The sample code above should be free from race conditions. However, performance will not be good. Let's compare the speed of 100000 inserts with and without the trigger:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
test=# TRUNCATE tab; TRUNCATE TABLE test=# timing Timing is on. test=# INSERT INTO tab (ts, data) SELECT clock_timestamp(), 'something' FROM generate_series(1, 100000); INSERT 0 0 Time: 3354.275 ms (00:03.354) test=# ALTER TABLE tab DISABLE TRIGGER part_trig; ALTER TABLE test=# TRUNCATE tab; TRUNCATE TABLE Time: 20.005 ms test=# INSERT INTO tab (ts, data) SELECT clock_timestamp(), 'something' FROM generate_series(1, 100000); INSERT 0 100000 Time: 120.869 ms |
So while the code works, the trigger incurs an overhead of about 32 milliseconds per row, which is quite a lot. The whole operation takes 28 times as long as without the trigger, but that factor is only so high because there are no indexes on the table. It would be possible to reduce the overhead by writing the trigger function in C as well, but that would add still more complexity.
While it is fairly easy to work around the lack of time-triggered partition creation, it is more difficult to create partitions on demand when rows are inserted. We found a solution using triggers, LISTEN
and NOTIFY
, but the solution was complicated and the performance impact was severe. It would be good to get better support from PostgreSQL core for this!
If you are interested in partitioning, you may want to read our articles about partition management and partitioning live data on foot and with pg_rewrite.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
A clever solution, despite the problem of performance.
Just out of curiosity: is there a reason why you "EXECUTE" a "NOTIFY" instead of using pg_notify?
Simple: I didn't think of it. It should work just the same (and be more elegant).
Could you simply create the partitions ahead of time, so that they are ready and waiting for the data?
Sure, if you can do that, everything is fine.
My article explores what you can do if it is hard to figure out ahead of time which partitions will be needed.
Thanks for the article! We're trying to figure out a simple and robust way to automate range partition maintenance here and I'd be very curious to hear your thoughts.
First, it seems like a good idea to create one "spare" partition ahead of time for the next range of keys, so that when the "current" partition fills up, the rows can start flowing into the next one and it will already have the right constraints. At this point, we need to go create the new "next" partition.
For peace of mind, I'd like that to happen automatically within the database, not through an external cron job which might not be running properly. We'll probably create a "DEFAULT" partition as a fallback but we actually don't want it to ever contain any rows.
So for creating the next partition upon the first INSERT, maybe we can work around the problem of a trigger being unable to modify the definition of the table it's firing on by using a deferred constraint trigger.
Specifically, we define a CONSTRAINT trigger AFTER INSERT on the "next" partition DEFERRABLE INITIALLY DEFERRED FOR EACH ROW.
In the trigger function, it creates the new "next" partition, creates the same trigger on that, and drops itself from the current table. So in effect, the trigger only runs once when the partition becomes "current", and so it should have no performance impact.
Do you see any problems with this approach?
Thank you!
I didn't try your approach, but I doubt that it will work. Specifically, dropping a trigger on the table being modified should fail for the same reasons as creating a new partition.
Frankly, I doubt that your construction would be more robust and reliable than a cron job.
I also want to warn you not to use a default partition if you want to create new partitions in the future: a conflicting value in the default partition will prevent creating a new partition.
Thanks a lot for your reply! I didn't realize the DEFAULT partition would do that; but evidently it does and also seems to be triggering more exclusive locks than we'd want.
As for the constraint trigger, I think it works because by the time it fires (on COMMIT), the original INSERT statement has completed and released whatever "schema stability" lock it was holding on its table, and only holds on to the row locks on the inserted tuples.
You are probably right about the deferred constraint!
You could try insert and fallback on creating the partition only when it fails. It will avoid trying to create a table each time.
Hello ,
Can we do list partiioning based on a varchar column using pg_partman extension.
I am getting this
SQL Error [P0001]: ERROR: pg_partman only supports partitioning of data types that are integer or date/timestamp.
CONTEXT: PL/pgSQL function create_parent
Any workaround ,please let me know
The error message seems pretty clear.
You should ask the developers of the software for workarounds.
Great read, thanks Laurenz!
Maybe another solution for on-demand partitioning that could work for some setups could be the following?
- Trigger on the default partition (not the whole partitioned table) that, if it receives a record, creates a new table (if a table for that partition does not exist yet) and puts a copy of that record into that table
- Scheduled job that every e.g. 5 minutes checks if there are records in the default partition and if yes truncates the default partition and attaches the tables that the trigger created
If the workload is not too bursty this might be a solution for some? Only drawback I can think of is the short interruptions that the job causes, but those should be super short? What do you think?
Sounds complicated, but why not.