Table of Contents
(Updated 2024-09-29) This article was prompted by version 2.28 of the GNU C library, which changed the definition of many collations. Such a change has the potential to cause data corruption in PostgreSQL. I will describe the problem and point out how ICU collations can solve it. The problem is not limited to glibc
2.28; it can hit you whenever you upgrade glibc
.
A collation is the set of rules that describe how strings are compared and ordered. This seems trivial at first glance. Indeed, there is a special collation, known as the C or POSIX collation, that simply compares strings character for character and ranks characters by their UNICODE encoding value. This collation is simple and performs very well, and indeed the C collation avoids all the problems that this article talks about, but it is usually not good enough. For example, it would sort upper-case characters before lower-case characters, and most people would argue that “Z” should not be sorted before “a”. The matter is further complicated because sequences of characters (like the Czech digram “ch”) can be sorted differently from their constituent characters. Think of punctuation, accents and numbers (is the string “12” less than “2” or not?), and you will begin to grasp the complexity.
PostgreSQL does not define its own collations, it uses collations defined by third-party libraries. By default, it uses the collations defined by the operating system's C library.
Since v10, PostgreSQL does not only support these “libc collations”, but also “ICU collations” provided by the ICU library. ICU (short for “International Components for Unicode”) describes itself as “a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support” and implements, among other things, collations.
From PostgreSQL v15 on, you are able to use ICU collations as database collation (which is the default collation used in a database). For that, you have to create the database with options like LOCALE_PROVIDER icu ICU_LOCALE "en-US"
. With older PostgreSQL releases, you have to specify the ICU collation explicitly for each string column you create.
PostgreSQL v17 has introduced a built-in collation provider, but currently it only supports binary collations. Once this provider supports natural language collations, it may be the way out of PostgreSQL's problems with collations!
Collations determine how strings are compared, so they affect everything in the database that has to do with the order of strings:
ORDER BY
clause in SELECT
statementsNow changes in the behavior of ORDER BY
can be annoying, but that doesn't threaten data integrity.
But if your index is suddenly ordered in the wrong fashion, that is data corruption. It can lead to incorrect query results or to duplicate entries in UNIQUE
indexes. You can hit this problem whenever the library that provides your collations is updated.
The straightforward remedy is to rebuild all indexes on strings.
The simple way to do this is to run REINDEX INDEX
for all affected indexes. However, that will block all attempts to use the index and will prevent data modifications on the table.
Since PostgreSQL v12, you can use REINDEX CONCURRENTLY
to avoid these disadvantages. In older versions of PostgreSQL, you can work around this by using CREATE INDEX CONCURRENTLY
to create a copy of the index, then drop the original one. I'll demonstrate that further down.
But I wanted to tell you about ICU collations and their benefits.
At first glance, it is not obvious why ICU collations should be any better than C library collations. But there are several advantages:
To avoid the danger of data corruption, the biggest advantage is:
Let's look at how to use ICU collations in PostgreSQL.
If you run
1 |
SELECT * FROM pg_collation WHERE collprovider = 'i'; |
on a v10 or later database, you'll get a long list of collations with names like “vai-Latn-LR-x-icu
”.
The “-x-icu
” is just a suffix that PostgreSQL adds so that you know at a glance that it is an ICU collation. The rest of the name is identical to the “collcollate” column (although that is not a requirement), which is the name of the ICU collation.
ICU collation names are defined in the Unicode Technical Standard #35 and describe the collation. They are a list of subtags, seperated by “-” or “_”. The subtags are:
en
” for “English”.und
” language is for the language independent “root” locale.Latn
” for the latin alphabet.-u-
”, follows an optional list of variant subtags and their values that define further modifications.co
: the collation variant, which can take values like “phonebk
” for the German phone book ordering or “trad
” for traditional Spanish orderingkn
: if “true
”, sequences of digits are ordered numerically rather than alphabetically (“natural ordering”)ks
: the collation strength; where “level2
” means “case insensitive” and “level1
” means “case and accent insensitive”CREATE COLLATION
option deterministic = false
.kc
: if “true
”, marks collations that would otherwise be case insensitive as case sensitive (if you use it together with “ks-level1
”, you end up with a case sensitive, but accent insensitive collation)va
: can take the value “posix
” to specify binary sort order (by Unicode code point)Once you know the nomenclature, it is easy to create your own ICU collations.
For example, if the pope wants a Latin collation with natural search order (so that “Can. 101
” sorts after “Can. 22
”), he or she could simply run
1 |
CREATE COLLATION 'vat-lat' (provider = icu, locale = 'la-VA-u-kn-true'); |
The script subtag is omitted here, since Latin is only available in the Latin alphabet.
From PostgreSQL v12 on, it is also possible to use case- or accent-insensitive collations in PostgreSQL (an often requested feature!). For that, you have to add the additional clause “deterministic = false
” so that PostgreSQL knows that it cannot handle equality comparisons byte-wise. You'll have to pay a certain performance penalty for that. However, it won't be any more expensive (and maybe more intuitive) than the current techniques of using upper()
or the citext
extension are today.
Each ICU collation carries the version with which it was created.
The main benefit of that is that PostgreSQL knows if an upgrade of the ICU library changes the collation. In that case, you will get a warning whenever you use the collation. Then you should REINDEX
all indexes that depend on the collation (the documentation has a query to find them). Then, you upgrade the collation using ALTER COLLATION ... REFRESH VERSION
. (Update: PostgreSQL v13 has added a similar warning for collations provided by the GNU C library, but unfortunately not for the default collation of the database, which reduces the usefulness considerably.)
Now this is nice, but it still means down time or the risk of database corruption. If you are using ICU collations, and the libicu
that ships with your operating system changes, you are in the same predicament as with glibc
collations. However, there is a simple solution: don't upgrade the ICU library. While the C library is an integral component of the operating system that you have to keep updated to avoid, for example, security problems, the ICU library is relatively unimportant and doesn't have to be updated.
It is easy to change the collation of an existing column:
1 |
ALTER TABLE mytable ALTER textcol TYPE text COLLATE 'en-x-icu'; |
Note that collation names are not strings, but identifiers, so you'll have to surround them with double quotes if they contain uppercase letters or dashes.
Changing the collation of a table column is very fast, because the table does not have to be rewritten for that. However, all indexes on the column that use the column collation have to be rebuilt. Since ALTER TABLE
requires an ACCESS EXCLUSIVE
lock on the table, this can effectively mean a longer downtime.
So, ALTER TABLE
is not the method of choice. How can we do better?
PostgreSQL v10 has introduced logical replication, which allows to replicate between databases that are not identical. In particular, the table columns on the primary and on the standby database need not have the same collation. We can make use of that.
These are the steps to move a database to ICU collations:
Create a schema-only dump of the source database:
1 |
pg_dump -s -f schema.sql mydb |
This can be done by hand, but it is safer to use a tool like awk
for that.
Create an awk
script named “modify.awk
” like this:
1 2 3 4 5 6 7 8 9 |
# in CREATE TABLE statements, append a COLLATE clause to all string columns /^CREATE TABLE/, /^$/ { print gensub( '^( *([a-z_0-9]*|'[^']*')) (text\y|character( varying)?\([0-9]*\)|([a-z_0-9]*|'[^']*').citext\y)( COLLATE ([a-z_0-9]*|'[^']*')(.([a-z_0-9]*|'[^']*'))?)?', '\1 \3 COLLATE ' collation, 1 ); processed=1 } # unmodified copy of all lines that have not been processed above { if (processed == 1) processed = 0; else print } |
Then you can call awk
like this to modify the file:
1 |
awk -v collation=''en-x-icu'' schema.sql >newschema.sql |
Note that this will overwrite all pre-existing non-default column collations.
1 2 |
createdb mydbcopy psql -d mydbcopy -1 -f newschema.sql |
If you create the new database in a different database cluster, you must first create all the users and tablespaces that the database needs.
This can easily be done with the script generated by pg_dumpall -g
from the original database.
On the original database, make sure that “wal_level
” is set to “logical
”. Then create a publication for all tables:
1 |
CREATE PUBLICATION coll_pub FOR ALL TABLES; |
Then you create a subscription for that publication on the destination database.
Since both databases are in the same database cluster in our example, we have to create the replication slot explicitly to avoid a deadlock. This is not necessary when you replicate to a different cluster.
On the original database, create a logical replication slot:
1 |
SELECT pg_create_logical_replication_slot('coll_sub', 'pgoutput'); |
On the new database, create a subscription that uses the replication slot:
1 2 3 4 |
CREATE SUBSCRIPTION coll_sub CONNECTION 'dbname=mydb user=postgres' PUBLICATION coll_pub WITH (slot_name = coll_sub, create_slot = false); |
This starts logical replication. First, the existing contents of the tables are copied over, then replication catches up and keeps the two databases synchronized.
First, you'll have to wait until replication catches up. You can monitor the view pg_stat_replication
on the original database and wait until “state
” is “streaming
” and “flush_lsn
” is close to the current log sequence number.
Once logical replication has caught up, stop the application, change the connection information to the new database and restart the application. It is a good idea to allow for a few seconds of down time because replication is asynchronous and you want all the changes to be replicated when you start using the new database.
If you want to achieve zero down time, you could use the following approach:
host=host1,host2
” (if the same port is used for both).synchronous_standby_names
” on the primary server. Then no transaction can get lost.Once you are done, you should drop the subscription on the new database.
ICU collations not only provide you with many features lacking in operating system collations, they also are a way to escape from the collation changes that C library collations undergo regularly. This will reduce the danger of data corruption from corrupted indexes on string columns.
To avoid the danger of collation changes in libicu
updates, you can keep the version stable with little impact on the whole system. And if your collations happen to change, you'll be warned by PostgreSQL.
With logical replication, changing your collation with little down time is not too difficult.
For further information about ICU collations, see my blog about case-insensitive pattern matching.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
> One can hope that ICU collation changes don’t happen as often as glibc collation changes do.
Last I checked (a few years ago), every ICU version changes every collation version.
Hmm, that's bad news. They seem to come out pretty often.
That would essentially mean that ICU collations are pretty much unusable if you use your operating system's version of
libicu
, right?Fedora seems to come out with a release every couple of months.
Using your OS libicu may or may not be more stable than using libc. However,
PostgreSQL does detect libicu updates and emit "collation ... has version
mismatch" warnings. PostgreSQL doesn't detect libc updates that modify
collation.
Red Hat Enterprise Linux 7.6 has libicu-50.1.2-17.el7. ICU 50.1.2 is from
2012, suggesting Red Hat may be maintaining collation compatibility throughout
the life of RHEL 7. Ten years of collation stability would be usable.
I recommend using a libicu build private to PostgreSQL, separate from the OS
build.
You are right, that's the way to go.
I'll add something to this effect to the post.
Why the following statement?
> the C collation avoids all the problems that this article talks about
Doesn't the C collation ever change?
It doesn't ever change. it is just the order defined by
memcmp()
of the strings, which only depends on the encoding.Is there any official official documentation that states that (for me to link to in my project)?
Look here:
Hello, thanx a lot for sharing
I have an
ERROR: character with byte sequence 0xe2 0x80 0xa6 in encoding "UTF8" has no equivalent in encoding "LATIN5"
but database about 2 TB I hesiste to change collation, what could I do
That has nothing to do with collations, but with the encoding of the database.
You have a character (…) that cannot be represented in LATIN5.
You cannot change the encoding of a database.
Don't store that character, or use a different database with encoding
UTF8
.