Database people dealing with natural languages are all painfully aware of the fact that encodings, special characters, accents and alike are usually hard to deal with. This is especially true if you want to implement search in a user-friendly way. This post describes the PG extension unaccent.
Table of Contents
Consider the following example in PostgreSQL: My name contains a couple of super inconvenient special characters, which will cause issues for people around the globe. The correct spelling of my surname is “Schönig”, which is pretty hard to type on most keyboards I have seen around the world. And: Who cares about my special characters anyway? People might just want to type in “Schonig” into some search form and find information about me (ideally related to PostgreSQL and database work).
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT 'Schönig' = 'Schonig'; ?column? ---------- f (1 row) test=# SELECT 'Schönig' = 'Schoenig'; ?column? ---------- f (1 row) |
The “=” operator compares those two strings and concludes that those two strings are not identical. Therefore, the correct answer is “false”. While that is true from a technical point of view it might be a real issue and end users might be unsatisfied with the result. Problems like that can make daily life pretty hard. A solution to the problem is therefore needed.
If you want to improve your user experience you can turn to the “unaccent” extension, which is shipped as part of the PostgreSQL contrib package. Installing it is really easy:
1 2 |
test=# CREATE EXTENSION unaccent; CREATE EXTENSION |
In the next step you can call the “unaccent” function to clean a string and turn it into something more useful. This is what happens when we use this function on my name and the name of my PostgreSQL support company:
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT unaccent('Hans-Jürgen Schönig, Gröhrmühlgasse 26, Wiener Neustadt'); unaccent --------------------------------------------------------- Hans-Jurgen Schonig, Grohrmuhlgasse 26, Wiener Neustadt (1 row) test=# SELECT unaccent('Cybertec Schönig & Schönig GmbH'); unaccent --------------------------------- Cybertec Schonig & Schonig GmbH (1 row) |
The beauty is that we can easily compare strings in a more tolerant and more user-friendly way:
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT unaccent('Schönig') = unaccent('Schonig'); ?column? ---------- t (1 row) test=# SELECT unaccent('Schönig') = unaccent('Schönig'); ?column? ---------- t (1 row) |
In both cases, PostgreSQL will return true, which is exactly what we want.
When using unaccent there is one thing, which you should keep in mind. Here is an example:
1 2 3 4 |
test=# CREATE TABLE t_name (name text); CREATE TABLE test=# CREATE INDEX idx_accent ON t_name (unaccent(name)); ERROR: functions in index expression must be marked IMMUTABLE |
PostgreSQL supports the creation of indexes on functions. However, a functional index has to return an immutable result, which is not the case here. If you want to index on an unaccented string you have to create an additional column, which contains a pre-calculated value (“materialized”). Otherwise, it's just not possible.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Is then a difference in
ORDER BY unaccent(name)
and
ORDER BY name ?
Sure with precondition of the right language setting 🙂
If you have a German collation in
pg_collation
(something likeGerman
on Windows orde_DE.UTF-8
on UNIX), the results should be pretty similar (as long as you are not using characters likeč
oré
).One notable difference would be the sort performance, since
unaccent
would be called for every string that gets sorted. IfORDER BY
can use an index scan, that shouldn't make a difference.Interesting topic, which raises a unique question for me. Could the unaccent() function be used during data creation to strip out the UTF8 characters (our database was built with ASCII so these characters especially cause us problems) to help work around the limitationof ASCII databases not handling these characters?