unaccent: Getting rid of umlauts, accents and special characters

Here is the problem:

test=# SELECT 'Schönig' = 'Schonig';
 ?column? 
----------
 f
(1 row)

test=# SELECT 'Schönig' = 'Schoenig';
 ?column? 
----------
 f
(1 row)

test=# SELECT 'Schönig' = 'Schonig';

?column?

----------

(1 row)

test=# SELECT 'Schönig' = 'Schoenig';

?column?

----------

(1 row)

The “=” operator compares those two strings and concludes that those two strings are not identical. Therefore, the correct answer is “false”. While that is true from a technical point of view it might be a real issue and end users might be unsatisfied with the result. Problems like that can make daily life pretty hard. A solution to the problem is therefore needed.

PostgreSQL provides a useful extension

If you want to improve your user experience you can turn to the “unaccent” extension, which is shipped as part of the PostgreSQL contrib package. Installing it is really easy:

test=# CREATE EXTENSION unaccent;
CREATE EXTENSION

1 2	test=# CREATE EXTENSION unaccent; CREATE EXTENSION

In the next step you can call the “unaccent” function to clean a string and turn it into something more useful. This is what happens when we use this function on my name and the name of my PostgreSQL support company:

test=# SELECT unaccent('Hans-Jürgen Schönig, Gröhrmühlgasse 26, Wiener Neustadt');
                        unaccent                         
---------------------------------------------------------
 Hans-Jurgen Schonig, Grohrmuhlgasse 26, Wiener Neustadt
(1 row)

test=# SELECT unaccent('Cybertec Schönig & Schönig GmbH');
            unaccent             
---------------------------------
 Cybertec Schonig & Schonig GmbH
(1 row)

test=# SELECT unaccent('Hans-Jürgen Schönig, Gröhrmühlgasse 26, Wiener Neustadt');

unaccent

---------------------------------------------------------

Hans-Jurgen Schonig, Grohrmuhlgasse 26, Wiener Neustadt

(1 row)

test=# SELECT unaccent('Cybertec Schönig & Schönig GmbH');

unaccent

---------------------------------

Cybertec Schonig & Schonig GmbH

(1 row)

The beauty is that we can easily compare strings in a more tolerant and more user-friendly way:

test=# SELECT unaccent('Schönig') = unaccent('Schonig');
 ?column? 
----------
 t
(1 row)

test=# SELECT unaccent('Schönig') = unaccent('Schönig');
 ?column? 
----------
 t
(1 row)

test=# SELECT unaccent('Schönig') = unaccent('Schonig');

?column?

----------

(1 row)

test=# SELECT unaccent('Schönig') = unaccent('Schönig');

?column?

----------

(1 row)

In both cases, PostgreSQL will return true, which is exactly what we want.

Indexing

When using unaccent there is one thing, which you should keep in mind. Here is an example:

test=# CREATE TABLE t_name (name text);
CREATE TABLE
test=# CREATE INDEX idx_accent ON t_name (unaccent(name));
ERROR:  functions in index expression must be marked IMMUTABLE

test=# CREATE TABLE t_name (name text);

CREATE TABLE

test=# CREATE INDEX idx_accent ON t_name (unaccent(name));

ERROR: functions in index expression must be marked IMMUTABLE

PostgreSQL supports the creation of indexes on functions. However, a functional index has to return an immutable result, which is not the case here. If you want to index on an unaccented string you have to create an additional column, which contains a pre-calculated value (“materialized”). Otherwise, it's just not possible.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

3 responses to “unaccent: Getting rid of umlauts, accents and special characters”

Georg Klimm says:

March 13, 2019 at 1:47 pm

Is then a difference in
ORDER BY unaccent(name)
and
ORDER BY name ?
Sure with precondition of the right language setting 🙂

Reply
- laurenz says:
  
  March 13, 2019 at 2:25 pm
  
  If you have a German collation in pg_collation (something like German on Windows or de_DE.UTF-8 on UNIX), the results should be pretty similar (as long as you are not using characters like č or é).
  
  One notable difference would be the sort performance, since unaccent would be called for every string that gets sorted. If ORDER BY can use an index scan, that shouldn't make a difference.
  
  Reply
Javier Zayas says:

July 12, 2019 at 4:35 pm

Interesting topic, which raises a unique question for me. Could the unaccent() function be used during data creation to strip out the UTF8 characters (our database was built with ASCII so these characters especially cause us problems) to help work around the limitationof ASCII databases not handling these characters?

Reply

unaccent: Getting rid of umlauts, accents and special characters

Here is the problem:

PostgreSQL provides a useful extension

Indexing

3 responses to “unaccent: Getting rid of umlauts, accents and special characters”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts