(Inscription seen on a T-shirt)
Table of Contents
It's annoying to get error messages caused by encoding problems. But it is more annoying not to get error messages and end up with bad data in the database. I'll show you how to fix bad encoding.
Each PostgreSQL database has a server encoding. You can inspect (but not change) the PostgreSQL parameter server_encoding
to see the encoding of the database you are connected to. You can choose the database encoding when you create a database, but you cannot change the server encoding after creating the database. This restriction is necessary because changing the server encoding will render string data corrupt.
Each database session has a client encoding, which you can view and change using the PostgreSQL parameter client_encoding
. If your client uses the C client library libpq, you can determine the client encoding with the environment variable PGCLIENTENCODING
. In addition, psql
has the command encoding
to view and change the client encoding. psql
tries to guess a good client encoding from the current locale. If the client encoding isn't specified anywhere, PostgreSQL will set it to the server encoding.
PostgreSQL expects that character data sent by a client are encoded in the client encoding and converts them to the server encoding. Conversely, PostgreSQL converts query results to the client encoding.
PostgreSQL supports many encodings. All supported server encodings must be a superset of ASCII.
The server encoding of my database is UTF8
. We'll use this simple table:
1 2 3 4 |
CREATE TABLE texts ( id bigint PRIMARY KEY, t text ); |
Using Linux' echo
command, I create an SQL script in UTF-8 encoding:
1 2 |
echo -e 'INSERT INTO texts VALUES (1, 'schxc3xb6n');' > insert_utf8.sql |
Then I create a second script, with the same string in LATIN-1 (= ISO 8859-1) encoding:
1 2 |
echo -e 'INSERT INTO texts VALUES (2, 'schxf6n');' > insert_latin1.sql |
Both strings contain the German letter “ö”. LATIN-1 encodes it as a single byte, while it is two bytes in UTF-8.
Many people have seen error messages like this one:
1 2 |
psql -c 'encoding UTF8' -f insert_latin1.sql psql:insert_latin1.sql:1: ERROR: invalid byte sequence for encoding 'UTF8': 0xf6 0x6e 0x27 0x29 |
PostgreSQL is very strict about data integrity and won't let you store strings that are not properly encoded. This may be annoying, but it saves you from worse trouble. Other database systems are more relaxed about data integrity (Oracle lets you store anything as long as client and server encoding are identical).
Given the above, you'd think that it is impossible to get badly encoded data in PostgreSQL. But that is not the case:
1 2 |
psql -c 'encoding LATIN1' -f insert_utf8.sql INSERT 0 1 |
There was no error, because both bytes that make up “ö” in UTF-8 also encode valid characters in LATIN-1. This becomes obvious if we query the table with a correctly set client encoding:
1 2 3 4 5 |
psql -c 'encoding UTF8' -c 'TABLE texts' id │ t ════╪════════ 1 │ schön (1 row) |
However, such data curruption can remain undetected for a long while. As long as you consistently set the client encoding to the wrong value, the data will appear to be correct:
1 2 3 4 5 |
psql -c 'encoding LATIN1' -c 'TABLE texts' id │ t ════╪════════ 1 │ schön (1 row) |
When diagnosing encoding problems, it is very helpful to know what data the database actually stores. After all, there are other reasons for improperly displayed characters:
To see the actual bytes that the database stores, make sure that you set the parameter bytea_output
to its default value hex
, then convert the offending string to the data type bytea
:
1 2 3 4 5 6 |
SELECT convert_to(t, 'SQL_ASCII') FROM texts WHERE id = 1; t ════════════════════ x736368c383c2b66e (1 row) |
The first three characters are ASCII characters and are encoded as single bytes. The “ö” now has become four bytes (c3 83 c2 b6), where the first two bytes encode “Ô and the last two bytes encode “¶”.
Since you cannot change the server encoding, the only way to do that is to export and import the database. How exactly you do that depends on how bad the mess is.
If all your non-ASCII characters have the same problem (the client encoding was always set to the same wrong value), the case is comparatively simple. You use plain format pg_dump
to export the database while explicitly setting the bad client encoding. In our case, that would look like
1 |
pg_dump -E LATIN1 -t texts -f dump.sql dbname |
Then you edit dump.sql
with a text editor and change the line that sets the client encoding to the value that actually matches the data:
1 |
SET client_encoding = 'UTF8'; |
Now import the modified dump.sql
using psql
, and you are done.
We import the second SQL script to get a second, correctly encoded row in our test table:
1 2 |
psql -c 'encoding LATIN1' -f insert_latin1.sql INSERT 0 1 |
With the client encoding that is correct for my terminal, the data now look like this:
1 2 3 4 5 6 |
psql -c 'encoding UTF8' -c 'TABLE texts' id │ t ════╪════════ 1 │ schön 2 │ schön (2 rows) |
Clearly, this case is more difficult than the previous one. There is no absolutely safe way to determine which data are correctly encoded and which ones are not. However, we can repair most of the damage as follows.
First, we create four dump files:
1 2 3 4 5 6 7 8 9 |
pg_dump --section=pre-data -t texts -f predata.sql dbname pg_dump --section=data -E LATIN1 --inserts -t texts -f data_latin1.sql dbname pg_dump --section=data -E UTF8 --inserts --on-conflict-do-nothing -t texts -f data_utf8.sql dbname pg_dump --section=post-data -t texts -f postdata.sql dbname |
The option --on-conflict-do-nothing
is new in PostgreSQL v12. You can omit that option if you have to work with an older version of pg_dump
, but in that case you will have to ignore errors during the import of data_utf8.sql
.
The next step is to edit data_latin1.sql
and change the line that sets client_encoding
to the value UTF8
.
Then simply import the four files in a slightly different order into a new database using psql
:
1 2 3 4 5 6 7 |
psql -d newdb -f predata.sql psql -d newdb -f data_latin1.sql 2>/dev/null psql -d newdb -f postdata.sql psql -d newdb -f data_utf8.sql |
During the import of data_latin1.sql
, we get an error for the rows with correctly encoded data, but psql
imports the rows with incorrectly encoded data correctly. Importing postdata.sql
creates the primary key constraint for the table. After that, importing data_utf8.sql
will skip the rows that we already managed to import, but insert the rows that are missing after the import of data_latin1.sql
.
The result is convincing:
1 2 3 4 5 6 |
psql -d newdb -c 'TABLE texts' id │ t ════╪═══════ 1 │ schön 2 │ schön (2 rows) |
Incorrectly encoded strings are an annoying problem in a database. If we know what encodings are present in the database, we can fix most of the corruption with a carefully designed series of exports and imports. The downtime incurred by such data repair is something we cannot avoid. It is possible that some incorrectly encoded strings remain; you will have to find and fix those manually.
If you are interested in data corruption, you may want to read my articles about collations and data corruption or what to watch out for if you want to avoid data corruption.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Casting text to bytea isn't a good way to see how values are encoded; many text values will give errors on the cast. (It doesn't do what you apparently think: it expects the text value to be the input representation of a bytea value.)
Instead use
convert_to(col,'SQL_ASCII')
to see the bytes of the text value;SQL_ASCII
in this context means "no conversion", so you get the stored bytes regardless of what the server encoding is.There are a bunch of ways to fix certain kinds of systematic errors without needing to do wholesale dump/restore. The most common errors I've seen from people on IRC are: double-encoded UTF8, and mixed UTF8 and Latin1 or Win1252 in an SQL_ASCII or LATIN1 db. These are both fixable in an automated way with very high reliability, thanks to UTF8's properties.
Thanks for the tip with
convert_to()
, that is indeed better.It is certainly also possible to run
UPDATE
s with aWHERE
condition that finds the problematic strings. That is likely preferable if you know which columns are affected. My method of dump and restore is a simple and dumb approach to the double encoding problem; it might be a good idea if much of the data are affected or if you don't know which columns are affected.