Infrastructure at your Service

Daniel Westermann

Dealing with German “Umlaute” in PostgreSQL’s full text search

PostgreSQL comes with build-in Full Text Search and you can do quite amazing stuff with it. A question that popped up during one of the last PostgreSQL for developers workshop was: How can I deal with German “Umlaute” such as “ä”, “ö” and “ü” in such a way, that I can search for e.g. “Schnösel” and the result will give me “Schnösel” as well as “Schnoesel”? One way to deal with that would be to replace all occurrences of “ä”, “ö” and “ü” with “ae”, “oe” and “ue” before doing the search, and then automatically convert the search string to “ae”, “oe” and “ue”, once a search with an “Umlaut” comes in. Even if this can be done quite easily (but probably introduces other issues), there is a more elegant solution to that with PostreSQL’s build-in Full Text Search.

Lets start with a simple table containing two strings:

postgres=# create table umlaute ( string text );
CREATE TABLE
postgres=# insert into umlaute values ('Schnösel');
INSERT 0 1
postgres=# insert into umlaute values ('Schnoesel');
INSERT 0 1
postgres=# select * from umlaute;
  string   
-----------
 Schnösel
 Schnoesel
(2 rows)

If you ask for the two variations of the string in the default configuration, you only get the exact match:

postgres=# select * from umlaute where to_tsvector(string) @@ to_tsquery('Schnösel');
  string  
----------
 Schnösel
(1 row)

postgres=# select * from umlaute where to_tsvector(string) @@ to_tsquery('Schnoesel');
  string   
-----------
 Schnoesel
(1 row)

By default PostgreSQL uses the English text search configuration:

postgres=# show default_text_search_config;
 default_text_search_config 
----------------------------
 pg_catalog.english
(1 row)

… and you might think that switching this to “German” already resolves the issue:

postgres=# set default_text_search_config='german';
SET
postgres=# select current_setting('default_text_search_config');
  current_setting  
-------------------
 pg_catalog.german
(1 row)

postgres=# select * from umlaute where to_tsvector(string) @@ to_tsquery('Schnösel');
  string  
----------
 Schnösel
(1 row)

postgres=# select * from umlaute where to_tsvector(string) @@ to_tsquery('Schnoesel');
  string   
-----------
 Schnoesel
(1 row)

This is definitely not the case, so more work needs to be done to get this working. PostgreSQL allows you to create your own text search configurations easily, and this is what happens in the next step:

postgres=# create text search configuration my_de ( copy = german );
CREATE TEXT SEARCH CONFIGURATION

We now have a copy of the default German text search configuration available as “my_de”. To test, that this actually works you can explicitly specify this configuration in our text search (although, it of course does not change anything right now):

postgres=# select * from umlaute where to_tsvector('my_de',string) @@ to_tsquery('my_de','Schnösel');
  string  
----------
 Schnösel
(1 row)

postgres=# select * from umlaute where to_tsvector('my_de',string) @@ to_tsquery('my_de','Schnoesel');
  string   
-----------
 Schnoesel
(1 row)

Before we can alter our new text search configuration any further we need the extension unaccent. This extension “removes accents (diacritic signs) from lexemes”. Here is a little example of what it does:

postgres=# create extension unaccent;
CREATE EXTENSION
postgres=# select unaccent ('aaÄ');
 unaccent 
----------
 aaA
(1 row)

postgres=# select unaccent ('ÆÆÆ');
 unaccent 
----------
 AEAEAE
(1 row)

Having that in place we can modify the text search configuration we just created:

postgres=# ALTER TEXT SEARCH CONFIGURATION my_de ALTER MAPPING FOR hword, hword_part, word WITH unaccent, german_stem;
ALTER TEXT SEARCH CONFIGURATION

If you wonder, what “hword”, “word” and “hword_part” actually mean, then have a look here. We basically instructed our text search configuration to “unaccent” our search terms by default. Is that enough?

postgres=# select * from umlaute where to_tsvector('my_de',string) @@ to_tsquery('my_de','Schnösel');
  string  
----------
 Schnösel
(1 row)

postgres=# select * from umlaute where to_tsvector('my_de',string) @@ to_tsquery('my_de','Schnoesel');
  string   
-----------
 Schnoesel
(1 row)

No, we still get the same results, but why? The “unaccent” extension uses a rules file to do the translation of the specific characters. By default the “unaccent.rules” file is used and you can find that in the SHAREDIR of your PostgreSQL installation:

postgres=# \! pg_config | grep SHAREDIR
SHAREDIR = /u01/app/postgres/product/DEV/db_1/share
postgres=# \! ls -la /u01/app/postgres/product/DEV/db_1/share/tsearch_data/*.rules
-rw-r--r-- 1 postgres postgres 9549 Dec 26 15:45 /u01/app/postgres/product/DEV/db_1/share/tsearch_data/unaccent.rules
-rw-r--r-- 1 postgres postgres  139 Dec 26 15:44 /u01/app/postgres/product/DEV/db_1/share/tsearch_data/xsyn_sample.rules

This file contains the rules for translating special characters to their desired target:

postgres=# \! head /u01/app/postgres/product/DEV/db_1/share/tsearch_data/unaccent.rules
©       (C)
«       <>
¼        1/4
½        1/2
¾        3/4
À       A
Á       A

If you look for the German “Umlaute” you’ll get this:

postgres=# \! egrep "ä|ö|ü" /u01/app/postgres/product/DEV/db_1/share/tsearch_data/unaccent.rules
ä       a
ö       o
ü       u

This is the reason why we still do not get the desired result. I’ll ge ahead and modify the rules directly but you should rather create your own rules file before modifying an existing one:

postgres=# \! egrep "ä|ö|ü" /u01/app/postgres/product/DEV/db_1/share/tsearch_data/unaccent.rules
ä       ae
ö       oe
ü       ue

From now on “ä” will be translated to “ae”, “ü” to “ue” and “ö” to “oe” and our search works as expected:

postgres=# select * from umlaute where to_tsvector('my_de',string) @@ to_tsquery('my_de','Schnoesel');
  string   
-----------
 Schnösel
 Schnoesel
(2 rows)

postgres=# select * from umlaute where to_tsvector('my_de',string) @@ to_tsquery('my_de','Schnösel');
  string   
-----------
 Schnösel
 Schnoesel
(2 rows)

Great stuff.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure