Infrastructure at your Service

When you work with databases sooner or later you will need to get in touch witch collations. It might be that you never needed to think about it as initdb picked a value that just worked for you and nobody in your database ever tried to use a collation that is not available in your installation. Once someone is trying that or there is the requirement to use a specific collation from the beginning you need to know at least the basics and this is the topic of this post. In contrast to Oracle, PostgreSQL relies on the operating system for collation support. As always, a demo says more than thousands words, so lets go.

I am running CentOS 8 (8.1.1911 (Core)), minimal installation plus the packages required for installing PostgreSQL from source code. To get a default environment lets create brand new user:

[email protected]:/home/postgres/ [pgdev] sudo useradd pg
[email protected]:/home/postgres/ [pgdev] sudo su - pg

Once we switch to that user and check the LANG and LC_* variable this is what we get:

[[email protected] ~]$ env | grep LC
[[email protected] ~]$ env | grep LANG
LANG=en_US.UTF-8

When we run initdb to create a new cluster this is what gets picked up and will become the default for the databases:

[[email protected] ~]$ /u01/app/postgres/product/12/db_0/bin/initdb -D /var/tmp/pg
[[email protected] ~]$ export PGPORT=9999
[[email protected] ~]$ /u01/app/postgres/product/12/db_0/bin/pg_ctl -D /var/tmp/pg start
[[email protected] ~]$ /u01/app/postgres/product/12/db_0/bin/psql -p 9999 postgres
psql (12.0)
Type "help" for help.

postgres=# \l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges 
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/pg            +
           |       |          |             |             | pg=CTc/pg
 template1 | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/pg            +
           |       |          |             |             | pg=CTc/pg
(3 rows)

With PostgreSQL you can define the default collation for a database at the time you create a database so creating a new database with the same collation as the ones already existing is not an issue at all:

[[email protected] ~]$ /u01/app/postgres/product/12/db_0/bin/psql -p 9999 postgres
postgres=# create database tmp with LC_COLLATE = "en_US.UTF-8";
CREATE DATABASE

The trouble starts when you want to use a collation that is not currently available from the operating system:

postgres=# create database tmp2 with LC_COLLATE = "de_DE.UTF-8";
ERROR:  invalid locale name: "de_DE.UTF-8"

At this point PostgreSQL can not create the database for you as it does not know anything about the German locale specified in the create database command. Apparently my CentOS installation does not have support for this collation. How then can you check what the operating system currently provides? The command to use on Linux flavors that use systemd is localectl:

[[email protected] ~]$ localectl list-locales
C.utf8
en_AG
en_AU
en_AU.utf8
en_BW
en_BW.utf8
en_CA
en_CA.utf8
en_DK
en_DK.utf8
en_GB
en_GB.iso885915
en_GB.utf8
en_HK
en_HK.utf8
en_IE
en_IE.utf8
[email protected]
en_IL
en_IN
en_NG
en_NZ
en_NZ.utf8
en_PH
en_PH.utf8
en_SC.utf8
en_SG
en_SG.utf8
en_US
en_US.iso885915
en_US.utf8
en_ZA
en_ZA.utf8
en_ZM
en_ZW
en_ZW.utf8

Given that output I can only use the various “en_*” and the “C” collations but nothing else. Let’s try to use a collation from the list above:

postgres=# create database tmpX with lc_collate='en_NZ.utf8';
ERROR:  new collation (en_NZ.utf8) is incompatible with the collation of the template database (en_US.UTF-8)
HINT:  Use the same collation as in the template database, or use template0 as template.

That fails because there already could be some data in template1 which is not compatible with the locale specific in the create database command. As template0 is read only (more on the template databases here) this one should be used when you want to create database using a locale which is different from template1:

postgres=# create database tmpX with lc_collate='en_NZ.utf8' template=template0;
CREATE DATABASE
postgres=# create database tmpXX with lc_collate='C' template=template0;
CREATE DATABASE
postgres=# \l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges 
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/pg            +
           |       |          |             |             | pg=CTc/pg
 template1 | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/pg            +
           |       |          |             |             | pg=CTc/pg
 tmp       | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 tmpx      | pg    | UTF8     | en_NZ.utf8  | en_US.UTF-8 | 
 tmpxx     | pg    | UTF8     | C           | en_US.UTF-8 | 
(7 rows)

Basically we can use every locale that the operating provides but as there is nothing for German we can not use it. The question now is how we can install support for other locales on the operating system. For Red Hat 8 or CentOS 8 based systems this is quite easy as dnf can be used to search for, and then install the language packs you need:

[[email protected] ~]$ dnf search locale | grep de
Last metadata expiration check: 1:23:33 ago on Wed 04 Mar 2020 01:53:43 AM CET.
perl-Encode-Locale.noarch : Determine the locale encoding
perl-Locale-Codes.noarch : Distribution of modules to handle locale codes
perl-Locale-US.noarch : Two letter codes for state identification in the United States and vice versa
rubygem-locale.noarch : Pure ruby library which provides basic APIs for localization
glibc-langpack-de.x86_64 : Locale data for de
glibc-langpack-de.i686 : Locale data for de
langtable.noarch : Guessing reasonable defaults for locale, keyboard layout, territory, and language.

Once you’ve identified the correct package just install it:

[[email protected] ~]$ sudo dnf install -y glibc-langpack-de

If we ask the operating system supported for supported locales once more, the German ones now show up:

[[email protected] ~]$ localectl list-locales | grep de
de_AT
de_AT.utf8
[email protected]
de_BE
de_BE.utf8
[email protected]
de_CH
de_CH.utf8
de_DE
de_DE.utf8
[email protected]
de_IT
de_IT.utf8
de_LI.utf8
de_LU
de_LU.utf8
[email protected]

This should give PostgreSQL all it needs to create the database with the German locale. Before we proceed with creating that database have a look at the pg_collation catalog table:

postgres=# select * from pg_collation ;
  oid  |    collname     | collnamespace | collowner | collprovider | collisdeterministic | collencoding |   collcollate   |    collctype    | collversion 
-------+-----------------+---------------+-----------+--------------+---------------------+--------------+-----------------+-----------------+-------------
   100 | default         |            11 |        10 | d            | t                   |           -1 |                 |                 | 
   950 | C               |            11 |        10 | c            | t                   |           -1 | C               | C               | 
   951 | POSIX           |            11 |        10 | c            | t                   |           -1 | POSIX           | POSIX           | 
 12326 | ucs_basic       |            11 |        10 | c            | t                   |            6 | C               | C               | 
 12327 | C.utf8          |            11 |        10 | c            | t                   |            6 | C.utf8          | C.utf8          | 
 12328 | en_AG           |            11 |        10 | c            | t                   |            6 | en_AG           | en_AG           | 
 12329 | en_AU           |            11 |        10 | c            | t                   |            8 | en_AU           | en_AU           | 
 12330 | en_AU.utf8      |            11 |        10 | c            | t                   |            6 | en_AU.utf8      | en_AU.utf8      | 
 12331 | en_BW           |            11 |        10 | c            | t                   |            8 | en_BW           | en_BW           | 
 12332 | en_BW.utf8      |            11 |        10 | c            | t                   |            6 | en_BW.utf8      | en_BW.utf8      | 
 12333 | en_CA           |            11 |        10 | c            | t                   |            8 | en_CA           | en_CA           | 
 12334 | en_CA.utf8      |            11 |        10 | c            | t                   |            6 | en_CA.utf8      | en_CA.utf8      | 
 12335 | en_DK           |            11 |        10 | c            | t                   |            8 | en_DK           | en_DK           | 
 12336 | en_DK.utf8      |            11 |        10 | c            | t                   |            6 | en_DK.utf8      | en_DK.utf8      | 
 12337 | en_GB           |            11 |        10 | c            | t                   |            8 | en_GB           | en_GB           | 
 12338 | en_GB.iso885915 |            11 |        10 | c            | t                   |           16 | en_GB.iso885915 | en_GB.iso885915 | 
 12339 | en_GB.utf8      |            11 |        10 | c            | t                   |            6 | en_GB.utf8      | en_GB.utf8      | 
 12340 | en_HK           |            11 |        10 | c            | t                   |            8 | en_HK           | en_HK           | 
 12341 | en_HK.utf8      |            11 |        10 | c            | t                   |            6 | en_HK.utf8      | en_HK.utf8      | 
 12342 | en_IE           |            11 |        10 | c            | t                   |            8 | en_IE           | en_IE           | 
 12343 | [email protected]      |            11 |        10 | c            | t                   |           16 | [email protected]      | [email protected]      | 
 12344 | en_IE.utf8      |            11 |        10 | c            | t                   |            6 | en_IE.utf8      | en_IE.utf8      | 
 12345 | en_IL           |            11 |        10 | c            | t                   |            6 | en_IL           | en_IL           | 
 12346 | en_IN           |            11 |        10 | c            | t                   |            6 | en_IN           | en_IN           | 
 12347 | en_NG           |            11 |        10 | c            | t                   |            6 | en_NG           | en_NG           | 
 12348 | en_NZ           |            11 |        10 | c            | t                   |            8 | en_NZ           | en_NZ           | 
 12349 | en_NZ.utf8      |            11 |        10 | c            | t                   |            6 | en_NZ.utf8      | en_NZ.utf8      | 
 12350 | en_PH           |            11 |        10 | c            | t                   |            8 | en_PH           | en_PH           | 
 12351 | en_PH.utf8      |            11 |        10 | c            | t                   |            6 | en_PH.utf8      | en_PH.utf8      | 
 12352 | en_SC.utf8      |            11 |        10 | c            | t                   |            6 | en_SC.utf8      | en_SC.utf8      | 
 12353 | en_SG           |            11 |        10 | c            | t                   |            8 | en_SG           | en_SG           | 
 12354 | en_SG.utf8      |            11 |        10 | c            | t                   |            6 | en_SG.utf8      | en_SG.utf8      | 
 12355 | en_US           |            11 |        10 | c            | t                   |            8 | en_US           | en_US           | 
 12356 | en_US.iso885915 |            11 |        10 | c            | t                   |           16 | en_US.iso885915 | en_US.iso885915 | 
 12357 | en_US.utf8      |            11 |        10 | c            | t                   |            6 | en_US.utf8      | en_US.utf8      | 
 12358 | en_ZA           |            11 |        10 | c            | t                   |            8 | en_ZA           | en_ZA           | 
 12359 | en_ZA.utf8      |            11 |        10 | c            | t                   |            6 | en_ZA.utf8      | en_ZA.utf8      | 
 12360 | en_ZM           |            11 |        10 | c            | t                   |            6 | en_ZM           | en_ZM           | 
 12361 | en_ZW           |            11 |        10 | c            | t                   |            8 | en_ZW           | en_ZW           | 
 12362 | en_ZW.utf8      |            11 |        10 | c            | t                   |            6 | en_ZW.utf8      | en_ZW.utf8      | 
 12363 | en_AU           |            11 |        10 | c            | t                   |            6 | en_AU.utf8      | en_AU.utf8      | 
 12364 | en_BW           |            11 |        10 | c            | t                   |            6 | en_BW.utf8      | en_BW.utf8      | 
 12365 | en_CA           |            11 |        10 | c            | t                   |            6 | en_CA.utf8      | en_CA.utf8      | 
 12366 | en_DK           |            11 |        10 | c            | t                   |            6 | en_DK.utf8      | en_DK.utf8      | 
 12367 | en_GB           |            11 |        10 | c            | t                   |           16 | en_GB.iso885915 | en_GB.iso885915 | 
 12368 | en_GB           |            11 |        10 | c            | t                   |            6 | en_GB.utf8      | en_GB.utf8      | 
 12369 | en_HK           |            11 |        10 | c            | t                   |            6 | en_HK.utf8      | en_HK.utf8      | 
 12370 | en_IE           |            11 |        10 | c            | t                   |            6 | en_IE.utf8      | en_IE.utf8      | 
 12371 | en_NZ           |            11 |        10 | c            | t                   |            6 | en_NZ.utf8      | en_NZ.utf8      | 
 12372 | en_PH           |            11 |        10 | c            | t                   |            6 | en_PH.utf8      | en_PH.utf8      | 
 12373 | en_SC           |            11 |        10 | c            | t                   |            6 | en_SC.utf8      | en_SC.utf8      | 
 12374 | en_SG           |            11 |        10 | c            | t                   |            6 | en_SG.utf8      | en_SG.utf8      | 
 12375 | en_US           |            11 |        10 | c            | t                   |           16 | en_US.iso885915 | en_US.iso885915 | 
 12376 | en_US           |            11 |        10 | c            | t                   |            6 | en_US.utf8      | en_US.utf8      | 
 12377 | en_ZA           |            11 |        10 | c            | t                   |            6 | en_ZA.utf8      | en_ZA.utf8      | 
 12378 | en_ZW           |            11 |        10 | c            | t                   |            6 | en_ZW.utf8      | en_ZW.utf8      | 
(56 rows)

You will not see any of the collations you installed after the cluster was created. To actually make them visible in the catalog you need to manually add the collations from operating system:

postgres=# select pg_import_system_collations('pg_catalog');
 pg_import_system_collations 
-----------------------------
                          24
(1 row)

You actually can create databases with new collations without importing them into the catalog table but we recommend to keep the catalog and the collations provided by the operating system in sync. Asking the catalog table once more the new “de_* collations will show up:

postgres=# select * from pg_collation where collname ~ '^de_';
  oid  |  collname  | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype  | collversion 
-------+------------+---------------+-----------+--------------+---------------------+--------------+-------------+------------+-------------
 16386 | de_AT      |            11 |        10 | c            | t                   |            8 | de_AT       | de_AT      | 
 16387 | [email protected] |            11 |        10 | c            | t                   |           16 | [email protected]  | [email protected] | 
 16388 | de_AT.utf8 |            11 |        10 | c            | t                   |            6 | de_AT.utf8  | de_AT.utf8 | 
 16389 | de_BE      |            11 |        10 | c            | t                   |            8 | de_BE       | de_BE      | 
 16390 | [email protected] |            11 |        10 | c            | t                   |           16 | [email protected]  | [email protected] | 
 16391 | de_BE.utf8 |            11 |        10 | c            | t                   |            6 | de_BE.utf8  | de_BE.utf8 | 
 16392 | de_CH      |            11 |        10 | c            | t                   |            8 | de_CH       | de_CH      | 
 16393 | de_CH.utf8 |            11 |        10 | c            | t                   |            6 | de_CH.utf8  | de_CH.utf8 | 
 16394 | de_DE      |            11 |        10 | c            | t                   |            8 | de_DE       | de_DE      | 
 16395 | [email protected] |            11 |        10 | c            | t                   |           16 | [email protected]  | [email protected] | 
 16396 | de_DE.utf8 |            11 |        10 | c            | t                   |            6 | de_DE.utf8  | de_DE.utf8 | 
 16397 | de_IT      |            11 |        10 | c            | t                   |            8 | de_IT       | de_IT      | 
 16398 | de_IT.utf8 |            11 |        10 | c            | t                   |            6 | de_IT.utf8  | de_IT.utf8 | 
 16399 | de_LI.utf8 |            11 |        10 | c            | t                   |            6 | de_LI.utf8  | de_LI.utf8 | 
 16400 | de_LU      |            11 |        10 | c            | t                   |            8 | de_LU       | de_LU      | 
 16401 | [email protected] |            11 |        10 | c            | t                   |           16 | [email protected]  | [email protected] | 
 16402 | de_LU.utf8 |            11 |        10 | c            | t                   |            6 | de_LU.utf8  | de_LU.utf8 | 
 16403 | de_AT      |            11 |        10 | c            | t                   |            6 | de_AT.utf8  | de_AT.utf8 | 
 16404 | de_BE      |            11 |        10 | c            | t                   |            6 | de_BE.utf8  | de_BE.utf8 | 
 16405 | de_CH      |            11 |        10 | c            | t                   |            6 | de_CH.utf8  | de_CH.utf8 | 
 16406 | de_DE      |            11 |        10 | c            | t                   |            6 | de_DE.utf8  | de_DE.utf8 | 
 16407 | de_IT      |            11 |        10 | c            | t                   |            6 | de_IT.utf8  | de_IT.utf8 | 
 16408 | de_LI      |            11 |        10 | c            | t                   |            6 | de_LI.utf8  | de_LI.utf8 | 
 16409 | de_LU      |            11 |        10 | c            | t                   |            6 | de_LU.utf8  | de_LU.utf8 | 
(24 rows)

Creating the new database using one of “de_*” collations does work now:

postgres=# create database tmp2 with LC_COLLATE = "de_DE.UTF-8" template=template0;
CREATE DATABASE
postgres=# \l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges 
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/pg            +
           |       |          |             |             | pg=CTc/pg
 template1 | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/pg            +
           |       |          |             |             | pg=CTc/pg
 tmp       | pg    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 tmp2      | pg    | UTF8     | de_DE.UTF-8 | en_US.UTF-8 | 
 tmpx      | pg    | UTF8     | en_NZ.utf8  | en_US.UTF-8 | 
 tmpxx     | pg    | UTF8     | C           | en_US.UTF-8 | 

With PostgreSQL you can also specify the collation on a column level, so by now we can have a column using an “en” location and another one using a “de” collation:

postgres=# create table t1 ( a text COLLATE "de_DE.utf8"
postgres(#                 , b text COLLATE "en_US.utf8"
postgres(#                 );
CREATE TABLE
postgres=# \d t1
                Table "public.t1"
 Column | Type | Collation  | Nullable | Default 
--------+------+------------+----------+---------
 a      | text | de_DE.utf8 |          | 
 b      | text | en_US.utf8 |          | 

If you want to use another collation which is currently not provided by the operating system the same rules apply as above:

postgres=# create table t1 ( a text COLLATE "de_DE.utf8"
postgres(#                 , b text COLLATE "en_US.utf8"
postgres(#                 , c text COLLATE "fr_FR.utf8"
postgres(#                 );
ERROR:  collation "fr_FR.utf8" for encoding "UTF8" does not exist
LINE 3:                 , c text COLLATE "fr_FR.utf8"

You need to install support for that before you can use it. The same applies for queries, of course:

postgres=# SELECT a < b COLLATE "de_DE" FROM t1;
 ?column? 
----------
(0 rows)

postgres=# SELECT a < b COLLATE "fr_FR" FROM t1;
ERROR:  collation "fr_FR" for encoding "UTF8" does not exist
LINE 1: SELECT a < b COLLATE "fr_FR" FROM t1;

That’s it for the basics. More to come in another post.

One Comment

  • Olli says:

    Please explain differences between “utf8” and “utf-8” -form; is it same which form is used
    when defining collation or ctype? I know that both are ok for postgres, so it is confusing that dash can be used there.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure