By Mouhamadou Diaw

I was searching a tool for anonymizing data in a PostgreSQL database and I have tested the extension pg_anonymizer.
PostgreSQL_anonymizer is a set of SQL functions that remove personally identifiable values from a PostgreSQL table and replace them with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
In this blog I am showing how this extension can be used. I am using a PostgreSQL 10 database.
The first step is to install the extension pg_anonymizer. In my case I did it with with pgxn client

1
2
3
4
5
6
7
8
9
10
11
12
[postgres@pgserver2 ~]$ pgxn install postgresql_anonymizer --pg_config /u01/app/postgres/product/10/db_1/bin/pg_config
INFO: best version: postgresql_anonymizer 0.0.3
INFO: saving /tmp/tmpVf3psT/postgresql_anonymizer-0.0.3.zip
INFO: unpacking: /tmp/tmpVf3psT/postgresql_anonymizer-0.0.3.zip
INFO: building extension
gmake: Nothing to be done for `all'.
INFO: installing extension
/usr/bin/mkdir -p '/u01/app/postgres/product/10/db_1/share/extension'
/usr/bin/mkdir -p '/u01/app/postgres/product/10/db_1/share/extension/anon'
/usr/bin/install -c -m 644 .//anon.control '/u01/app/postgres/product/10/db_1/share/extension/'
/usr/bin/install -c -m 644 .//anon/anon--0.0.3.sql  '/u01/app/postgres/product/10/db_1/share/extension/anon/'
[postgres@pgserver2 ~]$

We can then verify that under /u01/app/postgres/product/10/db_1/share/extension we have a file anon.control and a directory named anon

1
2
3
4
5
6
7
8
[postgres@pgserver2 extension]$ ls -ltra anon*
-rw-r--r--. 1 postgres postgres 167 Sep 13 10:54 anon.control
anon:
total 18552
drwxrwxr-x. 3 postgres postgres    12288 Sep 13 10:54 ..
drwxrwxr-x. 2 postgres postgres       28 Sep 13 10:54 .
-rw-r--r--. 1 postgres postgres 18980156 Sep 13 10:54 anon--0.0.3.sql

Let’s create a database named prod and let’s create the required extensions. tsm_system_rows should delivered by the contrib.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
prod=# c prod
You are now connected to database "prod" as user "postgres".
prod=#
prod=# CREATE EXTENSION tsm_system_rows;;
CREATE EXTENSION
prod=#
prod=# CREATE EXTENSION anon;
CREATE EXTENSION
prod=#
prod=# dx
                                    List of installed extensions
      Name       | Version |   Schema   |                        Description
-----------------+---------+------------+----------------------------------------------------
--------
 anon            | 0.0.3   | anon       | Data anonymization tools
 plpgsql         | 1.0     | pg_catalog | PL/pgSQL procedural language
 tsm_system_rows | 1.0     | public     | TABLESAMPLE method which accepts number of rows as
a limit
(3 rows)
prod=#

The extension will create following functions in the schema anon. These functions can be used to mask some data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
prod=# set search_path=anon;
SET
prod=# df
                                                               List of functions
 Schema |           Name           |     Result data type     |                          Argu
ment data types                           |  Type
--------+--------------------------+--------------------------+------------------------------
------------------------------------------+--------
 anon   | random_city              | text                     |
                                          | normal
 anon   | random_city_in_country   | text                     | country_name text
                                          | normal
 anon   | random_company           | text                     |
                                          | normal
 anon   | random_country           | text                     |
                                          | normal
 anon   | random_date              | timestamp with time zone |
                                          | normal
 anon   | random_date_between      | timestamp with time zone | date_start timestamp with tim
e zone, date_end timestamp with time zone | normal
 anon   | random_email             | text                     |
                                          | normal
 anon   | random_first_name        | text                     |
                                          | normal
 anon   | random_iban              | text                     |
                                          | normal
 anon   | random_int_between       | integer                  | int_start integer, int_stop integer
                            | normal
 anon   | random_last_name         | text                     |
                            | normal
 anon   | random_phone             | text                     | phone_prefix text DEFAULT '0'::text
                            | normal
 anon   | random_region            | text                     |
                            | normal
 anon   | random_region_in_country | text                     | country_name text
                            | normal
 anon   | random_siren             | text                     |
                            | normal
 anon   | random_siret             | text                     |
                            | normal
 anon   | random_string            | text                     | l integer
                            | normal
 anon   | random_zip               | text                     |
                            | normal
(18 rows)
prod=#

Now in the database prod let’s create a table with some data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
prod=# d customers
                      Table "public.customers"
   Column   |         Type          | Collation | Nullable | Default
------------+-----------------------+-----------+----------+---------
 first_name | character varying(30) |           |          |
 last_name  | character varying(30) |           |          |
 email_add  | character varying(30) |           |          |
 country    | character varying(60) |           |          |
 iban       | character varying(60) |           |          |
 amount     | integer               |           |          |
prod=#
prod=# table customers;
 first_name | last_name |        email_add        |   country    |            iban            |   amount
------------+-----------+-------------------------+--------------+----------------------------+------------
 Michel     | Delco     | [email protected]    | FRANCE       | FR763000600001123456890189 |    5000000
 Denise     | Blanchot  | [email protected] | GERMANY      | DE91100000000123456789     | 1000000000
 Farid      | Dim       | [email protected]       | Saudi Arabia | SA4420000001234567891234   |    2500000
(3 rows)
prod=#

Let’s say that I want some people to access to all data for this table, but I don’t want them to see the real email, the real country and the real iban of the customers.
One solution should be to create a view with anonymous data for these columns. This will replace them with random-but-plausible values for these columns

1
2
prod=# create view Customers_anon as select first_name as Firstname ,last_name  as Lastnmame,anon.random_email() as Email ,anon.random_country() as Country, anon.random_iban() as Iban ,amount as Amount from customers;
CREATE VIEW

And then grant the access privilege to concerned people

1
2
3
4
5
6
7
8
9
prod=# select * from customers_anon ;
 firstname | lastnmame |             email             | country |            iban            |   amount
-----------+-----------+-------------------------------+---------+----------------------------+------------
 Michel    | Delco     | [email protected]        | Spain   |  AD1111112222C3C3C3C3C3C3  |    5000000
 Denise    | Blanchot  | [email protected]          | Myanmar |  AD1111112222C3C3C3C3C3C3  | 1000000000
 Farid     | Dim       | [email protected] | India   |  AD1111112222C3C3C3C3C3C3  |    2500000
(3 rows)
prod=#

Thanks to Damien for this project. Just not that this project is in its early development and should be used with care (see Damien comment below).