Infrastructure at your Service

Daniel Westermann

[Data]nymizer – Data anonymizer for PostgreSQL

Often there is the requirement to populate a test or development database with data from production, but this comes with a risk: Do you really want, that developers or testers have access to sensitive data? In a lot of companies this might not be an issue, but for others, sensitive data must not be available to any other database than production. In Oracle there is Data Masking but there is nothing in Community PostgreSQL which helps you with that. Of course you could develop something on your own, but there is another solution: [Data]nymizer. This tool will produce a native dump file, and sensitive data is masked based on flexible rules. Because the result is a dump file, the size of the dump file might become an issue, if your source database is large, and you want to dump the whole database. But usually you do not have sensitive data in all the tables and you have the option to dump only specific tables. Lets have a look at how this can be installed, and how it works.

The installation itself is straight forward:

[email protected]:/home/postgres/ [pgdev] curl -sSfL https://git.io/pg_datanymizer | sh -s -- -b bin v0.1.0
pg_datanymizer installer: Version v0.1.0 will be installed
pg_datanymizer installer: Successfully installed pg_datanymizer 0.1.0 to bin/pg_datanymizer
[email protected]:/home/postgres/ [pgdev] ls -l bin/
total 14452
-rwxr-xr-x 1 postgres postgres 14796464 Feb 24 10:51 pg_datanymizer

To check if it works in general, we can print the help:

[email protected]:/home/postgres/ [pgdev] bin/pg_datanymizer --help
pg_datanymizer 0.1.0

USAGE:
    pg_datanymizer [OPTIONS] 

FLAGS:
        --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -f, --file                    Path to dump file, example: /tmp/dump.sql
    -c, --config                  Path to config file. Default: ./config.yml
    -d, --dbname                  database to dump [default: postgres]
    -h, --host                    database server host or socket directory [default: localhost]
    -W, --password                force password prompt (should happen automatically)
        --pg_dump                 pg_dump file location [default: pg_dump]
    -p, --port                    database server port number
    -U, --username                connect as specified database user

ARGS:
     

There are not too many options and you’ll notice that the tool can be used over the network as well. This is quite important, as access to the production host usually is limited.

Before we proceed with the tool we need some data, so lets use a standard pgbench database for this:

[email protected]:/home/postgres/ [pgdev] psql
psql (14devel)
Type "help" for help.

postgres=# create database test;
CREATE DATABASE
postgres=# \! pgbench -i -s 10 test
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
1000000 of 1000000 tuples (100%) done (elapsed 1.73 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 2.83 s (drop tables 0.00 s, create tables 0.03 s, client-side generate 1.77 s, vacuum 0.33 s, primary keys 0.70 s).
postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# \d
              List of relations
 Schema |       Name       | Type  |  Owner   
--------+------------------+-------+----------
 public | pgbench_accounts | table | postgres
 public | pgbench_branches | table | postgres
 public | pgbench_history  | table | postgres
 public | pgbench_tellers  | table | postgres
(4 rows)
test=# 

The “filler” column of pgbench_accounts is not populated by default. Lets assume we have some sensitive data there, e.g. email addresses:

test=# test=# update pgbench_accounts set filler = '[email protected]_'||md5(bid::text)||'.com';
UPDATE 1000000
test=# select * from pgbench_accounts limit 5;
  aid   | bid | abalance |                                        filler                                        
--------+-----+----------+--------------------------------------------------------------------------------------
 999947 |  10 |        0 | [email protected]_d3d9446802a44259755d38e6d163e820.com                                         
 999948 |  10 |        0 | [email protected]_d3d9446802a44259755d38e6d163e820.com                                         
 999949 |  10 |        0 | [email protected]_d3d9446802a44259755d38e6d163e820.com                                         
 999950 |  10 |        0 | [email protected]_d3d9446802a44259755d38e6d163e820.com                                         
 999951 |  10 |        0 | [email protected]_d3d9446802a44259755d38e6d163e820.com                                         
(5 rows)

Using a simple configuration file like this:

[email protected]:/home/postgres/ [pgdev] cat config.yaml 
tables:
  - name: pgbench_accounts
    rules:
      filler:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe

… we can easily obfuscate this:

[email protected]:/home/postgres/ [pgdev] bin/pg_datanymizer -c config.yaml -U postgres -f output.sql test 
Prepare data scheme...
Fetch tables metadata...
[1 / 4] Prepare to dump table: public.pgbench_tellers
[Dumping: public.pgbench_tellers] [|##################################################|] 100 of 100 rows [100%] (0s)
[Dumping: public.pgbench_tellers] Finished in 0 seconds
[Dumping: public.pgbench_accounts] [|##################################################|] 1016397 of 1016397 rows [100%] (0s)
[Dumping: public.pgbench_accounts] Finished in 54 seconds
[Dumping: public.pgbench_history] [|##################################################|] 0 of 0 rows [100%] (0s)
[Dumping: public.pgbench_history] Finished in 0 seconds
[Dumping: public.pgbench_branches] [|##################################################|] 10 of 10 rows [100%] (0s)
[Dumping: public.pgbench_branches] Finished in 0 seconds
Finishing with indexes...

Looking at the result, the filler column contains email addresses, but not with the original values anymore:

[email protected]:/home/postgres/ [pgdev] grep "@" output.sql | head -10
999947  10      0       [email protected]
999948  10      0       [email protected]
999949  10      0       [email protected]
999950  10      0       [email protected]
999951  10      0       [email protected]
999952  10      0       [email protected]
999953  10      0       [email protected]
110689  2       0       [email protected]
110690  2       0       [email protected]
110691  2       0       [email protected]

“Email” is only one of the available the rules. Have a look at the Readme for the other options.

Filtering tables, either to be included or excluded is possible as well:

[email protected]:/home/postgres/ [pg14] cat config.yaml 
filter:
  only:
    - public.pgbench_accounts

tables:
  - name: pgbench_accounts
    rules:
      filler:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe

Using this configuration, only the pgbench_accounts table will be in the dump file:

[email protected]:/home/postgres/ [pgdev] bin/pg_datanymizer -c config.yaml -U postgres -f output.sql test 
Prepare data scheme...
Fetch tables metadata...
[1 / 4] Prepare to dump table: public.pgbench_tellers
[Dumping: public.pgbench_tellers] --- SKIP ---
[2 / 4] Prepare to dump table: public.pgbench_history
[Dumping: public.pgbench_history] --- SKIP ---
[3 / 4] Prepare to dump table: public.pgbench_branches
[Dumping: public.pgbench_branches] --- SKIP ---
[4 / 4] Prepare to dump table: public.pgbench_accounts
[Dumping: public.pgbench_accounts] [|##################################################|] 1016397 of 1016397 rows [100%] (0s)
[Dumping: public.pgbench_accounts] Finished in 35 seconds
Finishing with indexes...
Full Dump finished in 35 seconds

Really a nice tool, and very flexible, if you have the requirement for data anonymization.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure