In the last posts in this little series we looked at how to get a ZFS file system up and running on a CentOS 7 host and how snapshots and clones can be used to simply processes such as testing and cloning PostgreSQL instances. In this post we’ll look at another feature of zfs: Compression.

The current status of my ZFS file systems is:

[root@centos7 ~] zfs list
NAME            USED  AVAIL  REFER  MOUNTPOINT
pgpool          170M  9.46G  20.5K  /pgpool
pgpool/pgdata   169M  9.46G   169M  /pgpool/pgdata

To check if compression is enabled:

[root@centos7 ~] zfs get compression pgpool/pgdata
NAME           PROPERTY     VALUE     SOURCE
pgpool/pgdata  compression  off       default

Lets create another file system and enable compression for it:

[root@centos7 ~] zfs create pgpool/pgdatacompressed
[root@centos7 ~] zfs list
NAME                      USED  AVAIL  REFER  MOUNTPOINT
pgpool                    170M  9.46G  20.5K  /pgpool
pgpool/pgdata             169M  9.46G   169M  /pgpool/pgdata
pgpool/pgdatacompressed    19K  9.46G    19K  /pgpool/pgdatacompressed
[root@centos7 ~] zfs get compression pgpool/pgdatacompressed
NAME                     PROPERTY     VALUE     SOURCE
pgpool/pgdatacompressed  compression  off       default
[root@centos7 ~] zfs set compression=on pgpool/pgdatacompressed
[root@centos7 ~] zfs get compression pgpool/pgdatacompressed
NAME                     PROPERTY     VALUE     SOURCE
pgpool/pgdatacompressed  compression  on        local

You can ask zfs to report the compression ratio for a file system:

[root@centos7 ~] zfs get compressratio pgpool/pgdatacompressed
NAME                     PROPERTY       VALUE  SOURCE
pgpool/pgdatacompressed  compressratio  1.00x  -
[root@centos7 ~] chown postgres:postgres /pgpool/pgdatacompressed/

The ratio is 1 which is because we do not have any data yet. Lets copy the PostgreSQL cluster from the uncompressed file system into our new compressed file system:

postgres@centos7:/home/postgres/ [PG1] cp -pr /pgpool/pgdata/* /pgpool/pgdatacompressed/
postgres@centos7:/home/postgres/ [PG1] ls -l /pgpool/pgdatacompressed/
total 30
drwx------. 6 postgres postgres     6 Sep 29 14:00 base
drwx------. 2 postgres postgres    54 Sep 29 14:27 global
drwx------. 2 postgres postgres     3 Sep 28 15:11 pg_clog
drwx------. 2 postgres postgres     2 Sep 28 15:11 pg_commit_ts
drwx------. 2 postgres postgres     2 Sep 28 15:11 pg_dynshmem
-rw-------. 1 postgres postgres  4468 Sep 28 15:11 pg_hba.conf
-rw-------. 1 postgres postgres  1636 Sep 28 15:11 pg_ident.conf
drwxr-xr-x. 2 postgres postgres     2 Sep 28 15:11 pg_log
drwx------. 4 postgres postgres     4 Sep 28 15:11 pg_logical
drwx------. 4 postgres postgres     4 Sep 28 15:11 pg_multixact
drwx------. 2 postgres postgres     3 Sep 29 14:27 pg_notify
drwx------. 2 postgres postgres     2 Sep 28 15:11 pg_replslot
drwx------. 2 postgres postgres     2 Sep 28 15:11 pg_serial
drwx------. 2 postgres postgres     2 Sep 28 15:11 pg_snapshots
drwx------. 2 postgres postgres     5 Sep 29 14:46 pg_stat
drwx------. 2 postgres postgres     2 Sep 29 14:46 pg_stat_tmp
drwx------. 2 postgres postgres     3 Sep 28 15:11 pg_subtrans
drwx------. 2 postgres postgres     2 Sep 28 15:11 pg_tblspc
drwx------. 2 postgres postgres     2 Sep 28 15:11 pg_twophase
-rw-------. 1 postgres postgres     4 Sep 28 15:11 PG_VERSION
drwx------. 3 postgres postgres     8 Sep 29 14:26 pg_xlog
-rw-------. 1 postgres postgres    88 Sep 28 15:11 postgresql.auto.conf
-rw-------. 1 postgres postgres 21270 Sep 28 15:11 postgresql.conf
-rw-------. 1 postgres postgres    69 Sep 29 14:27 postmaster.opts

We already should see a difference, shouldn’t we?

postgres@centos7:/home/postgres/ [PG1] df -h | grep pgdata
pgpool/pgdata            9.6G  170M  9.4G   2% /pgpool/pgdata
pgpool/pgdatacompressed  9.5G   82M  9.4G   1% /pgpool/pgdatacompressed

Not bad, less than half of the size. We should see another compression ratio than 1 now:

[root@centos7 ~] zfs get compressratio pgpool/pgdatacompressed
NAME                     PROPERTY       VALUE  SOURCE
pgpool/pgdatacompressed  compressratio  1.93x  -

Lets generate some data in our two PostgreSQL instances and check the time it takes as well as the size of the file systems afterwards. As in the last post the second instance just gets a different port, everything else is identical:

postgres@centos7:/home/postgres/ [PG1] pg_ctl start -D /pgpool/pgdata
postgres@centos7:/home/postgres/ [PG1] sed -i 's/#port = 5432/port=5433/g' /pgpool/pgdatacompressed/postgresql.conf
postgres@centos7:/home/postgres/ [PG1] FATAL:  data directory "/pgpool/pgdatacompressed" has group or world access
postgres@centos7:/home/postgres/ [PG1] chmod o-rwx,g-rwx /pgpool/pgdatacompressed/
postgres@centos7:/home/postgres/ [PG1] pg_ctl start -D /pgpool/pgdatacompressed/

This is the script to generate some data:

timing
c postgres
drop database if exists dataload;
create database dataload;
c dataload
create table dataload ( a bigint
                      , b varchar(100)
                      , c timestamp
                      );
with 
  data_generator_num as
     ( select *
         from generate_series ( 1
                              , 1000000 ) nums
     ) 
insert into dataload
select data_generator_num.nums
     , md5(data_generator_num.nums::varchar)
     , current_date+data_generator_num.nums
 from data_generator_num;

I will run the script two times on each instance. For the instance on the uncompressed file system:

-- FIRST RUN
postgres=# i generate_data.sql
Timing is on.
You are now connected to database "postgres" as user "postgres".
DROP DATABASE
Time: 720.626 ms
CREATE DATABASE
Time: 4631.212 ms
You are now connected to database "dataload" as user "postgres".
CREATE TABLE
Time: 6.517 ms
INSERT 0 1000000
Time: 28668.343 ms
-- SECOND RUN
dataload=# i generate_data.sql
Timing is on.
You are now connected to database "postgres" as user "postgres".
DROP DATABASE
Time: 774.061 ms
CREATE DATABASE
Time: 2721.169 ms
You are now connected to database "dataload" as user "postgres".
CREATE TABLE
Time: 7.374 ms
INSERT 0 1000000
Time: 32168.043 ms
dataload=# 

For the instance on the compressed file system:

-- FIRST RUN
postgres=# i generate_data.sql
Timing is on.
You are now connected to database "postgres" as user "postgres".
psql:generate_data.sql:3: NOTICE:  database "dataload" does not exist, skipping
DROP DATABASE
Time: 0.850 ms
CREATE DATABASE
Time: 4281.965 ms
You are now connected to database "dataload" as user "postgres".
CREATE TABLE
Time: 5.120 ms
INSERT 0 1000000
Time: 30606.966 ms
-- SECOND RUN
dataload=# i generate_data.sql
Timing is on.
You are now connected to database "postgres" as user "postgres".
DROP DATABASE
Time: 2359.120 ms
CREATE DATABASE
Time: 3267.151 ms
You are now connected to database "dataload" as user "postgres".
CREATE TABLE
Time: 8.665 ms
INSERT 0 1000000
Time: 23474.290 ms
dataload=# 

Despite that the numbers are quite bad (5 seconds to create an empty table) the fastest load was the second one on the compressed file system. So at least it is not slower. I have to admit that I did not do any tuning on the file systems and my VM does not have much memory (512m) which is far too less if you work with ZFS (ZFS needs much memory, at least 1gb).

So, what about the size of the data. First lets check what PostgreSQL is telling us:

-- instance on the uncompressed file system
dataload=# select * from pg_size_pretty ( pg_relation_size ( 'dataload' ));
 pg_size_pretty 
----------------
 81 MB
(1 row)
-- instance on the compressed file system
dataload=# select * from pg_size_pretty ( pg_relation_size ( 'dataload' ));
 pg_size_pretty 
----------------
 81 MB
(1 row)

Exactly the same, which is not surprising as PostgreSQL sees the files as if they would be uncompressed (please be aware that the my_app_table from the last post is still there which is why the file system usage in total is larger than you might expect). It is quite funny on how the size is reported on the compressed file system depending on how you ask.

You can use oid2name to map the file name to a table name:

postgres@centos7:/pgpool/pgdatacompressed/base/24580/ [PG1] oid2name -d dataload -p 5433 -f 24581
From database "dataload":
  Filenode  Table Name
----------------------
     24581    dataload

File 24581 is the table we generated. When you ask for the size by using “du” you get:

postgres@centos7:/pgpool/pgdatacompressed/base/24580/ [PG1] du -h 24581
48M	24581

This is the compressed size. When you use “ls” you get the uncompressed size:

postgres@centos7:/pgpool/pgdatacompressed/base/24580/ [PG1] ls -lh 24581
-rw-------. 1 postgres postgres 81M Sep 30 10:43 24581

What does “df” tell us:

postgres@centos7:/home/postgres/ [PG1] df -h | grep pgdata
pgpool/pgdata            9.5G  437M  9.1G   5% /pgpool/pgdata
pgpool/pgdatacompressed  9.2G  165M  9.1G   2% /pgpool/pgdatacompressed

Not bad, 437M of uncompressed data which is 165M compressed. So, if you are short on space this really can be an option.