Infrastructure at your Service

Mouhamadou Diaw

EDB Failover Manager 3.0 and PostgreSQL 10.1

Edb Failover Manager edb-efm30 is now released and supports PostgreSQL 10.1.
Efm is an EDB tool which can be used to perform a switchover and a failover in a PostgreSQL standby environment.
In this blog we are going to see how we can install and configure efm in a standby environment. We suppose that the standby is already confugured and is running. We also have not configured any VIP. I just use 3 virtual machines to test. Below the configuration we are using 3 servers with Oracle Linux 7.2

pgservertools: 192.168.56.30 which is the witness
pgserver1: 192.168.56.36 which is the primary server
pgserver2: 192.168.56.37 which is the standby
config
EFM must be installed on both 3 nodes.To install EDB EFM,I used the rpm provided by EnterpriseDB. Note that you will need an Edb account.We will show the installation on only one node, but is the same for both nodes.

[[email protected] tmp]# wget http://yum.enterprisedb.com/edbrepos/edb-repo-latest.noarch.rpm
[[email protected] tmp]# yum localinstall -y edb-repo-latest.noarch.rpm

After we have to enable the corresponding edb repository (You will need a login and password)

[[email protected] yum.repos.d]# cat /etc/yum.repos.d/edb.repo
[edbas10]
name=EnterpriseDB Advanced Server 10 $releasever - $basearch
baseurl=http://login:[email protected]/10/redhat/rhel-$releasever-$basearch
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/ENTERPRISEDB-GPG-KEY

And then we can search for the package

[[email protected] ~]# yum search efm
Loaded plugins: langpacks, ulninfo
=============================== N/S matched: efm ===============================
edb-efm30.x86_64 : EnterpriseDB Failover Manager
efm20.x86_64 : EnterpriseDB Failover Manager
efm21.x86_64 : EnterpriseDB Failover Manager

And then install it (ouputs are truncated)

[[email protected] ~]# yum install edb-efm30.x86_64
Loaded plugins: langpacks, ulninfo
Resolving Dependencies
--> Running transaction check
---> Package edb-efm30.x86_64 0:3.0.0-1.rhel7 will be installed
--> Finished Dependency Resolution
…
…
Installed:
  edb-efm30.x86_64 0:3.0.0-1.rhel7
Complete!

efm requires also openjdk. So we also have installed openjdk 1.8 on all nodes:

[email protected] tmp]# yum install java-1.8.0-openjdk.x86_64

To manage efm we create a dedicated user in the primary cluster

postgres=# create user efm with login password 'root' superuser;
CREATE ROLE
postgres=#

We have to add following entries to pg_hba.conf of all databases clusters to allow connection for user efm.

##for efm
host  postgres    efm  192.168.56.36/32     md5
host  postgres    efm  192.168.56.37/32     md5
host  postgres    efm  192.168.56.30/32     md5

The configuration of efm consists of editing 2 main configuration files: efm.nodes and efm.properties. In my case these files are located in /etc/edb/efm-3.0. There are already two sample files that we can copy and then edit.
First we need to encrypt the password of user efm and after we have to configure efm.nodes and efm.properties on both nodes.

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm encrypt efm
This utility will generate an encrypted password for you to place in your
EFM cluster property file.
Please enter the password and hit enter:
Please enter the password again to confirm:
The encrypted password is: ff7f041651e5e864013c1102d26a5e08
Please paste this into your cluster properties file.
        db.password.encrypted= ff7f041651e5e864013c1102d26a5e08

Below we show the contents of our two configuration files
On pgserver1 which is the primary
efm.nodes

[email protected] 101]# cat /etc/edb/efm-3.0/efm.nodes
# List of node address:port combinations separated by whitespace.
# The list should include at least the membership coordinator's address.
192.168.56.30:9998 192.168.56.37:9998
[[email protected] 101]#

efm.properties

[[email protected] 101]# cat /etc/edb/efm-3.0/efm.properties | grep -v ^#
db.user=efm
db.password.encrypted=ff7f041651e5e864013c1102d26a5e08
db.port=5436
db.database=postgres
db.service.owner=postgres
db.service.name=
db.bin=/u01/app/postgres/product/10/db_1/bin
db.recovery.conf.dir=/u90/mydata/101
jdbc.sslmode=disable
[email protected]
script.notification=
bind.address=192.168.56.36:9998
admin.port=9999
is.witness=false
local.period=10
local.timeout=60
local.timeout.final=10
remote.timeout=10
node.timeout=50
stop.isolated.master=false
pingServerIp=192.168.56.30
pingServerCommand=/bin/ping -q -c3 -w5
auto.allow.hosts=false
db.reuse.connection.count=0
auto.failover=true
auto.reconfigure=true
promotable=true
minimum.standbys=0
recovery.check.period=2
auto.resume.period=0
script.fence=
script.post.promotion=
script.resumed=
script.db.failure=
script.master.isolated=
script.remote.pre.promotion=
script.remote.post.promotion=
script.custom.monitor=
custom.monitor.interval=
custom.monitor.timeout=
custom.monitor.safe.mode=
sudo.command=sudo
sudo.user.command=sudo -u %u
log.dir=/var/log/efm-3.0
jgroups.loglevel=INFO
efm.loglevel=INFO
jvm.options=-Xmx32m
[[email protected] 101]#

On pgserver2 which is the standby
efm.nodes

[[email protected] tmp]# cat /etc/edb/efm-3.0/efm.nodes
# List of node address:port combinations separated by whitespace.
# The list should include at least the membership coordinator's address.
192.168.56.30:9998 192.168.56.36:9998
[[email protected] tmp]#

efm.properties

[[email protected] tmp]# cat /etc/edb/efm-3.0/efm.properties | grep -v ^#
db.user=efm
db.password.encrypted=ff7f041651e5e864013c1102d26a5e08
db.port=5436
db.database=postgres
db.service.owner=postgres
db.service.name=
db.bin=/u01/app/postgres/product/10/db_1/bin
db.recovery.conf.dir=/u90/mydata/101
jdbc.sslmode=disable
[email protected]
script.notification=
bind.address=192.168.56.37:9998
admin.port=9999
is.witness=false
local.period=10
local.timeout=60
local.timeout.final=10
remote.timeout=10
node.timeout=50
stop.isolated.master=false
pingServerIp=192.168.56.30
pingServerCommand=/bin/ping -q -c3 -w5
auto.allow.hosts=true
db.reuse.connection.count=0
auto.failover=true
auto.reconfigure=true
promotable=true
minimum.standbys=0
recovery.check.period=2
auto.resume.period=0
script.fence=
script.post.promotion=
script.resumed=
script.db.failure=
script.master.isolated=
script.remote.pre.promotion=
script.remote.post.promotion=
script.custom.monitor=
custom.monitor.interval=
custom.monitor.timeout=
custom.monitor.safe.mode=
sudo.command=sudo
sudo.user.command=sudo -u %u
log.dir=/var/log/efm-3.0
jgroups.loglevel=INFO
efm.loglevel=INFO
jvm.options=-Xmx32m
[[email protected] tmp]#

On pgservertools which is the witness
efm.nodes

[[email protected] efm-3.0]# cat /etc/edb/efm-3.0/efm.nodes
# List of node address:port combinations separated by whitespace.
# The list should include at least the membership coordinator's address.
192.168.56.36:9998 192.168.56.37:9998
[[email protected] efm-3.0]#

efm.properties

[[email protected] efm-3.0]# cat /etc/edb/efm-3.0/efm.properties | grep -v ^#
db.user=efm
db.password.encrypted=ff7f041651e5e864013c1102d26a5e08
db.port=5436
db.database=postgres
db.service.owner=postgres
db.service.name=
db.bin=/u01/app/postgres/product/10/db_1/bin
db.recovery.conf.dir=/u90/mydata/101
jdbc.sslmode=disable
[email protected]
script.notification=
bind.address=192.168.56.30:9998
admin.port=9999
is.witness=true
local.period=10
local.timeout=60
local.timeout.final=10
remote.timeout=10
node.timeout=50
stop.isolated.master=false
pingServerIp=192.168.56.30
pingServerCommand=/bin/ping -q -c3 -w5
auto.allow.hosts=false
db.reuse.connection.count=0
auto.failover=true
auto.reconfigure=true
promotable=true
minimum.standbys=0
recovery.check.period=2
auto.resume.period=0
script.fence=
script.post.promotion=
script.resumed=
script.db.failure=
script.master.isolated=
script.remote.pre.promotion=
script.remote.post.promotion=
script.custom.monitor=
custom.monitor.interval=
custom.monitor.timeout=
custom.monitor.safe.mode=
sudo.command=sudo
sudo.user.command=sudo -u %u
log.dir=/var/log/efm-3.0
jgroups.loglevel=INFO
efm.loglevel=INFO
jvm.options=-Xmx32m
[[email protected] efm-3.0]#

Now let’s start efm on both nodes. If there is any error check logs on /var/log/efm-3.0/.
I started on following order: pgserver1, pgserver2 and pgservertools. Services can be configured to be started automatically when the server starts.
Below an example on pgserverools

[[email protected] efm-3.0]# service efm-3.0 start
.
[[email protected] efm-3.0]# service efm-3.0 status
Redirecting to /bin/systemctl status  efm-3.0.service
● efm-3.0.service - EnterpriseDB Failover Manager 3.0
   Loaded: loaded (/usr/lib/systemd/system/efm-3.0.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2018-03-06 15:58:51 CET; 1h 46min ago
  Process: 22260 ExecStart=/bin/bash -c /usr/edb/efm-3.0/bin/runefm.sh start ${CLUSTER} (code=exited, status=0/SUCCESS)
 Main PID: 22321 (java)
   CGroup: /system.slice/efm-3.0.service
           └─22321 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64/jre/bin/java -cp /usr/edb/efm-3.0/lib/EFM-3.0.0.jar -Xmx32m com.enterprisedb.efm.main.ServiceCommand __int_start /e...
Mar 06 15:58:45 pgservertools.localdomain systemd[1]: Starting EnterpriseDB Failover Manager 3.0...
Mar 06 15:58:51 pgservertools.localdomain systemd[1]: Started EnterpriseDB Failover Manager 3.0.
[[email protected] efm-3.0]#

Once started we can verified from any node the status of our cluster

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm cluster-status efm
Cluster Status: efm
VIP:
        Agent Type  Address              Agent  DB       Info
        --------------------------------------------------------------
        Witness     192.168.56.30        UP     N/A
        Master      192.168.56.36        UP     UP
        Standby     192.168.56.37        UP     UP
Allowed node host list:
        192.168.56.36 192.168.56.37 192.168.56.30
Membership coordinator: 192.168.56.30
Standby priority host list:
        192.168.56.37
Promote Status:
        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      192.168.56.36        0/430001A8
        Standby     192.168.56.37        0/430001A8
        Standby database(s) in sync with master. It is safe to promote.
[[email protected] efm-3.0]#

Now that everything is ok, let’s do a switchover

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm promote efm -switchover
Promote/switchover command accepted by local agent. Proceeding with promotion and will reconfigure original master. Run the 'cluster-status' command for information about the new cluster state.

And if we run the cluster-status command during the switchover

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm cluster-status efm
Cluster Status: efm
VIP:
        Agent Type  Address              Agent  DB       Info
        --------------------------------------------------------------
        Witness     192.168.56.30        UP     N/A
        Idle        192.168.56.36        UP     UNKNOWN
        Standby     192.168.56.37        UP     UP
Allowed node host list:
        192.168.56.36 192.168.56.37 192.168.56.30
Membership coordinator: 192.168.56.30
Standby priority host list:
        192.168.56.37
Promote Status:
        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Standby     192.168.56.37        0/44000098
        No master database was found.
Idle Node Status (idle nodes ignored in XLog location comparisons):
        Address              XLog Loc         Info
        --------------------------------------------------------------
        192.168.56.36        UNKNOWN          Connection to 192.168.56.36:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

And we can see the promoting phase

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm cluster-status efm
Cluster Status: efm
VIP:
        Agent Type  Address              Agent  DB       Info
        --------------------------------------------------------------
        Witness     192.168.56.30        UP     N/A
        Idle        192.168.56.36        UP     UNKNOWN
        Promoting   192.168.56.37        UP     UP
Allowed node host list:
        192.168.56.36 192.168.56.37 192.168.56.30
Membership coordinator: 192.168.56.30
Standby priority host list:
        (List is empty.)
Promote Status:
        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      192.168.56.37        0/44000170
        No standby databases were found.
Idle Node Status (idle nodes ignored in XLog location comparisons):
        Address              XLog Loc         Info
        --------------------------------------------------------------
        192.168.56.36        UNKNOWN          Connection to 192.168.56.36:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

And after a few time we can see that the new master is on pgserver2

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm cluster-status efm
Cluster Status: efm
VIP:
        Agent Type  Address              Agent  DB       Info
        --------------------------------------------------------------
        Witness     192.168.56.30        UP     N/A
        Standby     192.168.56.36        UP     UP
        Master      192.168.56.37        UP     UP
Allowed node host list:
        192.168.56.36 192.168.56.37 192.168.56.30
Membership coordinator: 192.168.56.30
Standby priority host list:
        192.168.56.36
Promote Status:
        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      192.168.56.37        0/440001A8
        Standby     192.168.56.36        0/440001A8
        Standby database(s) in sync with master. It is safe to promote.
[[email protected] efm-3.0]#

The purpose of the witness is to do an automatic failover when the primary is down. Let’s simulate a crash of our primary database by killing the corresponding process.

[[email protected] tmp]# ps -ef | grep postgres
...
...
root     17529 14103  0 16:45 pts/1    00:00:00 tail -f /u90/mydata/101/log/postgresql-2018-03-06.log
postgres 20612     1  0 17:56 ?        00:00:00 /u01/app/postgres/product/10/db_1/bin/postgres -D /u90/mydata/101
postgres 20613 20612  0 17:56 ?        00:00:00 postgres: logger process
postgres 20615 20612  0 17:56 ?        00:00:00 postgres: checkpointer process
postgres 20616 20612  0 17:56 ?        00:00:00 postgres: writer process
postgres 20617 20612  0 17:56 ?        00:00:00 postgres: stats collector process
postgres 20819 20612  0 18:00 ?        00:00:00 postgres: wal writer process
postgres 20820 20612  0 18:00 ?        00:00:00 postgres: autovacuum launcher process
postgres 20821 20612  0 18:00 ?        00:00:00 postgres: archiver process   last was 00000008.history
postgres 20822 20612  0 18:00 ?        00:00:00 postgres: bgworker: logical replication launcher
postgres 20832 20612  0 18:00 ?        00:00:00 postgres: wal sender process repliuser 192.168.56.36(45827) streaming 0/440001A8
root     21481 16868  0 18:16 pts/0    00:00:00 grep --color=auto postgres
[[email protected] tmp]#

And let’s execute the kill command

[[email protected] tmp]# kill -9 20612

If we check the cluster status from the witness server, we can see that the master is in an UNKNOWN status

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm cluster-status efm
Cluster Status: efm
VIP:
        Agent Type  Address              Agent  DB       Info
        --------------------------------------------------------------
        Witness     192.168.56.30        UP     N/A
        Standby     192.168.56.36        UP     UP
        Idle        192.168.56.37        UP     UNKNOWN
Allowed node host list:
        192.168.56.36 192.168.56.37 192.168.56.30
Membership coordinator: 192.168.56.30
Standby priority host list:
        192.168.56.36
Promote Status:
        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Standby     192.168.56.36        0/440001A8
        No master database was found.
Idle Node Status (idle nodes ignored in XLog location comparisons):
        Address              XLog Loc         Info
        --------------------------------------------------------------
        192.168.56.37        UNKNOWN          Connection to 192.168.56.37:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
[[email protected] efm-3.0]#

In the alert log in our standby server pgserver1, we can see that that the database is converted to a primary one a few time after.

2018-03-06 18:17:49.381 CET [18384] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2018-03-06 18:17:49.382 CET [18380] LOG:  invalid record length at 0/440001A8: wanted 24, got 0
2018-03-06 18:17:49.387 CET [19049] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.37" and accepting
                TCP/IP connections on port 5436?
2018-03-06 18:17:54.404 CET [19055] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.37" and accepting
                TCP/IP connections on port 5436?
2018-03-06 18:17:59.406 CET [19107] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.37" and accepting
…
….
….
                TCP/IP connections on port 5436?
                TCP/IP connections on port 5436?
2018-03-06 18:18:34.450 CET [19128] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.37" and accepting
                TCP/IP connections on port 5436?
2018-03-06 18:18:39.451 CET [19134] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.37" and accepting
                TCP/IP connections on port 5436?
2018-03-06 18:18:44.462 CET [19135] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.56.37" and accepting
                TCP/IP connections on port 5436?
2018-03-06 18:18:49.456 CET [18380] LOG:  trigger file found: /tmp/postgresql.trigger
2018-03-06 18:18:49.456 CET [18380] LOG:  redo done at 0/44000170
2018-03-06 18:18:49.479 CET [18380] LOG:  selected new timeline ID: 9
2018-03-06 18:18:50.128 CET [18380] LOG:  archive recovery complete
2018-03-06 18:18:50.229 CET [18378] LOG:  database system is ready to accept connections

What we can confirm by querying the cluster status

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm cluster-status efm
Cluster Status: efm
VIP:
        Agent Type  Address              Agent  DB       Info
        --------------------------------------------------------------
        Witness     192.168.56.30        UP     N/A
        Master      192.168.56.36        UP     UP
        Idle        192.168.56.37        UP     UNKNOWN
Allowed node host list:
        192.168.56.36 192.168.56.37 192.168.56.30
Membership coordinator: 192.168.56.30
Standby priority host list:
        (List is empty.)
Promote Status:
        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      192.168.56.36        0/440002B8
        No standby databases were found.
Idle Node Status (idle nodes ignored in XLog location comparisons):
        Address              XLog Loc         Info
        --------------------------------------------------------------
        192.168.56.37        UNKNOWN          Connection to 192.168.56.37:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
[[email protected] efm-3.0]#

and

[[email protected] efm-3.0]# /usr/edb/efm-3.0/bin/efm cluster-status efm
Cluster Status: efm
VIP:
        Agent Type  Address              Agent  DB       Info
        --------------------------------------------------------------
        Witness     192.168.56.30        UP     N/A
        Master      192.168.56.36        UP     UP
Allowed node host list:
        192.168.56.36 192.168.56.37 192.168.56.30
Membership coordinator: 192.168.56.30
Standby priority host list:
        (List is empty.)
Promote Status:
        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      192.168.56.36        0/440002B8
        No standby databases were found.
[[email protected] efm-3.0]#

On the old primary pgserver2 we can see the contents of the file recovery.conf automatically created by EDB Failover manager

[[email protected] 101]$ cat recovery.conf
# EDB Failover Manager
# This generated recovery.conf file prevents the db server from accidentally
# being restarted as a master since a failover or promotion has occurred
standby_mode = on
restore_command = 'echo 2>"recovery suspended on failed server node"; exit 1'
[[email protected] 101]$

To rebuild our standby database we have to edit the recovery.conf file
Conclusion
We have seen in this blog how to configure edb-efm30. Note that a virtual IP can be also configured. The official documentation can help.

8 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Mouhamadou Diaw
Mouhamadou Diaw

Consultant