Infrastructure at your Service

Daniel Westermann

Maintenance scenarios with EDB Failover Manager (2) – Primary node

In the last post I looked at how you can do maintenance operations on the standby node when you are working in a PostgreSQL cluster protected by EDB Failover Manager. In this post I’ll look on how you can do maintenance on the primary node (better: the node where the primary instance currently runs on). This requires slightly more work and attention. Lets go.

As a quick reminder this is the setup:

IP Description
192.168.22.243 Current PostgreSQL hot standby instance
192.168.22.245 Currernt PostgreSQL primary instance
192.168.22.244 EDB Failover Manager Witness Node + EDB BART
192.168.22.250 Virtual IP that is used for client connections to the master database

When we want to do maintenance on the current primary node this will require a fail over of the PostgreSQL instance. In addition the VIP shall fail over too to provide the clients the same address to connect as before the fail over. Lets check the current status of the fail over cluster:

[email protected]:/home/postgres/ [pg950] efmstat 
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Master      192.168.22.245       UP     UP        
	Witness     192.168.22.244       UP     N/A       
	Standby     192.168.22.243       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.245 192.168.22.243

Standby priority host list:
	192.168.22.243

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/350000D0       
	Standby     192.168.22.243       0/350000D0       

	Standby database(s) in sync with master. It is safe to promote.

The last line is telling us that it is probably safe to promote. Consider these two cases:

  • Case 1: You want to perform a maintenance which requires the reboot the whole node
  • Case 2: You want to perform a maintenance which requires only the restart of the PostgreSQL instance on that node

It is important to distinguish the two cases because it will impact on how you have to deal with EDB Failover Manager. Lets see what happens in case one when we do a promote and then reboot the node. I’ll do the promote command on the witness node but it doesn’t really matter as you can execute it from any node in the fail over cluster:

-bash-4.2$ /usr/efm-2.0/bin/efm promote efm
Promote command accepted by local agent. Proceeding with promotion. Run the 'cluster-status' command for information about the new cluster state.

What happened. First lets check the cluster status:

-bash-4.2$ /usr/efm-2.0/bin/efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Idle        192.168.22.245       UP     UNKNOWN   
	Witness     192.168.22.244       UP     N/A       
	Master      192.168.22.243       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.245 192.168.22.243

Standby priority host list:
	(List is empty.)

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.243       0/350001E0       

	No standby databases were found.

Idle Node Status (idle nodes ignored in XLog location comparisons):

	Address              XLog Loc         Info
	--------------------------------------------------------------
	192.168.22.245       0/350000D0       DB is not in recovery.

The old master is gone (Idle/UNKNOWN) and the old standby became the new master. You can double check this if you login to the new master and check if the instance is still in recovery mode:

[email protected]:/home/postgres/ [PGSITE1] ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:6d:d8:b7 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
       valid_lft 85491sec preferred_lft 85491sec
    inet6 fe80::a00:27ff:fe6d:d8b7/64 scope link 
       valid_lft forever preferred_lft forever
3: enp0s8:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:e4:19:ec brd ff:ff:ff:ff:ff:ff
    inet 192.168.22.243/24 brd 192.168.22.255 scope global enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.22.250/24 brd 192.168.22.255 scope global secondary enp0s8:0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fee4:19ec/64 scope link 
       valid_lft forever preferred_lft forever
[email protected]:/home/postgres/ [PGSITE1] sqh
psql.bin (9.5.0.5)
Type "help" for help.

postgres=# select * from pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f
(1 row)

From the IP configuration output it is also proved that the VIP failed over to the new master node. What is the status of the old master instance?

[email protected]:/home/postgres/ [PGSITE2] ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:19:6f:0a brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
       valid_lft 85364sec preferred_lft 85364sec
    inet6 fe80::a00:27ff:fe19:6f0a/64 scope link 
       valid_lft forever preferred_lft forever
3: enp0s8:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:ba:c0:6a brd ff:ff:ff:ff:ff:ff
    inet 192.168.22.245/24 brd 192.168.22.255 scope global enp0s8
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:feba:c06a/64 scope link 
       valid_lft forever preferred_lft forever
[email protected]:/home/postgres/ [PGSITE2] sqh
psql.bin (9.5.0.5)
Type "help" for help.

postgres=# select * from pg_is_in_recovery();                               
 pg_is_in_recovery 
-------------------
 f
(1 row)

This is dangerous, because: You now have two masters. If any of your applications uses the local IP address to connect then it would be still connected to the standby instance. Failover Manager created a new recovery.conf file though:

[email protected]:/home/postgres/ [PGSITE2] cat $PGDATA/recovery.conf
# EDB Failover Manager
# This generated recovery.conf file prevents the db server from accidentally
# being restarted as a master since a failover or promotion has occurred
standby_mode = on
restore_command = 'echo 2>"recovery suspended on failed server node"; exit 1'

But as long as you do not restart the old primary instance the instance will accept modifications. If you restart:

2016-04-11 11:37:40.114 GMT - 8 - 3454 -  - @ LOCATION:  ReadRecord, xlog.c:3983
2016-04-11 11:37:40.114 GMT - 6 - 3452 -  - @ LOG:  00000: database system is ready to accept read only connections

… this can not do any harm anymore. But the required step to restart automatically by EDB Failover Manager is missing. It is left to you to perform an immediate restart of the old master instance to prevent the two master scenario. This should definitely be improved in EDB Failover Manager.

Once maintenance is finished and the old master node is rebooted what is the status of the cluster? Lets check:

[[email protected] ~] /usr/edb-efm/bin/efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Master      192.168.22.243       UP     UP        
	Witness     192.168.22.244       UP     N/A       

Allowed node host list:
	192.168.22.244 192.168.22.243

Standby priority host list:
	(List is empty.)

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.243       0/37021258       

	No standby databases were found.

The information of the old master completely disappeared. So how can we recover from that and bring back the old configuration? The first step is to add the old master node back to the allowed hosts of the cluster:

[[email protected] ~] /usr/edb-efm/bin/efm add-node efm 192.168.22.245
add-node signal sent to local agent.
[[email protected] ~] 

This should result in the following output:

[[email protected] ~] /usr/edb-efm/bin/efm cluster-status efm | grep -A 2 "Allowed"
Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Now we need to rebuild the old master as a new standby. This can be done in various ways, the two most common are to create a new basebackup of the new master or to use pg_rewind. I’ll us pg_rewind here. So to rebuild the old master as a new standby:

[email protected]:/u02/pgdata/PGSITE2/ [PGSITE2] pg_rewind -D /u02/pgdata/PGSITE2/ --source-server="port=4445 host=192.168.22.243 user=postgres dbname=postgres"
servers diverged at WAL position 0/350000D0 on timeline 2
rewinding from last common checkpoint at 0/35000028 on timeline 2
Done!

Make sure your recovery.conf matches your environment;

[email protected]:/u02/pgdata/PGSITE2/ [PGSITE2] cat recovery.conf
standby_mode = 'on'
primary_slot_name = 'standby1'
primary_conninfo = 'user=postgres password=admin123 host=192.168.22.243 port=4445 sslmode=prefer sslcompression=1'
recovery_target_timeline = 'latest'
trigger_file='/u02/pgdata/PGSITE2/trigger_file'

Start the new standby instance and check the log file:

2016-04-17 11:21:40.021 GMT - 1 - 2440 -  - @ LOG:  database system was interrupted while in recovery at log time 2016-04-17 11:09:28 GMT
2016-04-17 11:21:40.021 GMT - 2 - 2440 -  - @ HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2016-04-17 11:21:40.554 GMT - 3 - 2440 -  - @ LOG:  entering standby mode
2016-04-17 11:21:40.608 GMT - 4 - 2440 -  - @ LOG:  redo starts at 0/35000098
2016-04-17 11:21:40.704 GMT - 5 - 2440 -  - @ LOG:  consistent recovery state reached at 0/37047858
2016-04-17 11:21:40.704 GMT - 6 - 2440 -  - @ LOG:  invalid record length at 0/37047858
2016-04-17 11:21:40.704 GMT - 4 - 2438 -  - @ LOG:  database system is ready to accept read only connections
2016-04-17 11:21:40.836 GMT - 1 - 2444 -  - @ LOG:  started streaming WAL from primary at 0/37000000 on timeline 3

So far for the database part. To bring back the EDB failover manager configuration we need to adjust the efm.nodes file on the new standby to include all the hosts in the configuration:

[[email protected] efm-2.0] pwd
/etc/efm-2.0
[[email protected] efm-2.0] cat efm.nodes 
# List of node address:port combinations separated by whitespace.
192.168.22.244:9998 192.168.22.243:9998 192.168.22.245:9998

Once this is done EDB Failover Manager can be restarted and the configuration is fine again:

[[email protected] efm-2.0] systemctl start efm-2.0.service
[[email protected] efm-2.0] /usr/efm-2.0/bin/efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Witness     192.168.22.244       UP     N/A       
	Standby     192.168.22.245       UP     UP        
	Master      192.168.22.243       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	192.168.22.245

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.243       0/37048540       
	Standby     192.168.22.245       0/37048540       

	Standby database(s) in sync with master. It is safe to promote.

Coming to the second scenario: When you do not need to reboot the server but only need to take down the master database for maintenance what are the steps to follow? As with scenario 1 you’ll have to promote to activate the standby database and immediately shutdown the old master:

[[email protected] ~]# /usr/edb-efm/bin/efm promote efm
Promote command accepted by local agent. Proceeding with promotion. Run the 'cluster-status' command for information about the new cluster state.
[[email protected] ~]# /usr/edb-efm/bin/efm cluster-status efm 
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Idle        192.168.22.243       UP     UNKNOWN   
	Witness     192.168.22.244       UP     N/A       
	Master      192.168.22.245       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	(List is empty.)

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/37048730       

	No standby databases were found.

Idle Node Status (idle nodes ignored in XLog location comparisons):

	Address              XLog Loc         Info
	--------------------------------------------------------------
	192.168.22.243       UNKNOWN          Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

Now you can do your maintenance operations and once you finished rebuild the old master as a new standby:

[email protected]:/u02/pgdata/PGSITE1/ [PGSITE1] pg_rewind -D /u02/pgdata/PGSITE1/ --source-server="port=4445 host=192.168.22.245 user=postgres dbname=postgres"
servers diverged at WAL position 0/37048620 on timeline 3
rewinding from last common checkpoint at 0/37048578 on timeline 3
Done!

[email protected]:/u02/pgdata/PGSITE1/ [PGSITE1] cat recovery.conf
standby_mode = 'on'
primary_slot_name = 'standby1'
primary_conninfo = 'user=postgres password=admin123 host=192.168.22.245 port=4445 sslmode=prefer sslcompression=1'
recovery_target_timeline = 'latest'
trigger_file='/u02/pgdata/PGSITE1/trigger_file'

Once you restarted the instance EDB Failover Manager shows:

[[email protected] ~]# /usr/edb-efm/bin/efm cluster-status efm 
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Master      192.168.22.245       UP     UP        
	Witness     192.168.22.244       UP     N/A       
	Idle        192.168.22.243       UP     UNKNOWN   

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	(List is empty.)

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/37068EF0       

	No standby databases were found.

Idle Node Status (idle nodes ignored in XLog location comparisons):

	Address              XLog Loc         Info
	--------------------------------------------------------------
	192.168.22.243       0/37068EF0       DB is in recovery.

The new standby is detected as “in recovery” but still shows “UNKNOWN”. To fix this execute the “resume” command on the new standby and then check the cluster status again:

[[email protected] ~] /usr/edb-efm/bin/efm resume efm
Resume command successful on local agent.
[[email protected] ~] /usr/edb-efm/bin/efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Standby     192.168.22.243       UP     UP        
	Witness     192.168.22.244       UP     N/A       
	Master      192.168.22.245       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	192.168.22.243

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/37068EF0       
	Standby     192.168.22.243       0/37068EF0       

	Standby database(s) in sync with master. It is safe to promote.

Everything back to normal operations.
Conclusion: It depends on what exactly you want to do to get the failover cluster configuration back to normal operations. The steps itself are easy, the main issue is to perform them in the right order. Having that documented is a must as you’ll probably not do these kind of tasks every day.

4 Comments

  • Alok says:

    Hi,
    thanks for this great series on EFM. Much appreciated.
    Regarding your point on two masters (after manual promotion), seems this situation is not there with EFM2.1 (-switchover) option.
    if you include the –switchover keyword, during switchover:
    -The master database is stopped.
    – The address of the new master node is added to the recovery.conf file.
    – The old master is restarted; the agent will resume monitoring it as a standby.

    Please correct me if I am wrong.

    -Alok

  • Daniel Westermann says:

    Hi Alok,

    yes, when using the “-switchover” option this is not an issue. You can check the details here:
    http://blog.dbi-services.com/edb-failover-manager-2-1-two-new-features/

    Cheers,
    Daniel

    • Alok says:

      Hi Daniel,

      yes, thats true. Thanks.
      Even if you perform unplanned failover (by terminating postmaster process or node reboot etc..), in that case too, EFM does
      not bring back old master as second master as in such cases EFM creates dummy recovery.conf file on down node to prevent
      2 masters at any given point in time.

      -Alok

  • Alok says:

    sorry, I did’t notice that you had already mentioned that in your post.
    Great work Daniel.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure