Infrastructure at your Service

Daniel Westermann

Maintenance scenarios with EDB Failover Manager (1) – Standby node

Recently at a customer we again implemented a PostgreSQL architecture as described some time ago. When we delivered the first draft of the documentation the question popped up how to handle maintenance tasks in a failover cluster protected by EDB Failover Manager. So, here we go…

The setup of my little test environment is as follows:

IP Description
192.168.22.243 Current PostgreSQL hot standby instance
192.168.22.245 Current PostgreSQL primary instance
192.168.22.244 EDB Failover Manager Witness Node + EDB BART
192.168.22.250 Virtual IP that is used for client connections to the master database

In words there is one master database streaming to a hot standby database. On the third node there is EDB BART for backup and recovery and EDB Failover Manager as a witness. For this specific setup we used a VIP that will fail over as well if the master database goes down for any reason.

Using the EFM command line utility the current status of the cluster can be displayed:

[email protected]:/home/postgres/ [pg950] efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Standby     192.168.22.243       UP     UP        
	Witness     192.168.22.244       UP     N/A       
	Master      192.168.22.245       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.245 192.168.22.243

Standby priority host list:
	192.168.22.243

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/320001E8       
	Standby     192.168.22.243       0/320001E8       

	Standby database(s) in sync with master. It is safe to promote.

This clearly shows the cluster is healthy at the moment and a promote could be safely executed. Note that automatic failover is disabled here.

What do you need to do when you want to perform maintenance on the current standby node and you need to take the database down for being able to do so? Actually not much. Lets confirm that this is really the standby node:

[email protected]:/home/postgres/ [PGSITE1] psql postgres
psql.bin (9.5.0.5)
Type "help" for help.

postgres=# select * from pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 t
(1 row)

postgres=# 

Fine. I’ll be doing it the hard way and will just reboot the server (autostart for the PostgreSQL instance is disabled in my case):

[[email protected] ~] reboot
Connection to 192.168.22.243 closed by remote host.
Connection to 192.168.22.243 closed.

What is the status of my cluster now?

[email protected]:/home/postgres/ [pg950] efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Witness     192.168.22.244       UP     N/A       
	Master      192.168.22.245       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.245 192.168.22.243

Standby priority host list:
	(List is empty.)

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/320002C8       

	No standby databases were found.

The standby database disappeared. Once the node comes up again lets check the failover manager on that node:

[[email protected] ~] systemctl status efm-2.0.service
efm-2.0.service - EnterpriseDB Failover Manager 2.0
   Loaded: loaded (/usr/lib/systemd/system/efm-2.0.service; enabled)
   Active: failed (Result: exit-code) since Fri 2016-04-08 15:51:51 CEST; 1min 39s ago
  Process: 583 ExecStart=/bin/java -cp /usr/efm-2.0/lib/EFM-2.0.4.jar com.enterprisedb.hal.main.ServiceCommand start /etc/efm-2.0/${CLUSTER}.properties (code=exited, status=1/FAILURE)

Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ]
Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] 4/8/16 3:51:50 PM com.enterprisedb.hal.exec.ExecUtil performExe...ey efm]
Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ]
Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] 4/8/16 3:51:50 PM com.enterprisedb.hal.exec.ExecUtil performExe...Out=''}
Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ]
Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] 4/8/16 3:51:50 PM com.enterprisedb.hal.nodes.HalAgent shutdown ...xiting.
Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ]
Apr 08 15:51:51 edbppas systemd[1]: efm-2.0.service: control process exited, code=exited status=1
Apr 08 15:51:51 edbppas systemd[1]: Failed to start EnterpriseDB Failover Manager 2.0.
Apr 08 15:51:51 edbppas systemd[1]: Unit efm-2.0.service entered failed state.
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~] cat /var/log/efm-2.0/efm.log
[[email protected] ~] 

Hm, not so good. Although the service is enabled for automatic restart it did not come up. How to fix it? The database is still configured for recovery:

[email protected]:/home/postgres/ [PGSITE1] cat $PGDATA/recovery.conf
standby_mode = 'on'
primary_slot_name = 'standby1'
primary_conninfo = 'user=postgres password=admin123 host=192.168.22.245 port=4445 sslmode=prefer sslcompression=1'
recovery_target_timeline = 'latest'
trigger_file='/u02/pgdata/PGSITE1/trigger_file'

It should be safe to start it up. Once it comes up, looking at the logfile:

2016-04-08 14:25:24.350 GMT - 3 - 2428 -  - @ LOG:  consistent recovery state reached at 0/320001E8
2016-04-08 14:25:24.350 GMT - 4 - 2428 -  - @ LOG:  redo starts at 0/320001E8
2016-04-08 14:25:24.350 GMT - 5 - 2428 -  - @ LOG:  invalid record length at 0/320002C8
2016-04-08 14:25:24.351 GMT - 4 - 2426 -  - @ LOG:  database system is ready to accept read only connections
2016-04-08 14:25:24.359 GMT - 1 - 2432 -  - @ LOG:  started streaming WAL from primary at 0/32000000 on timeline 2

Fine. Streaming started again and from a database perspective we are fine. But what about our failover manager? In this case we need to tell failover manager to rejoin the cluster by adding all noes to the efm.nodes file:

[[email protected] efm-2.0] pwd
/etc/efm-2.0
[[email protected] efm-2.0] cat efm.nodes
# List of node address:port combinations separated by whitespace.
192.168.22.244:9998 192.168.22.245:9998 192.168.22.243:9998

Once this is ready we can restart and everything is back to normal operations:

[[email protected] efm-2.0] systemctl start efm-2.0.service
[[email protected] efm-2.0] /usr/efm
efm/     efm-2.0/ 
[[email protected] efm-2.0] /usr/efm-2.0/bin/efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Witness     192.168.22.244       UP     N/A       
	Standby     192.168.22.243       UP     UP        
	Master      192.168.22.245       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	192.168.22.243

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/330000D0       
	Standby     192.168.22.243       0/330000D0       

	Standby database(s) in sync with master. It is safe to promote.

Note 1: If you need to reboot the whole node where currently the standby is running on you’ll need to rejoin to the cluster.

What happens if we stop the standby database without rebooting the node?

[email protected]:/home/postgres/ [PGSITE1] pgstop
waiting for server to shut down.... done
server stopped
[email protected]:/home/postgres/ [PGSITE1] efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Witness     192.168.22.244       UP     N/A       
	Master      192.168.22.245       UP     UP        
	Standby     192.168.22.243       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	192.168.22.243

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/330000D0       
	Unknown     192.168.22.243       UNKNOWN          Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

	One or more standby databases are not in sync with the master database.

The configuration itself is still fine but the standby database is reported as “UNKNOWN”. This is fine, in this case you can just restart the instance:

[email protected]:/home/postgres/ [PGSITE1] pgstart
server starting
[email protected]:/home/postgres/ [PGSITE1] 2016-04-08 14:35:50.411 GMT - 1 - 3202 -  - @ LOG:  redirecting log output to logging collector process
2016-04-08 14:35:50.411 GMT - 2 - 3202 -  - @ HINT:  Future log output will appear in directory "/u02/pgdata/PGSITE1/pg_log".

[email protected]:/home/postgres/ [PGSITE1] efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Witness     192.168.22.244       UP     N/A       
	Idle        192.168.22.243       UP     UNKNOWN   
	Master      192.168.22.245       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	(List is empty.)

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/330000D0       

	No standby databases were found.

Idle Node Status (idle nodes ignored in XLog location comparisons):

	Address              XLog Loc         Info
	--------------------------------------------------------------
	192.168.22.243       0/330000D0       DB is in recovery.

The standby now is reported as “DB is in recovery.” How to fix this? Use the “resume” command:

[[email protected] ~] /usr/efm-2.0/bin/efm resume efm
Resume command successful on local agent.
[[email protected] ~] /usr/efm-2.0/bin/efm cluster-status efm
Cluster Status: efm
Automatic failover is disabled.

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Standby     192.168.22.243       UP     UP        
	Witness     192.168.22.244       UP     N/A       
	Master      192.168.22.245       UP     UP        

Allowed node host list:
	192.168.22.244 192.168.22.243 192.168.22.245

Standby priority host list:
	192.168.22.243

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      192.168.22.245       0/34000060       
	Standby     192.168.22.243       0/34000060       

	Standby database(s) in sync with master. It is safe to promote.

Ready again.

Note 2: If you take down the standby database: Once you started it up again use the “resume” command to return to normal operation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure