Recently at a customer we again implemented a PostgreSQL architecture as described some time ago. When we delivered the first draft of the documentation the question popped up how to handle maintenance tasks in a failover cluster protected by EDB Failover Manager. So, here we go…
The setup of my little test environment is as follows:
IP | Description |
---|---|
192.168.22.243 | Current PostgreSQL hot standby instance |
192.168.22.245 | Current PostgreSQL primary instance |
192.168.22.244 | EDB Failover Manager Witness Node + EDB BART |
192.168.22.250 | Virtual IP that is used for client connections to the master database |
In words there is one master database streaming to a hot standby database. On the third node there is EDB BART for backup and recovery and EDB Failover Manager as a witness. For this specific setup we used a VIP that will fail over as well if the master database goes down for any reason.
Using the EFM command line utility the current status of the cluster can be displayed:
postgres@edbbart:/home/postgres/ [pg950] efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Standby 192.168.22.243 UP UP Witness 192.168.22.244 UP N/A Master 192.168.22.245 UP UP Allowed node host list: 192.168.22.244 192.168.22.245 192.168.22.243 Standby priority host list: 192.168.22.243 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/320001E8 Standby 192.168.22.243 0/320001E8 Standby database(s) in sync with master. It is safe to promote.
This clearly shows the cluster is healthy at the moment and a promote could be safely executed. Note that automatic failover is disabled here.
What do you need to do when you want to perform maintenance on the current standby node and you need to take the database down for being able to do so? Actually not much. Lets confirm that this is really the standby node:
postgres@edbppas:/home/postgres/ [PGSITE1] psql postgres psql.bin (9.5.0.5) Type "help" for help. postgres=# select * from pg_is_in_recovery(); pg_is_in_recovery ------------------- t (1 row) postgres=#
Fine. I’ll be doing it the hard way and will just reboot the server (autostart for the PostgreSQL instance is disabled in my case):
[root@edbppas ~] reboot Connection to 192.168.22.243 closed by remote host. Connection to 192.168.22.243 closed.
What is the status of my cluster now?
postgres@edbbart:/home/postgres/ [pg950] efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.22.244 UP N/A Master 192.168.22.245 UP UP Allowed node host list: 192.168.22.244 192.168.22.245 192.168.22.243 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/320002C8 No standby databases were found.
The standby database disappeared. Once the node comes up again lets check the failover manager on that node:
[root@edbppas ~] systemctl status efm-2.0.service efm-2.0.service - EnterpriseDB Failover Manager 2.0 Loaded: loaded (/usr/lib/systemd/system/efm-2.0.service; enabled) Active: failed (Result: exit-code) since Fri 2016-04-08 15:51:51 CEST; 1min 39s ago Process: 583 ExecStart=/bin/java -cp /usr/efm-2.0/lib/EFM-2.0.4.jar com.enterprisedb.hal.main.ServiceCommand start /etc/efm-2.0/${CLUSTER}.properties (code=exited, status=1/FAILURE) Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] 4/8/16 3:51:50 PM com.enterprisedb.hal.exec.ExecUtil performExe...ey efm] Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] 4/8/16 3:51:50 PM com.enterprisedb.hal.exec.ExecUtil performExe...Out=''} Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] 4/8/16 3:51:50 PM com.enterprisedb.hal.nodes.HalAgent shutdown ...xiting. Apr 08 15:51:51 edbppas java[583]: [ 4/8/16 3:51:51 PM ] Apr 08 15:51:51 edbppas systemd[1]: efm-2.0.service: control process exited, code=exited status=1 Apr 08 15:51:51 edbppas systemd[1]: Failed to start EnterpriseDB Failover Manager 2.0. Apr 08 15:51:51 edbppas systemd[1]: Unit efm-2.0.service entered failed state. Hint: Some lines were ellipsized, use -l to show in full. [root@edbppas ~] cat /var/log/efm-2.0/efm.log [root@edbppas ~]
Hm, not so good. Although the service is enabled for automatic restart it did not come up. How to fix it? The database is still configured for recovery:
postgres@edbppas:/home/postgres/ [PGSITE1] cat $PGDATA/recovery.conf standby_mode = 'on' primary_slot_name = 'standby1' primary_conninfo = 'user=postgres password=admin123 host=192.168.22.245 port=4445 sslmode=prefer sslcompression=1' recovery_target_timeline = 'latest' trigger_file='/u02/pgdata/PGSITE1/trigger_file'
It should be safe to start it up. Once it comes up, looking at the logfile:
2016-04-08 14:25:24.350 GMT - 3 - 2428 - - @ LOG: consistent recovery state reached at 0/320001E8 2016-04-08 14:25:24.350 GMT - 4 - 2428 - - @ LOG: redo starts at 0/320001E8 2016-04-08 14:25:24.350 GMT - 5 - 2428 - - @ LOG: invalid record length at 0/320002C8 2016-04-08 14:25:24.351 GMT - 4 - 2426 - - @ LOG: database system is ready to accept read only connections 2016-04-08 14:25:24.359 GMT - 1 - 2432 - - @ LOG: started streaming WAL from primary at 0/32000000 on timeline 2
Fine. Streaming started again and from a database perspective we are fine. But what about our failover manager? In this case we need to tell failover manager to rejoin the cluster by adding all noes to the efm.nodes file:
[root@edbppas efm-2.0] pwd /etc/efm-2.0 [root@edbppas efm-2.0] cat efm.nodes # List of node address:port combinations separated by whitespace. 192.168.22.244:9998 192.168.22.245:9998 192.168.22.243:9998
Once this is ready we can restart and everything is back to normal operations:
[root@edbppas efm-2.0] systemctl start efm-2.0.service [root@edbppas efm-2.0] /usr/efm efm/ efm-2.0/ [root@edbppas efm-2.0] /usr/efm-2.0/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.22.244 UP N/A Standby 192.168.22.243 UP UP Master 192.168.22.245 UP UP Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: 192.168.22.243 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/330000D0 Standby 192.168.22.243 0/330000D0 Standby database(s) in sync with master. It is safe to promote.
Note 1: If you need to reboot the whole node where currently the standby is running on you’ll need to rejoin to the cluster.
What happens if we stop the standby database without rebooting the node?
postgres@edbppas:/home/postgres/ [PGSITE1] pgstop waiting for server to shut down.... done server stopped postgres@edbppas:/home/postgres/ [PGSITE1] efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.22.244 UP N/A Master 192.168.22.245 UP UP Standby 192.168.22.243 UP UP Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: 192.168.22.243 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/330000D0 Unknown 192.168.22.243 UNKNOWN Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections. One or more standby databases are not in sync with the master database.
The configuration itself is still fine but the standby database is reported as “UNKNOWN”. This is fine, in this case you can just restart the instance:
postgres@edbppas:/home/postgres/ [PGSITE1] pgstart server starting postgres@edbppas:/home/postgres/ [PGSITE1] 2016-04-08 14:35:50.411 GMT - 1 - 3202 - - @ LOG: redirecting log output to logging collector process 2016-04-08 14:35:50.411 GMT - 2 - 3202 - - @ HINT: Future log output will appear in directory "/u02/pgdata/PGSITE1/pg_log". postgres@edbppas:/home/postgres/ [PGSITE1] efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.22.244 UP N/A Idle 192.168.22.243 UP UNKNOWN Master 192.168.22.245 UP UP Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/330000D0 No standby databases were found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.22.243 0/330000D0 DB is in recovery.
The standby now is reported as “DB is in recovery.” How to fix this? Use the “resume” command:
[root@edbppas ~] /usr/efm-2.0/bin/efm resume efm Resume command successful on local agent. [root@edbppas ~] /usr/efm-2.0/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Standby 192.168.22.243 UP UP Witness 192.168.22.244 UP N/A Master 192.168.22.245 UP UP Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: 192.168.22.243 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/34000060 Standby 192.168.22.243 0/34000060 Standby database(s) in sync with master. It is safe to promote.
Ready again.
Note 2: If you take down the standby database: Once you started it up again use the “resume” command to return to normal operation.