Infrastructure at your Service

Daniel Westermann

pg_auto_failover: Failover and switchover scenarios

In the last post we had a look at the installation and setup of pg_auto_failover. We currently have one primary cluster and two replicas synchronizing from this primary cluster. But we potentially also have an issue in the setup: The monitor is running beside the primary instance on the same node and if that nodes goes down the monitor is gone. What happens in that case and how can we avoid that? We also did not look at controlled switch-overs, and this is definitely something you want to have in production. From time to time you’ll need to do some maintenance on one of the nodes, and switching the primary cluster to another node is very handy in such situations. Lets start with the simple case and have a look at switch-overs first.

This is the current state of the setup:

[email protected]:~$ pg_autoctl show state --pgdata /u02/pgdata/13/monitor/
  Name |  Node |                      Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | pgaf1.it.dbi-services.com:5432 | 0/6002408 |       yes |             primary |             primary
node_2 |     2 | pgaf2.it.dbi-services.com:5432 | 0/6002408 |       yes |           secondary |           secondary
node_3 |     3 | pgaf3.it.dbi-services.com:5432 | 0/6002408 |       yes |           secondary |           secondary

Before we attempt to do a switch-over you should be aware of your replication settings:

[email protected]:~$ pg_autoctl get formation settings --pgdata /u02/pgdata/13/monitor/
  Context |    Name |                   Setting | Value                                                       
----------+---------+---------------------------+-------------------------------------------------------------
formation | default |      number_sync_standbys | 1                                                           
  primary |  node_1 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)'
     node |  node_1 |        candidate priority | 50                                                          
     node |  node_2 |        candidate priority | 50                                                          
     node |  node_3 |        candidate priority | 50                                                          
     node |  node_1 |        replication quorum | true                                                        
     node |  node_2 |        replication quorum | true                                                        
     node |  node_3 |        replication quorum | true                                     

What does this tell us:

  • synchronous_standby_names: We’re using synchronous replication and at least one of the two replicas need to confirm a commit (This is a PostgreSQL setting)
  • number_sync_standbys=1: That means at least one standby needs to confirm the commit (This is a pg_auto_failover setting)
  • candidate priority=50: This specifies which replica gets promoted. At the default setting of 50 all replicas have the same chance to be selected for promotion and the monitor will pick the one with the most advanced LSN. (This is a pg_auto_failover setting)
  • replication quorum=true: This mean synchronous replication, a values of false mean asynchronous replication. (This is a pg_auto_failover setting)

You maybe have noticed the “formation” keyword above. A formation is a set of PostgreSQL clusters that are managed together and that means you can use the same monitor to manage multiple sets of PostgreSQL clusters. We are using the default formation in this example.

Lets assume we need to do some maintenance on our primary node and therefore want to switch-over the primary instance to another node. The command to do that is simple:

[email protected]:~$ pg_autoctl perform switchover --pgdata /u02/pgdata/13/PG1/
16:10:05 15960 INFO  Targetting group 0 in formation "default"
16:10:05 15960 INFO  Listening monitor notifications about state changes in formation "default" and group 0
16:10:05 15960 INFO  Following table displays times when notifications are received
    Time |   Name |  Node |                      Host:Port |       Current State |      Assigned State
---------+--------+-------+--------------------------------+---------------------+--------------------
16:10:05 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |             primary |            draining
16:10:05 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |            draining |            draining
16:10:05 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |           secondary |          report_lsn
16:10:05 | node_3 |     3 | pgaf3.it.dbi-services.com:5432 |           secondary |          report_lsn
16:10:06 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |          report_lsn |          report_lsn
16:10:06 | node_3 |     3 | pgaf3.it.dbi-services.com:5432 |          report_lsn |          report_lsn
16:10:06 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |          report_lsn |   prepare_promotion
16:10:06 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |   prepare_promotion |   prepare_promotion
16:10:06 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |   prepare_promotion |    stop_replication
16:10:06 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |            draining |      demote_timeout
16:10:06 | node_3 |     3 | pgaf3.it.dbi-services.com:5432 |          report_lsn |      join_secondary
16:10:06 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |      demote_timeout |      demote_timeout
16:10:06 | node_3 |     3 | pgaf3.it.dbi-services.com:5432 |      join_secondary |      join_secondary
16:10:07 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |    stop_replication |    stop_replication
16:10:07 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |    stop_replication |        wait_primary
16:10:07 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |      demote_timeout |             demoted
16:10:07 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |             demoted |             demoted
16:10:07 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |        wait_primary |        wait_primary
16:10:07 | node_3 |     3 | pgaf3.it.dbi-services.com:5432 |      join_secondary |           secondary
16:10:07 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |        wait_primary |             primary
16:10:07 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |             demoted |          catchingup
16:10:07 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |        wait_primary |        join_primary
16:10:07 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |        join_primary |        join_primary
16:10:08 | node_3 |     3 | pgaf3.it.dbi-services.com:5432 |           secondary |           secondary
16:10:08 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |          catchingup |          catchingup
16:10:08 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |          catchingup |           secondary
16:10:08 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |        join_primary |             primary
16:10:08 | node_1 |     1 | pgaf1.it.dbi-services.com:5432 |           secondary |           secondary
16:10:08 | node_2 |     2 | pgaf2.it.dbi-services.com:5432 |             primary |             primary
[email protected]:~$ 

You’ll get the progress messages to the screen so you can actually see what happens. As the services are started with systemd you can also have a look at the journal:

-- Logs begin at Thu 2020-12-10 15:17:38 CET, end at Thu 2020-12-10 16:11:26 CET. --
Dec 10 16:10:08 pgaf1 pg_autoctl[327]: 16:10:08 399 INFO  FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm
Dec 10 16:10:08 pgaf1 pg_autoctl[327]: 16:10:08 399 INFO  Transition complete: current state is now "secondary"
Dec 10 16:10:08 pgaf1 pg_autoctl[341]: 16:10:08 397 INFO  node 1 "node_1" (pgaf1.it.dbi-services.com:5432) reported new state "secondary"
Dec 10 16:10:08 pgaf1 pg_autoctl[341]: 16:10:08 397 INFO  New state for node 1 "node_1" (pgaf1.it.dbi-services.com:5432): secondary ➜ sec
Dec 10 16:10:08 pgaf1 pg_autoctl[327]: 16:10:08 399 INFO  New state for this node (node 1, "node_1") (pgaf1.it.dbi-services.com:5432): se
Dec 10 16:10:08 pgaf1 pg_autoctl[341]: 16:10:08 397 INFO  node 2 "node_2" (pgaf2.it.dbi-services.com:5432) reported new state "primary"
Dec 10 16:10:08 pgaf1 pg_autoctl[341]: 16:10:08 397 INFO  New state for node 2 "node_2" (pgaf2.it.dbi-services.com:5432): primary ➜ prima
Dec 10 16:10:08 pgaf1 pg_autoctl[327]: 16:10:08 399 INFO  New state for node 2 "node_2" (pgaf2.it.dbi-services.com:5432): primary ➜ prima

The second second node was selected as the new primary, and we can of course confirm that:

[email protected]:~$ pg_autoctl show state --pgdata /u02/pgdata/13/monitor/
  Name |  Node |                      Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | pgaf1.it.dbi-services.com:5432 | 0/60026F8 |       yes |           secondary |           secondary
node_2 |     2 | pgaf2.it.dbi-services.com:5432 | 0/60026F8 |       yes |             primary |             primary
node_3 |     3 | pgaf3.it.dbi-services.com:5432 | 0/60026F8 |       yes |           secondary |           secondary

[email protected]:~$ 

Next test: What happens when we reboot a node that currently is running a replica? Lets reboot pgaf3 as this one is currently a replica, and it does not run the monitor:

[email protected]:~$ sudo reboot
[email protected]:~$ Connection to 192.168.22.192 closed by remote host.
Connection to 192.168.22.192 closed.

Watching at the state the “Reachable” status changes to “no” for the third instance and the LSN falls behind:

[email protected]:~$ pg_autoctl show state --pgdata /u02/pgdata/13/monitor/
  Name |  Node |                      Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | pgaf1.it.dbi-services.com:5432 | 0/60026F8 |       yes |           secondary |           secondary
node_2 |     2 | pgaf2.it.dbi-services.com:5432 | 0/60026F8 |       yes |             primary |             primary
node_3 |     3 | pgaf3.it.dbi-services.com:5432 | 0/6000000 |        no |           secondary |           secondary

Once it is back, the replica is brought back to the configuration and all is fine:

[email protected]:~$ pg_autoctl show state --pgdata /u02/pgdata/13/monitor/
  Name |  Node |                      Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | pgaf1.it.dbi-services.com:5432 | 0/60026F8 |       yes |           secondary |           secondary
node_2 |     2 | pgaf2.it.dbi-services.com:5432 | 0/60026F8 |       yes |             primary |             primary
node_3 |     3 | pgaf3.it.dbi-services.com:5432 | 0/6000000 |       yes |           secondary |           secondary

...
[email protected]:~$ pg_autoctl show state --pgdata /u02/pgdata/13/monitor/
  Name |  Node |                      Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | pgaf1.it.dbi-services.com:5432 | 0/6013120 |       yes |           secondary |           secondary
node_2 |     2 | pgaf2.it.dbi-services.com:5432 | 0/6013120 |       yes |             primary |             primary
node_3 |     3 | pgaf3.it.dbi-services.com:5432 | 0/6013120 |       yes |           secondary |           secondary

But what happens if we shutdown the monitor node?

[email protected]:~$ sudo systemctl poweroff
[email protected]:~$ Connection to 192.168.22.190 closed by remote host.
Connection to 192.168.22.190 closed.

Checking the status on the node which currently hosts the primary cluster:

[email protected]:~$ pg_autoctl show state --pgdata /u02/pgdata/13/PG1/
10:26:52 1293 WARN  Failed to connect to "postgres://[email protected]:5433/pg_auto_failover?sslmode=require", retrying until the server is ready
10:26:52 1293 ERROR Connection to database failed: timeout expired
10:26:52 1293 ERROR Failed to connect to "postgres://[email protected]:5433/pg_auto_failover?sslmode=require" after 1 attempts in 2 seconds, pg_autoctl stops retrying now
10:26:52 1293 ERROR Failed to retrieve current state from the monitor

As the monitor is down we cannot anymore ask for status. The primary and the remaining replica cluster are still up and running but we lost the possibility to interact with pg_auto_failover. Booting up the monitor node brings is back into the game:

[email protected]:~$ pg_autoctl show state --pgdata /u02/pgdata/13/PG1/
  Name |  Node |                      Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | pgaf1.it.dbi-services.com:5432 | 0/6000000 |       yes |           secondary |           secondary
node_2 |     2 | pgaf2.it.dbi-services.com:5432 | 0/6013240 |       yes |             primary |             primary
node_3 |     3 | pgaf3.it.dbi-services.com:5432 | 0/6013240 |       yes |           secondary |           secondary

This has a consequence: The monitor should not run on any of the PostgreSQL nodes but on a separate node which is dedicated to the monitor. As you can manage more than one HA setup with the same monitor this should not an issue, though. But this also means that the monitor is a single point of failure and the health of the monitor is critical for pg_auto_failover.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure