One the really cool things with PostgreSQL is, that you have plenty of choices when it comes to tooling. For high availability we usually go with Patroni, but there is also pg_auto_failover and this will be the topic of this post. Because of the recent announcement around CentOS we’ll go with Debian this time. What is already prepared is the PostgreSQL installation (version 13.1), but nothing else. We start from scratch to see, if “is optimized for simplicity and correctness”, as it is stated on the GitHub page holds true.
This is the setup we’ll start with:
Hostname | IP-Address | Initial role |
---|---|---|
pgaf1.it.dbi-services.com | 192.168.22.190 | Primary and pg_auto_failover monitor |
pgaf2.it.dbi-services.com | 192.168.22.191 | First replica |
pgaf3.it.dbi-services.com | 192.168.22.192 | Second replica |
As said above, all three nodes have PostgreSQL 13.1 already installed at this location (PostgreSQL was installed from source code, but that should not really matter):
postgres@pgaf1:~$ ls /u01/app/postgres/product/13/db_1/ bin include lib share
What I did in addition, is to create ssh keys and then copy those from each machine to all nodes so password-less ssh connections are available between the nodes (here is the example from the first node):
postgres@pgaf1:~$ ssh-keygen postgres@pgaf1:~$ ssh-copy-id postgres@pgaf1 postgres@pgaf1:~$ ssh-copy-id postgres@pgaf2 postgres@pgaf1:~$ ssh-copy-id postgres@pgaf3
For installing pg_auto_failover from source make sure that pg_config is in your path:
postgres@pgaf1:~$ which pg_config /u01/app/postgres/product/13/db_1//bin/pg_config
Once that is ready, getting pg_auto_failover installed is quite simple:
postgres@pgaf1:~$ git clone https://github.com/citusdata/pg_auto_failover.git Cloning into 'pg_auto_failover'... remote: Enumerating objects: 252, done. remote: Counting objects: 100% (252/252), done. remote: Compressing objects: 100% (137/137), done. remote: Total 8131 (delta 134), reused 174 (delta 115), pack-reused 7879 Receiving objects: 100% (8131/8131), 5.07 MiB | 1.25 MiB/s, done. Resolving deltas: 100% (6022/6022), done. postgres@pgaf1:~$ cd pg_auto_failover/ postgres@pgaf1:~$ make make -C src/monitor/ all make[1]: Entering directory '/home/postgres/pg_auto_failover/src/monitor' gcc -std=c99 -D_GNU_SOURCE -g -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2 -Wformat -Wall -Werror=implicit-int -Werror=implicit-function-declaration -Werror=return-type -Wno-declaration-after-statement -Wno-missing-braces -fPIC -std=c99 -Wall -Werror -Wno-unused-parameter -Iinclude -I/u01/app/postgres/product/13/db_1/include -g -I. -I./ -I/u01/app/postgres/product/13/db_1/include/server -I/u01/app/postgres/product/13/db_1/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o metadata.o metadata.c ... make[2]: Leaving directory '/home/postgres/pg_auto_failover/src/bin/pg_autoctl' make[1]: Leaving directory '/home/postgres/pg_auto_failover/src/bin' postgres@pgaf1:~$ make install make -C src/monitor/ all make[1]: Entering directory '/home/postgres/pg_auto_failover/src/monitor' make[1]: Nothing to be done for 'all'. ...
This needs to be done on all hosts, of course. You will notice a new extension and new binaries in your PostgreSQL installation:
postgres@pgaf1:~$ ls /u01/app/postgres/product/13/db_1/share/extension/*pgauto* /u01/app/postgres/product/13/db_1/share/extension/pgautofailover--1.0--1.1.sql /u01/app/postgres/product/13/db_1/share/extension/pgautofailover--1.0.sql /u01/app/postgres/product/13/db_1/share/extension/pgautofailover--1.1--1.2.sql /u01/app/postgres/product/13/db_1/share/extension/pgautofailover--1.2--1.3.sql /u01/app/postgres/product/13/db_1/share/extension/pgautofailover--1.3--1.4.sql /u01/app/postgres/product/13/db_1/share/extension/pgautofailover--1.4--dummy.sql /u01/app/postgres/product/13/db_1/share/extension/pgautofailover--1.4.sql /u01/app/postgres/product/13/db_1/share/extension/pgautofailover.control postgres@pgaf1:~$ ls /u01/app/postgres/product/13/db_1/bin/*auto* /u01/app/postgres/product/13/db_1/bin/pg_autoctl
Having that available we’ll need to initialize the pg_auto_failover monitor which is responsible for assigning roles and health-checking. We’ll do that in the first node:
postgres@pgaf1:~$ export PGDATA=/u02/pgdata/13/monitor postgres@pgaf1:~$ export PGPORT=5433 postgres@pgaf1:~$ pg_autoctl create monitor --ssl-self-signed --hostname pgaf1.it.dbi-services.com --auth trust --run 14:45:40 13184 INFO Using default --ssl-mode "require" 14:45:40 13184 INFO Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic 14:45:40 13184 WARN Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks. 14:45:40 13184 WARN See https://www.postgresql.org/docs/current/libpq-ssl.html for details 14:45:40 13184 INFO Initialising a PostgreSQL cluster at "/u02/pgdata/13/monitor" 14:45:40 13184 INFO /u01/app/postgres/product/13/db_1/bin/pg_ctl initdb -s -D /u02/pgdata/13/monitor --option '--auth=trust' 14:45:42 13184 INFO /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /u02/pgdata/13/monitor/server.crt -keyout /u02/pgdata/13/monitor/server.key -subj "/CN=pgaf1.it.dbi-services.com" 14:45:42 13184 INFO Started pg_autoctl postgres service with pid 13204 14:45:42 13184 INFO Started pg_autoctl listener service with pid 13205 14:45:42 13204 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service postgres --pgdata /u02/pgdata/13/monitor -v 14:45:42 13209 INFO /u01/app/postgres/product/13/db_1/bin/postgres -D /u02/pgdata/13/monitor -p 5433 -h * 14:45:42 13205 ERROR Connection to database failed: could not connect to server: No such file or directory 14:45:42 13205 ERROR Is the server running locally and accepting 14:45:42 13205 ERROR connections on Unix domain socket "/tmp/.s.PGSQL.5433"? 14:45:42 13205 ERROR Failed to connect to local Postgres database at "port=5433 dbname=postgres", see above for details 14:45:42 13205 ERROR Failed to create user "autoctl" on local postgres server 14:45:42 13184 ERROR pg_autoctl service listener exited with exit status 12 14:45:42 13184 INFO Restarting service listener 14:45:42 13204 INFO Postgres is now serving PGDATA "/u02/pgdata/13/monitor" on port 5433 with pid 13209 14:45:43 13221 WARN NOTICE: installing required extension "btree_gist" 14:45:43 13221 INFO Granting connection privileges on 192.168.22.0/24 14:45:43 13221 INFO Your pg_auto_failover monitor instance is now ready on port 5433. 14:45:43 13221 INFO Monitor has been successfully initialized. 14:45:43 13221 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service listener --pgdata /u02/pgdata/13/monitor -v 14:45:43 13221 INFO Managing the monitor at postgres://[email protected]:5433/pg_auto_failover?sslmode=require 14:45:43 13221 INFO Reloaded the new configuration from "/home/postgres/.config/pg_autoctl/u02/pgdata/13/monitor/pg_autoctl.cfg" 14:45:44 13221 INFO The version of extension "pgautofailover" is "1.4" on the monitor 14:45:44 13221 INFO Contacting the monitor to LISTEN to its events.
This created a standard PostgreSQL cluster in the background:
postgres@pgaf1:~$ ls /u02/pgdata/13/monitor/ base pg_dynshmem pg_notify pg_stat_tmp pg_wal postmaster.opts current_logfiles pg_hba.conf pg_replslot pg_subtrans pg_xact postmaster.pid global pg_ident.conf pg_serial pg_tblspc postgresql.auto.conf server.crt log pg_logical pg_snapshots pg_twophase postgresql-auto-failover.conf server.key pg_commit_ts pg_multixact pg_stat PG_VERSION postgresql.conf startup.log postgres@pgaf1:~$ ps -ef | grep "postgres -D" postgres 13209 13204 0 14:45 pts/0 00:00:00 /u01/app/postgres/product/13/db_1/bin/postgres -D /u02/pgdata/13/monitor -p 5433 -h *
Before we can initialize the primary instance we need to get the connection string to the monitor:
postgres@pgaf1:~$ pg_autoctl show uri --monitor --pgdata /u02/pgdata/13/monitor/ postgres://[email protected]:5433/pg_auto_failover?sslmode=require
Create the primary:
postgres@pgaf1:~$ pg_autoctl create postgres > --hostname pgaf1.it.dbi-services.com > --auth trust > --ssl-self-signed > --monitor 'postgres://[email protected]:5433/pg_auto_failover?sslmode=require' > --run 14:52:11 13354 INFO Using default --ssl-mode "require" 14:52:11 13354 INFO Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic 14:52:11 13354 WARN Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks. 14:52:11 13354 WARN See https://www.postgresql.org/docs/current/libpq-ssl.html for details 14:52:11 13354 INFO Started pg_autoctl postgres service with pid 13356 14:52:11 13354 INFO Started pg_autoctl node-active service with pid 13357 14:52:11 13356 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service postgres --pgdata /u02/pgdata/13/PG1 -v 14:52:11 13357 INFO Registered node 1 (pgaf1.it.dbi-services.com:5432) with name "node_1" in formation "default", group 0, state "single" 14:52:11 13357 INFO Writing keeper state file at "/home/postgres/.local/share/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.state" 14:52:11 13357 INFO Writing keeper init state file at "/home/postgres/.local/share/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.init" 14:52:11 13357 INFO Successfully registered as "single" to the monitor. 14:52:11 13357 INFO FSM transition from "init" to "single": Start as a single node 14:52:11 13357 INFO Initialising postgres as a primary 14:52:11 13357 INFO Initialising a PostgreSQL cluster at "/u02/pgdata/13/PG1" 14:52:11 13357 INFO /u01/app/postgres/product/13/db_1/bin/pg_ctl initdb -s -D /u02/pgdata/13/PG1 --option '--auth=trust' 14:52:14 13357 INFO /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /u02/pgdata/13/PG1/server.crt -keyout /u02/pgdata/13/PG1/server.key -subj "/CN=pgaf1.it.dbi-services.com" 14:52:14 13385 INFO /u01/app/postgres/product/13/db_1/bin/postgres -D /u02/pgdata/13/PG1 -p 5432 -h * 14:52:14 13357 INFO CREATE DATABASE postgres; 14:52:14 13356 INFO Postgres is now serving PGDATA "/u02/pgdata/13/PG1" on port 5432 with pid 13385 14:52:14 13357 INFO The database "postgres" already exists, skipping. 14:52:14 13357 INFO CREATE EXTENSION pg_stat_statements; 14:52:14 13357 INFO /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /u02/pgdata/13/PG1/server.crt -keyout /u02/pgdata/13/PG1/server.key -subj "/CN=pgaf1.it.dbi-services.com" 14:52:14 13357 INFO Contents of "/u02/pgdata/13/PG1/postgresql-auto-failover.conf" have changed, overwriting 14:52:14 13357 INFO Transition complete: current state is now "single" 14:52:14 13357 INFO keeper has been successfully initialized. 14:52:14 13357 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service node-active --pgdata /u02/pgdata/13/PG1 -v 14:52:14 13357 INFO Reloaded the new configuration from "/home/postgres/.config/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.cfg" 14:52:14 13357 INFO pg_autoctl service is running, current state is "single"
Repeating the same command on the second node (with a different –hostname) will initialize the first replica:
postgres@pgaf2:~$ export PGDATA=/u02/pgdata/13/PG1 postgres@pgaf2:~$ export PGPORT=5432 postgres@pgaf2:~$ pg_autoctl create postgres > --hostname pgaf2.it.dbi-services.com > --auth trust > --ssl-self-signed > --monitor 'postgres://[email protected]:5433/pg_auto_failover?sslmode=require' > --run 14:54:09 13010 INFO Using default --ssl-mode "require" 14:54:09 13010 INFO Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic 14:54:09 13010 WARN Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks. 14:54:09 13010 WARN See https://www.postgresql.org/docs/current/libpq-ssl.html for details 14:54:09 13010 INFO Started pg_autoctl postgres service with pid 13012 14:54:09 13010 INFO Started pg_autoctl node-active service with pid 13013 14:54:09 13012 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service postgres --pgdata /u02/pgdata/13/PG1 -v 14:54:09 13013 INFO Registered node 2 (pgaf2.it.dbi-services.com:5432) with name "node_2" in formation "default", group 0, state "wait_standby" 14:54:09 13013 INFO Writing keeper state file at "/home/postgres/.local/share/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.state" 14:54:09 13013 INFO Writing keeper init state file at "/home/postgres/.local/share/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.init" 14:54:09 13013 INFO Successfully registered as "wait_standby" to the monitor. 14:54:09 13013 INFO FSM transition from "init" to "wait_standby": Start following a primary 14:54:09 13013 INFO Transition complete: current state is now "wait_standby" 14:54:09 13013 INFO New state for node 1 "node_1" (pgaf1.it.dbi-services.com:5432): single ➜ wait_primary 14:54:09 13013 INFO New state for node 1 "node_1" (pgaf1.it.dbi-services.com:5432): wait_primary ➜ wait_primary 14:54:09 13013 INFO Still waiting for the monitor to drive us to state "catchingup" 14:54:09 13013 WARN Please make sure that the primary node is currently running `pg_autoctl run` and contacting the monitor. 14:54:09 13013 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby 14:54:09 13013 INFO Initialising PostgreSQL as a hot standby 14:54:09 13013 INFO /u01/app/postgres/product/13/db_1/bin/pg_basebackup -w -d application_name=pgautofailover_standby_2 host=pgaf1.it.dbi-services.com port=5432 user=pgautofailover_replicator sslmode=require --pgdata /u02/pgdata/13/backup/node_2 -U pgautofailover_replicator --verbose --progress --max-rate 100M --wal-method=stream --slot pgautofailover_standby_2 14:54:09 13013 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete 14:54:15 13013 INFO pg_basebackup: checkpoint completed 14:54:15 13013 INFO pg_basebackup: write-ahead log start point: 0/2000028 on timeline 1 14:54:15 13013 INFO pg_basebackup: starting background WAL receiver 14:54:15 13013 INFO 0/23396 kB (0%), 0/1 tablespace (...ta/13/backup/node_2/backup_label) 14:54:16 13013 INFO 1752/23396 kB (7%), 0/1 tablespace (...ata/13/backup/node_2/base/1/2610) 14:54:16 13013 INFO 23406/23406 kB (100%), 0/1 tablespace (.../backup/node_2/global/pg_control) 14:54:16 13013 INFO 23406/23406 kB (100%), 1/1 tablespace 14:54:16 13013 INFO pg_basebackup: 14:54:16 13013 INFO 14:54:16 13013 INFO write-ahead log end point: 0/2000100 14:54:16 13013 INFO pg_basebackup: 14:54:16 13013 INFO 14:54:16 13013 INFO waiting for background process to finish streaming ... 14:54:16 13013 INFO pg_basebackup: syncing data to disk ... 14:54:17 13013 INFO pg_basebackup: renaming backup_manifest.tmp to backup_manifest 14:54:17 13013 INFO pg_basebackup: base backup completed 14:54:17 13013 INFO Creating the standby signal file at "/u02/pgdata/13/PG1/standby.signal", and replication setup at "/u02/pgdata/13/PG1/postgresql-auto-failover-standby.conf" 14:54:17 13013 INFO /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /u02/pgdata/13/PG1/server.crt -keyout /u02/pgdata/13/PG1/server.key -subj "/CN=pgaf2.it.dbi-services.com" 14:54:17 13021 INFO /u01/app/postgres/product/13/db_1/bin/postgres -D /u02/pgdata/13/PG1 -p 5432 -h * 14:54:19 13013 INFO PostgreSQL started on port 5432 14:54:19 13013 INFO Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes. 14:54:19 13013 INFO Ensuring HBA rules for node 1 "node_1" (pgaf1.it.dbi-services.com:5432) 14:54:19 13013 INFO Transition complete: current state is now "catchingup" 14:54:20 13012 INFO Postgres is now serving PGDATA "/u02/pgdata/13/PG1" on port 5432 with pid 13021 14:54:20 13013 INFO keeper has been successfully initialized. 14:54:20 13013 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service node-active --pgdata /u02/pgdata/13/PG1 -v 14:54:20 13013 INFO Reloaded the new configuration from "/home/postgres/.config/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.cfg" 14:54:20 13013 INFO pg_autoctl service is running, current state is "catchingup" 14:54:20 13013 INFO Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes. 14:54:20 13013 INFO Ensuring HBA rules for node 1 "node_1" (pgaf1.it.dbi-services.com:5432) 14:54:21 13013 INFO Monitor assigned new state "secondary" 14:54:21 13013 INFO FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again 14:54:21 13013 INFO Creating replication slot "pgautofailover_standby_1" 14:54:21 13013 INFO Transition complete: current state is now "secondary" 14:54:21 13013 INFO New state for node 1 "node_1" (pgaf1.it.dbi-services.com:5432): primary ➜ primary
The last lines of the output confirm, that pgaf1 is the primary cluster and pgaf2 now hosts a replica. Lets do the same on the third node:
postgres@pgaf3:~$ pg_autoctl create postgres > --hostname pgaf3.it.dbi-services.com > --auth trust > --ssl-self-signed > --monitor 'postgres://[email protected]:5433/pg_auto_failover?sslmode=require' > --run 14:57:19 12831 INFO Using default --ssl-mode "require" 14:57:19 12831 INFO Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic 14:57:19 12831 WARN Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks. 14:57:19 12831 WARN See https://www.postgresql.org/docs/current/libpq-ssl.html for details 14:57:19 12831 INFO Started pg_autoctl postgres service with pid 12833 14:57:19 12831 INFO Started pg_autoctl node-active service with pid 12834 14:57:19 12833 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service postgres --pgdata /u02/pgdata/13/PG1 -v 14:57:19 12834 INFO Registered node 3 (pgaf3.it.dbi-services.com:5432) with name "node_3" in formation "default", group 0, state "wait_standby" 14:57:19 12834 INFO Writing keeper state file at "/home/postgres/.local/share/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.state" 14:57:19 12834 INFO Writing keeper init state file at "/home/postgres/.local/share/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.init" 14:57:19 12834 INFO Successfully registered as "wait_standby" to the monitor. 14:57:19 12834 INFO FSM transition from "init" to "wait_standby": Start following a primary 14:57:19 12834 INFO Transition complete: current state is now "wait_standby" 14:57:19 12834 INFO New state for node 1 "node_1" (pgaf1.it.dbi-services.com:5432): primary ➜ join_primary 14:57:20 12834 INFO New state for node 1 "node_1" (pgaf1.it.dbi-services.com:5432): join_primary ➜ join_primary 14:57:20 12834 INFO Still waiting for the monitor to drive us to state "catchingup" 14:57:20 12834 WARN Please make sure that the primary node is currently running `pg_autoctl run` and contacting the monitor. 14:57:20 12834 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby 14:57:20 12834 INFO Initialising PostgreSQL as a hot standby 14:57:20 12834 INFO /u01/app/postgres/product/13/db_1/bin/pg_basebackup -w -d application_name=pgautofailover_standby_3 host=pgaf1.it.dbi-services.com port=5432 user=pgautofailover_replicator sslmode=require --pgdata /u02/pgdata/13/backup/node_3 -U pgautofailover_replicator --verbose --progress --max-rate 100M --wal-method=stream --slot pgautofailover_standby_3 14:57:20 12834 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete 14:57:20 12834 INFO pg_basebackup: checkpoint completed 14:57:20 12834 INFO pg_basebackup: write-ahead log start point: 0/4000028 on timeline 1 14:57:20 12834 INFO pg_basebackup: starting background WAL receiver 14:57:20 12834 INFO 0/23397 kB (0%), 0/1 tablespace (...ta/13/backup/node_3/backup_label) 14:57:20 12834 INFO 23406/23406 kB (100%), 0/1 tablespace (.../backup/node_3/global/pg_control) 14:57:20 12834 INFO 23406/23406 kB (100%), 1/1 tablespace 14:57:20 12834 INFO pg_basebackup: write-ahead log end point: 0/4000100 14:57:20 12834 INFO pg_basebackup: waiting for background process to finish streaming ... 14:57:20 12834 INFO pg_basebackup: syncing data to disk ... 14:57:22 12834 INFO pg_basebackup: renaming backup_manifest.tmp to backup_manifest 14:57:22 12834 INFO pg_basebackup: base backup completed 14:57:22 12834 INFO Creating the standby signal file at "/u02/pgdata/13/PG1/standby.signal", and replication setup at "/u02/pgdata/13/PG1/postgresql-auto-failover-standby.conf" 14:57:22 12834 INFO /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /u02/pgdata/13/PG1/server.crt -keyout /u02/pgdata/13/PG1/server.key -subj "/CN=pgaf3.it.dbi-services.com" 14:57:22 12841 INFO /u01/app/postgres/product/13/db_1/bin/postgres -D /u02/pgdata/13/PG1 -p 5432 -h * 14:57:22 12834 INFO PostgreSQL started on port 5432 14:57:22 12834 INFO Fetched current list of 2 other nodes from the monitor to update HBA rules, including 2 changes. 14:57:22 12834 INFO Ensuring HBA rules for node 1 "node_1" (pgaf1.it.dbi-services.com:5432) 14:57:22 12834 INFO Ensuring HBA rules for node 2 "node_2" (pgaf2.it.dbi-services.com:5432) 14:57:22 12834 ERROR Connection to database failed: could not connect to server: No such file or directory 14:57:22 12834 ERROR Is the server running locally and accepting 14:57:22 12834 ERROR connections on Unix domain socket "/tmp/.s.PGSQL.5432"? 14:57:22 12834 ERROR Failed to connect to local Postgres database at "port=5432 dbname=postgres", see above for details 14:57:22 12834 ERROR Failed to reload the postgres configuration after adding the standby user to pg_hba 14:57:22 12834 ERROR Failed to update the HBA entries for the new elements in the our formation "default" and group 0 14:57:22 12834 ERROR Failed to update HBA rules after a base backup 14:57:22 12834 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above. 14:57:22 12831 ERROR pg_autoctl service node-active exited with exit status 12 14:57:22 12831 INFO Restarting service node-active 14:57:22 12845 INFO Continuing from a previous `pg_autoctl create` failed attempt 14:57:22 12845 INFO PostgreSQL state at registration time was: PGDATA does not exists 14:57:22 12845 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby 14:57:22 12845 INFO Initialising PostgreSQL as a hot standby 14:57:22 12845 INFO Target directory exists: "/u02/pgdata/13/PG1", stopping PostgreSQL 14:57:24 12833 INFO Postgres is now serving PGDATA "/u02/pgdata/13/PG1" on port 5432 with pid 12841 14:57:24 12833 INFO Stopping pg_autoctl postgres service 14:57:24 12833 INFO /u01/app/postgres/product/13/db_1/bin/pg_ctl --pgdata /u02/pgdata/13/PG1 --wait stop --mode fast 14:57:24 12845 INFO /u01/app/postgres/product/13/db_1/bin/pg_basebackup -w -d application_name=pgautofailover_standby_3 host=pgaf1.it.dbi-services.com port=5432 user=pgautofailover_replicator sslmode=require --pgdata /u02/pgdata/13/backup/node_3 -U pgautofailover_replicator --verbose --progress --max-rate 100M --wal-method=stream --slot pgautofailover_standby_3 14:57:24 12845 INFO pg_basebackup: 14:57:24 12845 INFO 14:57:24 12845 INFO initiating base backup, waiting for checkpoint to complete 14:57:24 12845 INFO pg_basebackup: 14:57:24 12845 INFO 14:57:24 12845 INFO checkpoint completed 14:57:24 12845 INFO pg_basebackup: 14:57:24 12845 INFO 14:57:24 12845 INFO write-ahead log start point: 0/5000028 on timeline 1 14:57:24 12845 INFO pg_basebackup: 14:57:24 12845 INFO 14:57:24 12845 INFO starting background WAL receiver 14:57:24 12845 INFO 0/23397 kB (0%), 0/1 tablespace (...ta/13/backup/node_3/backup_label) 14:57:25 12845 INFO 16258/23397 kB (69%), 0/1 tablespace (...3/backup/node_3/base/12662/12512) 14:57:25 12845 INFO 23406/23406 kB (100%), 0/1 tablespace (.../backup/node_3/global/pg_control) 14:57:25 12845 INFO 23406/23406 kB (100%), 1/1 tablespace 14:57:25 12845 INFO pg_basebackup: write-ahead log end point: 0/5000100 14:57:25 12845 INFO pg_basebackup: waiting for background process to finish streaming ... 14:57:25 12845 INFO pg_basebackup: syncing data to disk ... 14:57:27 12845 INFO pg_basebackup: 14:57:27 12845 INFO 14:57:27 12845 INFO renaming backup_manifest.tmp to backup_manifest 14:57:27 12845 INFO pg_basebackup: 14:57:27 12845 INFO 14:57:27 12845 INFO base backup completed 14:57:27 12845 INFO Creating the standby signal file at "/u02/pgdata/13/PG1/standby.signal", and replication setup at "/u02/pgdata/13/PG1/postgresql-auto-failover-standby.conf" 14:57:27 12845 INFO /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /u02/pgdata/13/PG1/server.crt -keyout /u02/pgdata/13/PG1/server.key -subj "/CN=pgaf3.it.dbi-services.com" 14:57:27 12881 INFO /u01/app/postgres/product/13/db_1/bin/postgres -D /u02/pgdata/13/PG1 -p 5432 -h * 14:57:29 12845 INFO PostgreSQL started on port 5432 14:57:29 12845 INFO Fetched current list of 2 other nodes from the monitor to update HBA rules, including 2 changes. 14:57:29 12845 INFO Ensuring HBA rules for node 1 "node_1" (pgaf1.it.dbi-services.com:5432) 14:57:29 12845 INFO Ensuring HBA rules for node 2 "node_2" (pgaf2.it.dbi-services.com:5432) 14:57:29 12845 INFO Transition complete: current state is now "catchingup" 14:57:29 12845 INFO keeper has been successfully initialized. 14:57:29 12845 INFO /u01/app/postgres/product/13/db_1/bin/pg_autoctl do service node-active --pgdata /u02/pgdata/13/PG1 -v 14:57:29 12845 INFO Reloaded the new configuration from "/home/postgres/.config/pg_autoctl/u02/pgdata/13/PG1/pg_autoctl.cfg" 14:57:29 12845 INFO pg_autoctl service is running, current state is "catchingup" 14:57:29 12845 INFO Fetched current list of 2 other nodes from the monitor to update HBA rules, including 2 changes. 14:57:29 12845 INFO Ensuring HBA rules for node 1 "node_1" (pgaf1.it.dbi-services.com:5432) 14:57:29 12845 INFO Ensuring HBA rules for node 2 "node_2" (pgaf2.it.dbi-services.com:5432) 14:57:29 12845 INFO Monitor assigned new state "secondary" 14:57:29 12845 INFO FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again 14:57:29 12833 WARN PostgreSQL was not running, restarted with pid 12881 14:57:29 12845 INFO Creating replication slot "pgautofailover_standby_1" 14:57:29 12845 INFO Creating replication slot "pgautofailover_standby_2" 14:57:29 12845 INFO Transition complete: current state is now "secondary" 14:57:29 12845 INFO New state for node 1 "node_1" (pgaf1.it.dbi-services.com:5432): primary ➜ primary
That really was quite simple. We now have two replicas synchronizing from the same primary:
postgres=# select usename,application_name,client_hostname,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag from pg_stat_replication ; usename | application_name | client_hostname | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag ---------------------------+--------------------------+---------------------------+-----------+-----------+-----------+------------+----------- pgautofailover_replicator | pgautofailover_standby_2 | pgaf2.it.dbi-services.com | 0/6000148 | 0/6000148 | 0/6000148 | 0/6000148 | pgautofailover_replicator | pgautofailover_standby_3 | pgaf3.it.dbi-services.com | 0/6000148 | 0/6000148 | 0/6000148 | 0/6000148 | (2 rows)
If you prepare that well, it is a matter of a few minutes and a setup like this is up and runnning. For the setup part, one bit is missing: All these pg_autoctl commands did not detach from the console, but run in the foreground and everything stops if we cancel the commands or close the terminal.
Luckily pg_auto_failover comes with a handy command to create a systemd service file:
postgres@pgaf1:~$ pg_autoctl -q show systemd --pgdata /u02/pgdata/13/monitor/ > pgautofailover.service postgres@pgaf1:~$ cat pgautofailover.service [Unit] Description = pg_auto_failover [Service] WorkingDirectory = /home/postgres Environment = 'PGDATA=/u02/pgdata/13/monitor/' User = postgres ExecStart = /u01/app/postgres/product/13/db_1/bin/pg_autoctl run Restart = always StartLimitBurst = 0 [Install] WantedBy = multi-user.target
This can easily be added to systemd so the monitor will start automatically:
postgres@pgaf1:~$ sudo mv pgautofailover.service /etc/systemd/system postgres@pgaf1:~$ sudo systemctl daemon-reload postgres@pgaf1:~$ sudo systemctl enable pgautofailover.service Created symlink /etc/systemd/system/multi-user.target.wants/pgautofailover.service → /etc/systemd/system/pgautofailover.service. postgres@pgaf1:~$ sudo systemctl start pgautofailover.service
From now the service will start automatically when the node boots up. Lets do the same for the PostgreSQL clusters:
postgres@pgaf1:~$ pg_autoctl -q show systemd --pgdata /u02/pgdata/13/PG1/ > postgresp1.service postgres@pgaf1:~$ cat postgresp1.service [Unit] Description = pg_auto_failover [Service] WorkingDirectory = /home/postgres Environment = 'PGDATA=/u02/pgdata/13/PG1/' User = postgres ExecStart = /u01/app/postgres/product/13/db_1/bin/pg_autoctl run Restart = always StartLimitBurst = 0 [Install] WantedBy = multi-user.target postgres@pgaf1:~$ sudo mv postgresp1.service /etc/systemd/system postgres@pgaf1:~$ sudo systemctl daemon-reload postgres@pgaf1:~$ sudo systemctl enable postgresp1.service Created symlink /etc/systemd/system/multi-user.target.wants/postgresp1.service → /etc/systemd/system/postgresp1.service. postgres@pgaf1:~$ sudo systemctl start postgresp1.service
Do the same on the remaing two nodes and reboot all systems. If all went fine pg_auto_failover and the PostgreSQL cluster will come up automatically:
postgres@pgaf1:~$ pg_autoctl show state --pgdata /u02/pgdata/13/monitor/ Name | Node | Host:Port | LSN | Reachable | Current State | Assigned State -------+-------+--------------------------------+-----------+-----------+---------------------+-------------------- node_1 | 1 | pgaf1.it.dbi-services.com:5432 | 0/6002320 | yes | primary | primary node_2 | 2 | pgaf2.it.dbi-services.com:5432 | 0/6002320 | yes | secondary | secondary node_3 | 3 | pgaf3.it.dbi-services.com:5432 | 0/6002320 | yes | secondary | secondary
That’s it for the first part. In the next post we’ll look at how robust pg_auto_failover is, e.g. what happens when the first node, which also runs the monitor, goes down?