A few days ago I had the interesting mission to build a two Node Patroni Postgres 9.6 Cluster using etcd in our openDB Appliance environment. Sounds easy, one leader, one secondary, both running etcd, that’s it. But that’s only the first impression.

So we start with two newly created OpenDB Appliance servers and deploy the database home using the openDB Appliance.
Afterwards we installed all the stuff for patroni and etcd. We can do all the configuration for the cluster, etcd and patroni (for more information about building a patroni cluster click here ) .

When we start the etcd and patroni services, everything looks fine. Both Members up and running, replication also worked fine.

opendb@opendb_stby:/home/opendb/ [PG1] patronictl list
/home/opendb/.local/lib/python2.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | Leader | running | 3 | 0.0 |
| PG1 | opendb_stby | 192.168.22.34 | | running | 3 | 0.0 |
+---------+----------------+---------------+--------+---------+----+-----------+

Let’s try a manual failover – this works as expected:

opendb@opendb_stby:/home/opendb/ [PG1] patronictl switchover
/home/opendb/.local/lib/python2.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Master [opendb_primary]:
Candidate ['opendb_stby'] []: opendb_stby
When should the switchover take place (e.g. 2015-10-01T14:30) [now]:
Current cluster topology
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | Leader | running | 3 | 0.0 |
| PG1 | opendb_stby | 192.168.22.34 | | running | 3 | 0.0 |
+---------+----------------+---------------+--------+---------+----+-----------+
Are you sure you want to switchover cluster PG1, demoting current master opendb_primary? [y/N]: y
2019-04-25 01:19:07.89086 Successfully switched over to "opendb_stby"
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | | stopped | | unknown |
| PG1 | opendb_stby | 192.168.22.34 | Leader | running | 3 | |
+---------+----------------+---------------+--------+---------+----+-----------+

Do another check, if opendb_stby is the leader now, looks perfect.

opendb@opendb_stby:/home/opendb/ [PG1] patronictl list
/home/opendb/.local/lib/python2.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | | running | 4 | 0.0 |
| PG1 | opendb_stby | 192.168.22.34 | Leader | running | 4 | 0.0 |
+---------+----------------+---------------+--------+---------+----+-----------+

As a next step, let’s try to stop the Patroni service on the leading node to force the automatic failover.

opendb@opendb_stby:/home/opendb/ [PG1] patronictl list
/home/opendb/.local/lib/python2.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | | running | 4 | 0.0 |
| PG1 | opendb_stby | 192.168.22.34 | Leader | running | 4 | 0.0 |
+---------+----------------+---------------+--------+---------+----+-----------+
opendb@opendb_stby:/home/opendb/ [PG1] sudo systemctl stop patroni
opendb@opendb_stby:/home/opendb/ [PG1] patronictl list
/home/opendb/.local/lib/python2.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | Leader | running | 5 | 0.0 |
| PG1 | opendb_stby | 192.168.22.34 | | stopped | | unknown |
+---------+----------------+---------------+--------+---------+----+-----------+
opendb@opendb_stby:/home/opendb/ [PG1]

Yeah, this also works as expected!

Now, let us try the “worst case” and shutdown the leading node.
With this the problem starts! There are tons of error messages in the log file. The secondary recognizes that the leader has gone, but does not become the new leader.


Apr 24 13:10:34 opendb_stby patroni: 2019-04-24 13:10:34,124 INFO: no action. i am a secondary and i am following a leader
Apr 24 13:10:36 opendb_stby patroni: 2019-04-24 13:10:36,766 WARNING: Request failed to opendb_primary: GET http://192.168.22.33:8008/patroni (('Connection aborted.', error(104, 'Connection reset by peer')))
Apr 24 13:10:36 opendb_stby etcd: lost the TCP streaming connection with peer 1154b56a13168e2a (stream MsgApp v2 reader)
Apr 24 13:10:36 opendb_stby etcd: lost the TCP streaming connection with peer 1154b56a13168e2a (stream Message reader)
Apr 24 13:10:36 opendb_stby etcd: failed to dial 1154b56a13168e2a on stream MsgApp v2 (peer 1154b56a13168e2a failed to find local node f13d668ae0cba84)
Apr 24 13:10:36 opendb_stby etcd: peer 1154b56a13168e2a became inactive (message send to peer failed)
Apr 24 13:10:36 opendb_stby patroni: 2019-04-24 13:10:36,838 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',)': /v2/keys/service/PG1/leader
Apr 24 13:10:36 opendb_stby patroni: 2019-04-24 13:10:36,839 ERROR: Request to server http://192.168.22.33:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.22.33', port=2379): Max retries exceeded with url: /v2/keys/service/PG1/leader (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))",)
Apr 24 13:10:36 opendb_stby patroni: 2019-04-24 13:10:36,839 INFO: Reconnection allowed, looking for another server.
Apr 24 13:10:36 opendb_stby patroni: 2019-04-24 13:10:36,839 INFO: Selected new etcd server http://192.168.22.34:2379
Apr 24 13:10:37 opendb_stby etcd: lost the TCP streaming connection with peer 1154b56a13168e2a (stream Message writer)
Apr 24 13:10:38 opendb_stby etcd: lost the TCP streaming connection with peer 1154b56a13168e2a (stream MsgApp v2 writer)
Apr 24 13:10:38 opendb_stby etcd: f13d668ae0cba84 stepped down to follower since quorum is not active
Apr 24 13:10:38 opendb_stby etcd: f13d668ae0cba84 became follower at term 19197
Apr 24 13:10:38 opendb_stby etcd: raft.node: f13d668ae0cba84 lost leader f13d668ae0cba84 at term 19197
Apr 24 13:10:39 opendb_stby patroni: 2019-04-24 13:10:39,344 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host=u'192.168.22.34', port=2379): Read timed out. (read timeout=2.5)",)': /v2/keys/service/PG1/leader
Apr 24 13:10:40 opendb_stby etcd: f13d668ae0cba84 is starting a new election at term 19197
Apr 24 13:10:40 opendb_stby etcd: f13d668ae0cba84 became candidate at term 19198
Apr 24 13:10:40 opendb_stby etcd: f13d668ae0cba84 received MsgVoteResp from f13d668ae0cba84 at term 19198
Apr 24 13:10:40 opendb_stby etcd: f13d668ae0cba84 [logterm: 19197, index: 447303] sent MsgVote request to 1154b56a13168e2a at term 19198
Apr 24 13:10:41 opendb_stby etcd: f13d668ae0cba84 is starting a new election at term 19198
Apr 24 13:10:41 opendb_stby etcd: f13d668ae0cba84 became candidate at term 19199
Apr 24 13:10:41 opendb_stby etcd: f13d668ae0cba84 received MsgVoteResp from f13d668ae0cba84 at term 19199
Apr 24 13:10:41 opendb_stby etcd: f13d668ae0cba84 [logterm: 19197, index: 447303] sent MsgVote request to 1154b56a13168e2a at term 19199
Apr 24 13:10:41 opendb_stby patroni: 2019-04-24 13:10:41,938 ERROR: Request to server http://192.168.22.34:2379 failed: MaxRetryError(u'HTTPConnectionPool(host=u'192.168.22.34', port=2379): Max retries exceeded with url: /v2/keys/service/PG1/leader (Caused by ReadTimeoutError("HTTPConnectionPool(host=u'192.168.22.34', port=2379): Read timed out. (read timeout=2.5)",))',)
Apr 24 13:10:41 opendb_stby patroni: 2019-04-24 13:10:41,938 INFO: Reconnection allowed, looking for another server.
Apr 24 13:10:41 opendb_stby patroni: 2019-04-24 13:10:41,939 ERROR: Machines cache is empty, no machines to try.
Apr 24 13:10:41 opendb_stby patroni: 2019-04-24 13:10:41,939 INFO: Selected new etcd server http://192.168.22.33:2379
Apr 24 13:10:43 opendb_stby etcd: f13d668ae0cba84 is starting a new election at term 19199

Why? What went wrong?

Using some search engines bring lot of nice ideas, e.g errors in configuration. But nothing helps…

There is one main failure:
A two node patroni with only two etcd won’t work. How should the secondary node know, that he should become the leader?

We need a third etcd host. An etcd cluster needs an extra node to agree on updates to the cluster state. Means: when one node failes, the third node recognizes that one is missing and the secondary becomes the new leader.

We have to do some changes on the two existing hosts and on the third host we need to install etcd.


wget https://github.com/etcd-io/etcd/releases/download/v3.3.10/etcd-v3.3.10-linux-amd64.tar.gz
tar -axf etcd-v3.3.10-linux-amd64.tar.gz
cp etcd-v3.3.10-linux-amd64/etcd* /u01/app/opendb/local/dmk/bin
mkdir -p /u02/pgdata/etcd
rm -rf etcd-v3.3.10-linux-amd64 etcd-v3.3.10-linux-amd64.tar.gz

Afterwards we have to create the etcd configuration file on the third host and change the files on the both other hosts to include the new host.

cat > /u01/app/opendb/local/dmk/dmk_postgres/etc/etcd.conf < /u01/app/opendb/local/dmk/dmk_postgres/etc/etcd.conf < /u01/app/opendb/local/dmk/dmk_postgres/etc/etcd.conf << EOF
name: opendb-stby2
data-dir: /u02/opendb/pgdata/etcd
initial-advertise-peer-urls: http://192.168.22.35:2380
listen-peer-urls: http://192.168.22.35:2380
listen-client-urls: http://192.168.22.35:2379,http://localhost:2379
advertise-client-urls: http://192.168.22.35:2379
initial-cluster: opendb-primary=http://192.168.22.33:2380,opendb-stby=http://192.168.22.34:2380,opendb-stby2=http://192.168.22.35:2380
EOF

Create the etcd.service on the third node

[opendb@opendb_standby2 etc]$ cat /etc/systemd/system/etcd.service
#
# systemd integration for etcd
# Put this file under /etc/systemd/system/etcd.service
# then: systemctl daemon-reload
# then: systemctl list-unit-files | grep etcd
# then: systemctl enable etcd.service
#
[Unit]
Description=dbi services etcd service
After=network.target
[Service]
User=opendb
Type=notify
ExecStart=/u01/app/opendb/local/dmk/dmk_postgres/bin/etcd --config-file /u01/app/opendb/local/dmk/dmk_postgres/etc/etcd.conf
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target

Enable etcd service and reboot the third node

sudo systemctl daemon-reload
sudo systemctl list-unit-files | grep etcd
sudo systemctl enable etcd.service
sudo systemctl reboot

Now patroni can be installed on the third node as well

sudo pip install --upgrade pip
pip install --upgrade --user setuptools
pip install --user psycopg2-binary
pip install --user patroni[etcd]

Afterwards we have to change the patroni.yml file on both database hosts and recreate the database, that everything works as expected. Otherwise the third etcd won’t be recognized in the cluster.

1. Shutdown patroni on both nodes

systemctl stop patroni

2. Clean up data directories

rm -rf /u02/opendb/pgdata/96/PG1

3. Clean up cluster configuration in etcd

patronictl -c /u01/app/opendb/local/dmk/dmk_postgres/etc/patroni.yml remove PG1

4. Change the patroni.yml file

etcd:
hosts: 192.168.22.33:2379,192.168.22.34:2379,192.168.22.35:2379

5. Start patroni on both nodes where the database should run. This will build a new cluster.

systemctl start patroni

6. Start etcd on all THREE nodes

systemctl start etcd

Everything works as before. In the list you only see your two node patroni

opendb@opendb_primary:/home/opendb/ [PG1] patronictl list
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | Leader | running | 7 | 0.0 |
| PG1 | opendb_stby | 192.168.22.34 | | running | 7 | 0.0 |
+---------+----------------+---------------+--------+---------+----+-----------+

The etcd Cluster looks also good

opendb@opendb_primary:/home/opendb/ [PG1] etcdctl cluster-health
member f13d668ae0cba84 is healthy: got healthy result from http://192.168.22.34:2379
member 1154b56a13168e2a is healthy: got healthy result from http://192.168.22.33:2379
member c8520dba9907702d is healthy: got healthy result from http://192.168.22.35:2379
cluster is healthy

Now let’s try another shutdown of the leader node and see what happens:

opendb@opendb_stby:/home/opendb/ [PG1] patronictl list
/home/opendb/.local/lib/python2.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: .
""")
2019-04-25 03:40:26,447 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(, 'Connection to 192.168.22.33 timed out. (connect timeout=2.5)')': /v2/machines
2019-04-25 03:40:28,953 - ERROR - Failed to get list of machines from http://192.168.22.33:2379/v2: MaxRetryError("HTTPConnectionPool(host='192.168.22.33', port=2379): Max retries exceeded with url: /v2/machines (Caused by ConnectTimeoutError(, 'Connection to 192.168.22.33 timed out. (connect timeout=2.5)'))",)
+---------+----------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------+----------------+---------------+--------+---------+----+-----------+
| PG1 | opendb_primary | 192.168.22.33 | | stopped | | unknown |
| PG1 | opendb_stby | 192.168.22.34 | Leader | running | 8 | 0.0 |
+---------+----------------+---------------+--------+---------+----+-----------+
03:40:29 opendb@opendb_stby:/home/opendb/ [PG1]

Also in /var/log/messages you can see the successful failover to opendb_stby

Apr 25 03:40:09 opendb_stby patroni: 2019-04-25 03:40:09,096 WARNING: Request failed to opendb_primary: GET http://192.168.22.33:8008/patroni (('Connection aborted.', error(104, 'Connection reset by peer')))
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 [term 11] received MsgTimeoutNow from 1154b56a13168e2a and starts an election to get leadership.
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 became candidate at term 12
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 received MsgVoteResp from f13d668ae0cba84 at term 12
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 [logterm: 11, index: 92328] sent MsgVote request to 1154b56a13168e2a at term 12
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 [logterm: 11, index: 92328] sent MsgVote request to c8520dba9907702d at term 12
Apr 25 03:40:09 opendb_stby etcd: raft.node: f13d668ae0cba84 lost leader 1154b56a13168e2a at term 12
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 received MsgVoteResp from c8520dba9907702d at term 12
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections
Apr 25 03:40:09 opendb_stby etcd: f13d668ae0cba84 became leader at term 12
Apr 25 03:40:09 opendb_stby etcd: raft.node: f13d668ae0cba84 elected leader f13d668ae0cba84 at term 12
Apr 25 03:40:09 opendb_stby patroni: 2019-04-25 03:40:09,190 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Apr 25 03:40:09 opendb_stby patroni: 2019-04-25 03:40:09,194 INFO: promoted self to leader by acquiring session lock
Apr 25 03:40:09 opendb_stby patroni: server promoting
Apr 25 03:40:09 opendb_stby patroni: 2019-04-25 03:40:09,200 INFO: cleared rewind state after becoming the leader
Apr 25 03:40:09 opendb_stby etcd: lost the TCP streaming connection with peer 1154b56a13168e2a (stream Message reader)
Apr 25 03:40:09 opendb_stby etcd: lost the TCP streaming connection with peer 1154b56a13168e2a (stream MsgApp v2 reader)
Apr 25 03:40:09 opendb_stby etcd: failed to dial 1154b56a13168e2a on stream Message (peer 1154b56a13168e2a failed to find local node f13d668ae0cba84)
Apr 25 03:40:09 opendb_stby etcd: peer 1154b56a13168e2a became inactive (message send to peer failed)
Apr 25 03:40:09 opendb_stby etcd: lost the TCP streaming connection with peer 1154b56a13168e2a (stream Message writer)
Apr 25 03:40:10 opendb_stby patroni: 2019-04-25 03:40:10,271 INFO: Lock owner: opendb_stby; I am opendb_stby
Apr 25 03:40:10 opendb_stby patroni: 2019-04-25 03:40:10,304 INFO: no action. i am the leader with the lock

Conclusion

Running and especially failover a Patroni Cluster in two node modus is only possible when having a third etcd hosts. Otherwise the automatic failover won’t work when one node crashes.