Infrastructure at your Service

Daniel Westermann

What you can do when your Veritas cluster shows interfaces as down

Recently we had the situation that the Veritas cluster (InfoScale 7.3) showed interfaces as down on the two RedHat 7.3 nodes. This e.g. can happen when you change hardware. Although all service groups were up and running this is a situation you usually want to avoid as you never know what happens when the cluster is in such a state. When you have something like this:

[root@xxxxx-node1 ~]$ lltstat -nvv | head
LLT node information:
Node State Link Status Address
  * 0 xxxxx-node1 OPEN
      eth3 UP yy:yy:yy:yy:yy:yy
      eth1 UP xx:xx:xx:xx:xx:xx
      bond0 UP rr:rr:rr:rr:rr:rr
    1 xxxxx-node2 OPEN
      eth3 UP ee:ee:ee:ee:ee:ee
      eth1 DOWN tt:tt:tt:tt:tt:tt
      bond0 DOWN qq:qq:qq:qq:qq:qq

… what can you do?

In our configuration eth1 and eth3 are used for the interconnect and bond0 is the public network. As you can see above the eth1 and bond0 are reported as down for the second node. Of course, the first check you need to do is to check the interface status on the operating system level, but that was fine in our case.

Veritas comes with a tiny little utility (dlpiping) you can use to check connectivity on the Veritas level. Using the information from the lltstat command you can start dlpiping in “send” mode on the first node:

[root@xxxxx-node1 ~]$ /opt/VRTSllt/dlpiping -vs eth1

When that is running (will not detach from the terminal) you should start in “receive” mode on the second node:

[root@xxxxx-node1 ~]$ /opt/VRTSllt/dlpiping -vc eth1 xx:xx:xx:xx:xx:xx
using packet size = 78
dlpiping: sent a request to xx:xx:xx:xx:xx:xx
dlpiping: received a packet from xx:xx:xx:xx:xx:xx

This confirms that connectivity is fine for eth1. When you repeat that for the remaining interfaces (eth3 and bond0) and all is fine then you you can proceed. If not, then you have another issue than what we faced.

The next step is to freeze all your service groups so the cluster will not touch them:

[root@xxxxx-node1 ~]$ haconf -makerw
[root@xxxxx-node1 ~]$ hagrp -freeze SERVICE_GROUP -persistent # do that for all service groups you have defined in the cluster
[root@xxxxx-node1 ~]$ haconf -dump -makerw

Now the magic:

[root@xxxxx-node1 ~]$ hastop -all -force 

Why magic? This command will stop the cluster stack on all nodes BUT it will leave all the resources running. So you can do that without shutting down any user defined cluster services (Oracle databases in our case). Once the stack is down on all the nodes stop gab and ltt on both nodes as well:

[root@xxxxx-node1 ~]$ systemctl stop gab
[root@xxxxx-node1 ~]$ systemctl stop llt

Having stopped llt and gab you just need to start them again in the correct order on both systems:

[root@xxxxx-node1 ~]$ systemctl start llt
[root@xxxxx-node1 ~]$ systemctl start gab

… and after that start the cluster:

[root@xxxxx-node1 ~]$ systemctl start vcs

In our case that was enough to make llt work as expected again and the cluster is fine:

[root@xxxxx-node1 ~]$ gabconfig -a
GAB Port Memberships
===============================================================
Port a gen f44203 membership 01
Port h gen f44204 membership 01
[root@xxxxx-node1 ~]#

[root@xxxxx-node1 ~]$ lltstat -nvv | head
LLT node information:
   Node State Link Status Address
    * 0 xxxxx-node1 OPEN
      eth3 UP yy:yy:yy:yy:yy:yy
      eth1 UP xx:xx:xx:xx:xx:xx
      bond0 UP rr:rr:rr:rr:rr:rr
    1 xxxxx-node2 OPEN
      eth3 UP ee:ee:ee:ee:ee:ee
      eth1 UP qq:qq:qq:qq:qq:qq
      bond0 UP tt:tt:tt:tt:tt:tt 

Hope that helps …

Leave a Reply

Daniel Westermann
Daniel Westermann

Senior Consultant and Technology Leader Open Infrastructure