Infrastructure at your Service

Daniel Westermann

What you can do when your Veritas cluster shows interfaces as down

Recently we had the situation that the Veritas cluster (InfoScale 7.3) showed interfaces as down on the two RedHat 7.3 nodes. This e.g. can happen when you change hardware. Although all service groups were up and running this is a situation you usually want to avoid as you never know what happens when the cluster is in such a state. When you have something like this:

[[email protected] ~]$ lltstat -nvv | head
LLT node information:
Node State Link Status Address
  * 0 xxxxx-node1 OPEN
      eth3 UP yy:yy:yy:yy:yy:yy
      eth1 UP xx:xx:xx:xx:xx:xx
      bond0 UP rr:rr:rr:rr:rr:rr
    1 xxxxx-node2 OPEN
      eth3 UP ee:ee:ee:ee:ee:ee
      eth1 DOWN tt:tt:tt:tt:tt:tt
      bond0 DOWN qq:qq:qq:qq:qq:qq

… what can you do?

In our configuration eth1 and eth3 are used for the interconnect and bond0 is the public network. As you can see above the eth1 and bond0 are reported as down for the second node. Of course, the first check you need to do is to check the interface status on the operating system level, but that was fine in our case.

Veritas comes with a tiny little utility (dlpiping) you can use to check connectivity on the Veritas level. Using the information from the lltstat command you can start dlpiping in “send” mode on the first node:

[[email protected] ~]$ /opt/VRTSllt/dlpiping -vs eth1

When that is running (will not detach from the terminal) you should start in “receive” mode on the second node:

[[email protected] ~]$ /opt/VRTSllt/dlpiping -vc eth1 xx:xx:xx:xx:xx:xx
using packet size = 78
dlpiping: sent a request to xx:xx:xx:xx:xx:xx
dlpiping: received a packet from xx:xx:xx:xx:xx:xx

This confirms that connectivity is fine for eth1. When you repeat that for the remaining interfaces (eth3 and bond0) and all is fine then you you can proceed. If not, then you have another issue than what we faced.

The next step is to freeze all your service groups so the cluster will not touch them:

[[email protected] ~]$ haconf -makerw
[[email protected] ~]$ hagrp -freeze SERVICE_GROUP -persistent # do that for all service groups you have defined in the cluster
[[email protected] ~]$ haconf -dump -makerw

Now the magic:

[[email protected] ~]$ hastop -all -force 

Why magic? This command will stop the cluster stack on all nodes BUT it will leave all the resources running. So you can do that without shutting down any user defined cluster services (Oracle databases in our case). Once the stack is down on all the nodes stop gab and ltt on both nodes as well:

[[email protected] ~]$ systemctl stop gab
[[email protected] ~]$ systemctl stop llt

Having stopped llt and gab you just need to start them again in the correct order on both systems:

[[email protected] ~]$ systemctl start llt
[[email protected] ~]$ systemctl start gab

… and after that start the cluster:

[[email protected] ~]$ systemctl start vcs

In our case that was enough to make llt work as expected again and the cluster is fine:

[[email protected] ~]$ gabconfig -a
GAB Port Memberships
===============================================================
Port a gen f44203 membership 01
Port h gen f44204 membership 01
[[email protected] ~]#

[[email protected] ~]$ lltstat -nvv | head
LLT node information:
   Node State Link Status Address
    * 0 xxxxx-node1 OPEN
      eth3 UP yy:yy:yy:yy:yy:yy
      eth1 UP xx:xx:xx:xx:xx:xx
      bond0 UP rr:rr:rr:rr:rr:rr
    1 xxxxx-node2 OPEN
      eth3 UP ee:ee:ee:ee:ee:ee
      eth1 UP qq:qq:qq:qq:qq:qq
      bond0 UP tt:tt:tt:tt:tt:tt 

Hope that helps …

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure