Infrastructure at your Service

David Barbarin

SQL Server failover cluster, VSphere, & SCSI-3 reservation nightmares

When I have to install a virtualized SQL Server FCI at a customer place as an SQL Server consultant, the virtualized environment usally is ready. I guess this is the same for most part of the database consulting people. Since we therefore lack practice, I have to admit that we do not always know the good configuration settings to apply to the virtualized layer in order to correctly run our SQL Server FCI architecture.

A couple of days ago, I had an interesting case where I had to help my customer to correctly configure the storage layer on VSphere 5.1. First, I would like to thank my customer because I seldom have the opportunity to deal with VMWare (other than via my personal lab).

The story begins with a failover testing that fails randomly on a SQL Server FCI after switching the SQL Server volumes from VMFS to RDM in a physical compatibility mode. We had to switch because the first configuration was installed in CIB configuration (aka Cluster-In-Box configuration). As you certainly know, it does not provide a full additional layer of high availability over VMWare in this case, because all the virtual machines are on the same host. So we decided to move to a CAB configuration (Cluster across Box) scenario that is more reliable than the first configuration.

In the new configuration, a failover triggers a Windows 170 error randomly with this brief description: “The resource is busy”. At this point, I suspected that the ISCI-3 reservation was not performed correctly, but the cluster failover validate report didn’t show any errors concerning the SCSI-3 reservation. I then decided to generate the cluster log to see if I could find more information about my problem – and here is what I found:

 

00000fb8.000006e4::2014/10/07-17:50:34.116 INFO [RES] Physical Disk : HardDiskpQueryDiskFromStm: ClusterStmFindDisk returned device=’?scsi#disk&ven_dgc&prod_vraid#5&effe51&0&000c00#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}’
00000fb8.000016a8::2014/10/07-17:50:34.116 INFO [RES] Physical Disk : ResHardDiskArbitrateInternal request Not a Space: Uses FastPath
00000fb8.000016a8::2014/10/07-17:50:34.116 INFO [RES] Physical Disk : ResHardDiskArbitrateInternal: Clusdisk driver handle or event handle is NULL.
00000fb8.000016a8::2014/10/07-17:50:34.116 INFO [RES] Physical Disk : HardDiskpQueryDiskFromStm: ClusterStmFindDisk returned device=’?scsi#disk&ven_dgc&prod_vraid#5&effe51&0&000d00#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}’
00000fb8.000006e4::2014/10/07-17:50:34.117 INFO [RES] Physical Disk : Arbitrate – Node using PR key a6c936d60001734d
00000fb8.000016a8::2014/10/07-17:50:34.118 INFO [RES] Physical Disk : Arbitrate – Node using PR key a6c936d60001734d
00000fb8.000006e4::2014/10/07-17:50:34.120 INFO [RES] Physical Disk : HardDiskpPRArbitrate: Fast Path arbitration…
00000fb8.000016a8::2014/10/07-17:50:34.121 INFO [RES] Physical Disk : HardDiskpPRArbitrate: Fast Path arbitration…
00000fb8.000016a8::2014/10/07-17:50:34.122 WARN [RES] Physical Disk : PR reserve failed, status 170
00000fb8.000006e4::2014/10/07-17:50:34.122 INFO [RES] Physical Disk : Successful reserve, key a6c936d60001734d
00000fb8.000016a8::2014/10/07-17:50:34.123 ERR   [RES] Physical Disk : HardDiskpPRArbitrate: Error exit, unregistering key…
00000fb8.000016a8::2014/10/07-17:50:34.123 ERR   [RES] Physical Disk : ResHardDiskArbitrateInternal: PR Arbitration for disk Error: 170.
00000fb8.000016a8::2014/10/07-17:50:34.123 ERR   [RES] Physical Disk : OnlineThread: Unable to arbitrate for the disk. Error: 170.
00000fb8.000016a8::2014/10/07-17:50:34.124 ERR   [RES] Physical Disk : OnlineThread: Error 170 bringing resource online.
00000fb8.000016a8::2014/10/07-17:50:34.124 ERR   [RHS] Online for resource AS_LOG failed.

 

An arbitration problem! It looks to be related to my first guess doesn’t it? Unfortunately I didn’t have access to the vmkernel.log to see potential reservation conflicts. After that, and this is certainly the funny part of this story (and probably the trigger for this article), I took a look at the multi-pathing configuration for each RDM disk. The reason for this is that I remembered some of the conversations I had with one of my friends (he will surely recognize himself :-)) in which we talked about ISCSI-3 reservation issues with VMWare.

As a matter of fact, the path selection policy was configured to round robin here. According to the VMWare KB1010041, PSP_RR is not supported with VSphere 5.1 for Windows failover cluster and shared disks. This is however the default configuration when creating RDM disks with EMC VNX storage which is used by my customer. After changing this setting for each shared disk, no problem occurred!

My customer inquired about the difference between VMFS and RDM disks. I don’t presume to be a VMWare expert because I’m not, but I know that database administrators and consultants cannot just ignore anymore how VMWare (or Hyper-V) works.

Fortunately, most of the time there will be virtual administrators with strong skills, but sometimes not and in this case, you may feel alone facing such problem. So the brief answer I gave to the customer was the following: If we wouldn’t use physical mode RDMs and used VMDKs or virtual mode RDMs instead, the SCSI reservation would be translated to a file lock. In CIB configuration, there is no a problem, but not for CAB configuration, as you can imagine. Furthermore, using PSP_RR with older versions than VSphere 5.5 will free the reservation and can cause potential issues like the one described in this article.

Wishing you a happy and problem-free virtualization!

 

Leave a Reply


+ 6 = eight

David Barbarin
David Barbarin

Senior Consultant & Microsoft Technology Leader