I was playing with my lab environment which includes Windows 2016 and SQL Server 2016 and I noticed an interesting scenario while testing cluster node failures. After simulating some network outage scenarios, I was not able to bring back my cluster node online immediately by using traditional way. A quick look at the cluster manager led me to notice something new:

blog 113 - 0 - WSFC new state

A quarantined state value … interesting! But what does it mean exactly?

In respect of this Microsoft blog from the Server and management team, this is a new feature shipped with Windows Server 2016 that concerns Virtual Machine Compute Resiliency. To cut the long story short, this feature is intended to improve of response of the Windows Failover Cluster to transient failures. Initially, it was designed for Hyper-V virtual machines resiliency and it has been introduced to avoid “flapping nodes” phenomena which may impact negatively the overall cluster health.

To illustrate the benefit of this feature, let’s go back to my failure scenario which consisted in simulating random network failures. Experiencing such scenario in production environment may lead to an unstable cluster state including quorum loss in the worst scenario. In fact, after 3 failure events, the concerned cluster node was quarantined and we may confirm by looking at the new following parameters (QuarantineThreshold is our concern here). The other parameter QuarantineDuration defines the timeframe during which the cluster node is not able to join the cluster, so basically two hours.

blog 113 - 1 - WSFC quarantine parameters

In my case, the WIN20169SQL16 node will not automatically join the cluster until 22:56:38 in my case.

blog 113 - 2 - WSFC quarantine error log

What about my availability group? Well no surprise here, the quarantined cluster node means an availability replica disconnected and a synchronization issue as well as confirmed below:

 

blog 113 - 3 - availability group state

We many notice a corresponding error number 100060 with the message An error occurred while receiving data: ‘10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)’. There is no specific message from SQL Server error log about quarantine state. From the secondary replica, I got the following sample message into the SQL Server error log:

blog 113 - 4 - secondary replica error log

SQL Server is waiting for the cluster node to start and rejoin the WSFC. In short, the overall availability health state will not change while the quarantined node is active. So according to your context, it may be a good thing until you don’t fix the related issue on the concerned cluster node. Fortunately, as stated to the Microsoft document, this is not mandatory for us to wait for the quarantined period to finish by using the following PowerShell command:

blog 113 - 5 - WSFC start cluster node

Here we go! The cluster health state got back to normal

blog 113 - 6 - WSFC final state

Happy clustering!

By David Barbarin