Infrastructure at your Service

David Barbarin

When Read-Scale availability groups and Windows Failover Cluster are not good friends

A couple of days ago, with some fellow French data platform MVPS (@thesqlgrrrl and @Conseilit) we discussed an issue around Read-Scale availability groups and it made me think I had forgotten to share about a weird behavior I experienced with them.  

Basically, Read-Scale availability groups are clusterless infrastructures meaning there is no need to install an underlying cluster. Obviously, you will not benefit from resource orchestration and automatic failover, but this is obviously the intended behavior and their sole purpose and design is to scale out a read workload

Let’s say you have installed a Read-Scale availability group that includes 2 replicas with one primary and one secondary.

If there is a network failure between replicas the normal behavior should be:

        The primary replica continues to handle the R/W workload. Regarding the replication type (sync/async) the current queries may be blocked until reaching the session timeout and then everything should back to normal. Obviously, transactions will fill up the transaction log until the network outage is fixed.

        The secondary replica continues to handle the RO traffic assuming you use a direct connection to it and you don’t use the transparent redirection capability of the AG listener which may lead you to face the following error message:

Error: Microsoft ODBC Driver 17 for SQL Server : Unable to access the ‘<database>’ database because no online secondary replicas are enabled for read-only access. Check the availability group configuration to verify that at least one secondary replica is configured for read-only access. Wait for an enabled replica to come online, and retry your read-only operation..

But here comes the interesting part of the story.  Let’s introduce in the game the Windows Failover Cluster. As said previously there is no need to install such HA feature on Windows to make the Read-Scale AG working correctly. But suppose now you want to test read scale capabilities on an existing AG infrastructure that relies on the underlying cluster for HA. Why to do this? Well, because in my case with my customer we didn’t want to provision a dedicated infrastructure to only test Read-Scale AGs and we wanted to take advantage of the existing one for that. Anyway, referring to the Microsoft documentation and my google Fu as well, we didn’t find out any know evidence of “incompatibility” of using Read-Scale AG in conjunction of the Windows Failover Cluster. So, we just went ahead, and we added a new SQL Server 2017 Read-Scale AG to an existing infrastructure that already included other AGs for HA (and therefore implicitly an underlying WSFC installed).

The new installed Read-Scale AG was configured correctly as show below:

Simulating a failure scenario to voluntary shutdown the WSFC gave surprising results because we didn’t expect to see connections issues with the new Read-Scale AG both on the primary and the secondary. Let’s clarify this point: As mentioned above, for AG with HA capabilities, this result is consistent with a failed Windows Failover Cluster that has compromised the AG infrastructure. But with Read-Scale AGs because they are not tied to any HA mechanism, we didn’t expect to see the AG to get a “Resolving” state with no access to the primary database as confirmed with the following error message:

Msg 983, Level 14, State 1, Line 3
Unable to access availability database ‘<database>’ because the database replica is not in the PRIMARY or SECONDARY role.
Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role.
Try the operation again later.

Looking at the SQL Server error log gave us back some clues:


Always On Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.
Always On: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.
Always On: The local replica of availability group ‘AGSCALE’ is stopping. This is an informational message only. No user action is required.
The state of the local availability replica in availability group ‘AGSCALE’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the local instance of SQL Server is shutting down. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
Always On Availability Groups connection with secondary database terminated for primary database ‘<database>’ on the availability replica ‘WIN20192\SQL17’ with Replica ID: {5e3d8c42-9c52-449a-aa90-a16665aca055}. This is an informational message only. No user action is required.
The availability group database “<database>” is changing roles from “PRIMARY” to “RESOLVING” because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
State information for database ‘<database>’ – Hardened Lsn: ‘(42:10776:1)’ Commit LSN: ‘(42:10768:3)’ Commit Time: ‘Sep 19 2019 1:45PM’
Always On: The availability replica manager is starting. This is an informational message only. No user action is required.
Always On Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
Error: 983, Severity: 14, State: 73.
Unable to access availability database ‘<database>’ because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.
Always On Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required.

Attempt to access non-existent or uninitialized availability group with ID ‘{890B3B2C-3BD5-EE9D-1594-899F989700CF}.’. This is usually an internal condition, such as the availability group is being dropped or the local WSFC node has lost quorum. In such cases, and no user action is required.
Error: 983, Severity: 14, State: 1.
Unable to access availability database ‘<database>’ because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.

The Windows Failover Cluster has shutdown due to a quorum lost (expected result) but also bring the Read-Scale down … The above output tends to think that even with CLUSTER_TYPE=NONE there still exists a strong relationship between the AG and the underlying cluster. By the way, if you look at the sys.dm_hadr_cluster_members you get info of existing cluster. 

SELECT 
	member_name,
	member_type_desc AS member_type,
	member_state_desc AS member_state,
	number_of_quorum_votes
FROM sys.dm_hadr_cluster_members

 

From an architecture standpoint, I would say mixing both AGs with HA capabilities and Read-Scale AGs may be not a good practice at all and make no sense, and we probably came across a borderline scenario. However, my feeling is that technically speaking it should work as described in the Microsoft documentation 🙂 I noticed this behavior with SQL Server 2017 and different CUs and Windows Server 2016. I got the same behavior with Windows Server 2019 as well.

Please feel free to comment if there is something I omitted ! 

Anyway, we finally dedicated an environment for Read-Scale AGs and we got consist results this time 🙂

See you!

By David Barbarin

 

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Barbarin
David Barbarin

Principal Consultant & Microsoft Technology Leader