Ensuring data availability is an important part of the high availability design. SQL Server AlwaysOn features including SQL Server FCIs and availability groups address some aspects of the problem but we may also rely on the online operations features to maximize data availability. Indeed, for some high-critical workloads, offline operations during the maintenance phase are not permitted and may contribute to call into question the entire architecture.
But have you ever faced an outage from your primary replica leading to a failover to the secondary replica during the database maintenance timeline? Yes, your architecture is designed to behave in this way for sure but let’s say you were in the middle of a rebuild index operation concerning a big table and everything is rolled back now. In this case, you will probably have to wait the next windows maintenance time to initiate another rebuild operation, but from the start …
Fortunately, the new Resumable Online Indexes feature from SQL Server 2017 will come to the rescue and seems promising to address such situation. Some limitations are still around and well-documented in the Microsoft documentation but I guess Microsoft will work hard to remove them in the future.
Let’s play a little bit with the Resumable Online Indexes on my AG environment that includes two replicas with cluster_type = NONE meaning we don’t rely on an underlying orchestrator to handle automatic failover of the AG resources. In this context, it will be sufficient to demonstrate how Resumable Online Indexes work.
My initial configuration is as follows:
I used the AdventureWorks2016 database with a bigTransactionHistory table generated from the Adam Machanic script.
select o.name as table_name, i.name as index_name, p.index_id, au.type_desc, au.used_pages, p.rows from sys.allocation_units as au join sys.partitions as p on au.container_id = p.hobt_id join sys.objects as o on o.object_id = p.object_id left join sys.indexes as i on i.object_id = p.object_id and i.index_id = p.index_id where p.object_id = object_id('bigTransactionHistory');
Let’s then initiate a rebuild of the pk_bigTransactionHistory index by using the new RESUMABLE option as follows:
ALTER INDEX pk_bigTransactionHistory ON bigTransactionHistory REBUILD WITH (ONLINE = ON, RESUMABLE = ON); GO
By using this option, I made the rebuild phase “resumable” and I’m now able to perform additional actions like pausing or aborting the index rebuild operation. I may also use MAX_DURATION parameter to setup the duration (in minutes) of the operation before to pause it. We may easily identify resumable sessions which are running on the SQL Server instance by using the sys.dm_exec_requests DMV and new is_resumable column.
Let’s initiate a manual failover while the index is rebuilt by SQL Server on the primary replica. Then during the outage of the first primary replica, I faced the following error message:
A typical error message regarding the situation … but the most interesting part comes now. If we go through the new sys.index_resumable_operations DMV, we may notice our rebuild index operation is paused.
SELECT iro.last_max_dop_used, iro.sql_text, iro.start_time, iro.last_pause_time, iro.total_execution_time, iro.page_count, iro.percent_complete FROM sys.index_resumable_operations as iro JOIN sys.objects AS o ON o.object_id = iro.object_id
From the above output, we may extract interesting column values:
- percent_complete : the percent complete value at the moment of pausing the operation
- page_count = size of the second index structure at the moment of pausing the operation
- last_pause_time = it is self-explanatory
- last_max_dop_used = max dop value used during the last rebuild operation. It is interesting to notice we may change this value for the same operation between pausing sessions.
Before resuming the index operation, let’s have a brief look at the transaction log used space. I asked myself some questions about the transaction behavior when the index operation is paused: Does a pause state have an impact on the transaction log? Do the VLFs touched by this operation remain active until the index is completely rebuilt? Let’s answer to this question by the following test.
A first look at the transaction log space used says that the transaction log contains records that must be backed-up.
use AdventureWorks2016; select total_log_size_in_bytes / 1024 / 1024 / 1024 as total_log_size_GB, used_log_space_in_bytes / 1024 / 1024 as used_log_space_MB, log_space_in_bytes_since_last_backup / 1024 / 1024 as log_space_MB_since_last_backup, used_log_space_in_percent from sys.dm_db_log_space_usage; go
According to the previous result, let’s perform a backup log …
backup log AdventureWorks2016 to disk = 'NUL' with stats = 10; go
Well, pausing an index operation doesn’t rely on a remaining active transaction that may impact the transaction log retention but after a few thoughts, it seems to be obvious it works in this way.
What about storage? The resumable option is available only with online index operations meaning SQL Server must maintain the corresponding hidden structures as long as it is necessary to rebuild the underlying index. According to the BOL:
No extra resources are required for resumable index rebuild except for◦Additional space required to keep the index being built, including the time when index is being paused
Obviously, maintaining such structure may have a huge impact on your workload regarding your context. Out of curiosity, I went through the system view to see those hidden structures as follows:
select o.name as table_name, i.name as index_name, p.index_id, au.type_desc, SUM(au.used_pages) as total_used_pages, SUM(p.rows) as total_rows, COUNT(*) as nb_partitions from sys.allocation_units as au join sys.partitions as p on au.container_id = p.hobt_id join sys.objects as o on o.object_id = p.object_id left join sys.indexes as i on i.object_id = p.object_id and i.index_id = p.index_id where p.object_id = object_id('bigTransactionHistory') group by o.name, i.name, p.index_id, au.type_desc order by o.name, p.index_id; go
In regard to the second record, a simple math – 9261 * 100. / 190150 – confirms the structure is only 4.8 percent in size of the underlying cluster index. We may retrieve this result from the first output of the new sys.index_resumble_index system view. For the third one, my guess is it corresponds to the temporary mapping index used by SQL Server to determine which records to delete in the new indexes that are being built when rows in the underlying table are updated or deleted.
Here is an execution plan sample of an update query against the bigTransactionHistory table. We may notice DELETE / UPDATE operations to a “hidden” nonclustered index related to this special index highlighted in red.
Go ahead and let’s resume the pending index operation by using the new RESUME option. We may also add the MAX_DURATION option to guarantee the rebuild operation will go back to a pause state if we consider that reaching the maintenance windows limit cannot hurt the daily business workload in terms of resource for instance. An option we will definitely consider to add it with the next version of our DMK maintenance tool for SQL Server.
ALTER INDEX pk_bigTransactionHistory ON bigTransactionHistory RESUME WITH (MAX_DURATION = 120 MINUTES); GO
In this blog, we just have surfaced this new maintenance option and its capabilities. Other articles will probably come in the future.
Stay tuned!
By David Barbarin