Infrastructure at your Service

David Barbarin

SQL Server 2014 : New incremental statistics

Do you use partitioning with SQL Server? If yes this post is for you because there are good news about partitioning with SQL Server 2014. I remember an old job with a healthcare customer who we decided to implement a partitioning strategy for a big transaction archive table for medical records. We used SQL Server 2005 and we already faced to some challenges like statistics updates. Database administrators who use partitioning with version older than SQL Server 2014 know what I mean.

The first news that concern this blog post is the new incremental statistic strategy provided with SQL Server 2014. This feature allows to update statistics without reading the entire table that can be a problem when your table becomes very big! Imagine for instance you have only one small active partition with recent records and a lot of changes (update, insert or delete) against 30 read only big partitions with billions of records. With SQL Server you have different strategies to update statistics by using either the sample method or the full scan method. The latter is more reliable with partitioned tables because in this case we have to deal with skewed data but requires more time (and more resources) because SQL Server have to read all the table records. With big partitioned tables we talk about potentially many hours. Adding a new non empty partition could be also problematic because we have to deal with same issue. New data are not represented into the concerned statistic.

Using incremental statistics it is possible to update only one or several partitions as necessary. The information is then gathered or merged with existing information to create the final statistic. We will see later how works this new mechanism.

During my tests I will use a modified transaction history table in the AdventureWorks2012. The table records are generated from the Adam Machanic’s T-SQL script. This table will contain 49270382 records for a total size of 2 GB. We will partition the bigTransactionHistory table as following:

USE AdventureWorks2012;
GO
 
– Partition function with a partitionning strategy by year.
– The first day of the year will be used as boundary of the next partition
CREATE PARTITION FUNCTION BigTransactionRange(DATETIME)
AS RANGE RIGHT FOR VALUES
(
   ‘20050101’, ‘20060101’, ‘20070101’, ‘20080101’,
   ‘20090101’, ‘20100101’
);
GO
 
– Partition scheme
– All partitions will be in the primary filegroup
CREATE PARTITION SCHEME TransactionsPS1
AS PARTITION BigTransactionRange
ALL TO
(
[PRIMARY]
);
GO

Now we can create a clustered index by using the partition scheme TransactionPS1 and the new instruction STATISTICS_INCREMENTAL.

CREATE CLUSTERED INDEX idx_bigtransactionhistory_transactiondate
ON [dbo].[bigTransactionHistory]([TransactionDate])
WITH (SORT_IN_TEMPDB = ON, STATISTICS_INCREMENTAL = ON);
GO

The sys.stats system view has a new is_incremental column which indicates if a statistic is incremental or not.

SELECT
       stats_id,
       name AS stat_name,
       is_incremental
FROM sys.stats
WHERE object_id = object_id(‘bigtransactionhistory’)

blog_10_-_0-_incremental_stats_-_sys_stats

The new partition configuration of the bigTransactionHistory table is the following:

SELECT
       p.object_id,
       o.name AS table_name,
       p.partition_number,
       p.rows,
       au.total_pages,
       au.total_pages / 128 AS total_size_mb,
       au.type_desc,
       g.name AS [filegroup_name],
       RVL.value AS left_range_boundary,
       RVR.value AS right_range_boundary
       –PF.[name], RV.boundary_id, RV.[value]
FROM sys.partitions AS p
       JOIN sys.objects AS o
             ON o.object_id = p.object_id
       JOIN sys.indexes i
       ON p.object_id = i.object_id
                    AND p.index_id = i.index_id
       JOIN sys.allocation_units AS au
             ON p.hobt_id = au.container_id
       JOIN sys.filegroups AS g
             ON g.data_space_id = au.data_space_id
       LEFT JOIN sys.partition_schemes AS PS
             ON ps.data_space_id = i.data_space_id
       LEFT JOIN sys.partition_functions AS PF
             ON PF.function_id = ps.function_id            
       LEFT JOIN sys.partition_range_values AS RVL
             ON RVL.function_id = PF.function_id
                    AND RVL.boundary_id + 1 = p.partition_number
       LEFT JOIN sys.partition_range_values AS RVR
             ON RVL.function_id = PF.function_id
                    AND RVR.boundary_id = p.partition_number
WHERE p.object_id = object_id(‘bigtransactionhistory’)
       AND p.index_id = 1;

blog_10_-_1-_incremental_stats_-_config_partitionnement

Now let’s take a look at the statistics by using the DBCC SHOW_STATS command:

DBCC SHOW_STATISTICS(‘bigtransactionhistory’,‘idx_bigtransactionhistory_transactiondate’);

blog_10_-_2-_incremental_stats_-_stat_header

blog_10_-_3-_incremental_stats_-_stat_density

blog_10_-_4-_incremental_stats_-_stat_histogramm

We retrieve the same information as the older versions. In fact we’re trying to read statistic data from the final statistic. You will understand why later in the blog post. Note here that the final statistic can have a maximum of 200 steps. In others words currently new incremental statistics does not change the story: we already have a maximum of 200 steps for the entire table. I hope we will see some improvements maybe in the next version.

Now let’s fill up the table with some records which will have a transaction date greater than 2011-01-01 in order to move them to the last partition.

INSERT dbo.bigTransactionHistory
SELECT TOP 1
       TransactionID + 1, ProductID, dateadd(year, 1, TransactionDate), Quantity + Quantity, ActualCost * 1.2
FROM dbo.bigTransactionHistory
ORDER BY TransactionID DESC

The new row has been moved to the correct partition number 8:

blog_10_-_5_-_incremental_stats_-_config_partitionnement_after_insert_one_record

If we take a look at the statistic histogram of the index idx_bigtransactionhistory_transactiondate we can notice that the new record is not propagated to the existing histogram.

blog_10_-_6_-_incremental_stats_-_stat_histogramm_apres_insertion_data

Of course, an update statistics is mandatory here. Before SQL Server 2014 we had to update statistics from the entire table but as said earlier we can now use a new option: WITH RESAMPLE ON PARTITIONS (). The concerned partition is the number 8 in my case. The word RESAMPLE is important here because all pages in the new statistic tree structure must be aligned with the same sample.

UPDATE STATISTICS dbo.BigTransactionHistory(idx_bigtransactionhistory_transactiondate)
WITH RESAMPLE ON PARTITIONS(8);

blog_10_-_7_-_incremental_stats_-_stat_histogramm_apres_maj_incremental_statsjpg

Good news: the incremental statistics works ! In addition the following table gives us an idea of the performance we could obtain by using the incremental statistic (I used and compared the both methods with full scan and with resample on partitions):

Update statistics option

Elapsed Time (ms)

CPU Time (ms)

WITH FULLSCAN

3085

20935

WITH RESAMPLE ON PARTITIONS(8)

6

0

I guess you can easily imagine the result with this new method on a real production environment …

What about auto update statistic with incremental statistics? As you certainly know SQL Server uses a specific algorithm to update automatically statistics when the table has more than 500 rows. The update operation is triggered when the number of rows reaches 20% of the total existing rows + 500 rows. With tables that have a billions of rows, we can spend much time without any automatic update statistics operation. Besides, adding a new partition that does not modify more than 20% of the total rows will not issue an automatic update statistic operation and no information about it will not be available. Fortunately incremental statistics changes the story here. The update statistic can be triggered per partition when the number of modification reaches the threshold value = (total rows / number of partitions) * 20%.

Let’s try with the following test. In my case the bigTransactionhistory table contains 46895292 rows. According to the above formula the theoretical threshold should be: 46895292 / 8 * 0.2 = 1172383. Thus, I will update 1172892 rows in the partition number 7 to be sure to issue an automatic update statistic operation by using a SELECT statement with a predicate on the TransactionDate column.

UPDATE TOP (1172892) bigTransactionHistory
SET TransactionDate = DATEADD(dd, 1, TransactionDate)
WHERE $PARTITION.BigTransactionRange(TransactionDate) = 7;

I can confirm the number of changes is over the threshold by viewing the rowmodctr column from the legacy system view sysindexes:

SELECT
       indid, o.name AS table_name,
       i.name AS index_name,
   STATS_DATE(o.id,i.indid) AS last_date_updated,
       rowmodctr AS number_of_changes,
       st.is_incremental
FROM sysobjects o
       JOIN sysindexes i
             ON i.id = o.id
       JOIN sys.stats st
             ON st.object_id = o.id
                    AND st.stats_id = i.indid
WHERE xtype = ‘U’ AND i.name IS NOT NULL
       AND o.id = object_id(‘bigtransactionhistory’)
ORDER BY last_date_updated
GO

blog_10_-_8_-_incremental_stats_-_rowmodctr

Then, I perform a SELECT statement with a predicate on the TransactionDate column …

SELECT TOP 10 *
FROM dbo.bigTransactionHistory
WHERE TransactionDate > ‘20110101’;

… and I can conclude the SELECT statement has issued an automatic update statistic by viewing the rowmodctr column value equal to 0 for the idx_bigtransactionhistory_transactiondate column.

blog_10_-_9_-_incremental_stats_-_rowmodctr

The automatic update statistic operation has been issued for only 2.5% of total rows change in the entire table in my case.

Now as promise I will give you more information about statistic pages with incremental statistics. First, I would like to thank you Frédéric Pichaud, Senior Escalor Engineer at Microsoft in France, for giving us the following information. We can use the new internal dynamic management function to figure out how statistic pages are organized with incremental statistics.

SELECT
       s.name,
       s.is_incremental,
       sp.node_id,
       sp.first_child,
       sp.next_sibling,
       sp.last_updated,
       sp.modification_counter,
       sp.rows,
       sp.rows_sampled,
       sp.steps,
       sp.left_boundary,
       sp.right_boundary
FROM sys.stats as s
CROSS APPLY sys.dm_db_stats_properties_internal(s.object_id, s.stats_id) AS sp
WHERE s.object_id = object_id(‘bigtransactionhistory’)
       AND s.name = ‘idx_bigtransactionhistory_transactiondate’
ORDER BY s.name, node_id;

blog_10_-_10_-_incremental_stats_-_dmf_internal

We have to focus on 3 columns here: node_id, first_child and next_cibling columns. node_id is an identifier for a page statistic. The page with the node_id equal to 1 is the final statistic page we’ve seen earlier by using DBCC SHOW_STATISTICS. This page contains the total number of rows in the bigTransactionHistory table (column rows = 46895292). We can notice that each partition, easily identifiable by the left and right boundaries, has its own page statistic (node_id 3 to 9). The first_child and the next sibling columns help us to rebuild the complete tree of statistics objects as shown below:

blog_10_-_12_-_incremental_stats_-_statistic_object_tree

From my understanding, the green squares are the statistic page at the leaf level linked to a partition and red squares are the binary merge pages. For instance pages with node_id = 8 (4790286 rows) and 9 (1 row) are merged to the page with node_id = 10 (4790286 + 1 = 4790287). In turn, the page with node_id = 7 (7672943 rows) and the page with node_id = 10 are merged into the page with node_id = 12 (7672943 + 4790287 = 12463230) and so on… The page with node_id = 1 is the final page statistic that merges all the others pages. The page with node_id = 2 is a reserved page for future needs.

Earlier in the blog post we issued an automatic update statistic and the partition with the number 10 was concerned. The page with node_id = 8 is linked to this partition and we can notice the update statistic propagation throughout the statistic objects tree. Only the page statistic of this partition is updated with all the merge pages.

blog_10_-_13_-_incremental_stats_-_dmf_internal_2

Finally we can use the traceflag 2309 to see the detail of a page in the statistic tree. After enabling this traceflag we can use the DBCC SHOW_STATISTICS command with an additional third parameter node_id

DBCC TRACEON(2309);
GO
 
DBCC SHOW_STATISTICS(‘bigtransactionhistory’,‘idx_bigtransactionhistory_transactiondate’, 8);

blog_10_-_14_-_incremental_stats_-_show_statistics_detail_page

The results is the same as a classic DBCC SHOW_STATISTICS command but the detail is only for the concerned page in the tree. Enabling this traceflag is only for debugging purpose. For instance if we suspect a general update statistic issue we can try to identify which pages are relevant.

Database administrators who faced to statistics problems with partitioned tables will probably have fun to try this new feature. See you soon for the next good news about partitioning ;-)

 

One Comment

Leave a Reply


+ 3 = five

David Barbarin
David Barbarin

Senior Consultant & Microsoft Technology Leader