Thursday, May 23, 2013

Crowdsource momemtum attempt: MSSQL stats update per-partition **Updated**

 I see a Connect item opened in June 2009 titled "Update Statistics at the partition level".

If I hadn't found this item, I would have written an item up myself.  Consider this Connect item, and upvote if it seems worthy to you.

As of SQL Server 2012, partitioned indexes can be rebuilt per partition.  That's good.

But, the statistics for a partitioned table can only be updated across all partitions.  That's bad.

The various posts in the Connect item make a good case for partition level stats updates.  Why update stats for data from 2001 that is partitioned on a time-based key, if the file the partition is in is read-only?  If I eliminate fragmentation, use a high fill-factor and page level compression in an old partition that is read only - the statistics in the partition will be stagnant.  Still important for the query optimizer... but the stats within the partition don't need to be updated.  If there's only 1 out of 20 partitions that is being updated, and the other 19 have updated stats and are read-only, updating stats will be much faster if only the active partition is updated.

There's more.  The read only partitions only have to be backed up once per quarter (or less).  The only reason to continue to back them up periodically is to make sure the backup is on good media.  That can take a whole lot of stress off of the backup infrastructure.

There's more.  The dbcc checkdb interval for data should be based on recovery tolerance, and recoverability.  Once a partition is read-only, the resources to perform the consistency check are a known quantity, and the time to recover from backup is a known quantity.  The dbcc checkdb interval can be far less than the dbcc checkdb interval for contents that are not read-only,

 So, statistics update can be made lighter and faster.  Backup can be made lighter and faster.  Consistency checks can be made lighter and faster.

At the current time, this means when I'm asked (I only get asked about tables of over 1 billion rows - and all of the systems I currently work with have similar workflows and workflow patterns), I will be recommending pseudo-partitioning for MSSQL, instead of partitioning.  We'll split tables into separate similar tables in separate files and filegroups, and use a view for the union.  That way, we can rebuild indexes per table/pseudo partition.  We can update statistics per table/pseudo partition.  We can mark old tables/partitions read only, and take load off of backup and consistency check procedures.

Hopefully, when statistics can be updated per-partition, I'll be able to use partition switching to seamlessly transition from a pseudo partitioning to a real partitioning strategy.  I'd rather not move hundreds of millions - or billions - of rows just to accommodate the metadata differences between pseudo partitioning and partitioning.

**Update by sql_sasquatch 29 August 2013**
SQL Server 2014 has additional features to allow statistics maintenance per partition.  Awesome!  Adam Machanic is concerned about how the optimizer will use the partitioned statistics (I am too)... guess we'll have to see about that.

But the cool thing is that connect items 328093 and 468517 are both listed as active again.

Yay crowdsourcing!

No comments:

Post a Comment