Sunday, February 17, 2019

SQL Server 2016 SP2 CU4/2017 CU13 Access Violation on Certain Autostats Updates

***** Updated 2019-02-18 This post was originally published on February 17th, 2019. February 18 it was updated with some details re: SQL Server 2016 SP2 CU3 and CU5 - the CUs immediately preceding and following the CU4 with the av. The repro has been updated from repro version 58 to repro version 62. This version does a better job of giving results which can easily be compared across CU3, CU4, and CU5. *****

I spent a lot of time tracking this error down.

When I started investigating, the error was known only as an access violation, preventing some operations related to data cleansing or fact table versioning.

It occurred deep within a series of stored procedures.  The execution environment included cross-database DELETE statements, cross-database synonyms, lots of SELECT statements against system views, scalar UDFs, and lots and lots of dynamic SQL.

And... I don't have access to the four systems where the access violation occurred.

I was able to have some investigation performed on those systems - we learned that by disabling 'auto stats update' for the duration of the sensitive operations, the error was avoided.  We also learned that reverting a system from SQL Server 2016 SP2 CU4 to SP2 CU2 avoided the errors.  On those systems, reverting to SP2 CU2 or temporarily disabling 'auto stats update' were sufficient temporary mitigations.

So I set out to reproduce the errors on my laptop.  That took a while.  A pretty long while. 😱

And once I was able to reproduce the access violation, I generated 16 errors and suddenly couldn't get anymore errors!  (I'd later learn that changing the database recovery model from 'simple' to 'full' was the reason the errors were no longer occurring.)  I opened a premier support ticket with Microsoft describing the issue and attaching the sqldump files, and explaining I wasn't able to continue generating the errors.

Fortunately it was only a few hours later that I was able to start generating errors again by proceeding through my process of reproducing errors and simplifying the context much more carefully than previously.

At the bottom of this post is a minimal reproduction of the access violation error in SQL Server 2016 SP2 CU4.  As is, the statements will create a new database t1 with simple recovery mode and 'auto stats update' enabled.  Those are the system state requirements for the access violations.  If you recreate the error yourself, please be aware a memory dump will be created.

The execution context required for the access violation requires a stats object that is already created but empty.  So a stats object for a clustered index(or primary key clustered index) on a table that was just created, or has been truncated.  The underlying column must be BIGINT data type.  With INT, the error does not occur (or at least doesn't occur until a much higher threshold of activity).

Within the same explicit transaction, at least 1024 rows are inserted into the table using the TABLOCK or TABLOCKX hint (without a hint no error will occur), and then a query requests the stats (resulting in the auto-update stats activity).  Then the access violation occurs.

If you'd like to follow along, here is the repro code.

DECLARE @version NVARCHAR(50) = N'SQL Server ' + CONVERT(NVARCHAR(20), SERVERPROPERTY('ProductVersion')) 
SELECT [version] = @version 
PRINT @version
USE [master]
DECLARE @dpath NVARCHAR(200), @lpath NVARCHAR(200), @dbcmd NVARCHAR(1000);
SELECT @dpath = CONVERT(NVARCHAR(100), SERVERPROPERTY('InstanceDefaultDataPath')) + N't1.mdf';
SELECT @lpath = CONVERT(NVARCHAR(100), SERVERPROPERTY('InstanceDefaultLogPath')) + N't1_log.ldf';

SET @dbcmd = N'CREATE DATABASE [t1] ON PRIMARY (NAME = ''t1'', FILENAME = N''' + @dpath + N''', SIZE = 8192KB , MAXSIZE = 1GB, FILEGROWTH = 65536KB )'
SET @dbcmd = @dbcmd + N' LOG ON ( NAME = N''t1_log'', FILENAME = N''' + @lpath + N''', SIZE = 8192KB , MAXSIZE = 1GB , FILEGROWTH = 65536KB );'
EXEC (@dbcmd)
USE t1
SELECT repro_id = 'x62';
PRINT 'repro_id = x62'
/* both DROP and CREATE TABLE can be replaced with
   access violation takes place for empty stat either way */

CREATE TABLE x  ( col1 BIGINT -- No error if INT rather than BIGINT
                  INDEX ci_x CLUSTERED (col1) -- stats on CI or PKCI for access violation on auto stats update
--                index i_col1(col1)) -- nope, no av if stats accompany non-clustered index
--CREATE STATISTICS s_col1 ON x(col1); -- nope, no error if stats on non-indexed column
                                       -- only auto stats update of stats on CI or PKCI seems to hit AV
--Transaction can include or exclude CREATE TABLE but must include INSERT and request for auto stats update

  ;WITH u0 AS (SELECT num=1 UNION ALL SELECT num=1),             -- 2 rows
        u1 AS (SELECT num=1 FROM u0 AS n1 cross apply u0 AS n2), -- 2 * 2 = 4 rows
        u2 AS (SELECT num=1 FROM u1 AS n1 cross apply u1 AS n2), -- 4 * 4 = 16 rows
        u3 AS (SELECT num=1 FROM u2 AS n1 cross apply u2 AS n2), -- 16 * 16 = 256 rows
        u4 AS (SELECT num=1 FROM u3 AS n1 cross apply u1 AS n2), -- 256 * 4 = 1024 rows
  INSERT INTO x WITH ( TABLOCKX ) -- TABLOCK or TABLOCKX to get av; no hint no error
  SELECT TOP (1024) num -- no av if less than 1024 rows
  FROM nums;

  --SELECT result = COUNT (DISTINCT col1) FROM x -- stats auto update in CU3, CU4 or CU5; but no av
  SELECT result = COUNT ( * ) FROM x -- stats auto update in 2016 SP2 CU4 but not CU3 or CU5 
  /* access violation takes place when auto update stats is attempted in 2016 SP2 CU4 */

USE [master]

Here's what the output of the repro code below looks like on SQL Server 2016 SP2 CU3.  Not terribly exciting.  Note that at the end of execution, the stats for ci_x are still empty.  That makes sense - a count(*) from a table can't really benefit from stats.  It doesn''t have an operator accumulating results over time or anything else requiring a memory grant.  Doesn't have a join operator requiring a row estimate.

Indeed, the results look identical in SQL Server 2016 SP2 CU5.  No auto stats update, no error, stats still empty as the code completes.

But what happens in SQL Server 2016 SP2 CU4?  Just after the rows are inserted - and before the results of the count(*) are returned - the query encounters an access violation.

The exception.log file in the SQL Server error log directory looks nice and orderly from these access violations :-)

And  here's the short stack dump included in the SQLDump****.txt files.

It really looks like SQL Server 2016 SP2 CU4 is a loner in this regard - asking for an auto stats update in a context where the stats are immaterial to the query itself.

We'll see.  If there is a fix forthcoming in SQL Server 2016 SP2 CU6 or beyond, I'll update this post with a reference to it.  Perhaps instead I'll learn that this issue was coincidentally fixed in SQL Server 2016 SP2 CU5.  I've looked through the CU5 kbar a few times and didn't see anything I thought was related.

I tried to document within the repro code potential 'bandaid' workarounds if you run into this access violation - just something to tide you over till can upgrade to a fixed version.  For example, switching to full recovery model or disabling auto stats update.  Unfortunately, innocculating against these AVs at the system level may come at too great of a risk in some cases.  There are workarounds possible in the code as well.  Eliminating hints in the insert might work without unreasonable cost.  Alternatively a manual stats update before the count(*) may be acceptable.

All right - that's it for now! Ciao!! 

***** Addendum 20190220 *****
I decided to check SQL Server 2017 as well.  CU11 - no error.  CU12 - no error.

So be on the lookout for this access violation in SQL Server 2016 SP2 CU5 and SQL Server 2017 CU13.

No comments:

Post a Comment