sql.sasquatch: #SQLServer

Showing posts with label #SQLServer. Show all posts

Thursday, April 16, 2020

SQL Server 2017 cu17 ColumnStore Workload OOMs Part 1

A support ticket concerning this behavior is open with Microsoft. SQL Server 2017 CU20 has not been evaluated for this behavior. Although no published fix included in CU20 appears to address this behavior, it's possible additional affects of the documented fixes address the behavior. Its also possible an undocumented CU20 fix addresses this behavior. However, as CU20 has never been brought up in the context of the support ticket, and - as can be seen at the end of this blog post - Microsoft has previously documented fixes to similar behavior, I currently believe this behavior is likely to exist in SQL Server 2017 CU20 as well as CU17, CU18, and CU19.

Let's start the story with a SQL Server 2017 CU17 service restart on December 20, 2019.

All of the graphs below unless otherwise noted are from perfmon, captured in 30 second increments.

It's a VMware VM, with Windows Server 2016 as the operating system. Global startup trace flags 4199 (optimizer hotfixes), 4139 (quickstats histogram amendments), 3226 (no errorlog messages for successful t-log backups). Nothing too surprising there.

This vm has 56 vcpu and 620 GB vRAM.

[Max Server Memory] is set to 590000 mb.

12/20/2019 11:03:24,spid8s,Unknown,SQL Server shutdown has been initiated      
12/20/2019 11:03:24,spid8s,Unknown,SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.      
12/20/2019 11:03:25,Server,Unknown,Microsoft SQL Server 2017 (RTM-CU17) (KB4515579) - 14.0.3238.1 (X64) <nl/> Sep 13 2019 15:49:57 <nl/> Copyright (C) 2017 Microsoft Corporation<nl/> Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2016 Standard 10.0 <X64> (Build 14393: ) (Hypervisor)   
12/20/2019 11:03:25,Server,Unknown,UTC adjustment: -6:00      
12/20/2019 11:03:25,Server,Unknown,(c) Microsoft Corporation.      
12/20/2019 11:03:25,Server,Unknown,All rights reserved.      
12/20/2019 11:03:25,Server,Unknown,Server process ID is 12764.      
12/20/2019 11:03:25,Server,Unknown,System Manufacturer: 'VMware<c/> Inc.'<c/> System Model: 'VMware Virtual Platform'.      
12/20/2019 11:03:25,Server,Unknown,Authentication mode is MIXED.      
12/20/2019 11:03:25,Server,Unknown,Logging SQL Server messages in file 'C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\Log\ERRORLOG'.      
12/20/2019 11:03:25,Server,Unknown,The service account is '***redacted***'. This is an informational message; no user action is required.      
12/20/2019 11:03:25,Server,Unknown,Registry startup parameters: <nl/>  -d C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\master.mdf<nl/>  -e C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\Log\ERRORLOG<nl/>  -l C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\mastlog.ldf<nl/>  -T 4199<nl/>  -T 4139<nl/>  -T 3226
12/20/2019 11:03:25,Server,Unknown,Command Line Startup Parameters:<nl/>  -s "MSSQLSERVER"

12/20/2019 11:03:25,Server,Unknown,SQL Server detected 28 sockets with 2 cores per socket and 2 logical processors per socket<c/> 56 total logical processors; using 56 logical processors based on SQL Server licensing. This is an informational message; no user action is required.

When [Max Server Memory] can be attained by SQL Server, [Target Server Memory] will be equal to [Max Server Memory]. If memory conditions external to SQL Server make [Max Server Memory] unattainable, [Target Server Memory] will be adjusted downward to an attainable value. As workload is placed on the system, [Total Server Memory] grows toward [Target Server Memory]. That's typical, expected behavior.

In this case, the story stays boring until after the following midnight. There wasn't enough workload to drive much growth of [Total Server Memory] until about 2:15 am, after which [Total Server Memory] grew fairly rapidly. [Total Server Memory] reached a plateau between 3:00 am and 3:15 am, and then stepped slightly down later. [Target Server Memory] was never attained. That's curious.

[Total Server Memory] is the sum of [Database Cache Memory], [Free Memory], and [Stolen Memory] for the instance. Here's a graph of those three memory categories over the same timeperiod.

It really is quite curious that [Total Server Memory] never reached [Target Server Memory] and in fact eventually backed *farther off* from [Target Server Memory]. If external factors made [Target Server Memory] unattainable, I expect [Target Server Memory] to be adjusted downward. But that didn't happen.

I'm a very curious guy, but I don't spend days worth of time creating and staring at graphs of the same short timeperiod simply out of curiosity. This morning timeperiod encompassed 16 SQL Server out-of-memory errors. I want to be able to diagnose them, and find out when the contributing behavior was introduced, in order to advise organizations that are planning SQL Server upgrades - or that fall prey to the type of OOMs experienced in this instance.

I'm going to tighten the timeframe for the remaining graphs to 2:30 am to 5:00 am, in order for the individual errors to be discernable on the graphs.

There are the 16 OOMs, each marked with a vertical red line in the graph below. Interestingly, they occur during the plateau of [Total Server Memory]. But... SQL Server is still a long way off from reaching [Target Server Memory]... why not just grow SQL Server's memory share?

As I mentioned before, if external memory conditions made [Target Server Memory] unattainable, I expect [Target Server Memory] to be lowered. That didn't happen, hinting that sufficient memory was still present for [Total Server Memory] to grow.

What does perfmon have to say about [Available Memory]? Does it corroborate my conclusion that SQL Server *could have* continued to grow [Total Server Memory]? Sure 'nuf. At the time of *all 16* of the OOMs, there was over 100GB of [Available Memory], nicely divided across the 2 vNUMA nodes.

Those familiar with my memory graphs on this blog and in my twitter activity as @sqL_handLe may note that I'm usually much more preoccupied with [Free & Zero Page List Memory] than with [Available Memory]. Depending on the memory consumers and activity on a VM, there may be considerable memory accounted as [Available Memory] which is not [Free & Zero Page List Memory]. If a memory allocation for a consumer must be zeroed before handed over, it's gotta come from [Free & Zero Page List Memory] (which I often affectionately call FZPL).

Thankfully in this case, FZPL is nearly all of [Available Memory] and is similarly well-divided between the vNUMA nodes.

So [Available Memory] and [Free & Zero Page List Memory] are plentiful and balanced between the two vNUMA nodes. Can we see how much of SQL Server [Total Server Memory] is on each of the SQLOS nodes? Yup. Meet my friend [Total Node Memory]. Equally balanced across the two SQLOS nodes.

In fact, the categories of [Database], [Stolen], and [Free] SQLOS memory can be tracked at SQLOS node level, too. Pretty nice balance on nodes 000 and 001.

Extracted from the SQL Server errorlog, here are the 16 OOMs from that night.

12/21/2019 03:15:27,spid83,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 03:17:43,spid79,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 8
12/21/2019 03:24:02,spid51,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 8
12/21/2019 03:36:37,spid73,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 03:42:54,spid89,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 03:46:54,spid55,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 03:50:36,spid57,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 03:53:56,spid73,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 03:55:57,spid73,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 03:58:28,spid73,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 04:00:39,spid52,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 04:03:14,spid70,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 16
12/21/2019 04:05:34,spid74,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
12/21/2019 04:09:26,spid70,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 24
12/21/2019 04:15:44,spid107,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 8
12/21/2019 04:23:14,spid94,Unknown,Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1

So here's the summary of the story: 16 OOMs occurred in one night. The instance hadn't reached [Target Server Memory], and there doesn't seem to be a good reason for [Total Server Memory] to *not* have grown rather than incur an OOM. There was over 50GB of FZPL per vNUMA node, for goodness sake! Memory use seems to have been quite evenly divided between the vNUMA nodes.

For the last two OOMs, at 4:15 am and 4:23 am, there was even a large amount of [Free] SQLOS memory within [Total Server Memory]. And there was plenty of [Free] SQLOS memory on either of the two SQLOS nodes!! The possibility of SQLOS [Total] memory growth at the time of any of the 16 OOMs, and the additional presence of significant SQLOS [Free] memory on either SQLOS node during the last 2 OOMs fully convinces me that this is a bug. (Well, that and not having seen this type of behavior in any of the SQL Server 2016 or 2019 versions I've worked with.)

When the dust clears from this, I think a bug similar to this one from SQL Server 2017 cu10 will be found. I am upset at myself that in 5 months of staring down *this* behavior, I haven't been able to marshal the resources to get *this* bug properly identified so it can be corrected.

FIX: Out of memory error occurs even when there are many free pages in SQL Server
https://support.microsoft.com/en-us/help/4347088/fix-out-of-memory-error-when-there-are-free-pages-in-sql-server

The fix referenced above, kb4347088, is found in SQL Server 2017 cu10 and SQL Server 2016 sp2 cu3. The problem described in this blog post shares some similarities, and has been observed on SQL Server 2017 cu17, cu18, and cu19. As of 2020 April 16 this bug has not been identified and there is no fix imagined or planned.

Tuesday, November 19, 2019

SQL Server 2019 Scalar UDF inlining - OOM in some cases

*** Update 2020 April 7 ***

The issue described below was corrected in kb4536075, with the fix for the scalar UDF inlining OOMs first available in SQL Server 2019 CU2.

FIX: Scalar UDF Inlining issues in SQL Server 2019
https://support.microsoft.com/en-us/help/4538581/fix-scalar-udf-inlining-issues-in-sql-server-2019
SQL Server 2019 CU2
https://support.microsoft.com/en-us/help/4536075/cumulative-update-2-for-sql-server-2019

*** end update ***

Here's a little something I stumbled across. A caution about scalar UDF inlining.

Well, ok, maybe Joe Obbish stumbled across it first :-)

Today's adventure is on my laptop. Because no lie this ugly UDF combined with current UDF inlining memory consumption will take down your server no matter *how* big it is.

Here's my laptop SQL Server version and some important database details.

OK, let's create a scalar UDF with a few logic branches. The function is nonsense, I'm sorry. But if you try this... you can try making it as sensible as you'd like. :-)

In the function below there is one IF, 25 ELSE IFs, and 1 ELSE.

SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO
CREATE OR ALTER  FUNCTION dbo.test__inline_udf
(
  @in_param nvarchar(250)
)
RETURNS nvarchar(250)
AS
BEGIN
  DECLARE @retValue nvarchar(250) = @in_param
  IF @in_param = N'A'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)  
   BEGIN SET @retValue = N'1' END
  END
  ELSE IF @in_param = N'B'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) 
    BEGIN SET @retValue = N'2' END
  END
  ELSE IF @in_param = N'C'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'3' END
  END
  ELSE IF @in_param = N'D'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
    BEGIN SET @retValue = N'4' END
  END 
  ELSE IF @in_param = N'E'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'5' END
  END
  ELSE IF @in_param = N'F'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'6' END
  END
  ELSE IF @in_param = N'G'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'7' END
  END
  ELSE IF @in_param = N'H'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'8' END
  END
  ELSE IF @in_param = N'I'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'9' END
  END
  ELSE IF @in_param = N'J'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'10' END
  END
  ELSE IF @in_param = N'K'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'11' END
  END
  ELSE IF @in_param = N'L'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'12' END
  END
  ELSE IF @in_param = N'M'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'13' END
  END
  ELSE IF @in_param = N'N'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'14' END
  END
  ELSE IF @in_param = N'O'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'15' END
  END
  ELSE IF @in_param = N'P'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'16' END
  END
  ELSE IF @in_param = N'Q'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'17' END
  END
  ELSE IF @in_param = N'R'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'18' END
  END
  ELSE IF @in_param = N'S'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'19' END
  END
  ELSE IF @in_param = N'T'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'20' END
  END
  ELSE IF @in_param = N'U'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'21' END
  END 
  ELSE IF @in_param = N'V'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'22' END
  END
  ELSE IF @in_param = N'W'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'23' END
  END
  ELSE IF @in_param = N'X'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'24' END
  END
  ELSE IF @in_param = N'Y'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'25' END
  END
  ELSE IF @in_param = N'Z'
  BEGIN
    IF @retValue NOT IN ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
       BEGIN SET @retValue = N'26' END
  END
  ELSE
  BEGIN SET @retValue = N'00' END
  RETURN @retValue
END

On my laptop, the following result took from 8 to 10 seconds repeatedly.

I'll tell you what. I ran this same function on a VM with 930 GB vRAM, with Max Server Memory set to 690GB. It ran for over 15 minutes before crashing. Not just crashing itself. Crashing any other query on the instance that was trying to allocate memory (eg stealing against a query memory grant). It had amassed over 500 GB of optimizer memory at that point, like this...

Wow. Now you'll notice that the big ol' server is a slightly different version. No matter. As far as I know, this behavior will be found on every version of SQL Server 2019 up to date (today is 2019 November 19).

Once the instance reached the total limit for "stealable" memory, the query crashed. Same thing if an estimated plan was requested - so its in compilation rather than execution that the aggressive memory consumption occurs. Once the OOM occurs, the large amount of optimizer memory is freed within a few seconds and the instance is back to normal for all other purposes.

Now, if I disable UDF inlining... the following result comes in well under 1 second.

Here's the final thing I can say about this for now...
If you generate an estimated plan for a query that tries to inline that UDF, it'll also crash due to excessive optimizer memory*.

I'll update this blog post in the future when a fix is available.

* well, I speculate that there is some amount of memory which may be sufficient to allow this to complete with generating an OOM. But once it's more than 500 GB does it really matter?

~~~~~

The scalar UDF used above is really, really bad :-)

So here's a nicer one.

SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO
CREATE OR ALTER  FUNCTION dbo.test__inline_udf_2
(
  @in_param INT
)
RETURNS INT
AS
BEGIN
  DECLARE @retValue INT = @in_param
  IF @in_param = 1
  BEGIN
    IF @retValue IN 
  ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
  SET @retValue = 1
  END
  ELSE IF @in_param = 2
  BEGIN
    IF @retValue IN 
  ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
   SET @retValue = 2
   END
   ELSE IF @in_param = 3
  BEGIN
    IF @retValue IN 
  ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
   SET @retValue = 3
   END
   ELSE IF @in_param = 4
   BEGIN
     IF @retValue IN 
  ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
   SET @retValue = 4
   END
   ELSE IF @in_param = 5
   BEGIN
     IF @retValue IN 
  ( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
   SET @retValue = 5
  END
  ELSE SET @retValue = 0
  RETURN @retValue
END

No more implicit conversions. Took out some unnecessary BEGIN-END pairs. Its only one IF, four ELSE IFs, and an ELSE. Changed the NOT IN clauses to IN clauses.

And when i run it on my monster VM, it also generates an error.

That looks even more severe that the previous error. But its the same underlying condition: working on the inlining of the scalar UDF during plan compile kept gobbling optimizer memory until something gave way.

Msg 701, Level 17, State 123, Line 6
There is insufficient system memory in resource pool 'default' to run this query.
Location:
Expression: false
SPID: 61
Process ID: 4916
Msg 3624, Level 20, State 1, Line 1
A system assertion check has failed.
Msg 596, Level 21, State 1, Line 0
Cannot continue the execution because the session is in the kill state.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command.
Froid MEMORY_CLERK_SQLOPTIMIZER

Sunday, June 16, 2019

SQL Server Query Store: Usability Soft Spots -- Part 2

Part 2 Summary: the stored procedure was modified to no longer use "SELECT... INTO" with a table from tempdb, and to truncate the results table with each iteration rather than drop the table. Of the initial 19 plan_id values for the test query, only 2 would successfully force from Query Store. After trying to force all of the plans (including the surprise plans which first showed up and were used while another plan was forced), from the 34 total plan_id values, the tally stood at 17 plans which would successfully force, and 17 plans which would not.

Others in the series:
SQL Server Query Store: Usability Soft Spots -- Part 1
https://sql-sasquatch.blogspot.com/2019/06/sql-server-query-store-usability-soft.html

~~~~~

Yesterday I started a blog post series.

SQL Server Query Store: Usability Soft Spots -- Part 1

https://sql-sasquatch.blogspot.com/2019/06/sql-server-query-store-usability-soft.html

One of my goals for problem repros is to eliminate as many extraneous vulnerable points as possible. For example, when reproducing a problem with Query Store plan forcing, I want to remove as many known or suspected plan forcing issues as possible, other than the issue the repro is meant to draw attention to.

Erin Stellato pointed out that temp tables are associated with some atypical plan forcing behavior. Dropping of tables, indexes, or statistics can also result in atypical plan forcing behavior.

For these reasons I've reworked the repro which I'll be building on in these blog posts.

In my initial blog post, the test query used "SELECT... INTO" with a temp table.

The new version of the repro contains a CREATE TABLE statement before the test query, and no longer uses a temp table.

I'm still creating the test database the same way.

USE [master]
GO
CREATE DATABASE [surprise_plan]
 ON  PRIMARY 
( NAME = N'surprise_plan', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.CTP_3_0__2019\MSSQL\DATA\surprise_plan.mdf' , SIZE = 139264KB , MAXSIZE = UNLIMITED, FILEGROWTH = 65536KB )
 LOG ON 
( NAME = N'surprise_plan_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.CTP_3_0__2019\MSSQL\DATA\surprise_plan_log.ldf' , SIZE = 270336KB , MAXSIZE = 2048GB , FILEGROWTH = 65536KB )
 WITH CATALOG_COLLATION = DATABASE_DEFAULT
GO
ALTER DATABASE [surprise_plan] SET  MULTI_USER 
GO
ALTER DATABASE [surprise_plan] SET QUERY_STORE = ON
GO
ALTER DATABASE [surprise_plan] SET QUERY_STORE (OPERATION_MODE = READ_WRITE, CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30), DATA_FLUSH_INTERVAL_SECONDS = 60, INTERVAL_LENGTH_MINUTES = 1, MAX_STORAGE_SIZE_MB = 1000, QUERY_CAPTURE_MODE = ALL, SIZE_BASED_CLEANUP_MODE = AUTO)
GO
ALTER DATABASE [surprise_plan] SET  READ_WRITE 
GO

The stored procedure follows the same type of strategy. Note there are now two parameters. @testQuery_reps indicates how many instances of the test query will be executed. In addition to the test query itself, the stored procedure executes some additional queries to increase the rate of change in the Query Store structures. @filler_reps indicates how times the 'filler query' section of code will execute.

USE [surprise_plan]
GO
CREATE OR ALTER PROCEDURE [dbo].[surprise_plan2test] @testQuery_reps INT, @filler_reps INT = 2
AS
SET NOCOUNT ON
DROP TABLE IF EXISTS sq__surprise_plan_tests;
CREATE TABLE sq__surprise_plan_tests(
     query_id BIGINT,
     query_hash BINARY(8),
     plan_id BIGINT,
     query_plan_hash BINARY(8),
     runtime_stats_id BIGINT,
     count_executions BIGINT);

DECLARE @ij INT = 0, @query_count INT, @sqlcmd2 NVARCHAR(MAX), @sqlcmd NVARCHAR(MAX) =
'INSERT INTO sq__surprise_plan_tests
 SELECT /*surprise2test*/ qsq.query_id, qsq.query_hash, qsp.plan_id,

        qsp.query_plan_hash, qsrs.runtime_stats_id, qsrs.count_executions
 FROM sys.query_store_runtime_stats qsrs
 JOIN sys.query_store_plan qsp ON qsrs.plan_id = qsp.plan_id
 JOIN sys.query_store_query qsq ON qsp.query_id = qsq.query_id
 JOIN sys.query_store_query_text qsqt ON qsq.query_text_id = qsqt.query_text_id
 WHERE qsqt.query_sql_text LIKE ''%surprise2test%''
 ORDER BY qsrs.runtime_stats_id DESC;'

WHILE @testQuery_reps > 0
BEGIN
     TRUNCATE TABLE sq__surprise_plan_tests;
     DBCC FREEPROCCACHE WITH no_infomsgs;
     EXEC (@sqlcmd);
     SET @ij = 0;
     WHILE @ij < @filler_reps
     BEGIN
          SELECT TOP 1

@sqlcmd2 = 'SELECT TOP (1) plan_id INTO #tmp 
                             FROM SYS.QUERY_STORE_RUNTIME_STATS 
                             WHERE runtime_stats_interval_id = '
                             + CONVERT(NVARCHAR(10), ((query_id - 1) * @testQuery_reps) + @ij) + ';
                             DROP TABLE IF EXISTS #tmp;'
          FROM sys.query_store_query

          WHERE @testQuery_reps = @testQuery_reps AND @ij = @ij
          ORDER BY query_id DESC;

EXEC (@sqlcmd2);
          SET @ij = @ij + 1;
     END

     SELECT @testQuery_reps = @testQuery_reps - 1;
END

SELECT * FROM sq__surprise_plan_tests
ORDER BY runtime_stats_id DESC;

Ok. Let's execute the test query 1600 times, with each iteration of the test query submission accompanied by two iterations of the filler query section of code.

EXEC [dbo].[surprise_plan2test] @testQuery_reps = 1600, @filler_reps = 2;

The result... 19 unique plan_id values for the query from 1600 executions.

So let's take a look in the "Tracked Queries" Query Store activity.

Interesting that the order of plan_id values in the right-hand legend is immediately sensible. Let's see what happens when I force plan_id 2...

I execute the test query again twice, with 1 time 'round for the filler queries. I could execute the test query a single time but I like it when the new plans show up in the query results the test query returns... due to timing issues that is reliable when the test query is executed twice but not when the test query is executed a single time only.

EXEC [dbo].[surprise_plan2test] @testQuery_reps = 2, @filler_reps = 1;

Well look at that! At the top of the results list is plan 3220 now.

Let's refresh the "Tracked Queries" activity.

Ohhhh. I forced plan_id 2 (in the purple box below) but what showed up was plan_id 3220 (in the yellow box below).

Ok, that's weird. Did a last_force_failure_reason_desc show up for plan_id 2?

SELECT plan_id, query_id, engine_version, compatibility_level,  
       is_forced_plan, last_force_failure_reason_desc
FROM sys.query_store_plan WHERE plan_id = 2;

Hm. So I forced plan_id 2, but plan_id 3220 showed up instead. And there is no last_force_failure_reason_desc for plan_id 2.

Kendra Little uses the term “morally equivalent plan” in some excellent blog posts, mentioning that the term likely came from Conor Cunningham via Erland Sommarskog.
I've listed those blog posts below, as well as a related Azure Feedback suggestion.

Forced Plan Confusion: Is_Forced vs Use Plan = True
https://sqlworkbooks.com/2018/03/forced-plan-confusion-is_forced-vs-use-plan-true/

What is a “Morally Equivalent Execution Plan,” and Why Is It Good?
https://sqlworkbooks.com/2018/03/what-is-a-morally-equivalent-execution-plan-and-why-is-it-good/

Indicate “morally equivalent” forced plan where use_plan = true but is_forced = 0
https://feedback.azure.com/forums/908035-sql-server/suggestions/33562136-indicate-morally-equivalent-forced-plan-where-us

So... something funky happened when I tried to force plan_id 2. What if I try forcing the other plans for query_id 2?

Here's a grid of the plan_id values selected for execution after forcing each originally available plan_id for query_id 2.

Of 19 query plans available from the first 1600 executions of the test query, only two plan_id values were able to be successfully forced: 292 and 501. Seventeen of the original plan_id values were not selected by the optimizer after they were forced from Query Store. That seems like the cards are stacked against me.

But as the plan_id values were forced, new plans appeared. Plenty of them LoL. If I add in the new plans and attempt to force them as well...

So, with all 34 of the plans in scope, it's a 50-50 chance. 17 of the original plans wouldn't force, 2 of them would. Add in 15 surprise plans, and there are 17 plans that won't force and 17 plans that do force successfully. And the "Tracked Queries" activity now looks like...

Ok. So with this example I hope I've shown that, at least as through SQL Server 2019 CTP 3.0, encountering a “morally equivalent plan” choice after Query Store forcing a plan might not be as rare as commonly held.

If you're still with me, I've still got a little ways to go with this one... next installment won't be at least until tomorrow, though.

Thanks for reading!

Ciao!!

Q

Saturday, June 15, 2019

SQL Server Query Store: Usability Soft Spots -- Part 1

Part 1 Summary: pretty easy to get lots of Query Store plan_id values per query_id, and sometimes the sort of plan_id values in "Tracked Queries" is downright unhelpful.

SQL Server Query Store: Usability Soft Spots -- Part 2
https://sql-sasquatch.blogspot.com/2019/06/sql-server-query-store-usability-soft_16.html
~~~~

For now, I'm working with a Developer Edition SQL Server 2019 CTP 3.0 installation, on my laptop.

No global trace flags enabled. Pretty vanilla.

OK, let's create a database just for this series of blog posts. No tricks up my sleeve, enable Query Store out the gate, change QUERY_CAPTURE_MODE from default AUTO to ALL, DATA_FLUSH_INTERVAL_SECONDS = 60 and INTERVAL_LENGTH_MINUTES = 1 because these will be pretty quick tests.

USE [master]
GO
CREATE DATABASE [surprise_plan]
 ON  PRIMARY 
( NAME = N'surprise_plan', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.CTP_3_0__2019\MSSQL\DATA\surprise_plan.mdf' , SIZE = 139264KB , MAXSIZE = UNLIMITED, FILEGROWTH = 65536KB )
 LOG ON 
( NAME = N'surprise_plan_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.CTP_3_0__2019\MSSQL\DATA\surprise_plan_log.ldf' , SIZE = 270336KB , MAXSIZE = 2048GB , FILEGROWTH = 65536KB )
 WITH CATALOG_COLLATION = DATABASE_DEFAULT
GO
ALTER DATABASE [surprise_plan] SET  MULTI_USER 
GO
ALTER DATABASE [surprise_plan] SET QUERY_STORE = ON
GO
ALTER DATABASE [surprise_plan] SET QUERY_STORE (OPERATION_MODE = READ_WRITE, CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30), DATA_FLUSH_INTERVAL_SECONDS = 60, INTERVAL_LENGTH_MINUTES = 1, MAX_STORAGE_SIZE_MB = 1000, QUERY_CAPTURE_MODE = ALL, SIZE_BASED_CLEANUP_MODE = AUTO)
GO
ALTER DATABASE [surprise_plan] SET  READ_WRITE 
GO

Now let's create stored procedure [surprise_plan1test] for some fun. Nothing too clever here. A query is executed against query store as dynamic SQL. Do it as dynamic SQL so the offset is predictable and it always matches to the same Query Store query_id as long as context_settings_id stays the same.

USE [surprise_plan]
GO
CREATE OR ALTER   PROCEDURE [dbo].[surprise_plan1test] @max_reps INT
AS
SET NOCOUNT ON
DECLARE @ij INT = 0, @query_count INT, @sqlcmd2 NVARCHAR(MAX), @sqlcmd NVARCHAR(MAX) =
'SELECT /*surprise1test*/ qsq.query_id, qsq.query_hash, qsp.plan_id, qsp.query_plan_hash, qsrs.runtime_stats_id,
                      r_last_execution_time = MAX(qsrs.last_execution_time), r_count_executions = SUM(qsrs.count_executions)
 INTO ##results
 FROM sys.query_store_runtime_stats qsrs
 JOIN sys.query_store_plan qsp ON qsrs.plan_id = qsp.plan_id
 JOIN sys.query_store_query qsq ON qsp.query_id = qsq.query_id
 JOIN sys.query_store_query_text qsqt ON qsq.query_text_id = qsqt.query_text_id
 WHERE qsqt.query_sql_text LIKE ''%surprise1test%''
 GROUP BY qsq.query_id, qsq.query_hash, qsp.plan_id, qsp.query_plan_hash, qsrs.runtime_stats_id
 ORDER BY qsrs.runtime_stats_id DESC, r_last_execution_time DESC;'
SELECT @query_count = COUNT(*) FROM sys.query_store_query;
WHILE @ij < @max_reps
BEGIN
DROP TABLE IF EXISTS ##results;
DBCC FREEPROCCACHE WITH no_infomsgs;
EXEC (@sqlcmd);
SET @sqlcmd2 = 'SELECT TOP (1) * 
INTO #tmp 
FROM SYS.QUERY_STORE_RUNTIME_STATS 
WHERE runtime_stats_interval_id = ' + CONVERT(NVARCHAR(10), @query_count + @ij) + ';
DROP TABLE IF EXISTS #tmp;'
EXEC (@sqlcmd2);
EXEC (@sqlcmd2);
SET @ij = @ij +1;
END
SELECT * FROM ##results
ORDER BY runtime_stats_id DESC;

EXEC [dbo].[surprise_plan1test] @max_reps = 1600

All right. Let's head over to the Query Store activities. Since I know I'm interested in query_id 4, that's the one I'll pull up.

Yeah. That's a lotta plans in the right-hand legend. 22 of them. In a not very helpful order. In fact... though I've tried to figure it out, I don't know what type of order is used for that right-hand legend. It's not chronological. It's not based on duration which is the metric displayed by the graph. I dunno.

Usually with a few more executions of query_id 4 and a few refreshes of this activity, the right-legend becomes sorted in descending plan_id from top-to-bottom. And that's pretty durable. But sometimes it switches back to this nonsensical (to me) ordering.

Got plenty more to show but I wanted to keep this blog post short as an introduction to what I'd be working with.

In the field, I see this quite often - a query_id with over 20 plans that show up in the Tracked Queries activity. And the unhelpful ordering of plans in the right-hand legend, while a minor and fleeting inconvenience, is usability soft spot number 1.

For an update to the repro code and the next step in looking at Query Store Usability Soft Spots continue at...
SQL Server Query Store: Usability Soft Spots -- Part 2
https://sql-sasquatch.blogspot.com/2019/06/sql-server-query-store-usability-soft_16.html

Thanks for reading!

Ciao!!

Q

Sunday, February 17, 2019

SQL Server 2016 SP2 CU4/2017 CU13 Access Violation on Certain Autostats Updates

***** Updated 2019-02-18 This post was originally published on February 17th, 2019. February 18 it was updated with some details re: SQL Server 2016 SP2 CU3 and CU5 - the CUs immediately preceding and following the CU4 with the av. The repro has been updated from repro version 58 to repro version 62. This version does a better job of giving results which can easily be compared across CU3, CU4, and CU5. *****

I spent a lot of time tracking this error down.

When I started investigating, the error was known only as an access violation, preventing some operations related to data cleansing or fact table versioning.

It occurred deep within a series of stored procedures. The execution environment included cross-database DELETE statements, cross-database synonyms, lots of SELECT statements against system views, scalar UDFs, and lots and lots of dynamic SQL.

And... I don't have access to the four systems where the access violation occurred.

I was able to have some investigation performed on those systems - we learned that by disabling 'auto stats update' for the duration of the sensitive operations, the error was avoided. We also learned that reverting a system from SQL Server 2016 SP2 CU4 to SP2 CU2 avoided the errors. On those systems, reverting to SP2 CU2 or temporarily disabling 'auto stats update' were sufficient temporary mitigations.

So I set out to reproduce the errors on my laptop. That took a while. A pretty long while. 😱

And once I was able to reproduce the access violation, I generated 16 errors and suddenly couldn't get anymore errors! (I'd later learn that changing the database recovery model from 'simple' to 'full' was the reason the errors were no longer occurring.) I opened a premier support ticket with Microsoft describing the issue and attaching the sqldump files, and explaining I wasn't able to continue generating the errors.

Fortunately it was only a few hours later that I was able to start generating errors again by proceeding through my process of reproducing errors and simplifying the context much more carefully than previously.

At the bottom of this post is a minimal reproduction of the access violation error in SQL Server 2016 SP2 CU4. As is, the statements will create a new database t1 with simple recovery mode and 'auto stats update' enabled. Those are the system state requirements for the access violations. If you recreate the error yourself, please be aware a memory dump will be created.

The execution context required for the access violation requires a stats object that is already created but empty. So a stats object for a clustered index(or primary key clustered index) on a table that was just created, or has been truncated. The underlying column must be BIGINT data type. With INT, the error does not occur (or at least doesn't occur until a much higher threshold of activity).

Within the same explicit transaction, at least 1024 rows are inserted into the table using the TABLOCK or TABLOCKX hint (without a hint no error will occur), and then a query requests the stats (resulting in the auto-update stats activity). Then the access violation occurs.

If you'd like to follow along, here is the repro code.

DECLARE @version NVARCHAR(50) = N'SQL Server ' + CONVERT(NVARCHAR(20), SERVERPROPERTY('ProductVersion'))

SELECT [version] = @version 
PRINT @version
USE [master]
GO
DECLARE @dpath NVARCHAR(200), @lpath NVARCHAR(200), @dbcmd NVARCHAR(1000);
SELECT @dpath = CONVERT(NVARCHAR(100), SERVERPROPERTY('InstanceDefaultDataPath')) + N't1.mdf';
SELECT @lpath = CONVERT(NVARCHAR(100), SERVERPROPERTY('InstanceDefaultLogPath')) + N't1_log.ldf';

SET @dbcmd = N'CREATE DATABASE [t1] ON PRIMARY (NAME = ''t1'', FILENAME = N''' + @dpath + N''', SIZE = 8192KB , MAXSIZE = 1GB, FILEGROWTH = 65536KB )'
SET @dbcmd = @dbcmd + N' LOG ON ( NAME = N''t1_log'', FILENAME = N''' + @lpath + N''', SIZE = 8192KB , MAXSIZE = 1GB , FILEGROWTH = 65536KB );'
SET @dbcmd = @dbcmd + N' ALTER DATABASE [t1] SET RECOVERY SIMPLE,  AUTO_UPDATE_STATISTICS ON,      AUTO_UPDATE_STATISTICS_ASYNC OFF;'
/*                                           RECOVERY SIMPLE ONLY, AUTO_UPDATE_STATISTICS ON ONLY, AUTO_UPDATE_STATISTICS_ASYNC ON or OFF   */
EXEC (@dbcmd)
GO
USE t1
GO
SELECT repro_id = 'x62';
PRINT 'repro_id = x62'
/* both DROP and CREATE TABLE can be replaced with
   IF OBJECT_ID('x') IS NOT NULL TRUNCATE TABLE x 
   access violation takes place for empty stat either way */

DROP TABLE IF EXISTS x
CREATE TABLE x  ( col1 BIGINT -- No error if INT rather than BIGINT
                  INDEX ci_x CLUSTERED (col1) -- stats on CI or PKCI for access violation on auto stats update
--                index i_col1(col1)) -- nope, no av if stats accompany non-clustered index
                )
--CREATE STATISTICS s_col1 ON x(col1); -- nope, no error if stats on non-indexed column
                                       -- only auto stats update of stats on CI or PKCI seems to hit AV
                  
GO
--Transaction can include or exclude CREATE TABLE but must include INSERT and request for auto stats update
BEGIN TRANSACTION

  ;WITH u0 AS (SELECT num=1 UNION ALL SELECT num=1),             -- 2 rows
        u1 AS (SELECT num=1 FROM u0 AS n1 cross apply u0 AS n2), -- 2 * 2 = 4 rows
        u2 AS (SELECT num=1 FROM u1 AS n1 cross apply u1 AS n2), -- 4 * 4 = 16 rows
        u3 AS (SELECT num=1 FROM u2 AS n1 cross apply u2 AS n2), -- 16 * 16 = 256 rows
        u4 AS (SELECT num=1 FROM u3 AS n1 cross apply u1 AS n2), -- 256 * 4 = 1024 rows
        nums AS (SELECT num=ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM u4)
  INSERT INTO x WITH ( TABLOCKX ) -- TABLOCK or TABLOCKX to get av; no hint no error
  SELECT TOP (1024) num -- no av if less than 1024 rows
  FROM nums;

  --SELECT result = COUNT (DISTINCT col1) FROM x -- stats auto update in CU3, CU4 or CU5; but no av
  SELECT result = COUNT ( * ) FROM x -- stats auto update in 2016 SP2 CU4 but not CU3 or CU5 
  /* access violation takes place when auto update stats is attempted in 2016 SP2 CU4 */

  COMMIT TRAN
  DBCC SHOW_STATISTICS (x, ci_x) WITH STAT_HEADER, NO_INFOMSGS;
GO
/*
USE [master]
DROP DATABASE [t1]
*/

Here's what the output of the repro code below looks like on SQL Server 2016 SP2 CU3. Not terribly exciting. Note that at the end of execution, the stats for ci_x are still empty. That makes sense - a count(*) from a table can't really benefit from stats. It doesn''t have an operator accumulating results over time or anything else requiring a memory grant. Doesn't have a join operator requiring a row estimate.

Indeed, the results look identical in SQL Server 2016 SP2 CU5. No auto stats update, no error, stats still empty as the code completes.

But what happens in SQL Server 2016 SP2 CU4? Just after the rows are inserted - and before the results of the count(*) are returned - the query encounters an access violation.

The exception.log file in the SQL Server error log directory looks nice and orderly from these access violations :-)

And here's the short stack dump included in the SQLDump****.txt files.

It really looks like SQL Server 2016 SP2 CU4 is a loner in this regard - asking for an auto stats update in a context where the stats are immaterial to the query itself.

We'll see. If there is a fix forthcoming in SQL Server 2016 SP2 CU6 or beyond, I'll update this post with a reference to it. Perhaps instead I'll learn that this issue was coincidentally fixed in SQL Server 2016 SP2 CU5. I've looked through the CU5 kbar a few times and didn't see anything I thought was related.

I tried to document within the repro code potential 'bandaid' workarounds if you run into this access violation - just something to tide you over till can upgrade to a fixed version. For example, switching to full recovery model or disabling auto stats update. Unfortunately, innocculating against these AVs at the system level may come at too great of a risk in some cases. There are workarounds possible in the code as well. Eliminating hints in the insert might work without unreasonable cost. Alternatively a manual stats update before the count(*) may be acceptable.

All right - that's it for now! Ciao!!

***** Addendum 20190220 *****
I decided to check SQL Server 2017 as well. CU11 - no error. CU12 - no error.
CU13...

So be on the lookout for this access violation in SQL Server 2016 SP2 CU5 and SQL Server 2017 CU13.