Thursday, May 23, 2019

SQL Server 2016: Here a NUMA, there a NUMA... Part 2

This post is a continued look at some memory behaviors explored earlier in the following blog post.

SQL Server 2017: here a NUMA, there a NUMA... Part 1

The previous blog post gives detail around a significant performance degradation event experienced on a 4 node perf/scale test system. This blog post instead details observations from a production 4 node system.

4x14 vcpus, 1472 total GB vRAM (368 GB per vNUMA node).

Max server memory set at 1375000 mb. 

See if you can spot the unexpected system state below.  🙂

The [db cache] on node 000 is ginormous!!

There is only 368 GB of vRAM on the vNUMA node.  Node 000 is accounting for [db cache] that alone approaches the full amount of vRAM on 3 vNUMA nodes!!

How does that happen if instance wide [total] memory stays constant at [max server memory], and [total] memory stays constant on the other 3 NUMA nodes at [target]?

Got memory counted twice.  Comparing [stolen] memory from all 4 nodes to the instance-wide measure of [stolen] memory shows that something's up 🙂  On this system, nodes 001, 002, and 003 each have a portion of their memory which they account for as [stolen] but node 000 accounts for as [db cache].

There were no free list stalls during this period, thankfully.Somehow the instance is able to eke along keeping a small amount of free memory throughout.

But there is still a cost to pay for this unusual memory state.

I suspect that the cost of cross-node accounting and double-counting the memory is in part reflected by latch waits.

Wednesday, May 15, 2019

MADPASS 2019 May 15 - perfmon black box & AMA

Nothing to see here yet :-)

This is a place holder for a follow-up blog post to tonight's 5 minute intro to the value of a locally run perfmon black box, followed by the AMA panel with myself, Joe Obbish, and John Eisbrener.

Once there's something of interest here I'll tweet the link to give it a proper introduction.

Windows groups Performance Monitor Users and Performance Log Users.
Error message when you try to access the Performance Monitor (Perfmon.exe) on a remote computer: "Access Is Denied"

How to create a “black box” performance counter data collector
(Clint Huffman, 2014 May 23)

My m.o. is to use a counters file. I typically keep the counters file in the same directory as my perfmon start script(s).

For example this text file lives at C:\perfmon\blackbox__counters.txt.

Note the final two lines - date and my name.  Because they don't match to legit counters these lines are ignored; i use this for commenting in my counters files.

\NUMA Node Memory(*)\Free & Zero Page List MBytes
\Paging File(_Total)\% Usage
\Memory\Pages Input/sec
\Memory\Pages Output/sec
\Network Interface(*)\Bytes Received/sec 
\Network Interface(*)\Bytes Sent/sec
\Processor Information(*,_Total)\% Processor Time
\LogicalDisk(*)\Current Disk Queue Length
\Lonny Niederstadt -- Blackbox

The start script can be run interactively("Run as administrator"), but I also enjoy setting them to run on system start with schtasks.  It lives at C:\perfmon\blackbox__start.bat

:: local perfmon blackbox Lonny Niederstadt 20190515
setlocal enableextensions
cd /d "%~dp0"
logman.exe stop blackbox_%COMPUTERNAME%
logman.exe delete blackbox_%COMPUTERNAME%
logman.exe create counter blackbox_%COMPUTERNAME% -f csv -v mmddhhmm -max 500 -cf blackbox__counters.txt -o "C:\perfmon\blackbox_%computername%" -si 00:00:30 -cnf 24:00:00 -rf 672:00:00
logman.exe start blackbox_%COMPUTERNAME%

The real magic is in the "logman create counter" command.
logman create counter 

Additional resources:

Monday, May 13, 2019

SQL Server 2017: here a NUMA, there a NUMA... Part 1

aka Performance Bamboozle

This blog post considers a perf/scale testing system.  See a following blog post for similar concerns from a SQL Server 2016 production system.
SQL Server 2016: Here a NUMA, there a NUMA... Part 2

Let's observe a workload on SQL Server 2017 CU13, running on a 4x24 vcpu VM with 900 GB vRAM. [Max server memory] is set to 750 GB.  From 5:00 am to 12:30 pm [target server memory] is equal to [max server memory]; below I will refer to [target] only.

The yellow box marked in the graphs below is unusual.  Although CPU utilization generally trends with both active requests and parallel workers in the [default] workload group, CPU utilization is far lower in the yellow box than predicted by the number of active requests or parallel workers.  The end of the yellow box coincided with an intervention step.  My quick-thinking colleague Joe Obbish issued a [dbcc dropcleanbuffers] and the recovery was nothing short of miraculous.  But certainly that's not an operation to execute frequently on a busy system.  I'll be referring to the yellow box as a "performance bamboozle" for the remainder of this blog post.  I've gotta do something to keep myself amused.  What went wrong - and why did [dbcc dropcleanbuffers] help?

Only two DOP 8 queries ran for the majority of the yellow box.  Lets compare their Query Store details to those of similar queries with the same query_hash values.  All eight of the query executions detailed in the following 2 result sets ran at DOP 8; the particular queries of interest are highlighted in red.

Wow.  All 8 queries returned a single row.  All 8 had similar CPU ms, physical reads, logical reads, tempdb use, query memory use. But those two highlighted in red had elapsed time roughly 20 to 30 times that of the other queries. What happened?

The workers for these queries spent the majority of the yellow box time in sleep_task wait.  Often sleep_task is often ignored as a benign wait.

Read more about sleep_task waits in the
SQLSkills SQL Server Wait Type Library

Though perfmon includes details of several wait types in [\SQLServer:Wait Statistics(*)\*], sleep_task is not among them.

But perfmon offers other evidence of what was happening in those 2 queries.  Check out those free list stalls!!

The free list stalls call to mind the instance-wide perfmon counter [\SQLServer:Memory Manager\Free Memory (KB)].

SQL Server [total] memory is comprised of three categories: [database cache], [stolen], [free].

Just in case someone out there loves a little perfmon math as much as I do...

  [\SQLServer:Memory Manager\Database Cache Memory (KB)]
+ [\SQLServer:Memory Manager\Stolen Server Memory (KB)]
+ [\SQLServer:Memory Manager\Free Memory (KB)]
= [\SQLServer:Memory Manager\Total Server Memory (KB)]

It seems fairly intuitive that [\SQLServer:Memory Manager\Free Memory (KB)] would be pretty low when free list stalls are occurring.  And indeed that's what is seen in the graph below:the high plateau of free list stalls occurs during a valley of SQLOS free memory.

On this 4 vNUMA node vm, perfmon counters [\SQLServer:Memory Node(*)\Free Node Memory (KB)] will account for the portion of [\SQLServer:Memory Manager\Free Memory (KB)] on each SQLOS node.  The graph below shows that (although every now and then a stray errant value shows up in the individual node values).

The changes in free memory across the SQLOS nodes almost seem to predict something...

OK, let's look at the [\SQLServer:Memory Node(*)\Database Node Memory (KB)] counters.
Interesting that almost no database cache is listed on node 002 during the performance bamboozle.

When looking at [Stolen] memory in the next graph, something doesn't quite add up.  Some memory in the instance is being double-counted by SQLOS: counted on one node as [db cache] and on another as [stolen].  I've blogged about that briefly in the following two blog posts.

SQL Server 2016 Memory Accounting: A Suspicious Surprise 

SQL Server 2016 Memory Accounting Part II: Another Suspicious Example

The massive amount of [stolen] memory on node 002 makes sense given the almost complete lack of [Db Cache] on node 002. It still looks kind of unusual.


Now, when free list stalls occur, one way to replenish the free lists is for clean buffer pool pages to be evicted by the lazy writer thread(s). Page writes can make clean pages out of dirty pages 🙂

So, when free list stalls occur, I expect to see page writes coming out of SQL Server.  But there weren't many page writes during this performance bamboozle.  Lazy writer wrote out a few pages.  Notice that as lazy writer wrote pages, the number of free list stalls slightly decreased. Checkpoint writes occurred up to the start of the performance bamboozle.  But not during.

There were no background writer writes, because the active databases were using automatic checkpoints rather than indirect checkpoints.

The slow pace of page writes contributed to the performance bamboozle. Or perhaps share a common underlying cause.

Note especially the second large bout of free list stalls at approximately 10:15 am.  During that round of free lists stalls, SQL Server page writes and lazy writer writes were plentiful.  That's how I expect the system to respond to free list stalls.

The [dbcc dropcleanbuffers] that Joe initiated to end the performance bamboozle was able to do something that the lazywriter was not able to do.  It broke a logjam, freeing up a bunch of memory within SQL Server [total] memory and allowing the lazywriter threads to get back to normal work.


Many may already be familiar with the exceptional behavior shown in the next graph.  [Target server memory] is not as often enforced for SQL Server [total] memory as it used to be.  Batch mode columnstore queries especially seem prone to make [total] exceed [target].  This behavior is not unique to systems with multiple NUMA nodes; it can also be seen when there is a single SQLOS memory node.

Notice that [total] exceeds [target] almost immediately after the yellow box below, in addition to other times in the graph..

Description of [total] exceeding [target] can be seen in the following KB article, with a screenshot excerpt following.

Memory configuration and sizing considerations in SQL Server 2012 and later versions

For various reasons, I do not consider [total] > [target] to be a contributor to the performance bamboozle.  Rather, the brief time that [total] exceeded [target] after the performance bamboozle was a part of SQL Server recovering to normal behavior.


The graph below shows this instance also experienced persistent and gradually accumulating [foreign] memory in the SQLOS memory nodes.  Fluctuation in the amount of [foreign] memory is somehow related to [total] exceeding [target].  This behavior is unique to systems with multiple NUMA nodes and multiple SQLOS memory nodes.

The persistent and accumulating [foreign] memory, similar to [total] > [target] does not seem to have been a contributor to the performance bamboozle.  Rather, the increase in total 


At 9:31:07 top level memory blocks were allocated.

Harshdeep Singh discusses the role of top level memory blocks in maintaining free lists in the following blog post.

An In-depth look at memory – SQL Server 2012/2014

Late(r than expected) allocation of top level memory blocks may have contributed to the performance bamboozle.


dbcc MemoryStatus -- IO error??

No answers in this post, just a question I don't want to lose track of.

2019 May 7 I put this question to #sqlhelp:

[dbcc memorystatus] in 2017 cu13 gives information about IO errors. Is there another source for that error count or more details elsewhere? Nothing in the SQL Server error log, Windows event log.

No answer yet.

I set up a task via schtasks to send 'dbcc memorystatus' output to a file every 5 minutes for a memory investigation.  I thought it might shed some light on this information from 'dbcc memorystatus'.  Nope, not yet.

Here are a few of the results I scraped.  Not sure if 'IO error' is a count of errors, or indicating an error code for the most recent IO error.  Though there's seemingly no information for errors with some of these numbers.  It may be a count of errors within each checkpoint, or somesuch.  The two busy databases in this instance are both using automatic rather than indirect checkpoints.
May 11 time        Dirty     IO error
4:00:01          2382710         7685
4:05:01          1510246         2412
4:10:01          1076007        19739
4:15:01           364968         9050
4:20:00            44028        16855
4:25:00            28251         1653
4:30:00            23242         2674
4:35:00            24668        32750
4:40:00            25199       124190
4:50:01            26107         2079
4:55:00            25661        70204
5:00:00           909808        13664

I'll update this post if I learn anything exciting in the future.

Sunday, February 17, 2019

SQL Server 2016 SP2 CU4/2017 CU13 Access Violation on Certain Autostats Updates

***** Updated 2019-02-18 This post was originally published on February 17th, 2019. February 18 it was updated with some details re: SQL Server 2016 SP2 CU3 and CU5 - the CUs immediately preceding and following the CU4 with the av. The repro has been updated from repro version 58 to repro version 62. This version does a better job of giving results which can easily be compared across CU3, CU4, and CU5. *****

I spent a lot of time tracking this error down.

When I started investigating, the error was known only as an access violation, preventing some operations related to data cleansing or fact table versioning.

It occurred deep within a series of stored procedures.  The execution environment included cross-database DELETE statements, cross-database synonyms, lots of SELECT statements against system views, scalar UDFs, and lots and lots of dynamic SQL.

And... I don't have access to the four systems where the access violation occurred.

I was able to have some investigation performed on those systems - we learned that by disabling 'auto stats update' for the duration of the sensitive operations, the error was avoided.  We also learned that reverting a system from SQL Server 2016 SP2 CU4 to SP2 CU2 avoided the errors.  On those systems, reverting to SP2 CU2 or temporarily disabling 'auto stats update' were sufficient temporary mitigations.

So I set out to reproduce the errors on my laptop.  That took a while.  A pretty long while. 😱

And once I was able to reproduce the access violation, I generated 16 errors and suddenly couldn't get anymore errors!  (I'd later learn that changing the database recovery model from 'simple' to 'full' was the reason the errors were no longer occurring.)  I opened a premier support ticket with Microsoft describing the issue and attaching the sqldump files, and explaining I wasn't able to continue generating the errors.

Fortunately it was only a few hours later that I was able to start generating errors again by proceeding through my process of reproducing errors and simplifying the context much more carefully than previously.

At the bottom of this post is a minimal reproduction of the access violation error in SQL Server 2016 SP2 CU4.  As is, the statements will create a new database t1 with simple recovery mode and 'auto stats update' enabled.  Those are the system state requirements for the access violations.  If you recreate the error yourself, please be aware a memory dump will be created.

The execution context required for the access violation requires a stats object that is already created but empty.  So a stats object for a clustered index(or primary key clustered index) on a table that was just created, or has been truncated.  The underlying column must be BIGINT data type.  With INT, the error does not occur (or at least doesn't occur until a much higher threshold of activity).

Within the same explicit transaction, at least 1024 rows are inserted into the table using the TABLOCK or TABLOCKX hint (without a hint no error will occur), and then a query requests the stats (resulting in the auto-update stats activity).  Then the access violation occurs.

If you'd like to follow along, here is the repro code.

DECLARE @version NVARCHAR(50) = N'SQL Server ' + CONVERT(NVARCHAR(20), SERVERPROPERTY('ProductVersion')) 
SELECT [version] = @version 
PRINT @version
USE [master]
DECLARE @dpath NVARCHAR(200), @lpath NVARCHAR(200), @dbcmd NVARCHAR(1000);
SELECT @dpath = CONVERT(NVARCHAR(100), SERVERPROPERTY('InstanceDefaultDataPath')) + N't1.mdf';
SELECT @lpath = CONVERT(NVARCHAR(100), SERVERPROPERTY('InstanceDefaultLogPath')) + N't1_log.ldf';

SET @dbcmd = N'CREATE DATABASE [t1] ON PRIMARY (NAME = ''t1'', FILENAME = N''' + @dpath + N''', SIZE = 8192KB , MAXSIZE = 1GB, FILEGROWTH = 65536KB )'
SET @dbcmd = @dbcmd + N' LOG ON ( NAME = N''t1_log'', FILENAME = N''' + @lpath + N''', SIZE = 8192KB , MAXSIZE = 1GB , FILEGROWTH = 65536KB );'
EXEC (@dbcmd)
USE t1
SELECT repro_id = 'x62';
PRINT 'repro_id = x62'
/* both DROP and CREATE TABLE can be replaced with
   access violation takes place for empty stat either way */

CREATE TABLE x  ( col1 BIGINT -- No error if INT rather than BIGINT
                  INDEX ci_x CLUSTERED (col1) -- stats on CI or PKCI for access violation on auto stats update
--                index i_col1(col1)) -- nope, no av if stats accompany non-clustered index
--CREATE STATISTICS s_col1 ON x(col1); -- nope, no error if stats on non-indexed column
                                       -- only auto stats update of stats on CI or PKCI seems to hit AV
--Transaction can include or exclude CREATE TABLE but must include INSERT and request for auto stats update

  ;WITH u0 AS (SELECT num=1 UNION ALL SELECT num=1),             -- 2 rows
        u1 AS (SELECT num=1 FROM u0 AS n1 cross apply u0 AS n2), -- 2 * 2 = 4 rows
        u2 AS (SELECT num=1 FROM u1 AS n1 cross apply u1 AS n2), -- 4 * 4 = 16 rows
        u3 AS (SELECT num=1 FROM u2 AS n1 cross apply u2 AS n2), -- 16 * 16 = 256 rows
        u4 AS (SELECT num=1 FROM u3 AS n1 cross apply u1 AS n2), -- 256 * 4 = 1024 rows
  INSERT INTO x WITH ( TABLOCKX ) -- TABLOCK or TABLOCKX to get av; no hint no error
  SELECT TOP (1024) num -- no av if less than 1024 rows
  FROM nums;

  --SELECT result = COUNT (DISTINCT col1) FROM x -- stats auto update in CU3, CU4 or CU5; but no av
  SELECT result = COUNT ( * ) FROM x -- stats auto update in 2016 SP2 CU4 but not CU3 or CU5 
  /* access violation takes place when auto update stats is attempted in 2016 SP2 CU4 */

USE [master]

Here's what the output of the repro code below looks like on SQL Server 2016 SP2 CU3.  Not terribly exciting.  Note that at the end of execution, the stats for ci_x are still empty.  That makes sense - a count(*) from a table can't really benefit from stats.  It doesn''t have an operator accumulating results over time or anything else requiring a memory grant.  Doesn't have a join operator requiring a row estimate.

Indeed, the results look identical in SQL Server 2016 SP2 CU5.  No auto stats update, no error, stats still empty as the code completes.

But what happens in SQL Server 2016 SP2 CU4?  Just after the rows are inserted - and before the results of the count(*) are returned - the query encounters an access violation.

The exception.log file in the SQL Server error log directory looks nice and orderly from these access violations :-)

And  here's the short stack dump included in the SQLDump****.txt files.

It really looks like SQL Server 2016 SP2 CU4 is a loner in this regard - asking for an auto stats update in a context where the stats are immaterial to the query itself.

We'll see.  If there is a fix forthcoming in SQL Server 2016 SP2 CU6 or beyond, I'll update this post with a reference to it.  Perhaps instead I'll learn that this issue was coincidentally fixed in SQL Server 2016 SP2 CU5.  I've looked through the CU5 kbar a few times and didn't see anything I thought was related.

I tried to document within the repro code potential 'bandaid' workarounds if you run into this access violation - just something to tide you over till can upgrade to a fixed version.  For example, switching to full recovery model or disabling auto stats update.  Unfortunately, innocculating against these AVs at the system level may come at too great of a risk in some cases.  There are workarounds possible in the code as well.  Eliminating hints in the insert might work without unreasonable cost.  Alternatively a manual stats update before the count(*) may be acceptable.

All right - that's it for now! Ciao!! 

***** Addendum 20190220 *****
I decided to check SQL Server 2017 as well.  CU11 - no error.  CU12 - no error.

So be on the lookout for this access violation in SQL Server 2016 SP2 CU5 and SQL Server 2017 CU13.

Monday, January 14, 2019

Evaluating SQL Server System Resource Governor Workload Group "Maximum Memory Grant %" for Adjustment

I'm looking at a SQL Server system with 4 NUMA nodes, 18 cores on each NUMA node.  Total RAM: 2 TB.

We can easily verify that the SQL Server default workload group is the main CPU consumer on the server.

I like to trend active requests and parallel threads against CPU utilization.  On the systems I watch, when they are healthy there should be a fairly strong correlation between CPU and requests, or CPU and parallel threads, or both.  Hm. During this time, threads have a stronger correlation than requests.  That's not concerning by itself.  But with this many requests and this many parallel threads against 72 logical processors/schedulers (HT is not enabled), I want higher CPU utilization and a higher rate of work.

Either the work is poorly distributed (a less likely condition) or there is a wait type (or types) that can be identified as suppressing CPU utilization.

Perfmon captures some wait types, and I strike gold by looking at resource_semaphore waits. 

I don't like switching between area and line graphs for resource utilization, but the graph below is easiest to read with CPU as a line graph over stacked area graphs of memory grants (pending and outstanding).  The point of this graph is to show if, at every point in time, the pending memory grants were converted to outstanding (so that the query was executing rather than waiting) there would be enough CPU to allow the queries to run.  When I intervene on a system for scalability or performance, I hope not just to replace one wait type with another 😎.  If, for example, CPU was near 100% utilized while there were pending memory grants, I'd expect simply to replace resource_semaphore waits with SOS_scheduler waits.  But, that's not the case here.

So I have in mind to lower the "Maximum Memory Grant %" for the relevant workload group.  On this system, the default workload group is still using 25%.

The graph below has reserved memory in dark blue layered in front of granted memory in lighter blue.  (There really is only one active user workload group on this system - default.  If there were additional workload groups and/or additional resource pools this type of analysis is still possible but information from DMVs would be necessary to augment perfmon.) So the exposed light blue on the top is the amount of used granted query memory.  Its just a fraction of the granted memory and sometimes a quite small fraction at that.  By reducing the maximum memory grant percentage, there will be a better fit between granted and used query memory (eg I expect reserved memory to be lower zs well as granted memory at the same query concurrency.) 

But there are lots of queries running on the system.  Some of them are probably already spilling to tempdb.  Those queries will spill more by lowering the max memory grant percent.  And some queries that weren't spilling before but were close will start spilling into tempdb.

Let's look at how much tempdb was already in use.  Perfmon has counters for the data file size in tempdb, and the amount of tempdb used.  In the graph below the used tempdb is in light blue layered in front of the dark blue size of tempdb files.  So the exposed dark blue at the top is the used tempdb.

Let's do the subtraction for used query memory, and used tempdb.  And stack them on top of each other.  The upper border is the upper limit for sort/hash activity on the system at any given time.  (Upper limit because "used tempdb" also includes temp tables, version store, etc).

Assuming that all of the queries from 10:00 am to 6:00 pm ran with the same plans but with maximum memory grant percent at 6% rather than 25%, none of their sort/hash needs would change.  Assuming the pace of every query remained the same, eliminating only resource_semaphore wait time (that's not realistic but work with me here 😇) the total area under the upper border would stay constant.  Like moving tetris blocks around.  The share of tempdb is likely to increase vs used query memory.  But it looks like bringing the maximum query memory grant percent down from 25% to 6% is going to be a win for this workload.  After the change, I recommend doing the same type of analysis to make sure the behavior is as expected.