Monday, May 13, 2019

SQL Server 2017: here a NUMA, there a NUMA... Part 1

aka Performance Bamboozle

This blog post considers a perf/scale testing system.  See a following blog post for similar concerns from a SQL Server 2016 production system.
SQL Server 2016: Here a NUMA, there a NUMA... Part 2

Let's observe a workload on SQL Server 2017 CU13, running on a 4x24 vcpu VM with 900 GB vRAM. [Max server memory] is set to 750 GB.  From 5:00 am to 12:30 pm [target server memory] is equal to [max server memory]; below I will refer to [target] only.

The yellow box marked in the graphs below is unusual.  Although CPU utilization generally trends with both active requests and parallel workers in the [default] workload group, CPU utilization is far lower in the yellow box than predicted by the number of active requests or parallel workers.  The end of the yellow box coincided with an intervention step.  My quick-thinking colleague Joe Obbish issued a [dbcc dropcleanbuffers] and the recovery was nothing short of miraculous.  But certainly that's not an operation to execute frequently on a busy system.  I'll be referring to the yellow box as a "performance bamboozle" for the remainder of this blog post.  I've gotta do something to keep myself amused.  What went wrong - and why did [dbcc dropcleanbuffers] help?

Only two DOP 8 queries ran for the majority of the yellow box.  Lets compare their Query Store details to those of similar queries with the same query_hash values.  All eight of the query executions detailed in the following 2 result sets ran at DOP 8; the particular queries of interest are highlighted in red.

Wow.  All 8 queries returned a single row.  All 8 had similar CPU ms, physical reads, logical reads, tempdb use, query memory use. But those two highlighted in red had elapsed time roughly 20 to 30 times that of the other queries. What happened?

The workers for these queries spent the majority of the yellow box time in sleep_task wait.  Often sleep_task is often ignored as a benign wait.

Read more about sleep_task waits in the
SQLSkills SQL Server Wait Type Library

Though perfmon includes details of several wait types in [\SQLServer:Wait Statistics(*)\*], sleep_task is not among them.

But perfmon offers other evidence of what was happening in those 2 queries.  Check out those free list stalls!!

The free list stalls call to mind the instance-wide perfmon counter [\SQLServer:Memory Manager\Free Memory (KB)].

SQL Server [total] memory is comprised of three categories: [database cache], [stolen], [free].

Just in case someone out there loves a little perfmon math as much as I do...

  [\SQLServer:Memory Manager\Database Cache Memory (KB)]
+ [\SQLServer:Memory Manager\Stolen Server Memory (KB)]
+ [\SQLServer:Memory Manager\Free Memory (KB)]
= [\SQLServer:Memory Manager\Total Server Memory (KB)]

It seems fairly intuitive that [\SQLServer:Memory Manager\Free Memory (KB)] would be pretty low when free list stalls are occurring.  And indeed that's what is seen in the graph below:the high plateau of free list stalls occurs during a valley of SQLOS free memory.

On this 4 vNUMA node vm, perfmon counters [\SQLServer:Memory Node(*)\Free Node Memory (KB)] will account for the portion of [\SQLServer:Memory Manager\Free Memory (KB)] on each SQLOS node.  The graph below shows that (although every now and then a stray errant value shows up in the individual node values).

The changes in free memory across the SQLOS nodes almost seem to predict something...

OK, let's look at the [\SQLServer:Memory Node(*)\Database Node Memory (KB)] counters.
Interesting that almost no database cache is listed on node 002 during the performance bamboozle.

When looking at [Stolen] memory in the next graph, something doesn't quite add up.  Some memory in the instance is being double-counted by SQLOS: counted on one node as [db cache] and on another as [stolen].  I've blogged about that briefly in the following two blog posts.

SQL Server 2016 Memory Accounting: A Suspicious Surprise 

SQL Server 2016 Memory Accounting Part II: Another Suspicious Example

The massive amount of [stolen] memory on node 002 makes sense given the almost complete lack of [Db Cache] on node 002. It still looks kind of unusual.


Now, when free list stalls occur, one way to replenish the free lists is for clean buffer pool pages to be evicted by the lazy writer thread(s). Page writes can make clean pages out of dirty pages 🙂

So, when free list stalls occur, I expect to see page writes coming out of SQL Server.  But there weren't many page writes during this performance bamboozle.  Lazy writer wrote out a few pages.  Notice that as lazy writer wrote pages, the number of free list stalls slightly decreased. Checkpoint writes occurred up to the start of the performance bamboozle.  But not during.

There were no background writer writes, because the active databases were using automatic checkpoints rather than indirect checkpoints.

The slow pace of page writes contributed to the performance bamboozle. Or perhaps share a common underlying cause.

Note especially the second large bout of free list stalls at approximately 10:15 am.  During that round of free lists stalls, SQL Server page writes and lazy writer writes were plentiful.  That's how I expect the system to respond to free list stalls.

The [dbcc dropcleanbuffers] that Joe initiated to end the performance bamboozle was able to do something that the lazywriter was not able to do.  It broke a logjam, freeing up a bunch of memory within SQL Server [total] memory and allowing the lazywriter threads to get back to normal work.


Many may already be familiar with the exceptional behavior shown in the next graph.  [Target server memory] is not as often enforced for SQL Server [total] memory as it used to be.  Batch mode columnstore queries especially seem prone to make [total] exceed [target].  This behavior is not unique to systems with multiple NUMA nodes; it can also be seen when there is a single SQLOS memory node.

Notice that [total] exceeds [target] almost immediately after the yellow box below, in addition to other times in the graph..

Description of [total] exceeding [target] can be seen in the following KB article, with a screenshot excerpt following.

Memory configuration and sizing considerations in SQL Server 2012 and later versions

For various reasons, I do not consider [total] > [target] to be a contributor to the performance bamboozle.  Rather, the brief time that [total] exceeded [target] after the performance bamboozle was a part of SQL Server recovering to normal behavior.


The graph below shows this instance also experienced persistent and gradually accumulating [foreign] memory in the SQLOS memory nodes.  Fluctuation in the amount of [foreign] memory is somehow related to [total] exceeding [target].  This behavior is unique to systems with multiple NUMA nodes and multiple SQLOS memory nodes.

The persistent and accumulating [foreign] memory, similar to [total] > [target] does not seem to have been a contributor to the performance bamboozle.  Rather, the increase in total 


At 9:31:07 top level memory blocks were allocated.

Harshdeep Singh discusses the role of top level memory blocks in maintaining free lists in the following blog post.

An In-depth look at memory – SQL Server 2012/2014

Late(r than expected) allocation of top level memory blocks may have contributed to the performance bamboozle.


No comments:

Post a Comment