Thursday, October 25, 2018

SQL Server 2016 Memory Accounting Part II: Another Suspicious Example

Originally published 2018 October 25
Updated 2020 December 15
I wanted to update this blog post with a link to a fix that corrects the memory double-counting described here (and the condition that leads to it).

KB4536005 - Improvement: Fix incorrect memory page accounting that causes out-of-memory errors in #SQLServer #SQLServer 2019 CU2 #SQLServer 2017 CU20 #SQLServer 2016 SP2 CU15

tl;dr - No answers yet but perhaps a cleaner picture of the puzzle.

A few months ago I shared a puzzle I've been looking at.  The example i shared is from 4 vNUMA node VM, where the equation [database + stolen + free = total] holds true for the instance but does NOT hold true on one or more SQLOS memory nodes associated with the vNUMA nodes.  Further, although sum(node database cache) is equal to instance-wide database cache, and sum(node free memory) is equal to instance-wide free memory... sum(node stolen memory) may greatly exceed instance-wide stolen memory. Eventually i realized that memory is being double-counted as database cache on one node and as stolen memory on another node.  More details at the link.

SQL Server 2016 Memory Accounting: A Suspicious Surprise

Recently I looked at a 2 vNUMA node VM and saw the same memory conditions.  Even though I don't yet understand what leads to this condition (or what leads out of it), I wanted to share these graphs because the 2 vNUMA node example is maybe a little cleaner. And I think these graphs maybe look a little nicer :-)

Nothing extraordinary when looking at instance-wide measures from memory manager.


And, even though that's a really large share of stolen memory on node 000, [database + stolen + free = total] as expected.

But... whoa!! What's going on on node 001? The [database + stolen + free] goes way over total.

Adding free memory from the two nodes yields the free memory number from memory manager.

Adding database cache from the two nodes yields the database cache number from memory manager.

Omigarrrssshhh!!  Adding stolen memory from the two nodes yields a number sometimes much greater than the stolen memory reported by the memory manager.

So now there are two cases where sums of node values are unexpectedly unexpectedly greater than their corresponding values.

I expect these two equations to hold true, and they are being transgressed:

   "\SQLServer:Memory Node(001)\Stolen Node Memory (KB)"
+ "\SQLServer:Memory Node(000)\Stolen Node Memory (KB)"
= "\SQLServer:Memory Manager\Stolen Server Memory (KB)"

   "\SQLServer:Memory Node(001)\Database Node Memory (KB)"
+ "\SQLServer:Memory Node(001)\Stolen Node Memory (KB)"
+ "\SQLServer:Memory Node(001)\Free Node Memory (KB)"
= "\SQLServer:Memory Node(001)\Total Node Memory (KB)"

Let's compare the differences between the sum of components and the aggregate value (both of these differences are expected to be zero, or near zero due to timing issues).

Huh. The differences from expected values are equal.  Something is causing a varying amount of memory to be counted on node 001 as database cache memory AND double-counted on node 000 as stolen memory.

I've asked about this on twitter at the #sqlhelp tag.  No answers yet - that's not surprising.  i tend to look at SQL Server memory in a rather particular - maybe even peculiar - way 😀 .  If i don't have more of a clue by  PASS Summit, i plan to take this to SQL Clinic to find out what's up.

Why do i care?  Because with tuned CCI workloads, memory-related waits (and CPU time related to memory latency) becomes more important than it was in the past with pure rowstore queries.  And there are sometimes still foreign memory issues like this...

In order for me to tune the system optimally to match the workload (and vice versa), I need a better understanding of memory behavior, potential performance costs, and interventions which can change behavior.

All right - that's all for today!  Ciao!!