Tuesday, October 25, 2016

Harmonic mean of SQLOS Buffer Node PLEs on NUMA servers

By default, on a (v)NUMA server/vm, SQL Server will partition its shared memory resources by using a count of SQLOS memory nodes equal to the visible (v)NUMA nodes.

In that strategy, each SQLOS memory node gets its own IO completion port, its own lazy writer and in SQL Server 2016 there will be one transaction log writer per node (up to 4, on (v/l)cpus 1-4 as needed).  Each SQLOS node gets a portion of the database cache, stolen memory, and free memory.
When pages read from disk are inserted into the bpool for a worker, they are inserted into the database cache for the SQLOS memory node associated with the compute node for the scheduler running the worker.

Put all that together: each SQLOS memory node may be experiencing different rate and footprint of pages read, first-writes to new database pages, steal against memory grant (or other steals) and freeing memory.  I don't know the formula used for PLE... but each of those factors are part of the calculation.  So... each SQLOS node has its own PLE, and these PLE values can vary greatly.


Yet the node level PLE is a metric seldom checked.  Rather, the overall calculated PLE for the instance is the metric usually consulted.  Here's that metric added into the same graph.




I've been puzzling for quite a while how the overall PLE was calculated.  Its fairly obvious its not an arithmetic mean of the PLEs - it varies gradually even as individual PLEs vary greatly.  I figured perhaps it was a weighted average, maybe with weight determined by the amount of database cache on the individual nodes.

But that could also be quite volatile, since especially in the case of large hashes, memory can be stolen against a grant very quickly - rapidly shrinking the amount of database cache on an individual SQLOS node.

Eventually this Paul Randal blog post was pointed out to me.  The post is originally from 2011, but after Matt Slocum pointed out that arithmetic mean didn't fit for deriving the overall PLE it was updated in 2015. 

"The calculation is: add the reciprocals of (1000 x PLE) for each node, divide that into the number of nodes and then divide by 1000."

OK... let's plug that formula into Excel along with the data for the 4 SQLOS nodes and the overall PLE to see if it fits.



That's a pretty good fit.  The very first harmonic mean I calculate rounds to 142 rather than the value of 144 that is the overall PLE.  That's not surprising - after seeing how good the fit is otherwise, I suspect that was a timing issue - volatility in the 000 and 001 PLEs probably lead to a slightly different value between the time that the individual SQLOS node values were reported and the value a short time later when the overall PLE was reported.

A graph of my calculated harmonic mean and the overall PLE shows just how good the fit is...



That's good enough for me :-)

I'm glad that Paul Randal and Matt Slocum dug into this... one fewer question in my heap.







No comments:

Post a Comment