Wednesday, May 28, 2014

"\SQLServer:Buffer Manager\Page life expectancy" hides some detail on NUMA machines

Just a few words about "\SQLServer:Buffer Manager\Page life expectancy" vs. "\SQLServer:Buffer Node(*)\Page life expectancy".

On busy servers with multiple NUMA nodes, the overall PLE provided by "\SQLServer:Buffer Manager\Page life expectancy" can obscure some detail that is very helpful in investigations.  Consider the difference between "PLE" and the individual NUMA node PLEs (from \SQLServer:Buffer Manager\Page life expectancy) below.

Most OLTP workloads may never be disturbed by this condition.  Busy, batch-oriented workloads could fall victim to it, though.  That would include certain optimization strategies for SQL Server index maintenance.

If  only "PLE" were considered, it looks like there is a problem with server memory sufficiency (assuming the workload itself is not problematic).  But, when looking at the PLE of the 4 individual NUMA nodes, it becomes apparent that NUMA node 001 is working its memory hard, while the others are loafing.

 The workload on this server is being performed exclusively in the Resource Governor Default Workload Group.  Plotting the node PLEs against the parallel worker count for the Default RG group shows that its during times of high active parallel workers that the memory across the server seems to be used unevenly.

From my perspective its significant that two of the disk IO read spikes above coincide with precipitous drops in the Node 001 PLE.

 I aggregated '\Processor(*)\% Processor Time' aligned with NUMA node boundaries and this is the result.  CPU is being used fairly evenly across all 4 NUMA nodes, but memory in one NUMA node is stressed.

That scenario can happen in large query batches, where queries interested in the same large data sets run nearly concurrently.  First one in wins - the first query to start executing will very likely have all of its workers on a single NUMA node.  When pages are read into the bpool for those workers, they'll be inserted into the bpool region controlled by the "home node" of those workers.  Other queries just a few moments behind on the 000, 002, and 003 node will 'bogart' off  the leader in NUMA node 001 - they'll get many more cache hits - most of them from NUMA node 001's home node memory region.

No comments:

Post a Comment