sql.sasquatch: 40 concurrent SQL Server parallel queries + (2 sockets * 10 cores per socket) = spinlock convoy

Here's a disclaimer: trace flags aren't toys - only deploy them in production environments after exploring and testing their behavior in the context of your critical workloads (including end user activity, backups, integrity checks, index maintenance, stats maintenance). Be especially cautious with startup trace flags since it'll require a SQL Server restart to back them out.

Trace flag 8048 - I don't know of any bad side effects... and I've been looking. Not a reason for you not to test... but I'll update this post and add a new post if I ever learn of trouble from trace flag 8048.

Trace flag 8015 - combined with trace flag 8048, disabling NUMA has been good medicine for the kinda-DSS/kinda data warehouse type workflows on the systems I'm involved with. Its lead to more even CPU utilization across cores, more stable/predictable memory utilization, lower overall physical reads, higher scalability, and shorter execution times for batch workloads. Your mileage WILL vary :) Potential bad side effects to consider: loss of one independent buffer pool on each NUMA node, loss of "affinitize different SQL Server connections to different physical NUMA nodes" strategy, single lazy writer instead of one lazy writer per NUMA node.

Ok... is that good enough for the lawyers? :)

A colleague sent me the perfmon spreadsheet snippet above today. This is from a SQL Server with 2 Intel sockets of ten cores per socket, with hyperthreading enabled. I know its hard to read :) Don't bother trying to read the numbers - read the reds and the greens. And if its 3 digit red - the logical CPU is 100% busy. Report application servers send up to 40 concurrent queries to this SQL Server system.

Its about as good of a picture as you'll ever see of what can go wrong in SQL Server by default on a NUMA server: parallel query worker threads are stacking up on a single NUMA node instead of spreading out, and the worker threads on the busy NUMA node are hitting MAJOR spinlock contention in memory allocation. My colleague just grabbed the per-CPU %busy stats from perfmon, stuffed them into excel, turned on conditional formatting and voila! Its clear to me how much of a benefit trace flags 8048 and 8015 will be to this system. Is it clear to other folks yet?

See - with NUMA enabled by default, all worker threads for parallelized queries tend to stay on the NUMA node which received the connection. Connections are distributed round-robin among the NUMA nodes... but what if by happenstance one node gets all the quick queries, and the other node gets all the long, parallelized queries? Ummm... this happens.

You end up with lots and lots and lots of worker threads executing tasks for parallelized queries - all stacked on the same poor cores, and the other CPU cores just might be loafing.

The other behavior to keep in mind is that, by default, query memory allocation (and some other memory allocations) for all threads on a NUMA node are serialized through a single chokepoint (!!). That's as well-defined bottleneck as you'll find, just waiting for enough concurrent activity on the NUMA node to 'poke the bear'. In fact, when things start to look like they do above, it might not just be CPU overload coupled with spinlock contention. It could even be a 'spinlock convoy', where numerous cores (in this case all CPU cores on at least 1 NUMA node) spend more cycles spinning for the same resource than accomplishing logical work.

So here's the deal - if you see this pattern on your SQL Server... and its a dedicated server so nothing else is chewing up your compute power... Check the waits. Check the spins. Plot logical reads, or some other measure of database work, against CPU utilization. Consider disabling NUMA with trace flag 8015, and don't forget to also apply trace flag 8048 in order to promote query memory allocation to per-core serialization, eliminating the NUMA node serialization bottleneck. Even if disabling NUMA with trace flag 8015 is not a good idea for your workload (maybe you need more than one lazywriter or you are affinitizing connections to a NUMA node?), I haven't yet found a disadvantage to promoting memory serialization to per-core and removing the bottleneck with trace flag 8048.

Oh yeah, I know about SQL Server 2012 kb hotfixes 2819662 and 2845380. Neither one nor the combination will remedy this issue. Nor will they address the stuff I talk about in the following post:
http://sql-sasquatch.blogspot.com/2013/08/sql-server-numa-servers-cpu-hotspot.html

Here's a bibliography if you want a fuller understanding of task distribution with default SQL Server NUMA behavior, memory management with default SQL Server NUMA behavior, memory allocation with default SQL Server serialization, etc. Or maybe you want to double check that I actually know what I'm talking about :) You could also follow the tags to my other posts on NUMA, TF8048, TF8015 :)

Selected SQL Server 2012 NUMA and Memory Management Hotfixes

kb2819662: SQL Server Performance Issues in NUMA Environments

http://bit.ly/Z9C25x

kb2845380: You may experience performance issues in SQL Server 2012

http://bit.ly/13LADVq

Selected Trace Flag 8048 Sources

CSS SQL Server Engineers - SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node May Need Trace Flag 8048

http://bit.ly/15Wjdmn

CSS SQL Server Engineers - CMemThread and Debugging Them

http://bit.ly/X9BsDW

Selected SQL Server NUMA Sources

CSS SQL Server Engineers - SQL Server (NUMA Local, Foreign and Away Memory Blocks)