I talk a lot about hotspots, and usually its about storage. Balanced resource utilization typically provides the most predictable performance, and that's what I shoot for by default.
Its important to realize that hotspots are not only a storage resource concept - they can apply to CPU or other resources as well. In fact, for SQL Server on NUMA servers, by default there is a CPU hotspot liability.
Imagine you have a server with 8 NUMA nodes, 6 cores on each NUMA node(think old school AMD, 12 cores per socket, 2 NUMA nodes per socket one logical CPU per physical core). Parallel queries start stacking up their workers. Once the 6 cores on NUMA node 0 have enough active worker threads threads to stay busy - it would seem like a good idea to start giving some threads to the other cores... right? In fact - once those 6 cores are all at or near 100% CPU utilized - if any more active threads are added the available cycles on those CPUs are just going to be spread among a larger number of active workers, with each of them accomplishing less than previously. (In fact, the growing number of context switches also represents a growing management cost, so less of the CPU total cycles will be available for dividing among the growing number of threads.) And in fact, that's exactly what happens in SQL Server by default. Hotspot jambalaya!
Here's the deal: client connections are round-robin distributed among NUMA nodes. But when a connection requests a parallel query, there is a strong tendency for all of the query workers to remain on the NUMA node. Hmmmm.... so on my 8 NUMA node server if a single query spins up a worker thread per CPU, each of the cores in the NUMA node would have 8 active worker threads. While the other 40 cores get no work to do for the query. Well.... shoot....
See - this is a huge part of the recommendation to set maxdop to the number of cores in the NUMA node... or half of the cores in the NUMA node... or...
But that's a losing game, let me tell you. Because if a query starts with multiple data paths, you can easily end up right back where you started. If all queries were simple and had a single stream of processing, setting maxdop to the number of logical CPUs in a NUMA node might make sense. Each parallel query would result in one thread per logical CPU. But complex, multi-stream queries toss that idea out the window. Use a stored procedure or complex query with 8 separate processing streams - you are right back where you started: 8 active worker threads per logical core on one NUMA node, and no work associated with that query for anyone else.
If that type of activity takes place on a DSS or DW system with a high concurrency of batched queries, query throughput can be horrible. Check it out. I'm sorry its ugly. The numbers in the graph are from a 48 physical core system with 8 NUMA nodes, and MAXDOP set to 6. I averaged the CPU busy along NUMA node boundaries, and those are the results below for Node0 to Node7. While some NUMA nodes are maxed out, others are loafing.
This isn't a secret, although almost no-one ever talks about it. There are details about task assignment at the following locations.
How It Works: SQL Server 2005 Connection and Task Assignmentshttp://blogs.msdn.com/b/psssql/archive/2008/02/12/how-it-works-sql-server-2005-connection-and-task-assignments.aspx
NUMA Connection Affinity and Parallel Queries
So... what to do, what to do?
Well, SQLCAT pursued one avenue. They wrapped queries in a "terminate - reconnect" strategy.
Resolving scheduler contention for concurrent BULK INSERThttp://sqlcat.com/sqlcat/b/technicalnotes/archive/2008/04/09/resolving-scheduler-contention-for-concurrent-bulk-insert.aspx
I'm not very fond of that solution, but it is an acknowledgement of the problem, if nothing else :)
Instead, for the DSS systems and workloads I work with, I recommend disabling NUMA support at the database level with startup trace flag 8015. All schedulers are treated as a single pool, then - no preferential treatment for parallel query workers along NUMA node boundaries. (All memory is treated as a single pool as well, which leads to a separate set of benefits for the workloads I deal with. Guess I'll have to return to that another day.)
Its important to note that on these systems we already use trace flag 8048 to eliminate spinlock contention among schedulers in the same group during concurrent memory allocation. Increase the number of schedulers in the group by disabling NUMA with TF8015, and if you haven't also put TF8048 in place you could invoke a spinlock convoy with enough concurrent activity. That would NOT be cool.
Anyway... time for the big finish. Here's another graph that's better than nothing, although I'm still not completely happy with it. It takes forever to explain what is being measured... but at least it makes the difference very evident :)
So, the blue line is the control workflow - trace flag 8048 is in place to eliminate spinlock contention, but the database NUMA support is enabled. The Y axis counts occurrences of qualifying 15 second interval perfmon samples. The X axis is the difference between the CPU% for the busiest NUMA node and the average of the remaining 7 NUMA nodes. The graph displays the level of balance in CPU utilization across NUMA nodes. The more balanced the CPU use of a workload is, the more samples there will be in the lower end of the graph. See -- the control workflow is not very balanced. The most popular level of difference was 68% - so the busiest NUMA node was 68% busier than the average of the remaining 7 nodes well over 600 times (more than 2.5 hours).
Contrast that with the same workload on the same system, with trace flag 8015 added to trace flag 8048, to disable database NUMA support. Now the most popular position is about 4% difference, with over 2.5 hours spent at that level.