Friday, May 23, 2014
Adventures with SQL Server & perfmon Part 1: Resource Governor & Query Memory
CPU cores are the licensing metric for SQL Server. So, some of my goals are to keep 'em busy, and to make sure they are busy with meaningful work. :-)
I always look at graphs of CPU busy against logical IO, regardless of the database platform, to get an idea of performance and resource utilization. Perfmon supplies most of what I look at for SQL Server, although sometimes I scrape the DM views/tables/functions/whatsits and sometimes I have to resort to extraordinary measures (like monitoring in-memory transaction log buffers).
The system in this post has 48 logical CPUs. Resource governor is enabled, and workload group RG1 is associated with resource pool RG1. There's more to the Resource Governor config... maybe I'll get into the goals and the rest of the config another day.
The graph below shows two types of time periods that are concerning to me. In green are time periods of very high CPU utilization by SQL Server, but relatively low logical IOs, or buffer page lookups. I want to investigate those areas for "management overhead" - maybe spinlock or latch contention that can be avoided somehow. On the systems I watch, that can be especially fruitful. That's because in general, query compile/recompile total cost is typically low, and there isn't much XML shredding or other high CPU-cost instruction sets specified within most of the queries. But lets save spinlock and latch contention hunting for another day. :-) Sorry @sqltrooper!!
In orange are time periods of fairly low CPU utilization. I know the workload context - this was all part of a batch workload that should have a fairly constant number of concurrent requests/queries executing at all times. That can be verified, among other ways, by checking activity logs for the client(s) that submit the queries. So if there is low CPU utilization at some point... maybe query concurrency is lower than expected, or maybe the parallelism of the queries is lower than expected (eg maybe some of the queries expected to be parallel are forced to serial).
SQL Server %CPU = \Process(sqlservr)\% Processor Time
Buffer Page Lookups = \SQLServer:Buffer Manager\Page lookups/sec
When checking query concurrency within a Resource Governor group, the graph below has the items I like to check first.
RG1 Grants (Dark Blue) = \SQLServer:Resource Pool Stats(RG1)\Active memory grants count
RG1 Pending (Light Blue) = \SQLServer:Resource Pool Stats(RG1)\Pending memory grants count
RG1 Qx Mem Target (Dark Red) = \SQLServer:Resource Pool Stats(RG1)\Query exec memory target (KB)
RG1 Qx Mem (Light Red) = \SQLServer:Resource Pool Stats(RG1)\Active memory grant amount (KB)
From 9:00 am to 10:00 am there are a high number of grants, but not much aggregate granted query memory.
Between 10:30 am and 11:00 am, though, a problem crops up (not that there wasn't a problem from 9:00 - 10:30... I guess I should say that's when the problem I want to talk about today crops up). The granted memory has climbed, even though the number of outstanding grants has dropped. The number of potential queries is still roughly the same... but the pending grants have increased and outstanding grants have dropped. Active query concurrency has dropped.
The granted query memory in this pool peaks at about the same time the query exec memory target drops. That's unfortunate.
The same thing happens for a shorter period of time right around noon. Fewer queries with high total granted query memory, and at peak aggregate query memory the target drops. Pending query memory grants are high throughout, and the peak of pending grants is correlated to the peak in granted query memory and the attending drop in query memory target.
So... identify the queries running during those time periods. They are asking for a lot of query memory - so much that its limiting query concurrency and keeping the CPUs from doing as much work as they could.
Are these queries suffering from "eyes bigger than stomach"? In such cases, they request far bigger grants than they would EVER use. Additional help can be pulled in by polling sys.dm_exec_query_memory_grants.
In cases like this, I poll for the following and associate with plan_handles:
ideal_memory_kb - how much would it put in a grant request if RAM on the system were absolutely gargantuan?
max_used_memory_kb - what's the most its used in this execution of the query?
used_memory_kb - how much is it using right now?
Then I can grab the plan from plan cache, and start poking around. Honestly... I'm not much good for tuning individual queries. :-) I'm a systems guy... at least for now. Sometimes I can tell that a missing index will make all the difference in the world. Sometimes I can tell a scalar udf is causing the ideal memory/memory grant request to explode. If its a stats problem... maybe I can figure out what the problem is.
But, at any rate, this is how I can find problem queries, then hand them over to folks that are wwwaaayyy better than me at query tuning. With data to show the system cost of current behavior. :-)