Thursday, August 4, 2016

Modeling SQL Server Behavior for Performance Intervention: Feed the CPUs

Before recommending any changes on a SQL Server system, I like to collect a number of days worth of perfmon data.  -- I also like to ask if there are particular windows of time or particular queries that are performance concerns.  That can help prevent me from spending lots of time on time windows or queries that are considered low value :-) But that's a story for another day.

When reviewing perfmon data with the intent of finding system optimizations, I like to say:
"It helps to have expected models of behavior, and a theory of performance."

I don't say that just to sound impressive - honest (although if people tell me it sounds impressive I'll probably say it more frequently).  There are a lot of thoughts behind that, I'll try to unpack them over time.

When I work with SQL Server batch-controlled workflows, I use the theory "feed the CPUs".  That's the simplest positive adaptation I could come up with of Kevin Closson's paradigm "Everything is a CPU problem" :-)

What I mean by "Feed the CPUs" is that memory and disk response times are primary factors determining the maximum rate for the CPUs to process the data.  Nuts & bolts of such a model for SQL Server are slightly different than a similar model for Oracle.  SQL Server access to persistent data is always through database cache, while Oracle uses shared access to database cache in SGA and private access to persistent data through direct access in PGA.

Here's where I'll throw in a favorite quote.

"Essentiallyall models are wrongbut some are useful."
Box, George E. P.; Norman R. Draper (1987)
Empirical Model-Building and Response Surfaces, p. 424
Wiley ISBN 0471810339

The workflows I work with most often are quite heavy on disk access.  The direction of change of CPU utilization over time will usually match the direction of change in read bytes per second.  So if you want to improve performance - ie shorten the elapsed time of a batch workload on a given system - Feed the CPUs.  Increase average disk read bytes per second over the time window.

That's a very basic model.  It ignores concepts like compression.  In my case, all tables and indexes should be page compressed.  If not I'll recommend it :)  It kinda sidesteps portions of the workload that may be 100% database cache hit.  (Although I consider that included in the "usually" equivocation.)  It definitely doesn't account for portions of the workload where sorting or other intensive work (like string manipulation) dominates CPU utilization.  Still, a simple model can be very useful.

Here's a 2 hour time window from a server doing some batch ETL processing.  You can kinda see how CPU utilization trends upward as read bytes per second trends upward, right?  I've highlighted two areas with boxes to take a closer look.

Let's look at the 8:15 to 8:30 am window first.  By zooming in on just that window & changing the scale of the CPU utilization axis, the relationship between CPU utilization and read bytes/second is much easier to see.  That relationship - direction of change - is the same as other parts of the 2 hour window where the scale is different.  Depending on the complexity of the work (number of filters, calculations to perform on the data, etc) with the data, the amount of CPU used per unit of data will vary.  But the relationship of directionality* still holds.  I'm probably misusing the term 'directionality' but it feels right in these discussions :-)

All right.  Now what about the earlier time window?  I've zoomed in a bit closer and its 7:02 to 7:07 that's really interesting.

Assuming that SQL Server is the only significant CPU consumer, and assuming that that whole 5 minute time isn't dominated by spinlocks (which at ~40% CPU utilization doesn't seem too likely), the most likely explanation is simple: 5 minutes of 100% cache hit workload, no physical reads needed.  That can be supported by looking at perfmon counters for page lookups/second (roughly logical IO against the database cache) and the difference between granted memory and reserved memory (which is equal to the amount of used sort/hash memory at that time).

OK... well, that's all the time I want to spend justifying the model for now :-)

Here's the original graph again, minus the yellow boxes.  There's plenty of CPU available most of the time.  (No end users to worry about on the server at that time so we can push it to ~90% CPU utilization comfortably.) Plenty of bandwidth available.  If we increase average read bytes/sec we can increase average CPU utilization.  Assuming that the management overhead is insignificant (spinlocks, increased context switching among busy threads, etc), increasing average CPU utilization will result in a corresponding decrease in elapsed time for the work.

All right.  So next we can investigate how we'll increase read bytes per second on this system.

Ciao for now!

No comments:

Post a Comment