Thursday, April 13, 2017

#SQLServer: Waits keeping CPU utilization down? Or Management Cost inflating CPU utilization? Part II

Ed Elliott (@EdDebug on Twitter) today mentioned how expensive disk IO waits - or another common wait - could lead to the sawtooth CPU utilization pattern I was pondering yesterday.

So... before I start looking at some exotic sources of resource utilization/performance issues, I should eliminate some of the more typical wait types as potential causes of depressed CPU utilization during the sawtooth.

Yesterday's blog post:

#SQLServer: Are waits keeping CPU utilization down? Or is Management Cost inflating CPU utilization? Part I
http://sql-sasquatch.blogspot.com/2017/04/sqlserver-are-waits-keeping-cpu.html

And a brief reminder in graph form below(as previously, 30 second perfmon interval). From approximately 1:20 am to 1:35 am either:
1) there is sufficient thread/query demand to keep CPU utilization high, but CPU utilization is suppressed by some type(s) of wait
2) thread/query demand isn't high enough to sustain high CPU utilization, but a management cost is occasionally driving CPU utilization unexpectedly high (eg spinlock contention, etc).



Let's try to eliminate some common wait types.  First up for consideration: resource_semaphore waits, the purgatory of pending query memory grants.  This type of system I do expect to see pending query memory grants unless the maximum memory grant has been limited via Resource Governor.

Sure enough - there are pending memory grants, racking up resource_semaphore wait time.  But only after the sawtooth pattern is, for the most part, over.  Still - from 1:40 am to 1:57 am there is idle CPU - converting those pending grants to running queries will improve the performance of this workload by increasing concurrency and using otherwise idle resources.  But... pending query memory grants can't explain the sawtooth.


Onward!!

This timeperiod is primarily ETL activity.  So... transaction log writes are a potential source of depressing waits (both in the technical and psychic sense).

Let's take a look.

Transaction log writes have been observed to inject two main wait types into this ETL.  The first is likely the more common across other systems: log write wait.  This is the wait associated with transaction commit.  The other wait type is log buffer wait.  This is an interesting one.  Each database has 128 in-memory transaction log buffers.  When scheduled to write, they can be written up to 32 at a time.  What if none of the 128 in-memory transaction log buffers are free when a transaction needs to write some txlog records to a buffer in the course of the transaction?  The session in the transaction incurs a log buffer wait - which can end only when one of the buffer writes to the transaction log is complete and the in-memory buffer has been freed to take on new stuff.

First - the more common write log wait.  Oh yeah - got some write log waits.  But not in the cool yellow box.  Mainly before the box, and a little bit after.


Here's the thing about waits: even if there aren't a huge number, if they are humongous they can still drive down CPU utilization.  So let's take a look at average wait time against the waits started per second.

I'm glad we took a look.  It's still not the cause of the sawtooth.  But its interesting to see that in the yellow there is some time charged to log buffer wait, so at least a few of them must have occurred even though only a fraction of the waits previously occurring and at lower average wait time.


All right.  Let's move on to disk read and write activity to persistent database files.  Waits for database file reads are a common location for performance optimization, and exorbitant disk read waits can definitely dominate a workload.  There's an important mitigating factor though, which I rarely see discussed: the fabulous SQL Server readahead.

Here's the thing about page IO latch waits*: only two opportunities for them - one a guarantee, the other a maybe.  The guarantee is for a synchronous read - read without readahead.  Ask for the read right when needed by query processing - no choice but to wait.  The wait maybe is a little trickier: ask for the read ahead of time - check on it when you need it for query processing.  If its there when asked - no wait!  But if its needed and asked for - not there yet - gotta wait.  So... that shouldn't happen all the time, and when it does happen the wait time incurred should be somewhat less than the disk read latency for that read.

*this is my own explanation - which could totally be wrong and I'm open to correction.  If let's argue about it, let's argue about based on data, though 😂😂😂

All right.  Funky explanation out of the way, let's look at some graphs.  How can we profile "readahead efficiency"? Well... here's readahead pages/sec in light blue in front of read pages/sec in dark blue.  So the exposed dark blue are "nonreadahead pages".  Pages read in synchronous read operations (important to note that measure is 8k pages - not IO operations - for both of these perfmon numbers).  Guaranteed to be some page IO latch wait time for the read operations which supplied those pages.  And the disk reads that yielded readahead pages?  Well - if they were all blazing fast or if there were enough runnable threads to make CPU waits "swallow" the potential read wait time for readahead, there'd be no page io latch wait time.  But if not blazing fast and not enough runnable thread pressure to "hide" read wait time, some page IO latch wait time is gonna leak through even for readahead.


All right.  Let's look at page IO latch waits started per second, compared to CPU utilization.  Nope - not more waits in the yellow box than outside the yellow box.



But again - its not only how many waits are started but how long they last.  So let's look...

Well, not only are the wait starts unremarkable in context, the average wait time is lower inside the yellow box than outside the yellow box.



I'm not ready to quit looking at read activity though.  I made some wacky claims about the relationship of page IO latch wait time to disk read service time.  Let's see if what I said holds water in this context.  In the 2nd graph below, I can see that read service time average topped out around 180ms.  But page IO latch wait average topped out around 120ms.  OK - that's kinda like what I predicted.  Because not all reads will generate waits and some will generate only only waits for part of the read service time, we'd expect average page IO latch wait time to be lower than average read service time when readahead is involved.  But why are there sometimes more page IO latch waits started per second than reads completed per second?  Two reasons: multiple queries interested in the same pages at nearly the same time, and multiple parallel query threads in the same query interested in the same page at nearly the same time :-)



Let no graph go to waste.  We've already seen that read times for the persistent database files are almost certainly not to blame for the sawtooth.  But I also like to show that for a workload like this - unless there's a lot of RAM on the server, there's a loose correlation between read bytes/sec and CPU utilization.  Well... except inside that curious yellow box :-)




All right - let's move on from persistent database data file reads, to persistent database file writes.  Just cuz :-)

More writes outside the yellow box, too.  Curiouser and curiouser.


Let's look at the write service time that accompanies those writes...

Yep, this system could use some attention to write latency.  Probably saturated write cache, and likely due to more than just this server on shared storage.  But... the write latency to persistent datafiles definitely doesn't seem to be a suspect for the sawtooth.



What if it was?  We discussed log buffer waits and log write waits as waits associated with high write service time or write floods to transaction logs.  But write floods to persistent database files are more rare, and we don't hear about wait conditions associated very often.  The one I see sometimes - based on write service time to persistent database files and the ability of 'total server memory' to absorb dirty pages before incurring checkpoint or lazy writer activity - is free list stalls.  Yup.  Got some of those.  Lowering write service time and/or increasing RAM will help.  But... as with other stuff we've looked at so far... bupkis in the paint.



All right.  We've got resource_semaphore, transaction log, and persistent database file read and write waits off the suspect list for now.  Let's do one more group: tempdb data files.  I separate tempdb data file reads and writes from persistent db datafile reads and writes for a few reasons:
1) if only spills, read bytes and write bytes will be one-to-one
2) tempdb spill writes are womper aggressive compared to datafile writes for persistent data files.  Or compared to checkpoint & lazy writes to tempdb.  They're just their own animal :-)
3) tempdb spill writes have their own subset of waits that pertain to them

Maybe you'll remember this graph from yesterday.  Used Tempdb stacked on top of used grant memory.  This graph tells us footprint in tempdb - I use this type of graph a lot.  But it doesn't tell us traffic to tempdb, which is what would indicate potential wait events.

Fortunately on this system, they have the tempdb datafiles isolated to mount points with tempdb in the name :-)  I heartily recommend isolating tempdb datafiles from other stuff at the Windows volume level.  And putting tempdb in the name really helps sorting perfmon. :-)  So... two Windows volumes for tempdb datafiles.  Piece of cake to add the reads/sec and writes/sec for those volumes into a single reads/sec and writes/sec measure.

Wow.  Almost no tempdb traffic in the yellow box of wonder.  Now, if I didn't have a hot plate of pasta near me and a family waiting to eat dinner with me, I'd also include read bytes/sec and write bytes/second to the tempdb Windows volumes. And sec/read and sec/write.  But I've shown enough to eliminate tempdb from interest for now.  Except for page latch contention - ie too few sets of allocation pages in memory to handle the traffic.  I'll handle that last bit maybe tomorrow.



No comments:

Post a Comment