Monday, February 6, 2017

Performance Problem PLUS: A Case of Absurd SQL Server Disk Service Times

An interesting case of perfmon captures came to me last week.  The original question was "What is causing the intermittent extremely high disk latency?"  The graph immediately below shows read and write service times, measure in the VM guest in 30 second perfmon captures.  The guest is a 4 vcpu VM in ESXi 5.5, running SQL Server 2012.

The blue is the read service time average... in full seconds!  And the red is write service time average - again in full seconds!!

That's some astoundingly high disk latency!!  Let's walk through some typical analysis to see if a potential cause becomes clear.

First I see that a single outlier in read service time results in the difference between the sec/read axis label and the sec/write axis label.  By eliminating that outlier, I can graph both read and write service times on the same scale. With read and write latency on the same scale, its more evident that read and write latency are typically elevated in the same perfmon intervals.  Some performance problem spots can cause read and write service time elevation, some will only trouble writes, and some will only trouble reads.  (It's trickier than seems at first glance to recognize these as it sometimes requires splitting graphs for activity vs latency per WIndows logicalDisk or physicalDisk - or even aggregating those numbers for an individual adapter port/vHBA.)  But in this case, it looks like both read and write performance are disrupted by something (or maybe somethings).  

The writes from this server are going to external storage with write cache.  Read performance problems are almost always accompanied by write performance problems in this sample.  So I'll continue by focusing on the write behavior.

Perhaps write throughput in IOPs or bytes/sec is correlated to the write service time?  Nope, doesn't appear to be.

What about total IOPs or total bytes/sec?  Hmmm.  Those look slightly better... maybe.

Now - what if I had looked just at the busiest times to try to diagnose the issue?  The way that the yellow-shaded areas look below, if I narrowed my scope too early to busy disk times, I may certainly have pursued some type of performance issue related to total throughput.

But, looking back again at the entire timeperiod, total bytes/sec doesn't do much to help understand the green-shaded portion - where disk throughput is very low but disk latency is intermittently intolerable. 

What about disk queuing?  That's one of my favorites to blame! Nope.  Not this time.

Now, to be really thorough, each of those cases should be considered on a Windows volume basis.  I won't drag you through that today, though :-)

And of course, noisy neighbor is always a possibility.  Could be at ESXi host level, or at shared storage due to another tenant there.  Could even be intermittent errors from something in the server or outside the server but in the storage path.  With service times *this* high, some type of error condition and SCSI timeouts would actually be a solid suspicion. But that's not where we're going today. either.

OK.  Let's look at write latency against CPU utilization on the VM.  Hmmm.  Seems I could split that into a very idle period of time until about 2:45 am, a busy time from 2:45 am to about 4:00 am, and another very idle time from 4:00 am on.

I'll actually split those graphs apart.  Here's 12:30 pm on February 1 to 2:44 am on February 2.

Here's 2:45 am to 4:00 am.

And here's 4:00 am onward.  That's an interesting wave in the graph - reminds me of "The Outer Limits" intro.

Let's zero in on a chunk of time from the first period - a little more than 2 hours.  Now that we look more closely.... hey!!!  got an "Outer Limits" wave here, too.  I look at a *lot* of perfmon data. In my professional opinion, that's weird.

OK, zoom in a little more, on those latency Alps.

Let's pick a peak.

Yep, that's totally weird.  Does it make any more sense if we look at individual vcpus?  (Remember - in your VM everything is software.  The scsi adapter is emulated within the VM, the NIC is emulated in the VM, etc) 

Rats.  That was a really good idea, too.  Nothin'.  Or....

(I want to call attention to the change in maximum axis value for CPU utilization.  I nearly *always* use 100% as the maximum value, In these last two graphs I've used a maximum value of 24% because utilization of each of the 4 vcpus is really low and I felt it made the graphs slightly more readable.)

Hey!! Perfmon was collecting 30 second samples.  But two intervals after 2:06:22 is 2:07:52 rather than 2:07:22.  The system lost 30 seconds in the span of a minute.  And two intervals past 2:07:52 is 2:11:22 - the system lost 150 seconds in a minute?

Here's an important maxim: perfmon is only as reliable as the system clock of its accounting.

Its pretty rare these days for a physical server to drift much.  Occasionally time may be corrected after crossing a tolerance threshold.  But this vm seems to have a system clock that is often slowing way down - and then being abruptly corrected.  The abrupt corrections result in lost time.  There are a number of reasons time on a system may drift.  In this case, I suspect that the VM is syncing time with the host rather than syncing time with an NTP source.  Maybe time is drifting even on the ESXi host?

If I find out the exact nature of the system clock problem here, I'll update the post.  Usually its enough to nudge the virtualization admins about the system clock - so I may not hear back once its fixed.

I titled this post "Performance Problem PLUS" intentionally.  This VM still won't have great disk service times once the system clock is fixed - but the extreme outliers should be gone.  It would be flying blind to try to characterize a performance problem from perfmon if the clock can't be trusted.  So... fixing any remaining performance issues will have to be a story for another day.  


Some extra reading in case you happen on one of these cases yourself...

Timekeeping in VMware Virtual Machines
VMware vSphere 5.0, Workstation 8.0, Fusion 4.0