Monday, February 10, 2014

SQL Server: looking for disk errors and timeouts

Error conditions and timeouts can cause confounding performance and resource utilization conditions.  In many cases, utilization of different resources tend to trend upward together (most of the workloads I am concerned with have a strong positive correlation between CPU utilization and disk throughput).  There are also many cases of different metrics for the same set of resources with a predictable relationship to each other: as IOPs increase, disk bytes/second generally increases.  On a database system using HDDs, as read IOPs and read bytes/second increase, read latency tends to increase.

However, disk related errors and timeouts can cause performance conditions that just don't make sense when the normal graphs for bytes/second vs latency are consulted.  Normal relationships can be thrown asunder with extremely high average latency, even at very low total reads or writes per unit time.

On the IBM AIX operating system, I use the iostat utility all the time to make sense of disk IO.  (Typically 'iostat -DlRTV', and I use a little awk script to convert all of the suffixes for kb, mb, etc, and seconds to strictly numeric values).  Fortunately, that utility contains columns for timeouts and errors.  In fact, the timeouts and errors encountered for reads are nicely separated from those for writes.  I rarely have to consult those numbers... but when they are needed boy do they come in handy.

I don't know of an easy way in Windows to get the same information.  Emulex and QLogic fibre channel HBAs both allow setting various levels of activity logging.  Storage manufacturer multipathing software, like EMC PowerPath and Hitachi hdlm may have facilities for tracking errors and timeouts.

Today I was looking at perfmon with a colleague, and on a given system with very low levels of disk activity there were periodic spikes into average read or write latency above 10 seconds.  Wow.

We scrolled through perfmon (15 second collection interval) logical disk reads/sec and writes/sec numbers together with sec/reads and sec/write.  In under a minute I found what I was looking for: instances of single write or read requests to a single drive in the interval (as indicated by the operations per second at roughly .067 per second for that drive) with latency really, really close to 60 seconds.  (Most of these were 59.8 seconds and above).

The first trick to know is that the reads/sec and writes/sec reported by perfmon are 'io_dones' rather than 'io_starts'.  If a read required about 59 seconds to complete and perfmon is collected in 15 second intervals, its completion is counted in the interval it completes and its full latency is used in the calculation of the average latency for that period.

The nearly 60 second latency was caused by disk timeouts.  Maybe it would be more accurate to say that waits LONGER than 60 seconds were spared by the timeouts?  After 60 seconds, the read or write request would be issued again - and if failover is being used for the multipathing algorithm, the other path will be used.

In once case we saw 4 read IOs with an average of 75 seconds/read(.267 reads/sec for the 15 second interval).  How does that happen?

The first read to be issued timed out at 60 seconds... it was resubmitted.  At about the same time as the resubmit, 3 additional reads were made.  Now there are 4 outstanding reads.  All 4 of them timed at at about 60 seconds.  All 4 were resubmitted, and satisfied almost immediately.

But one of them had a total latency of about 120 seconds.  Three of them had total latency of about 60 seconds each.  All four of them completed in the same 15 second interval - an aggregate total of 300 seconds latency for 4 reads = 75 seconds average read latency.

OK.  Sure, anyone can play with numbers :)

How can we help to actually diagnose a problem like that?  Well... engage your storage administrator.  Timeout and error events should be visible within the storage itself.  But, the timeout may not be originating there... it could be a switch in the middle.  (Storage admin will be able to find out the end-to-end topology and switches have timeout and error logging, too.)  The storage administrator will probably also know if the multipathing software can report errors and timeouts.  Either the storage administrator or the WIndows server admin can help find out if logging on the server HBA will be helpful.

 What if you've got to go to those folks with chapter and verse before they will help diagnose and correct a potential problem?  Today I played around with the query below.  Nothing too fancy... just looking for pending IO that has been outstanding for more than 15 seconds.  Well... there is one little trick to it.

Turns out there are some types of IO which can amass massive amounts of io_pending_ms_ticks and its not really a problem. 
Bob Dorr describes that in the following CSS SQL Server Engineers post.
How It Works: sys.dm_io_pending_io_requests

The trick to know is that the "outstanding for a really long time but not a problem" IO will have a filehandle that doesn't exist in  sys.dm_io_virtual_file_stats.  Excellent!  I was going to CROSS APPLY to that old friend anyway, in order to get physical_name.

If you run into a case where you suspect disk timeouts or other recoverable error conditions, you can shoehorn a query like this one to run as a SQL Agent job, maybe every 30 seconds.  I used a 15000 ms threshold just because that's what SQL Server may alert on in the error log.  Use a nice big threshold, send the results to a table... you'll know the file and offset and you'll know you had a delayed IO to that location.  Diagnostics from multipathing, HBA, or the storage array will probably still be needed.  But at least this way you can gather a little information up front.  

   sys.dm_io_pending_io_requests AS pio
   sys.dm_io_virtual_file_stats(NULL,NULL) file_stats
   file_stats.file_handle = pio.io_handle
   pio.io_pending_ms_ticks > 15000

No comments:

Post a Comment