Monday, February 10, 2014

SQL Server: looking for disk errors and timeouts

Error conditions and timeouts can cause confounding performance and resource utilization conditions.  In many cases, utilization of different resources tend to trend upward together (most of the workloads I am concerned with have a strong positive correlation between CPU utilization and disk throughput).  There are also many cases of different metrics for the same set of resources with a predictable relationship to each other: as IOPs increase, disk bytes/second generally increases.  On a database system using HDDs, as read IOPs and read bytes/second increase, read latency tends to increase.

However, disk related errors and timeouts can cause performance conditions that just don't make sense when the normal graphs for bytes/second vs latency are consulted.  Normal relationships can be thrown asunder with extremely high average latency, even at very low total reads or writes per unit time.

On the IBM AIX operating system, I use the iostat utility all the time to make sense of disk IO.  (Typically 'iostat -DlRTV', and I use a little awk script to convert all of the suffixes for kb, mb, etc, and seconds to strictly numeric values).  Fortunately, that utility contains columns for timeouts and errors.  In fact, the timeouts and errors encountered for reads are nicely separated from those for writes.  I rarely have to consult those numbers... but when they are needed boy do they come in handy.

I don't know of an easy way in Windows to get the same information.  Emulex and QLogic fibre channel HBAs both allow setting various levels of activity logging.  Storage manufacturer multipathing software, like EMC PowerPath and Hitachi hdlm may have facilities for tracking errors and timeouts.

Today I was looking at perfmon with a colleague, and on a given system with very low levels of disk activity there were periodic spikes into average read or write latency above 10 seconds.  Wow.

We scrolled through perfmon (15 second collection interval) logical disk reads/sec and writes/sec numbers together with sec/reads and sec/write.  In under a minute I found what I was looking for: instances of single write or read requests to a single drive in the interval (as indicated by the operations per second at roughly .067 per second for that drive) with latency really, really close to 60 seconds.  (Most of these were 59.8 seconds and above).

The first trick to know is that the reads/sec and writes/sec reported by perfmon are 'io_dones' rather than 'io_starts'.  If a read required about 59 seconds to complete and perfmon is collected in 15 second intervals, its completion is counted in the interval it completes and its full latency is used in the calculation of the average latency for that period.

The nearly 60 second latency was caused by disk timeouts.  Maybe it would be more accurate to say that waits LONGER than 60 seconds were spared by the timeouts?  After 60 seconds, the read or write request would be issued again - and if failover is being used for the multipathing algorithm, the other path will be used.

In once case we saw 4 read IOs with an average of 75 seconds/read(.267 reads/sec for the 15 second interval).  How does that happen?

The first read to be issued timed out at 60 seconds... it was resubmitted.  At about the same time as the resubmit, 3 additional reads were made.  Now there are 4 outstanding reads.  All 4 of them timed at at about 60 seconds.  All 4 were resubmitted, and satisfied almost immediately.

But one of them had a total latency of about 120 seconds.  Three of them had total latency of about 60 seconds each.  All four of them completed in the same 15 second interval - an aggregate total of 300 seconds latency for 4 reads = 75 seconds average read latency.

OK.  Sure, anyone can play with numbers :)

How can we help to actually diagnose a problem like that?  Well... engage your storage administrator.  Timeout and error events should be visible within the storage itself.  But, the timeout may not be originating there... it could be a switch in the middle.  (Storage admin will be able to find out the end-to-end topology and switches have timeout and error logging, too.)  The storage administrator will probably also know if the multipathing software can report errors and timeouts.  Either the storage administrator or the WIndows server admin can help find out if logging on the server HBA will be helpful.

 What if you've got to go to those folks with chapter and verse before they will help diagnose and correct a potential problem?  Today I played around with the query below.  Nothing too fancy... just looking for pending IO that has been outstanding for more than 15 seconds.  Well... there is one little trick to it.

Turns out there are some types of IO which can amass massive amounts of io_pending_ms_ticks and its not really a problem. 
Bob Dorr describes that in the following CSS SQL Server Engineers post.
How It Works: sys.dm_io_pending_io_requests

The trick to know is that the "outstanding for a really long time but not a problem" IO will have a filehandle that doesn't exist in  sys.dm_io_virtual_file_stats.  Excellent!  I was going to CROSS APPLY to that old friend anyway, in order to get physical_name.

If you run into a case where you suspect disk timeouts or other recoverable error conditions, you can shoehorn a query like this one to run as a SQL Agent job, maybe every 30 seconds.  I used a 15000 ms threshold just because that's what SQL Server may alert on in the error log.  Use a nice big threshold, send the results to a table... you'll know the file and offset and you'll know you had a delayed IO to that location.  Diagnostics from multipathing, HBA, or the storage array will probably still be needed.  But at least this way you can gather a little information up front.  

   sys.dm_io_pending_io_requests AS pio
   sys.dm_io_virtual_file_stats(NULL,NULL) file_stats
   file_stats.file_handle = pio.io_handle
   pio.io_pending_ms_ticks > 15000

Sunday, February 2, 2014

Asynchronous Replication does NOT mean decoupled from Primary System Perf/Scalability

 A growing number of the organizations that I work with are implementing robust disaster recovery plans for SQL Server and Oracle projects.  For that, I'm glad.  There is a misconception that I want to address: asynchronous replication is NOT without performance and scalability considerations for the primary system.  In the last few weeks, I've worked with database systems using enterprise storage arrays and asynchronous replication from two different vendors.  In both cases there was primary system impact (in one case performance degradation for the primary database host, in the other high system resource utilization degraded performance for many tenants of shared resources).  In both cases there was surprise that asynchronous replication could result in performance or scalability drag on the systems involved.

I'll refer to EMC SRDF/A documentation here... not because this problem is unique to SRDF/A or even more frequently encountered on SRDF/A than other array-level replication.  No - the reason I'll reference SRDF/A here is rather that at 412 pages the document I'll refer to is one of the most complete you'll find on asynchronous replication.  EMC takes SRDF/A and SRDF/S extremely seriously.  Even if you are implementing a different asynchronous replication mechanism, taking a look at this resource can help identify the various design and workload components that should be considered for asynchronous replication to be successful within project goals.   

EMC SRDF/A and SRDF/A Multi-Session Consistency on UNIX and Windows TechBook Version 1.7
Part Number H2554.7; 7 mb pdf (2010)

The Symmetrix Cache Management (Pages 254-256) section includes passages about both logical device and system-wide write pending limits.  This document was written with Enginuity 5671 in mind.  The limits themselves may change with different versions of Enginuity, and methods of cache and bandwidth resource management/partitioning may change over time as well.  The key is: since SRDF/A is cache based asynchronous replication, sustained high write pending on the source OR target system can result in throttled write activity for the primary system host.

The coupling of primary and secondary system performance and scalability during asynchronous replication is not unique to SRDF/A among array based mechanisms.  HUR replication (Hitachi Universal Replicator) asynchronous replication also has a maximum tolerance, as does asynchronous replication in SVC/V7000 configurations.  Database level asynchronous replication has similar limitations, although the system resource considerations are different than for array-based replication.

All asynchronous replication relies on buffering of primary system activity on the primary before transmission to the secondary system, and buffering on the secondary system before applying/destaging write contents.  In some SAN replication facilities such as SRDF/A, the primary system buffering before transmission and the secondary system buffering before application take place in cache.  Other replication facilities, such as SRDF/AR or database level asynchronous replication such as Oracle Data Guard Maximum Performance or SQL Server Always On Asynchronous Commit, use file system or disk based buffering on the primary and/or secondary system.

Whether activity is buffered in cache or on disk, the dilemma remains: once the buffer resource is fully consumed, should replication cease - or should the primary system write activity be throttled in order to allow the recovery mechanism to recover?

Some asynchronous replication mechanisms allow a choice of response to full buffer resource conditions.  Some don't.  

If you implement asynchronous replication, understand the system considerations for the buffer resources on primary and secondary, as well as the consequences for replication and throttling as buffer resources saturate.