Monday, December 9, 2013

'All are Punishèd' part 1: QFULL messages

Here's a reason NOT to ignore the perfmon 'LogicalDisk(*)\Current Disk Queue Length' metric: it is among very few ways of diagnosing a particularly punishing performance condition.  (I got another one that's a slight variation on this theme, coming in a few days.)

Windows allows up to 256 outstanding disk IO operations per host (or VM guest) LUN.  For fibre channel LUNs, the HBA defines a LUN queue depth within that number.  Typical default is a fibre channel LUN queue depth of 32.

That means that at any given time, there may be up to 32 in-flight IO operations in the host LUN service queue, with up to an additional 224 (for a total of 256) in an OS wait queue.  The IO requests in the OS wait queue will go into the service queue as slots open, and the total service + wait queue depth of 256 will keep additional IO requests at bay if need be.

What if there are 32 or more cores on the server with a data-hungry workload, with many threads interested in the same LUN at the same time? Chances of overflowing the service queue queue depth of 32 are quite high.  (Actually, that's true for SQL Server as soon as there are 8 or more physical cores - with or without hyperthreading - and a busy enough workload... but I digress.)

OK... well, increasing the LUN service queue depth on the Windows server HBA can be fairly easy.  And more parallel in-flight IO should increase throughput - which should allow increased CPU utilization and higher throughput of logical database work, assuming the same data-hungry workload - right?  Sure, latency would be expected to rise... but as long as the application is on the data-hungry side of the spectrum, instead of the latency-sensitive side, everything should be dandy!

Except when its not.  And the QFULL message is an example of when it is not.

The QFULL message from a storage array is a means of telling the connected server(s) to hold some horses.  In the old days, when command queues for storage array ports or other elements overflowed, it was possible to crash the array OS.  Perhaps that's still possible - I haven't heard of that particular type of failure in quite some time.  But QFULL messages can still be sent from a front end port on a storage array that has a full command queue - or, more likely when the command queue for an array LU (the array object corresponding to the server host's LUN) is full with additional commands coming in.

That's tricky.  In some cases, an array LU has a documented maximum command queue depth.  If documented, the max may be a specific number, or it may be based on the number of underlying storage devices in the LU.

There are two somewhat common ways that a given system can be set up for this trouble: the LUN queue depth on the server HBA may be deeper than the command queue for the corresponding LU.  The problem wouldn't necessarily be immediately apparent in such a case, but once the LUN service queue length gets long enough that the LU command queue overflows, the array would return a QFULL message.  Then the host activity is at the mercy of the response of the host OS/guest OS/fibre channel driver response to the QFULL condition.

For virtual servers, the other path to trouble is when the LUNs presented to several virtual guests come from the same array LUt.  Blah.  I'm sure there's a more clear way to say that :) But, imagine a physical ESXi server with 4 guest Windows VMs, each VM using a Windows Q: drive.  Each of those Q drives in this hypothetical is actually being served from the same underlying LUN on the ESXi server.

Bringing LUN queue depth up to 128 on a VMWare host server is pretty common.  Let's leave the guest Q drive queue depth at 32.  There are 4 guests, so the aggregate queue depth is 128.  Now imagine that the storage array involved has a command queue depth of 64 for the LU that becomes the host LUN in question and eventually each of the Q drives for the 4 VM guests.

If the LUN queue depth reaches 32 simultaneously in each of the 4 guests - will it be a problem?  Maybe not.  There is IO coalescing at the VMWare level - so the 128 separate IO requests at the guest levels may very well coalesce to less than 64 IO requests in the VMWare host LUN service queue length.  The storage array won't get grumpy in that case.

But... what if the 128 outstanding IOs against each guest VMs Q drive are random from the perspective of the guests AND cannot be coalesced by VMWare at all?  128 commands will go into the LUN service queue and get sent to the array.  The array in this example will respond with a QFULL message (cuz its my example).  And in what I consider the worst case scenario, lets say the response is determined by something like the VMWare adaptive throttling algorithm.

Here's a good summary of the VMWare adaptive throttling algorithm:

In that case, the QFULL message would cut the queue depth in half.  Continued QFULL messages could continue to reduce the LUN queue depth.  When congestion clears, the increase in queue depth is not as quick as the decline - the queue depth doesn't double in each 'good' interval until it reaches its previously configured value.  Rather, each 'good' interval sees the queue depth increase by 1.

Now - if VMWare adaptive queue depth throttling is suspected to be taking place, it'll be best to diagnose at the ESX host level.

But, if its a physical Windows server, and QFULL messages from the array are suspected, there are two possibilities for diagnosis from the server.  Most HBAs will allow an increased level of error logging, and QFULL messages can typically be logged (along with a plethora of other conditions at various logging levels).  But if the Windows LUN service queue depth is known, the perfmon 'LogicalDisk(*)\Current Disk Queue Length' metric can come in really handy.  Especially if you know the storage array LU command queue depth - if the host LUN queue depth is greater than LU command queue depth and perfmon shows a queue length higher than the array command queue length... you can ask the SAN admin if QFULL messages are being sent and you just might be a hero.  Making sure that host LUN queue length stays lower than array LU command queue depth is an example of 'less is more'.

More reading in case you just can't get enough of this:

This one I'll list individually because its one of my all-time favorite posts which include white-board diagrams of IO stack :)

No comments:

Post a Comment