Wednesday, May 14, 2014

Windows Wait Queue & Emulex FC HBA Service Queue Depth

So... either I use ASCII art for a few illustrations or it might be another year before any of this appears on my blog :)

Operating systems and storage device drivers track storage IO requests in wait and service queues, in order to properly acknowledgement success (for writes) or return data (for reads).


t0 - there are no outstanding IO requests but 3 process threads are ready to submit disk IOs.

The wait queue and the service queue are empty.



\W   R   W/    <=Threads submitting disk IO requests
 \       /

    | |
    | |
    | |
    | |        <= Wait Queue
    | |
    | |
    | |
    | |

  --------
               <= Service Queue
  --------


t1 - the disk IOs are submitted.  Because there were slots in the servie queue free, the requests went right into the service queue.
The application threads don't just forget about their IO requests, even if the read or write was asynchronous.  For a synchronous read or write, thread progress is at a standstill until the request is answered.
For an asynchronous read or write, the thread will poll for the answer.  Various activities will have limits on the number of outstanding async IOs they will will tolerate while still allowing additional work to be done by the thread.
As an example, the SQL Server transaction log writer will allow 32 inflight writes per transaction log, or at total of 3840k of inflight writes across all transaction logs.  
Anyway... the service queue has a "queue depth" but conceptually I usually think of it as wide.

\w   r   w/    <=Threads submitting disk IO requests
 \       /

    | |
    | |
    | |
    | |        <= Wait Queue
    | |
    | |
    | |
    | |

  --------
       WRW     <= Service Queue
  --------

t2 - the writes completed more quickly than the read - as is often the case.  The ack of success has already been returned.  So now there is only 1 outstanding IO request - its the in-flight read.
But - goodness! Query spill!  Lots of threads are getting ready to submit writes to tempdb!

\WWW r  WW/    <=Threads submitting disk IO requests
 \       /

    | |
    | |
    | |
    | |        <= Wait Queue
    | |
    | |
    | |
    | |

  --------
        R      <= Service Queue
  --------


t3 - All of the outstanding IO requests are in the service queue - the newly submitted writes and the hold-over read.
\www r  ww/    <=Threads submitting disk IO requests
 \       /

    | |
    | |
    | |
    | |        <= Wait Queue
    | |
    | |
    | |
    | |

  --------
    WWWWRW     <= Service Queue
  --------

t4 - Thank goodness!  All of the outstanding IOs have completed.  Back to normal, and all quiet for disk IO.

\         /    <=Threads submitting disk IO requests
 \       /

    | |
    | |
    | |
    | |        <= Wait Queue
    | |
    | |
    | |
    | |

  --------
               <= Service Queue
  --------



But there's always another day!

Lets join this day in progress - due to a few concurrent parallel queries there are plenty of reads!


t0 - the service queue is full of reads - with more in the pipeline ready to be submitted!
\RrR rR Rr/    <=Threads submitting disk IO requests
 \rrrRrrR/

    | |
    | |
    | |
    | |        <= Wait Queue
    | |
    | |
    | |
    | |

  --------
  RRRRRRRR     <= Service Queue
  --------

t1 - an instant later, and the app threads have submitted the read requests.
But the previous reads haven't returned yet, and they've filled the service queue.  So into the wait queue they go.

\rrr rr rr/    <=Threads submitting disk IO requests
 \rrrrrrr/

    | |
    | |
    |R|
    |R|        <= Wait Queue
    |R|
    |R|
    |R|
    |R|

  --------
  RRRRRRRR     <= Service Queue
  --------


t2 - the first 8 Reads have all returned, and the 6 Reads previously in the wait queue are now in the service queue.

\r r  r r /    <=Threads submitting disk IO requests
 \   r  r/

    | |
    | |
    | |
    | |        <= Wait Queue
    | |
    | |
    | |
    | |

  --------
    RRRRRR     <= Service Queue
  --------

t3 - all quiet again, nothing left outstanding in the wait or service queue.  But... let's talk latency.  The last 6 reads to complete in this scenario had some amount of time spent in the wait queue.
In fact, adding their wait queue time (which is determined by how long it took for a service queue slot to open up) and their service time, these reads may have taken twice as long as they would have on an uncongested system.
That's because the reads were wiating for other reads to complete - assuming a constant average read latency - which is never actually the case ;-) - these unfortunate reads may have spent the same amount of time in the wait queue as they did in the service queue.
Hmmmm


That brings to the penultimate scenario I wanted to share today.

In the scenario below, the service queue is full of reads that were just submitted (queues were empty previously).  Writes have also been submitted - they are in the wait queue.  Assuming all of these disk IOs were submitted within microseconds of each other, and assuming a 15 ms average read service time and 1 ms average write service time... what will the resulting average read and write latency be?

\rrrrrrrrw/    <=Threads submitting disk IO requests
 \wwwwwww/

    |W|
    |W|
    |W|
    |W|        <= Wait Queue
    |W|
    |W|
    |W|
    |W|

  --------
  RRRRRRRR     <= Service Queue
  --------

The read latency average will be 15 ms.  None of the reads spent any time in the wait queue.

The write latency is a different story however.  The average write latency will be 16 ms.  Why?  Each write spent 15 ms - the read service time - in the wait queue before getting into the service queue.  Service time once in the service queue took 1 ms, so each of the 8 writes experienced the average latency of 16 ms.


On a system with write cache, Write latency greater than read latency for any sustained period of time is always a sign of a performance concern.  Its generally one of four things:
1. overhead due to storage level replication
2. Write cache saturation
3. SAN CPU saturation
4. IO queuing

In this case it was IO queuing that lead to average write latency greater than average read latency.

OK... now for me to explain just how inadequate my ASCII art really is :)
If there is an Emulex fibre channel HBA on the Windows server, it allows setting a per LUN service queue depth.  (Default LUN queue depth for Emulex is typically 32.)
The Windows wait queue is interesting - rather than a one-size fits all, the wait queue is defined such that the sum of service queue depth and wait queue depth is 256.
If you watch "Current Queue Length" numbers in perfmon long enough... you'll believe me even if right now you are skeptical. :-)
That number just won't go above 256.

So what is the absolute worst case latency for a write, assuming constant read service time of 15 ms and constant write service time of 1 ms?

Well, lets assume a service queue depth of 32 - and fill it with reads.  Lets put another 6*32=192 reads in the wait queue, followed by 32 writes in the wait queue, with all 256 of these IOs submitted within microseconds of each other.  That fills not only the service queue, but the wait queue as well.
The writes are the last to finally make it into the service queue.
But they have waited for 7 rounds of reads at 15 ms each to complete - they've spent 105 ms in the wait queue.  Their service time is as expected - 1 ms.  But average latency is 106 ms.  For writes, that's horrible.

OK... here's a kicker: this matters even if you are using SSDs.  If the service queue depth is still 32 and there are 256 total outstanding IOs to the drive... those last IOs in the wait queue have some time to wait before they finally get into the service queue.
I've seen all flash systems deliver a level of performance that is easy to obtain with hard drives because of host queuing issues.
Try not to to let that happen to you :-)

OK... now sometime soon when I'm feeling brave I'll come back to this and take on QLogic and the execution throttle.  Its similar to LUN queue depth, but has a set of considerations all its own.  And I haven't figured out how to represent it in ASCII art yet :)

No comments:

Post a Comment