Wednesday, March 25, 2015

SQL Server Outstanding IO Part 1: Missing Dimension in Design & Performance Analysis

DBAs get frustrated because discussing their performance concerns with storage administrators often doesn't result in resolution.
Storage administrators get frustrated because storage level changes and upgrades don't always result in the expected performance gains for critical databases.
I understand this.  As a systems engineer, I try to negotiate with both parties - that's also difficult :-)

Here's an important concept in systems performance and scalability investigation/intervention:
Missing dimensions of analysis make progress difficult.

I crafted that phrase carefully.  Missing dimensions of analysis are NOT the only condition that make progress in these cases difficult.  Sometimes, the time interval granularity of behavior capture isn't sufficient for diagnosis.  Sometimes isolating problem behavior from other behavior within the same time interval is difficult.  Other difficult conditions can arise.  But missing dimensions of analysis DO make performance investigation and intervention difficult.  And what if that dimension is largely absent from an *entire* DBA community?

Perhaps you think I'm exaggerating.  If you can find a SQL Server general performance presentation or SQL Server storage performance presentation that discusses outstanding IO count (disk queue length/depth, adapter queue length/depth) as a performance/scalability dimension, I'll concede that I've exaggerated.  I've yet to see such a presentation - although I'm putting one together myself :-)  I am increasingly seeing it addressed in best practice docs and fast track guides and reference architectures.  Thank goodness for that!

When there is a missing dimension of analysis, there are many possible stumbling blocks in investigation and negotiation between the teams responsible for the various layers of the system.

DBA: I see average read latency over 200 ms!
SAN admin: The SAN performance stats show average service time on those LUNs at 10 ms or lower...

DBA: I can't get more than 200mb/sec of throughput!
SAN admin: I don't see any more than that being requested, and the disks in that group have a very low percentage of utilization all day.

DBA1: I see almost 800 mb/sec throughput to the database.
DBA2: But the read latency is over a full second then...

Over the last few years, there has been a lot of discussion about IOPs rating alone being insufficient for predicting performance and scalability.  I fully agree.  What good is it for *system* sizing to know IOPs (which is largely a rating of controller CPU handling rate) without also knowing the IO size, read/write mix, and random vs sequential mix?  Somewhere, total bytes/sec has to fit in.  But is that *all* there is to know?

Consider this situation.  On this server, tempdb is isolated to its own LUN.  A massive LUN holds reporting database data in a single huge file.  Transaction logs - thankfully - are on a separate LUN.

That looks pretty good, right? Approaching 800 mb/sec at peak throughput. But...

Oh. Write latency averages up to 15 seconds!!! Write latency often higher than read latency, when writes should be going to write cache?!?
Ok, ok. Maybe not so good.

Is it a bytes/second throughput problem?  If so, and if the issue is solely caused by activity from this server, we should see a correlation.

Hmmm.  Not that much of a correlation.
"Silly sasquatch.  That's because it was mostly *read* activity then.  So *reads* were experiencing high latency because of a bytes/sec throughput limitation.  I see this *all* the time in SAN design, and engineers just don't get the math. Just leave this stuff to the experts, ok?"

Ok.  But I'd like to look at a few more things.

Oh... wait a minute.  Maybe reads were *not* experiencing high latency at peak bytes/sec.  So read and write latency were both *low* when the system hit bytes/sec peak?  Well - on shared storage there's always the problem of other connected servers causing performance problems...
Hang on.  Maybe this is a full write cache.  That can cause exaggerated read and write latencies.

I guess not - at least not caused by activity from this server alone.
"See!! Shared storage.  SAN engineers just don't get it..."
Hang on... cuz I got one more dimension to evaluate.
Lets look at outstanding disk IO, in order to evaluate whether there is adequate LUN and adapter queue depth for your workload.  After we've examined all dimensions from this server, we can move on to examine potential effects of shared storage by looking at combined traffic.

"The high queue length isn't a cause!  Its an effect!  The shared storage is messing us up, and a bytes/sec bottleneck is causing high queue length."
Maybe.  But that's not what all of the relationships we've examined is implying.  What kind of HBA are you using?  Emulex, Brocade, QLogic...
Ok.  What's the "execution throttle" set to - and how was that planned?

To be continued...