Saturday, March 21, 2015

What about LUN Count and Aggregate Service Queue Depth?





There are a few conversations that I have repetitively.  I hope to be able to diagram and illustrate the concepts behind my part in the discussion - because honestly I would rather spend the time in the future learning something new.

Often when I bring these details up with SQL Server experts, responses include:
"SANs are about more than just disks."  (As if I didn't know that!)
"You're wrong... its simple math."  (Well - actually the math AND industry benchmarks agree with me. Show me your numbers and I'll show you mine.)
"This is just Disk101." (Yes, I agree. So I wish folks would start teaching it and applying it rather than shutting me down when I bring it up because... because why, exactly?)

Anyway, I've finally started diagramming these things in a way that I like.  So, here goes.  Here's a Windows LUN on a SQL Server - either a drive letter or a mount point.  Its not using Storage Spaces.  Its not a striped dynamic disk.  I'll deal with those in later posts.  The LUN has a service queue(the dashed lines), and a wait queue.  The wait queue depth is equal to (256 - service queue depth).  On a physical server, assume a LUN queue depth of 32.  (On a virtual server, assume a LUN queue depth of 64.)  Whether physical or virtual server, the maximum transfer size by default is 512kb.  That's important too - we'll talk about maximum transfer size another day.



 *****


Interesting.  32 simultaneous 8kb reads(the blue ovals on the dashes of the line).  Unusual?  Not for SQL Server, if there are any parallel queries that will do index key lookups :-) Not for Oracle either, under the same circumstances.  Note that in SQL Server the default worker count for EE 64bit is at least 512, and each worker can issue its own reads.
128,000 read IOPs to get to 1 gigabyte/s at 8KB read size and 32 outstanding, with .25 ms.  That's a lot. Consider the following Xtremio SLOB test results.
https://community.emc.com/docs/DOC-42804?cmp=soc-cor-glbl-us-sprinklr-TWITTER-Flash-EMCXtremIO-155866600&linkId=12903441
118067 IOPs, with a read latency of 1.1 ms and a write latency of 1.5 ms.  That's quite a bit above the .25 ms read service time in the example above.  That means that the service queue length (and accompanying service queue depth) was greater than 32.  Makes sense - it was 8 VMs :-).
Look - folks keep trying to convince me that *somehow* there is some magic that can make a single SAN LUN scale to 1 gigabyte/s or more with an arbitrary IO size and LUN service queue depth.  Hey - its even hard to hit 1 gigabyte/s with a single LUN and a database workload if you crank queue depth all the way to 256.  I know, I've tried.   If you've got numbers for an all flash array to show me wrong - bring 'em.  I know PCIe flash offers lower latency than an all flash array - that's not in scope of what I'm addressing. (I'm a big fan of PCIe flash for SQL Server tempdb and with read accelerator tier 0 cache software if appropriate to the workload.)

Yesterday there was a kinda "hardee-har-har let's laugh at the math inability of SAN engineers" post*.  I'm willing to take a little constructive criticism.  Truth be told, this wasn't very harsh criticism - and the author makes a valid point (if the example is changed slightly).

Here's the original post.
http://dataidol.com/tonyrogerson/2015/03/20/the-maths-most-san-engineers-do-not-understand-in-context-of-database-servers/

The post doesn't mention average IO size - or the maximum transfer size of the host.  Just specifies that the servers are reporting database server and they move a lot of data.  By default, planning should use 8kb IO size(SQL Server page size and default recommended Oracle block size) unless other numbers are given.  Doesn't mention LUN service queue depth or queue length.    Doesn't mention read or write service times for the setup in the diagram. All it says is: "Assume data is located on a single LUN across all the SSD's." By the way - the storage network protocol also isn't specified.  But including 8 gigabit/s as the server connectivity indicates these are fibre channel connections, as FCoE or iscsi would most likely be using 10gbE of 1 gigabit connectors.  I'm pretty certain this isn't Ficon, its definitely not Infiniband :-)

My comment - which hasn't made it through moderation yet.

*****
"…what happens when you drive down a 10 lane motorway that get’s reduced to just a single lane?"
"Assume data is located on a single LUN across all the SSD's."
A single LUN on the hosts would most likely present a serious DB server bottleneck before getting to the downstream network bottleneck. Assume a LUN service queue depth of 32 and queue length of of the same – since no specification has been made about LUN service queue depth. (32 is maximum LUN queue depth for Hitachi VSP, XP24000 and related offerings from HP as well as for EMC VNX with virtual provisioning.) With those assumptions 128,000 IOPs of 8k each at .25 ms avg latency would be needed to hit 1 gigabyte/sec. 384,000 IOPs of 8k each across the 3 servers to hit 3 gigabytes/sec.
At a LUN service queue depth of 256 it becomes a more manageable 2 ms avg latency. But not all storage controllers will allow such a high LUN service queue depth. Other than 32, the most common LUN queue depth I encounter is 64 - this is the default guest LUN queue depth in a Windows VMWare guest. Without other specification, 32 is the LUN service queue depth that should be used in planning.
With at least 8 or more LUNs at queue depth 32 on a database server I think you could easily be throttled by available downstream bandwidth. But with a single LUN, the most typical case on a database server (with transfer sizes ranging from a single block to the maximum physical transfer size) is that the throughput requirements of the outstanding IO don’t come close to saturating available bandwidth of an 8 gigabit HBA.
Although accommodating for outstanding IO potential isn’t common in SQL Server configurations, for example, those responsible for system design and performance should know that a 4 vcpu VM can drive up to 1000 outstanding IOs, and the more of them that are relegated to wait queue status, the more punishment in latency and throughput the VM will experience.
Its very common for flash and SSD devices to provide more IOPs and bytes/sec than storage controllers can serve, and even more common in my opinion for the storage controllers to be underutilized while host/guest disk IO queues on heavily used reporting database servers are saturated.
*****

If you look at enough perfmon from a SQL Server system, or enough iostat on an Oracle system, you'll see floods of reads come when multiple disk-intensive parallel queries kick off (or a single massively parallel query).  Read-ahead and async reads are important ingredients in high scalability RDBMS.

*****

 *****
The  above illustration is actually idealized.  The truth is, there is a queuing penalty that results in total response time for the waiting rows being greater than the sum of the remaining service time before a slot empties and its own service time.  This additional time for queue management may be negligible (it usually is), or it could be significant in measurement.  Regardless - the numbers here are the *best* that can be achieved in the system as defined.
If the LUN queue length exceeds the service queue depth, average read response time will increase and the system won't hit 1 gigabyte/s.  Decreasing service time even further could allow us to reach 1 gigabyte/s.  But .25 ms avg is about as good as you'll ever get out of external storage.
Increasing the average IO size will also help to recover from queuing penalties.

Here's the thing: the blog post author has a very valid point: to achieve maximum performance out of a storage system, the disks, the pipes to the controller, and the pipes to the server host need to be considered(as well as controller cache & controller CPU).  But before that, we've got to look at the pipes coming out of the server host - namely the LUN and adapter queues (at both physical and virtual server layers if a VM).  Otherwise, you'll just add capacity that can't be tapped by the server anyway.  And if you want to negotiate with a SAN engineer, you don't want to burn political capital by asking for changes that you can't use until after or at least at the same time as changes that will allow you to capitalize on the new performance capacity.

If the example were changed to just say "assume no queuing delay at the server LUNs or adapters", or "assume an adequate aggregate LUN service queue" or "assume the data is evenly accessed across 8 LUNs", the example would be valid.  But making the example with a single LUN assumption works against the author's larger intention - as does the general response of "oh my goodness you've perfectly captured how silly SAN engineers are".  At Summit I usually here exhortations to work as a team with storage administrators.  I think that shouldn't just be a ploy to get what you want.  If there is a bit more learning of storage technology - especially shared storage, I firmly believe there will be more performance success more widely across the SQL Server community.

*I double-checked this because I wanted to make sure I wasn't rushing to judgment.  Re-reading the original post's first line has me, for now, convinced I am not hasty in my appraisal:
"I suppose I write this post to mock some of the things I keep seeing in the industry today."

5 comments:

  1. Unfortunately you've based that maths on an 8KiB page, for a reporting server that just isn't realistic, looking at a real semi-busy server doing OLTP the avg disk bytes/write is averaging 15,550 and the read is averaging 13,264; when doing a reporting work load the avg rises to 325,240.

    If I’m running a query that requires an index scan (left side of the join on loop/merge/hash even) on a table that is 200GiB, indexed so that the real data read is just 50GiB then that 50GiB isn’t read in 8KiB IO’s, I’ll be getting between 100KiB and 2MiB depending on how fragmented the data is and what I’m actually doing in the query.

    Do you now see why the 700MiB limiter (8Gbits) i.e. one lane out of the SAN servicing multiple servers or even just the one is just plain silly.

    I think you’ve fallen into one of the traps I see – SAN engineers assume 8KiB fully random IO when looking at database servers.

    Also – as to the single LUN mentioned then that’s actually what I see time and time again out in the field – to use multiple LUN’s for a SQL Server database you need to use multiple files in the database which a lot of folk just don’t do.

    So in reality at a read rate of 128KiBytes per IO off the SAN you can only manage around 5.5K IOps which really isn’t that much!

    ReplyDelete
    Replies
    1. I've added a new post with numbers from a reporting server performance intervention. If you'd like to post similar numbers from a reporting server that achieves 1 gigabyte/sec from SAN with a single LUN at latencies and total queue length that demonstrate queuing is not a bottleneck*, I'd love to see it.

      http://sql-sasquatch.blogspot.com/2015/03/sql-server-on-vmware-lsi-vhba-vs-pvscsi.html

      For example, average read latency and write latency from SAN SSDs should remain below 2 ms to indicate lack of performance penalty from queuing, and the current disk queue length reported by perfmon should remain under the LUN service queue depth.

      Delete
  2. You'll be interested to see my next post. A comparison of results from a SQL Server VM with 4 guest database LUNs on a single LSI adapter to 4 guest LUNs on a pvscsi adapter, with the adapter queue length increased from 256 to 1024. The results include a capture from the reporting workload on this VM.

    I know quite well how many SQL Server installations have the entirety of their busiest database on a single LUN. I'd like to put a stop to that. If that were to come to pass, there would be fewer installations that feel the need to upgrade to SSD within SAN or an all flash array. Read ahead and prefetch in SQL Server are quite capable, and if peak bytes/sec throughput is maintained throughout a workload, the difference between a 1 ms avg read service time and a 5 ms avg read service time can become insignificant (as long as synchronous reads are minimized).

    ReplyDelete
  3. "depending on how fragmented the data is and what I’m actually doing in the query."
    Well, kinda. Bookmark lookups will still be 8kb, regardless. Could be pre-fetched, but still take a slot in service queue.
    Bookmark lookup could be eliminated if all necessary columns are included in the index. But, again, that would be based on an assumption not stated in your example - which didn't address average IO size, queue depth, queue length, or latency even though it was meant to mock folks that work with those concepts every day. If your post had included an average IO size, I would have worked with that.

    If your negotiations with storage engineers are not more specific than the examples you put together to mock them, they may not be very successful.

    ReplyDelete
  4. I think you are getting bogged down in the detail, my point in the blog was simple - one hell of a lot of SAN engineers don't tend to listen to the guys that are expert in the database product, they just don't get the maths of throughput, a lot of the time the database guy has to overcome an arrogance of "we know better" even though the database guy knows their environment understands the database engine, understand the query dynamics of their requirements - unfortunately the business takes the view - oh, they are the storage experts so they will know best.

    To conclude - if you look at my diagram you see 3 database servers each have their own 8Gbit port - so each database server can perform a throughput of upto around 700MBytes per second (think distributed database processing in Microsoft APS). But there is only a single 8Gbit port out of the SAN! 3 x 700 does not equal 1 x 700 which is my point.

    Bookmark lookups aren't restricted to 8KiB, also what I posted I took straight off a live environment which has two 8Gbit ports out of the SAN and yep - I got 1.2GBytes per second throughput even though all my data is on a single LUN - I don't understand why that is - not sure the model number but it's an expensive EMC VNX something.

    I think to conclude my comment - my original post was from frustration as a consultant who's sick and tired of SAN engineers thinking they know that what I want isn't actually what I want, without looking at perfmon, without asking going into the detail of requirements etc.. Thankfully we are heading into an era of locally attached storage but distributed data so this issue of throttling SAN throughput will eventually go away - as well as that single central box!

    ReplyDelete