Showing posts with label ESX. Show all posts
Showing posts with label ESX. Show all posts

Tuesday, April 21, 2015

3rd Try Charm: Why does pvscsi rather than LSI matter so much for SQL Server on VMware?


I've written two blog posts recently on the performance benefits of the VMware pvscsi vHBA over the LSI vHBA.

March 23, 2015
SQL Server on VMware: LSI vHBA vs pvscsi vHBA
http://sql-sasquatch.blogspot.com/2015/03/sql-server-on-vmware-lsi-vhba-vs-pvscsi.html

April 7, 2015
Another SQL Server VMware LSI vs pvscsi vHBA blog post
http://sql-sasquatch.blogspot.com/2015/04/another-sql-server-vmware-lsi-vs-pvscsi.html

These posts give details from different systems (although running a similar workload), and claim a benefit of roughly 10 fold in peak throughput and peak disk response times by switching from the default LSI vHBA to the pvscsi vHBA for the SQL Server LUNs.  That sounds a little fishy, doesn't it?  Especially if you are familiar with...

Achieving a Million I/O Operations per Second from a Single VMware vSphere® 5.0 Host
http://www.vmware.com/files/pdf/1M-iops-perf-vsphere5.pdf

Page 10 of the performance study above includes the following text.
"… a PVSCSI adapter provides 8% better throughput at 10% lower CPU cost."


That is a much more modest (and probably more believable) claim than 10x performance benefit.  What gives?

Here's a few details about the mechanics:
LSI vHBA has adapter queue depth 128, cannot be increased. LUN queue depth cannot be increased from default 32.
pvscsi vHBA has default adapter queue depth 256, and default LUN queue depth 64.  Adapter queue depth can be increased to 1024, LUN queue depth to 256 with Windows registry settings.
http://www.pearsonitcertification.com/articles/article.aspx?p=2240989&seqNum=3
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2053145


And here's a detail about the testing that I just happened to come across in the Longwhiteclouds blog.


"Maximum queue depth supported by LSI driver (in guest) cannot be changed. So to keep the aggregate outstanding I/Os per controller lower than the max queue depth we had to use 16 OIOs per vDisk. To have a fair comparison between LSI and pvSCSI, second test also had 16 OIOs per vDisk for pvSCSI as well. " Chethan Kumar, author of the VMware paper, as quoted on Longwhiteclouds.

So, comparison testing was done within the queue depth constraints of the LSI vHBA.  But, in the case of these Enterprise Edition SQL Server workloads, the number of outstanding IOs would often exceed 600 and microbursts as high as 1000 outstanding IOs occurred.  That's well outside the LSI adapter queue depth, and the queuing penalty can be high.  Even with 4 LSI adapters in a VM, the aggregate adapter queue depth would be only 512.

If a SQL Server workload doesn't burst more than 32 outstanding IOs per LUN or more than 128 outstanding IOs per vHBA adapter, the change to pvscsi would most likely bring rather modest performance benefits - along the lines of the 8% better throughput at 10% lower CPU utilization indicated in the whitepaper.  In fact, at that low level of outstanding IO... maybe there would be a slight performance decline.  That's because the LSI vHBA can allow an IO request up to 32mb in size.  SQL Server won't (yet) perform disk IO that large.  The largest disk IO I've seen from SQL Server has been 4 mb.*  The pvscsi vHBA adapter currently will allow a maximum disk IO size of 512kb.

However - really large disk IOs from SQL Server are in my experience fairly rare, and high aggregate queue length is more common.  For that reason, I heartily recommend using the pvscsi vHBA for SQL Server vms.  Retaining the LSI vHBA for the boot drive is common, even when pvscsi vHBAs are added for database LUNs.  I've got nothing against that approach. But its important to ensure that a SQL Server vm can handle the outstanding IO generated by its workload.  CPUs are hungry - feed them lots of data quickly :-).


*But Niko has shown that columnstore will do read IO up to 8 mb :-)
Clustered Columnstore Indexes – part 50 ("Columnstore IO")
http://www.nikoport.com/2015/04/04/clustered-columnstore-indexes-part-50-columnstore-io/       
 

Thursday, December 19, 2013

VMWare vCPU dispatch overload without high perfmon %Processor Time or ESX %RDY...









The system for the graph above is a 1 vCPU Windows Server 2008 R2 VMWare guest. %CPU busy and the processor run queue against the black vertical axis on the left, total thread count from perfmon against the blue axis on the right.

I was watching an ETL process... we'd already optimized the sending system.  Times looked great... at first.  Suddenly the times for the ETL stretched back and became even worse than before optimization!  First thought was to look at the receiving database system.  Nothing stood out, but there are always things to optimize... maybe reduce IO queuing by increasing max transfer size... I noticed that total throughput on the server seemed capped at 800 mb/second with queued IO and CPUs with plenty of idle time... but those factors were trues before.

Finally I started looking at the middleman - a small VM that grabs the extract files from the source system and manages their import into the target system.  I'd actually forgotten it was a VM in this case... I'm a dinosaur so I always assume physical server until it becomes apparent its virtual.

I started staring at disk queuing first out of habit.  Nothing out of the ordinary - in fact that was a little surprising.  I had expected some disk queuing due to the number of sending threads which had increased after optimizing the sending system.  And each sending thread should have been able to increase its send rate, as well.

But the "current disk queue length" perfmon counter was completely unremarkable... even though more than 40 writing threads coming from the source system should at times have overlapped with 30 or more reading threads sending to the target system.

But I guess the processor run queue, which reaches up to 102 threads for the lonely single vCPU would explain why the level of disk queuing was lower than I expected... and also potentially why performance was slower than before optimizing the sending system.

Not sure yet if a change was introduced to this VM (perhaps someone reduced vCPU based on prior utilization?), or maybe the optimization on the sending system caused a scheduling/interrupt handling overload for the hypervisor or guest.

Interestingly, the virtualization host would not necessarily show this as a problem.  The vCPU might be getting all of its allotted time on a host server hardware thread.  But with that many threads to switch on and off the vCPU, and trying to manage the interrupts from the network adapter that the ETL was keeping busy... there certainly was a problem.  If only CPU busy in perfmon was monitored without also the processor run queue... the problem might not be apparent. 

Good thing I'm always watching the queues :)

Hopefully cranking up to at least one more vCPU will lead to an improvement.  I hate it when my optimizations end up in a net loss :)


Gotta give you a link, right? Here's a great one, on this topic which applies to Windows dispatch overload whether physical or virtual CPUs.

Measuring Processor Utilization and Queuing Delays in Windows applications

http://blogs.msdn.com/b/ddperf/archive/2010/04/04/measuring-processor-utilization-and-queuing-delays-in-windows-applications.aspx

Monday, December 9, 2013

'All are Punishèd' part 1: QFULL messages

Here's a reason NOT to ignore the perfmon 'LogicalDisk(*)\Current Disk Queue Length' metric: it is among very few ways of diagnosing a particularly punishing performance condition.  (I got another one that's a slight variation on this theme, coming in a few days.)

Windows allows up to 256 outstanding disk IO operations per host (or VM guest) LUN.  For fibre channel LUNs, the HBA defines a LUN queue depth within that number.  Typical default is a fibre channel LUN queue depth of 32.

That means that at any given time, there may be up to 32 in-flight IO operations in the host LUN service queue, with up to an additional 224 (for a total of 256) in an OS wait queue.  The IO requests in the OS wait queue will go into the service queue as slots open, and the total service + wait queue depth of 256 will keep additional IO requests at bay if need be.

What if there are 32 or more cores on the server with a data-hungry workload, with many threads interested in the same LUN at the same time? Chances of overflowing the service queue queue depth of 32 are quite high.  (Actually, that's true for SQL Server as soon as there are 8 or more physical cores - with or without hyperthreading - and a busy enough workload... but I digress.)

OK... well, increasing the LUN service queue depth on the Windows server HBA can be fairly easy.  And more parallel in-flight IO should increase throughput - which should allow increased CPU utilization and higher throughput of logical database work, assuming the same data-hungry workload - right?  Sure, latency would be expected to rise... but as long as the application is on the data-hungry side of the spectrum, instead of the latency-sensitive side, everything should be dandy!

Except when its not.  And the QFULL message is an example of when it is not.

The QFULL message from a storage array is a means of telling the connected server(s) to hold some horses.  In the old days, when command queues for storage array ports or other elements overflowed, it was possible to crash the array OS.  Perhaps that's still possible - I haven't heard of that particular type of failure in quite some time.  But QFULL messages can still be sent from a front end port on a storage array that has a full command queue - or, more likely when the command queue for an array LU (the array object corresponding to the server host's LUN) is full with additional commands coming in.

That's tricky.  In some cases, an array LU has a documented maximum command queue depth.  If documented, the max may be a specific number, or it may be based on the number of underlying storage devices in the LU.

There are two somewhat common ways that a given system can be set up for this trouble: the LUN queue depth on the server HBA may be deeper than the command queue for the corresponding LU.  The problem wouldn't necessarily be immediately apparent in such a case, but once the LUN service queue length gets long enough that the LU command queue overflows, the array would return a QFULL message.  Then the host activity is at the mercy of the response of the host OS/guest OS/fibre channel driver response to the QFULL condition.

For virtual servers, the other path to trouble is when the LUNs presented to several virtual guests come from the same array LUt.  Blah.  I'm sure there's a more clear way to say that :) But, imagine a physical ESXi server with 4 guest Windows VMs, each VM using a Windows Q: drive.  Each of those Q drives in this hypothetical is actually being served from the same underlying LUN on the ESXi server.

Bringing LUN queue depth up to 128 on a VMWare host server is pretty common.  Let's leave the guest Q drive queue depth at 32.  There are 4 guests, so the aggregate queue depth is 128.  Now imagine that the storage array involved has a command queue depth of 64 for the LU that becomes the host LUN in question and eventually each of the Q drives for the 4 VM guests.

If the LUN queue depth reaches 32 simultaneously in each of the 4 guests - will it be a problem?  Maybe not.  There is IO coalescing at the VMWare level - so the 128 separate IO requests at the guest levels may very well coalesce to less than 64 IO requests in the VMWare host LUN service queue length.  The storage array won't get grumpy in that case.

But... what if the 128 outstanding IOs against each guest VMs Q drive are random from the perspective of the guests AND cannot be coalesced by VMWare at all?  128 commands will go into the LUN service queue and get sent to the array.  The array in this example will respond with a QFULL message (cuz its my example).  And in what I consider the worst case scenario, lets say the response is determined by something like the VMWare adaptive throttling algorithm.

Here's a good summary of the VMWare adaptive throttling algorithm:
http://cormachogan.com/2013/01/22/adaptive-queueing-vs-storage-io-control/

In that case, the QFULL message would cut the queue depth in half.  Continued QFULL messages could continue to reduce the LUN queue depth.  When congestion clears, the increase in queue depth is not as quick as the decline - the queue depth doesn't double in each 'good' interval until it reaches its previously configured value.  Rather, each 'good' interval sees the queue depth increase by 1.

Now - if VMWare adaptive queue depth throttling is suspected to be taking place, it'll be best to diagnose at the ESX host level.

But, if its a physical Windows server, and QFULL messages from the array are suspected, there are two possibilities for diagnosis from the server.  Most HBAs will allow an increased level of error logging, and QFULL messages can typically be logged (along with a plethora of other conditions at various logging levels).  But if the Windows LUN service queue depth is known, the perfmon 'LogicalDisk(*)\Current Disk Queue Length' metric can come in really handy.  Especially if you know the storage array LU command queue depth - if the host LUN queue depth is greater than LU command queue depth and perfmon shows a queue length higher than the array command queue length... you can ask the SAN admin if QFULL messages are being sent and you just might be a hero.  Making sure that host LUN queue length stays lower than array LU command queue depth is an example of 'less is more'.

More reading in case you just can't get enough of this:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008113
http://kb.vmware.com/selfservice/documentLinkInt.do?micrositeID=&popup=true&languageId=&externalID=1027901
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1030381
https://support.qlogic.com/app/answers/detail/a_id/599/kw/VMware%20queue%20depth/

This one I'll list individually because its one of my all-time favorite posts which include white-board diagrams of IO stack :)
http://virtualgeek.typepad.com/virtual_geek/2009/06/vmware-io-queues-micro-bursting-and-multipathing.html