Showing posts with label pvscsi. Show all posts
Showing posts with label pvscsi. Show all posts

Thursday, October 6, 2016

Migration to pvscsi from LSI for SQL Server on VMware; It *really* matters

I expect it to make a big difference.  Even so, I'm still pleasantly surprised how much of a difference it makes.

About the VM:
8 vcpu system.
19 logicaldisk/physicaldisks other than the C Windows install drive.
About the guests vdisks:
Each guest physicaldisk is its own datastore, each datastore is on a single ESXi host LUN.

On July 27th, the 19 SQL Server vdisks were distributed among 3 LSI vHBA (with 1 additional LSI vHBA reserved for the C install drive).

I finally caught back up with this system.  An LSI vHBA for the C install drive has been retained.  But the remaining 3 LSI vHBA have been switched out by pvscsi vHBA.

The nature of the workload is the same on both days, even though the amount of work done is different.  Its a concurrent ETL of many tables, with threads managed in a pool and the pool size is constant between the two days.

Quite a dramatic change at the system level :-)

Lets first look at read behavior before and after the change.  I start to cringe when read latency for this workload is over 150 ms.  100 ms I *might* be able to tolerate.  After changing to the pvscsi vHBA it looks very healthy at under 16 ms.


OK, what about write behavior?

Ouch!! The workload can tolerate up to 10ms average write latency for a bit.  5 ms is the performance target.  With several measures above 100 ms write latency on July 28th, the system is at risk of transaction log buffer waits, SQL Server free list stalls, and more painful than usual waits on tempdb.  But after the change to pvscsi, all averages are below 10 ms with the majority of time below 5 ms.  Whew!



Looking at queuing behavior is the most intriguing :-) Maximum device and adapter queue depth is one of the most significant differences between the pvscsi and LSI vHBA adapters. The pvscsi adapter allows increasing the maximum adapter queue depth from default 256 all the way to 1024 (by setting a Windows registry parameter for "ringpages"). Also allows increasing device queue depth from default 64 to 256 (although storport will pass no more than 254 at a time to the lower layer).  By contrast, LSI adapter and device queue depths are both lower and no increase is possible.

It may be counter-intuitive unless considering the nature of the measure (instantaneous) and the nature of what's being measured (outstanding disk IO operations at that instant).  But by using the vHBA with higher adapter and device queue depth (thus allowing higher queue length from the application side), the measured queue length was consistently lower.  A *lot* lower. :-)



Thursday, November 5, 2015

High DiskSpd Activity in VMware VMs - Part 2: More Questions

In my most recent previoust blog post I showed some perfmon graphs, with 1 second collection interval, from some tests I was running with DiskSpd in an 8 vcpu VMware VM, against a guest LUN managed by the pvscsi vHBA.

High DiskSpd Activity in VMware VMs - Part 1: Start of an investigation
http://sql-sasquatch.blogspot.com/2015/11/high-diskspd-activity-in-vmware-vms.html 

In that particular test run, the main concern was the high privileged time on vCPU 0, which is also the location of the SQL Server logwriter.

Although most of my tests showed high CPU utilization on vcpu 0, there was some variability.  Sometimes it would be a different single vcpu bearing the cost of handling the vHBA.  Sometimes the work would be shared among a few vcpus.

Here are some results of a separate test, mere minutes away from the last test I shared, on the same system.  These results show another concern I have about high SQL Server disk IO within VMware VMs.

In this case, cost of managing disk IO has been balanced across vCPUs 0, 1, 2, and 3.  The remaining 4 vcpus have little involvement in managing the disk IO.

That presents an interesting question - why is the cost dispersed in this case?  It may be extremely preferable to disperse the cost over 4 vcpus, rather than keep vcpu 0 (*especially* vcpu 0) nearly fully consumed with privileged time.
But there's another important question - and one which may be familiar to lots of other folks running SQL Server in VMware VMs by the time I'm done.  What's going on in the nearly 6 second valley?







If I look at IOPs (reads/sec since this was a read-only test) the valley is evident, too.



Was the test running?  Oh, yeah.  In fact, this test had a target queue length (total outstanding IO against the tested Windows volume) of 16 - and the valley was the only sustained time that the target was achieved constantly. (As an aside: I still see folks - performance experts even!! - proclaim 'current disk queue length' as a metric without value.  They're wrong.)




But... if there was a constant queue length of 16 reads while IOPs dropped really low...

Yep.  Really high latency. So that valley in activity and utilization is really a performance divot.  (That's one of my favorite terms.  You might end up seeing it a lot if you read my blog regularly.)



Sometimes a chart of the values over time put performance divots with extreme outlier values into better perspective.



Disk Reads/sec Current Disk Queue Length  Avg. Disk sec/Read
14:50:18 65389.65 12 0.00022
14:50:19 58861.78 16 0.00024
14:50:20 42.96 16 0.31733
14:50:21 18.98 16 0.71181
14:50:22 8.96 16 0.50111
14:50:23 272.43 16 0.09647
14:50:24 0.00 16 0.00000
14:50:25 25.35 16 1.24708
14:50:26 54419.92 11 0.00039
14:50:27 65439.81 12 0.00022

So - average read latency of under a millisecond at roughly 60,000 read IOPs with a divot to under 300 read IOPs at average read latency from 96 milleseconds to over 1 full second.  And one second ending at 14:50:24 with NO reads completed at all.

That raises some serious questions.  I've seen this behavior numerous times on numerous systems.  Hopefully in the next couple of weeks I'll be able to get to the bottom of it, so I can share some answers rather than just more questions.  :-)