Showing posts with label vhba. Show all posts
Showing posts with label vhba. Show all posts

Thursday, October 6, 2016

Migration to pvscsi from LSI for SQL Server on VMware; It *really* matters

I expect it to make a big difference.  Even so, I'm still pleasantly surprised how much of a difference it makes.

About the VM:
8 vcpu system.
19 logicaldisk/physicaldisks other than the C Windows install drive.
About the guests vdisks:
Each guest physicaldisk is its own datastore, each datastore is on a single ESXi host LUN.

On July 27th, the 19 SQL Server vdisks were distributed among 3 LSI vHBA (with 1 additional LSI vHBA reserved for the C install drive).

I finally caught back up with this system.  An LSI vHBA for the C install drive has been retained.  But the remaining 3 LSI vHBA have been switched out by pvscsi vHBA.

The nature of the workload is the same on both days, even though the amount of work done is different.  Its a concurrent ETL of many tables, with threads managed in a pool and the pool size is constant between the two days.

Quite a dramatic change at the system level :-)

Lets first look at read behavior before and after the change.  I start to cringe when read latency for this workload is over 150 ms.  100 ms I *might* be able to tolerate.  After changing to the pvscsi vHBA it looks very healthy at under 16 ms.


OK, what about write behavior?

Ouch!! The workload can tolerate up to 10ms average write latency for a bit.  5 ms is the performance target.  With several measures above 100 ms write latency on July 28th, the system is at risk of transaction log buffer waits, SQL Server free list stalls, and more painful than usual waits on tempdb.  But after the change to pvscsi, all averages are below 10 ms with the majority of time below 5 ms.  Whew!



Looking at queuing behavior is the most intriguing :-) Maximum device and adapter queue depth is one of the most significant differences between the pvscsi and LSI vHBA adapters. The pvscsi adapter allows increasing the maximum adapter queue depth from default 256 all the way to 1024 (by setting a Windows registry parameter for "ringpages"). Also allows increasing device queue depth from default 64 to 256 (although storport will pass no more than 254 at a time to the lower layer).  By contrast, LSI adapter and device queue depths are both lower and no increase is possible.

It may be counter-intuitive unless considering the nature of the measure (instantaneous) and the nature of what's being measured (outstanding disk IO operations at that instant).  But by using the vHBA with higher adapter and device queue depth (thus allowing higher queue length from the application side), the measured queue length was consistently lower.  A *lot* lower. :-)



Tuesday, April 21, 2015

3rd Try Charm: Why does pvscsi rather than LSI matter so much for SQL Server on VMware?


I've written two blog posts recently on the performance benefits of the VMware pvscsi vHBA over the LSI vHBA.

March 23, 2015
SQL Server on VMware: LSI vHBA vs pvscsi vHBA
http://sql-sasquatch.blogspot.com/2015/03/sql-server-on-vmware-lsi-vhba-vs-pvscsi.html

April 7, 2015
Another SQL Server VMware LSI vs pvscsi vHBA blog post
http://sql-sasquatch.blogspot.com/2015/04/another-sql-server-vmware-lsi-vs-pvscsi.html

These posts give details from different systems (although running a similar workload), and claim a benefit of roughly 10 fold in peak throughput and peak disk response times by switching from the default LSI vHBA to the pvscsi vHBA for the SQL Server LUNs.  That sounds a little fishy, doesn't it?  Especially if you are familiar with...

Achieving a Million I/O Operations per Second from a Single VMware vSphere® 5.0 Host
http://www.vmware.com/files/pdf/1M-iops-perf-vsphere5.pdf

Page 10 of the performance study above includes the following text.
"… a PVSCSI adapter provides 8% better throughput at 10% lower CPU cost."


That is a much more modest (and probably more believable) claim than 10x performance benefit.  What gives?

Here's a few details about the mechanics:
LSI vHBA has adapter queue depth 128, cannot be increased. LUN queue depth cannot be increased from default 32.
pvscsi vHBA has default adapter queue depth 256, and default LUN queue depth 64.  Adapter queue depth can be increased to 1024, LUN queue depth to 256 with Windows registry settings.
http://www.pearsonitcertification.com/articles/article.aspx?p=2240989&seqNum=3
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2053145


And here's a detail about the testing that I just happened to come across in the Longwhiteclouds blog.


"Maximum queue depth supported by LSI driver (in guest) cannot be changed. So to keep the aggregate outstanding I/Os per controller lower than the max queue depth we had to use 16 OIOs per vDisk. To have a fair comparison between LSI and pvSCSI, second test also had 16 OIOs per vDisk for pvSCSI as well. " Chethan Kumar, author of the VMware paper, as quoted on Longwhiteclouds.

So, comparison testing was done within the queue depth constraints of the LSI vHBA.  But, in the case of these Enterprise Edition SQL Server workloads, the number of outstanding IOs would often exceed 600 and microbursts as high as 1000 outstanding IOs occurred.  That's well outside the LSI adapter queue depth, and the queuing penalty can be high.  Even with 4 LSI adapters in a VM, the aggregate adapter queue depth would be only 512.

If a SQL Server workload doesn't burst more than 32 outstanding IOs per LUN or more than 128 outstanding IOs per vHBA adapter, the change to pvscsi would most likely bring rather modest performance benefits - along the lines of the 8% better throughput at 10% lower CPU utilization indicated in the whitepaper.  In fact, at that low level of outstanding IO... maybe there would be a slight performance decline.  That's because the LSI vHBA can allow an IO request up to 32mb in size.  SQL Server won't (yet) perform disk IO that large.  The largest disk IO I've seen from SQL Server has been 4 mb.*  The pvscsi vHBA adapter currently will allow a maximum disk IO size of 512kb.

However - really large disk IOs from SQL Server are in my experience fairly rare, and high aggregate queue length is more common.  For that reason, I heartily recommend using the pvscsi vHBA for SQL Server vms.  Retaining the LSI vHBA for the boot drive is common, even when pvscsi vHBAs are added for database LUNs.  I've got nothing against that approach. But its important to ensure that a SQL Server vm can handle the outstanding IO generated by its workload.  CPUs are hungry - feed them lots of data quickly :-).


*But Niko has shown that columnstore will do read IO up to 8 mb :-)
Clustered Columnstore Indexes – part 50 ("Columnstore IO")
http://www.nikoport.com/2015/04/04/clustered-columnstore-indexes-part-50-columnstore-io/       
 

Tuesday, December 3, 2013

VMWare ESX 5.1u1: Increasing pvscsi Adapter Queue Depth... Hey, What on Earth are RequestRingPages?

Figured I'd write about this, because it won't  be long before I forget it all again.  A few weeks from now, I'll need this... won't be able to piece it together from memory or from VMWare documentation.   Then I'll google... find my own blog post... refresh my memory... rinse... repeat :)

SQL Server performance on virtual platforms is a big deal these days.  It's pretty rare that folks really get to the guts of what may limit the performance of a particular workload on a virtual platform.

I'm lucky that the main SQL Server workflows I am concerned with are all at the same side of the spectrum between latency sensitive and bandwidth hungry.

VMWare ESX 5.1 update 1 contains an update that allows the increase of vHBA adapter queue depth from the default of 256 to 1024.  That's a good thing - since the default for Windows on a physical server with QLogic, Brocade, or Emulex FC HBA is 1024 outstanding IOs per adapter port.

In fact, when my data hungry workflows were tested inhouse on VMWare vSphere 5.0 and 5.1, it became apparent that the difference in adapter queue depth was a significant factor in the performance reached with SQL Server on physical server and SQL Server on VMWare on the same server.  Increased adapter queue depth, and with 4 vHBAs attached to the VM, performance of the target workflows was nearly indistinguishable from physical server.

The instructions for increasing the vhba adapter queue depth are here:
Large-scale workloads with intensive I/O patterns might require queue depths significantly greater than PVSCSI default values (2053145)
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2053145 

You might find the kb article a little cryptic :)

*/
As an aside let me mention that the VMs I test use dedicated server hardware, dedicated HBAs, dedicated LUNs.  For that reason, parameters like Disk.SchedNumReqOutstanding (ESXi 4.x, 5.0, 5.1) or "--schednumreqoutstanding | -O" (ESXi 5.5) are not something I typically pay attention to.  If your configuration does share resources such as vSphere host LUNs with other VMs, you may want to read these:
Setting the Maximum Outstanding Disk Requests for virtual machines
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1268
http://pubs.vmware.com/vsphere-55/index.jsp?topic=/com.vmware.vcli.ref.doc/esxcli_storage.html
/*

The VMWare kb article gives examples of increasing the host LUN queue depth by setting ql2xmaxqdepth (QLogic) and lpfc_lun_queue_depth (Emulex) to 128.

QLogic has this related post also giving an example of setting LUN queue depth ql2xmaxqdepth to 128.
https://support.qlogic.com/app/answers/detail/a_id/2189/~/configuring-queue-depth-for-large-workloads-in-esxi5.1x

But, apparently, in tests for IBM ql2xmaxqdepth was set as high as 256.
http://www.vmware.com/a/assets/vmmark/pdf/2012-04-12-IBM-FlexSystemx240.pdf
(page 7)

The most recent Emulex documentation I could find still documents 128 as the maximum value for  lpfc_lun_queue_depth LUN queue depth.
http://www-dl.emulex.com/support/elx/rt960/b12/docs/vmware/vmware_manual_elx.pdf
(page 23)

Dell, for their part, documents setting both ql2xmaxqdepth and lpfc_lun_queue_depth to 255 in Compellent Best practices :)
http://en.community.dell.com/cfs-file.ashx/__key/telligent-evolution-components-attachments/13-4491-00-00-20-43-79-43/Compellent-Best-Practices-with-VMware-ESX-4.x.pdf
(page 10)

Because of some nasty experiences with out-of-bounds parameter values resulting in unwanted effective values (and unwanted performance consequences), we decided to stick with host level LUN queue depth of 128.  It worked well enough for what we were doing - there does seem to be coalescing of guest IOs at the ESX level: even though we increased guest LUN queue depth to 254 (with guest LUNs in one-to-one relationship with host LUNs) we never overflowed the host LUN queue depth of 128.

Why does the kb article mention a new maximum Windows guest LUN queue depth of 256, while the example sets it to 254?  My guess: the difference between actual and effective queue depth settings.  In other VMWare documentation, allowed queue depth of 32 for a LUN results in explicitly setting the queue depth to 30.  Its kinda similar with IBMPower AIX LPARs that are using vscsi attached devices served through VIOs - the vscsi adapter has a maximum queue depth of 512 but 2 are reserved, so it can be set to 510.  LUN queue depth (the sum of which should be less than the vscsi adapter queue depth) should allow for 3 outstanding IOs to be used by the virtualization layer.  I expect something similar here - LUN queue depth of 256 is allowed for the guest but 2 need to be reserved for the virtualization layer so the maximum effective LUN queue depth is 254.

Then there's the question that motivated me to write this post: how does the vhba queue depth actually get increased to 1024?  And what on earth is RequestRingPages?

The kb article includes this somewhat cryptic example of a modification to the Windows registry.
REG ADD HKLM\SYSTEM\CurrentControlSet\services\pvscsi\Parameters\Device /v "DriverParameter" /t REG_SZ /d RequestRingPages=32,MaxQueueDepth=254 

 Hmmm... decoder ring maybe? :) MaxQueueDepth sounds like it should be the per LUN queue depth in the Windows guest.  And, if I'm right about 2 IO slots reserved for the virtualization platform, that makes sense.   But what is RequestRingPages, and where do we set the PVSCSI vHBA adapter queue depth to 1024?

RequestRingPages indicates a configuration for the PVSCSI driver - and what if I told you that each page allowed 32 slots for outstanding IO requests on the adapter driver?  That would work out splendidly: it would result in this registry edit giving an effective LUN queue depth of 254 (the most allowed with an actual queue depth limit of 256, reserving 2 for the virtual platform) and a queue depth of 1024 outstanding IO requests for the PVSCSI adapter.