***** Update - if you want to check out the goodness that is the pvscsi vHBA for high outstanding disk IO workoads, check out the links at the bottom of this post. *****
As a disclaimer to the post below: I try to establish best practices for system setup. But in the course of performance interventions, negotiation is necessary. Although I prefer 3 or 4 pvscsi vHBA adapters with database LUNs distributed among them (or boot drive isolated on LSI with database LUNs distributed among the 3 pvscsi vHBAs), in negotiation I'm willing to settle for a single vHBA with its queue depth increased from 256 to 1024. Sometimes, ya gotta take what ya can get early on. Prove you know what you are talking about, then you can ask for more.
Here's a system I worked on last year, as an illustration of the difference between vHBAs and an example of the performance gain that can come from properly accommodating the number of outstanding IOs in a SQL Server guest.
Here are some graphs representing the system before intervention. An ETL is occurring until approximately 9:30, at which time a batch reporting workload begins.
But this system isn't doing much work per unit of time. Before intervention this system had 6 vcpus, they barely ever broke a sweat. In large part, that's because this vm was simply starved for data.
The system wasn't even managing to exceed 60 mb/sec of disk traffic, and rarely exceeded 1000 IOPs. Even at that low level of activity, read and write response times measured in perfmon were astounding - typically over 1 full second!
The database MDF and NDF files, all 4 in a single filegroup, are each on their own Windows drive letter - four physical/logical volumes in the guest were receiving the majority of the traffic in this vm. Below is the correlation between the total "current queue length" of these disks, and the read and write latency.
The intervention in this system was a negotiation. Server administrators wanted to increase the vcpus from 6 to 8 - even though the system wasn't doing a high rate of work at the time, there was a sense that when tuned, it would likely exceed the capabilities of the 6 vcpus.
When this environment had been requested, at least 40 15k rpm (shared) disk spindles were requested. The database LUNs were actually served from 80 shared spindles. After discussing the storage setup with all parties, I learned that the underlying ESX host setup was that host LUNs were presented as RDMs to the vm guest. I was glad for that - it eliminated checking for ESX adaptive queuing, sioc, or an unfavorable relationship among host LUNs, datastores, and vmdks.
The guest LUNs were all associated with a single LSI vHBA in the guest. Suddenly the very high latencies made sense. The LSI vHBA adapter can perform well at low queue lengths. But above a queue length of 64 - and especially above a queue length of 128 for a single adapter, queuing penalties become rather absurd. So the second intervention was to add a pvscsi vHBA, and increase its adapter queue depth from the default of 256 to 1024. (In general, I prefer to max out VMs at 4 vHBAs - with at least 3 of them pvscsi adapters. Sometimes, you take what you can get :-) )
The change in system behavior was remarkable. Look at that! From peaks under 60 mb/sec to peaks over 600 mb/sec. Other than 3 brief peaks in average write latency, all latency averages (15 second captures) remained under 300 ms. This represented a considerable gain over behavior on the system previously, where latency climbed to peaks above 3 full seconds.
IOPs were previously rarely over 1000 - after intervention the system periodically saw 15,000 IOPs and sometimes more.
The increase in vcpus does seem to have been worthwhile - now that the vm was moving more data, the workflow could engage more compute power.
Although remarkable improvement had been made in this database environment, the target read response time was 100 ms at 600 mb/sec. Target write latency in this type of environment is between 1 and 5 ms - when write latency trends with read latency in a SAN environment, it typically means a queuing issue or a write cache saturation issue.
In this case, queueing at the LUN level was still exacting a price. In the original specs for this database, 8 host LUNs were requested, with each planned to have a single database file, and all 8 database files in a single filegroup for the busy reporting database. Tempdb, a separate staging database, and transaction logs were each to be located on their own separate LUNs. Although tempdb and transaction logs were not colocated on LUNs with the busy reporting database data, only 4 LUNs and files were given to the reporting database filegroup.
Re-layout operations of large databases in SQL Server are not easy. I've helped numerous organizations with them, however, to overcome the throughput and scalability limits of insufficient aggregate LUN queue depth. In the course of the next few months I'm fairly certain this particular system will see a relayout operation, spreading report database contents over at least the 8 LUNs initially requested. And they'll see a performance benefit - as long as they've attended to queuing and resource concerns in the addition of the new LUNs.
Rather interesting how much difference there is in the pvscsi vHBA and the LSI vHBA under this intensive workload. Much more powerful than the "8% better throughput at 10% lower CPU cost" indicated on page 10 of the following whitepaper :-) But that will have to be a topic for another day.
"Achieving a Million I/O Operations per Second from a Single VMware vSphere 5.0 Host"
Ps... I use this case as support for my "you can't super-scale SQL Server on a single LUN" argument. Why? Because a single logical or physical volume in Windows (whether physical or virtual server) can only accommodate 256 outstanding IOs in the sum of service queue and wait queue. Even in the "after" state with much improved disk IO, the total outstanding queue length was seen to be much higher than 256 rather often. If this system were on a single guest LUN, as SQL Server workers wanted to submit additional IOs past the 256 limit - the threads would have been suspended. CPU utilization would have been throttled due to saturation of IO queues. I see that, too sometimes.
And what about flash and SSD storage? Well... queuing don't care how fast the service time is, if the batched inter-arrival time is less than the service time. Its simple math, I'll leave the proof to the reader.
Retrofit a VM with the VMware Paravirtual SCSI Driver
Technobabble by Klee
Large-scale workloads with intensive I/O patterns might require queue depths significantly greater than Paravirtual SCSI default values (2053145)
All other things being equal, I prefer 4 pvscsi vHBAs at queue depth 256 to a single pvscsi with queue depth at 1024. But I've seen workloads that can demand 4 at 1024 :-)
There are other important optimizations for IOPs, OIO, and bytes/sec for SQL Server on VMware*. But this is an important one that's not getting too much press yet.
*I blog slowly, sorry. But I'll try to get to them all :-)