Tuesday, April 7, 2015

Another SQL Server VMware LSI vs pvscsi vHBA blog post

Stop me if you’ve heard this one before. :-)

I keep searching for a clear but complete way to tell this story – if you enough of my posts you’ll probably see various attempts to explain this – looking for something that gives folks an 'aha! ' moment.

Below are right-left comparisons of two 10 minute periods on the same system, a SQL Server VMware VM using fibre-channel SAN storage.  The graphs below are from perfmon LogicalDisk(_Total) counters.  Perfmon is using a 15 second capture interval.
The first time period sees between 200 and 450 mb/second disk throughput, with average read latency remaining below 6 ms.  That’s great.  The second time period, though, doesn’t look healthy.  Less than 60 mb/second of disk throughput with average read latency ranging from 400 ms to just over 2 full seconds.  What’s the deal?

Maybe the second time period is overloaded with IOPs?  Nope – not according to the comparison below.  The healthy first time period remains above 1600 IOPs for the most part, while the second time period never goes above 1600 IOPs.
Weird, huh? The same system averaging about 2000 IOPs and over 300 mb/second with read latency <6ms in the morning is struggling to do 1000 IOPs and 60 mb/second.
The second time period has more write activity – the first time period has almost no write activity.  Maybe the system is penalized for writes?  Synchronous replication could do that.  Insufficient write cache could do that.  Copy-on-write could do that.  Bypassing write cache could do that (especially if RAID5 or RAID6).  But all of that requires insight into the underlying SAN configuration and activity.  Anything else to look at in the VM guest before descending to the physical host and the SAN?  One avenue would be to split out metrics for each Windows guest logical drive – perhaps there are significant differences among them?  I won’t include the graphs here – I did that and although there were differences they were not the most significant issue on this system.
I like to refer to 'outstanding IO' as a missing dimension in SQL Server performance analysis and system design.  I hope examples like this explain why.

LogicalDisk perfmon counters allow us to look at 'Current Queue Length'.  Many folks believe that, since SAN LUNs are comprised of many underlying storage devices, the 'Current Queue Length' metrics are not very meaningful.  I hope to disabuse performance investigators of that notion. :-)

At a total disk queue length under 16, the system seemed to perform great.  But with a queue length above 100 – and up to nearly 800 – the system struggled.
The low queue length of the first time period is expected.  The first time period was SQL Server backup. 
If you stare at perfmon or other monitoring of SQL Server backup and checkdb long enough – some interesting patterns arise.  Basically, both backup and checkdb use a fixed maximum read queue length as they progress.  For the SQL Server backup, the maximum number of outstanding reads is determined by the number of backup buffers.  DBCC CHECKDB uses the number of Windows drives that the database data files occupy to plan its read queue length (unless trace flag 2549 is enabled globally – then it’s the number of files).  Tricky, huh?  Want to speed up backup or checkdb with physical_only?  Often that can be done by increase their read queue length!  Give backup more buffers, or use more files (and potentially more LUNs) for the database contents for checkdb.   I wasn’t even planning to go on that tangent – but I’ll leave it there.  We’ll pick up with more stuff about backup and checkdb performance another time.
The second time period – wow!  Certainly not backup or checkdb.  If you run a system like mine, which typically is engaged by more parallel queries than cores, with each query fairly data-hungry – this can be important: each SQL Server worker thread can perform its own reads.  Each worker thread can be responsible for more than 1 asynchronous read.  4 vcpus?  Maximum outstanding read liability isn’t directly tied to the number of vcpus.  It is more directly tied to the active number of workers.  And SQL Server Enterprise Edition always has at least 512 workers by default.  
At this point, time to examine the guest vHBA adapter configuration.  I’ve mentioned before that the LSI adapter does not handle high queue length well – the pvscsi vHBA is much better at that.  The LSI adapter can submit IOs up to 32mb in size – but here’s a secret that isn’t too well kept: SQL Server won’t do any disk IOs that large.  I’ve heard rumors that SQL Server can perform  8 mb reads – but the largest read I’ve ever seen from SQL Server was 4 mb.  Unfortunately, the pvscsi driver will only allow disk IO up to 512kb.  But, in my workload the larger the IO, the less frequent due to fragmentation and interleave of tables and indexes.  So sacrificing 1, 2, and 4 mb reads still has a potential performance effect – but in every case I’ve worked on with DSS workloads, the tradeoff from LSI adapter to pvscsi was a significant net gain.
In this case, the C: boot drive was left assigned to the LSI adapter, and the database data file and transaction log volumes were assigned to a pvscsi vHBA that was added to the VM.  A Windows registry value was used to increase the pvscsi vHBA adapter queue depth from its default of 256 to 1024(link below adds details for that).
Look at that! The target workflow (had to go earlier in the morning to get a 10 minute sample because it was finishing WAY faster now J) used to have less throughput than the backup.  Now its got more for part of the 10 minute period!  (This implies adding backup buffers during the backup, will probably speed the backup up, too.)
Read latency has decreased incredibly – it now stays comfortably below 100 ms.  This workflow is reasonably tolerant of read latency up to 100 ms, as long as throughput is high.

Wow – look at that jump in IOPs for the target workflow!  Under 1600 previously, IOPs peak over 9000 now.


I’m glad we increased the pvscsi vHBA queue depth from 256 to 1024 – even with the performance improvement, queue length nearly reaches 800.


Below are some of my favorite references when its time to add a pvscsi vHBA to a vm.

Retrofit a VM with the VMware Paravirtual SCSI Driver
Technobabble by Klee
David Klee

Large-scale workloads with intensive I/O patterns might require queue depths significantly greater than Paravirtual SCSI default values (2053145)



No comments:

Post a Comment