Monday, November 2, 2015

High DiskSpd Activity in VMware VMs - Part 1: Start of an investigation

I've started testing VMware VMs (with the pvscsi vHBA) with high amounts of simulated SQL Server disk activity, using DiskSpd.

You can continue with this story at my next blog post:
High DiskSpd Activity in VMware VMs - Part 2: More Questions
http://sql-sasquatch.blogspot.com/2015/11/high-diskspd-activity-in-vmware-vms_5.htm

I want to find out how much the
idleLoopSpinBeforeHalt/idleLoopMinSpinUS
settings pair, the vSphere 5.5+ "Latency Sensitivity=High", and the corespersocket configuration impact a high disk throughput VM when vCPU/CPU ratio is NOT oversubscribed.


Here is a VMware kb article about the
idleLoopSpinBeforeHalt/idleLoopMinSpinUS settings pair, which I am interested in testing because it is in some sense the least invasive of these options.

Workloads perform poorly on ESX SMP virtual machines (1018276)

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1018276

The following documents approach latency sensitivity in a more broad sense, discussing both the
idleLoopSpinBeforeHalt/idleLoopMinSpinUS settings pair and the "Latency Sensitivity=High" option available in vSphere 5.5 and beyond.

Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5
http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf

Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs
https://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf 

The best practices document above contains this statement:
"vNUMA is automatically enabled for VMs configured with more than 8 vCPUs that are wider than the number of cores per physical NUMA node." (p5)

The general performance effect of corespersocket vNUMA configuration is documented in this blog post.  
Does corespersocket Affect Performance?
https://blogs.vmware.com/vsphere/2013/10/does-corespersocket-affect-performance.html

I'm specifically interested in the potential impact on disk IO.

I ran a test with a nominal target queue length of 64 total outstanding disk IOs (100% read) against a single, fairly small (8GB) test file.  Latency was quite good, as expected with data being served almost exclusively from SAN cache. Interesting that though the requested queue length was 64, the instantaneous measures by perfmon once-per-second had considerable range and were generally much lower than 64.


How did that low latency turn into IOPs?  Pretty good, pretty good – hovering near 75,000 IOPs.  This was an 8 vcpu VM.  CPU utilization - especially privileged time - is a little worrying with nearly a full vcpu consumed with privileged time.



In the graph below we can see that vcpu 0 is bearing almost all of the weight of the test.  Recall that SQL Server logwriter is on vcpu 0.  From the excellent work by Chris Adkin detailed in the posts below (and other posts of his), it is clear that transaction log writer performance and general transaction performance can be greatly impacted when the logwriter has to fight for time on CPU.
Large Memory Pages, How They Work and The LOGCACHE_ACCESS Spinlock
http://exadat.co.uk/2015/01/20/large-memory-pages-how-they-work-and-the-logcache_access-spinlock/

WRITELOG At Scale: Going Beyond "You need faster disks"
http://exadat.co.uk/2015/06/11/writelog-at-scale-going-beyond-you-need-faster-disks/

Almost all of vcpu processor time is privileged time.


All of the DPCs are associated with vcpu 0.



That can be a significant performance concern, as indicated in the sources below.

Analyzing Interrupt and DPC Activity
https://technet.microsoft.com/en-us/library/cc938646.aspx

  

No comments:

Post a Comment