sql.sasquatch: VMWare

Showing posts with label VMWare. Show all posts

Tuesday, April 18, 2017

How many vcpus make the cores on this ESXi server oversubscribed?

How many vcpus on a given physical server before the physical cores are oversubscribed?

The CPU Scheduler in VMware vSphere® 5.1
http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-vsphere-cpu-sched-performance-white-paper.pdf

Best Practices for Oversubscription of CPU, Memory, and Storage in vSphere Virtual Environments
How far can oversubscription be taken safely?
https://communities.vmware.com/servlet/JiveServlet/downloadBody/34283-102-2-46887/Dell%20%20%20Best%20Practices%20for%20Oversubscription%20of%20CPU%20%20Memory%20and%20Storage%20in%20vSphere%20Virtual%20Environments_0%20(1).pdf

The most common answer is that oversubscription begins when vcpu count is greater than core count; if the sum of vcpus across all VMs is less than or equal to the core count the physical server's cores are typically not considered oversubscribed.

But, the vcpus for a VM are not the only CPU scheduling needs for the ESXi server or even for the VM itself. The hypervisor needs some CPU resources - these are among the system worlds on the ESXi server. In addition, as of ESXi 6 each VM has 4 non-vcpu worlds that must occasionally be scheduled on physical cores. These worlds take action on behalf of the VM, but their work is not in the guest context of the VM: stuff like handling the IO after its been handed off by the guest.

Imagine an ESXi server with 24 physical cores and a single 24 vcpu vm. Let's keep all 24 guest vcpus very busy with a SQL Server workload - a very demanding ETL with data coming in over the network. The vcpus are bound to the cores most of the time since they've got runnable SQL Server threads. Those SQL Server threads are handling data coming in from the network, and also issuing reads and writes for the disk subsystem.

The hypervisor sooner or later has stuff its gotta do: when the hypervisor takes time on any core, that's time denied to a guest VM vcpu. The non-vcpu worlds for the VM have stuff to do: time they spend on a physical core is time denied to a guest vcpu. Even with relaxed co-scheduling, its still possible for the skew between the lead and lag vcpus for the VM to be greater than the threshold and result in throttling the VM's vcpus with co-stop time.

The idea of oversubscribing the cores of a server (and implementing co-scheduling policy, vm fairness policy, etc) is to drive up utilization. The question serving as subtitle to the Dell paper above should be a clue that this approach to achieving maximum resource utilization can be antithetical to achieving the highest level of performance.

Excluding the questions of whether instructions being executed are necessary or efficient(fundamental application and database level questions), and the question of how much utilization is for management rather than meaningful work(typically a question for database level evaluation), a remaining gold mine is whether high resource utilization is a higher priority goal than limiting wait time for the resource. (This is one huge reason that goals are extremely important to any conversation about performance and scalability.)

Even at one busy vcpu per core on the ESXi server, hypervisor and non-vcpu worlds can result in %ready time for the vm's vcpus. Time that there is a runnable thread within the guest, dispatched to the vcpu, but with the vcpu waiting for time on the core. And co-scheduling policy can amplify that under prolonged heavy demand.

So, typically, no-one will raise an eyebrow if the number of vcpus on an ESXi server is equal to the number of cores. In fact, the numbers in the 2012 whitepaper above... that 1-3 times the number of cores is typically not a problem... are not too uncommon out there. And in cases that high utilization is the primary goal, that's not bad.

But, if the primary goal is squeezing as much as possible out of a certain number of vcpus (think a SQL Server EE VM licensed per vcpu), don't be too surprised if someone like me comes along and starts scrutinizing wait time for CPU at the SQL Server and ESXi level, and maybe trying to talk someone into lowering the vcpu count to something below the core count, or using full reservation for that VM's vcpus and trying to make sure there's always enough time for the VMs non-vcpu worlds... or trying to get the VM marked as "latency sensitivity = high" 😀

Wednesday, March 1, 2017

Plotting the CPU in a VM shouldn't look like this...

***** Author's note *****
My half-baked drafts often contain more questions than answers 😀😀😀
When I get more information, I may be able to come back and update. If I actually answer the questions, hopefully I'll promote to a full baked blog post. No promises, though.

If this VMware half-baked draft intrigues you, this related half-baked draft might also be of interest.
Windows Guest: VMware Perfmon metrics intermittently missing?
http://sql-sasquatch.blogspot.com/2016/08/windows-guest-vmware-perfmon-metrics.html
**********

I've become accustomed to perfmon graphs that look like this from within a VMware vm.

Its a 4 vcpu system. When I first worked with VMs, I'd expect that metric to align with 100 * vcpu count - so a max value of 400. But as you can see below it approaches 500. Maybe that's because the non-vcpu worlds consumed physical CPU time on behalf of the VM on pcpus other than the 4 pcpus serving the vcpus. (So the vcpus themselves could consume up to 400% with the non-vcpu worlds adding some beyond that) It might also be due to calculations based on rated frequency of the pcpus and speedstep kicking in. Could even be both? Maybe by the time I convert this from a half-backed draft to a full-fledged blog post I'll be able to propose a way to know what leads to the overage.

Here's the reason for this half-baked blog post: the graph below took me by surprise.

Something weird is going on - probably outside this VM - and severely effecting the reported relationship between guest vcpu utilization in 'Processor Info' and 'VM Processor'.

Its important to understand what 'VM Processor(*)\%Processor Time' means in the context of a VM. It means the time there are runnable instructions already on the vcpu. But the vcpu might not be bound to a pcpu beneath. For example - what if the pcpus are oversubscribed, and two busy VMs are fighting over the same pcpus? The vcpus in the guests could report 100% busy while only being bound to pcpus 50% of the time. In that scenario, high %rdy time would be expected at the ESXi host level. Could be a %COSTP situation, too - a scheduling requirement of a multi-vcpu VM. Could be lots of halt & wakeup cycles, resulting in higher reported vcpu utilization in the guest than pcpu utilization in the host. Lots of migrations of vcpus across NUMA nodes could also lead to higher vcpu utilization reported in guest than pcpu utilization at guest.

Could also be a memory mismatch. If the guest is performing operations on vRAM which is actually not backed by physical RAM in the ESXi host, operations may be considered 100% CPU time within the guest even though the host registers a little bit of CPU time and a lot of wait time for paging/swap space traffic. In a similar fashion, vMotion means a lot of memory traffic becomes disk traffic, and vcpu utilization in the guest can be exaggerated by sluggish interaction with vRAM. But two hours of wall clock time is an afwul lot of vmotion :-)

But I can't shake the feeling that the nearly perfect 1:00 am to 3:00 am window means something very important here. Maybe VM backups in the shared datastore than has this VMs vmdk?

I'll be back hopefully to update this one with more detail in the future...

*Current suspects include:

· guest vRAM not backed by host physical RAM

· oversubscribed pcpus & excessive %rdy

· excessive %Costop

· excessive migrations

· excessive halt/wakeup cycles

· vm backups in a shared datastore

· excessive vmotion activity

· patch apply to ESXi host?

Thursday, October 6, 2016

Migration to pvscsi from LSI for SQL Server on VMware; It really matters

I expect it to make a big difference. Even so, I'm still pleasantly surprised how much of a difference it makes.

About the VM:
8 vcpu system.
19 logicaldisk/physicaldisks other than the C Windows install drive.
About the guests vdisks:
Each guest physicaldisk is its own datastore, each datastore is on a single ESXi host LUN.

On July 27th, the 19 SQL Server vdisks were distributed among 3 LSI vHBA (with 1 additional LSI vHBA reserved for the C install drive).

I finally caught back up with this system. An LSI vHBA for the C install drive has been retained. But the remaining 3 LSI vHBA have been switched out by pvscsi vHBA.

The nature of the workload is the same on both days, even though the amount of work done is different. Its a concurrent ETL of many tables, with threads managed in a pool and the pool size is constant between the two days.

Quite a dramatic change at the system level :-)

Lets first look at read behavior before and after the change. I start to cringe when read latency for this workload is over 150 ms. 100 ms I *might* be able to tolerate. After changing to the pvscsi vHBA it looks very healthy at under 16 ms.

OK, what about write behavior?

Ouch!! The workload can tolerate up to 10ms average write latency for a bit. 5 ms is the performance target. With several measures above 100 ms write latency on July 28th, the system is at risk of transaction log buffer waits, SQL Server free list stalls, and more painful than usual waits on tempdb. But after the change to pvscsi, all averages are below 10 ms with the majority of time below 5 ms. Whew!

Looking at queuing behavior is the most intriguing :-) Maximum device and adapter queue depth is one of the most significant differences between the pvscsi and LSI vHBA adapters. The pvscsi adapter allows increasing the maximum adapter queue depth from default 256 all the way to 1024 (by setting a Windows registry parameter for "ringpages"). Also allows increasing device queue depth from default 64 to 256 (although storport will pass no more than 254 at a time to the lower layer). By contrast, LSI adapter and device queue depths are both lower and no increase is possible.

It may be counter-intuitive unless considering the nature of the measure (instantaneous) and the nature of what's being measured (outstanding disk IO operations at that instant). But by using the vHBA with higher adapter and device queue depth (thus allowing higher queue length from the application side), the measured queue length was consistently lower. A *lot* lower. :-)

Wednesday, August 24, 2016

Windows Guest: VMware Perfmon metrics intermittently missing?

I'm using logman to collect Windows, SQL Server, and VMware metrics in a csv file(30 second intervals).

Not sure what's causing the sawtooth pattern below. I thought it was a timekeeping problem on the ESXi host, now not so sure.

See this Ryan Ries blog post for what may be a similar issue, caused by clock sync of guest with ESXi host.
https://www.myotherpcisacloud.com/post/Mystery-of-the-Performance-Counters-with-Negative-Denominators!
But that is also a reversal of the problem somewhat. In that case, guest metrics were returning -1. In my case, its the VMware metrics (passed through from the host) that are missing.

The same pattern for "Host processor speed in MHz" adds to the mystery.

If only "Effective VM Speed in MHz" was missing, I'd chalk it up to a possible arithmetic issue, with a negative denominator resulting from time skew between guest and host.

But... "Host processor speed in MHz" should be a constant 2600 in this case. Maybe somehow its still calculated with a time interval, and time skew can screw it up?

For now, I've got a loop running on this VM logging local time, and using NET TIME to retrieve time from another VM as well. That's turned up variations, but it seems to be variations of up to 5 seconds in contacting the other VM rather than large time skew between the VMs.

Guess I'll see what turns up...

*****
So far this is the closest I've found. Similar problem reported - no resolution.

"Windows VM perfmon counters - VM Processor"
https://communities.vmware.com/thread/522281

Thursday, November 5, 2015

High DiskSpd Activity in VMware VMs - Part 2: More Questions

In my most recent previoust blog post I showed some perfmon graphs, with 1 second collection interval, from some tests I was running with DiskSpd in an 8 vcpu VMware VM, against a guest LUN managed by the pvscsi vHBA.

High DiskSpd Activity in VMware VMs - Part 1: Start of an investigation
http://sql-sasquatch.blogspot.com/2015/11/high-diskspd-activity-in-vmware-vms.html

In that particular test run, the main concern was the high privileged time on vCPU 0, which is also the location of the SQL Server logwriter.

Although most of my tests showed high CPU utilization on vcpu 0, there was some variability. Sometimes it would be a different single vcpu bearing the cost of handling the vHBA. Sometimes the work would be shared among a few vcpus.

Here are some results of a separate test, mere minutes away from the last test I shared, on the same system. These results show another concern I have about high SQL Server disk IO within VMware VMs.

In this case, cost of managing disk IO has been balanced across vCPUs 0, 1, 2, and 3. The remaining 4 vcpus have little involvement in managing the disk IO.

That presents an interesting question - why is the cost dispersed in this case? It may be extremely preferable to disperse the cost over 4 vcpus, rather than keep vcpu 0 (*especially* vcpu 0) nearly fully consumed with privileged time.

But there's another important question - and one which may be familiar to lots of other folks running SQL Server in VMware VMs by the time I'm done. What's going on in the nearly 6 second valley?

If I look at IOPs (reads/sec since this was a read-only test) the valley is evident, too.

Was the test running? Oh, yeah. In fact, this test had a target queue length (total outstanding IO against the tested Windows volume) of 16 - and the valley was the only sustained time that the target was achieved constantly. (As an aside: I still see folks - performance experts even!! - proclaim 'current disk queue length' as a metric without value. They're wrong.)

But... if there was a constant queue length of 16 reads while IOPs dropped really low...

Yep. Really high latency. So that valley in activity and utilization is really a performance divot. (That's one of my favorite terms. You might end up seeing it a lot if you read my blog regularly.)

Sometimes a chart of the values over time put performance divots with extreme outlier values into better perspective.

	Disk Reads/sec	Current Disk Queue Length	Avg. Disk sec/Read
14:50:18	65389.65	12	0.00022
14:50:19	58861.78	16	0.00024
14:50:20	42.96	16	0.31733
14:50:21	18.98	16	0.71181
14:50:22	8.96	16	0.50111
14:50:23	272.43	16	0.09647
14:50:24	0.00	16	0.00000
14:50:25	25.35	16	1.24708
14:50:26	54419.92	11	0.00039
14:50:27	65439.81	12	0.00022

So - average read latency of under a millisecond at roughly 60,000 read IOPs with a divot to under 300 read IOPs at average read latency from 96 milleseconds to over 1 full second. And one second ending at 14:50:24 with NO reads completed at all.

That raises some serious questions. I've seen this behavior numerous times on numerous systems. Hopefully in the next couple of weeks I'll be able to get to the bottom of it, so I can share some answers rather than just more questions. :-)