Thursday, December 19, 2013
VMWare vCPU dispatch overload without high perfmon %Processor Time or ESX %RDY...
The system for the graph above is a 1 vCPU Windows Server 2008 R2 VMWare guest. %CPU busy and the processor run queue against the black vertical axis on the left, total thread count from perfmon against the blue axis on the right.
I was watching an ETL process... we'd already optimized the sending system. Times looked great... at first. Suddenly the times for the ETL stretched back and became even worse than before optimization! First thought was to look at the receiving database system. Nothing stood out, but there are always things to optimize... maybe reduce IO queuing by increasing max transfer size... I noticed that total throughput on the server seemed capped at 800 mb/second with queued IO and CPUs with plenty of idle time... but those factors were trues before.
Finally I started looking at the middleman - a small VM that grabs the extract files from the source system and manages their import into the target system. I'd actually forgotten it was a VM in this case... I'm a dinosaur so I always assume physical server until it becomes apparent its virtual.
I started staring at disk queuing first out of habit. Nothing out of the ordinary - in fact that was a little surprising. I had expected some disk queuing due to the number of sending threads which had increased after optimizing the sending system. And each sending thread should have been able to increase its send rate, as well.
But the "current disk queue length" perfmon counter was completely unremarkable... even though more than 40 writing threads coming from the source system should at times have overlapped with 30 or more reading threads sending to the target system.
But I guess the processor run queue, which reaches up to 102 threads for the lonely single vCPU would explain why the level of disk queuing was lower than I expected... and also potentially why performance was slower than before optimizing the sending system.
Not sure yet if a change was introduced to this VM (perhaps someone reduced vCPU based on prior utilization?), or maybe the optimization on the sending system caused a scheduling/interrupt handling overload for the hypervisor or guest.
Interestingly, the virtualization host would not necessarily show this as a problem. The vCPU might be getting all of its allotted time on a host server hardware thread. But with that many threads to switch on and off the vCPU, and trying to manage the interrupts from the network adapter that the ETL was keeping busy... there certainly was a problem. If only CPU busy in perfmon was monitored without also the processor run queue... the problem might not be apparent.
Good thing I'm always watching the queues :)
Hopefully cranking up to at least one more vCPU will lead to an improvement. I hate it when my optimizations end up in a net loss :)
Gotta give you a link, right? Here's a great one, on this topic which applies to Windows dispatch overload whether physical or virtual CPUs.
Measuring Processor Utilization and Queuing Delays in Windows applications