Monday, May 11, 2015

SQL Server PLE with 8 NUMA nodes


I posted earlier about PLE as a tricky performance metric - showing how a drop in PLE in some cases is a sign of performance improvement rather than performance drop :-)
http://sql-sasquatch.blogspot.com/2015/04/sql-server-ple-is-very-tricky.html

Here's another quick set of thoughts about PLE - this time about the PLE on "home NUMA node" being more relevant to a given query than the overall PLE, or the PLEs of other NUMA nodes.

Because each NUMA node by default manages its own portion of the buffer pool cache, the number reported as the overall page life expectancy of SQL Server on an 8 NUMA node server is a composite of the 8 NUMA node-local PLEs.  I haven't been able to figure out how the overall PLE is calculated from node PLEs; I'm typically more interested in the amount of database cache and free memory per node anyway.  But here are some of the more interesting numbers. 





Note: these perfmon captures are from a SQL Server version where SQLOS still swapped to call NUMA node 0 "Buffer Node 1", and NUMA node 1 "buffer node 0".  SQL Server doesn't still do that in SQL Server 2014 :-)

Lets zoom in a little.  Here are the 8 NUMA node PLEs, and the dotted red line is the overall PLE.


The difference in PLE across the nodes is easy to understand once you notice the variation across NUMA nodes in CPU busy.  All of the CPU consumption on this server was from SQL Server.


Database page reads will insert the page into the buffer pool associated with the NUMA node that is home to the worker thread performing the read.  First writes, resulting in a newly allocated page, will also go into the NUMA node that is home for that worker.  Busier NUMA nodes(in terms of CPU utilization) can be expected to have higher insert rates.  Insert rate and database cache size are important contributors to PLE calculation.  But, there are other important considerations, too: rate of stealing, and rate of freeing - since these rates will determine the change in size of the database cache.  (Also, the rate of growth of the database cache - especially during rampup but also during other periods when other cache allotments such as procedure cache shrink resulting in lots of free pages.)






***** Update *****

Yeah, just wanted to throw in one more close-up :-)

Tuesday, April 21, 2015

3rd Try Charm: Why does pvscsi rather than LSI matter so much for SQL Server on VMware?


I've written two blog posts recently on the performance benefits of the VMware pvscsi vHBA over the LSI vHBA.

March 23, 2015
SQL Server on VMware: LSI vHBA vs pvscsi vHBA
http://sql-sasquatch.blogspot.com/2015/03/sql-server-on-vmware-lsi-vhba-vs-pvscsi.html

April 7, 2015
Another SQL Server VMware LSI vs pvscsi vHBA blog post
http://sql-sasquatch.blogspot.com/2015/04/another-sql-server-vmware-lsi-vs-pvscsi.html

These posts give details from different systems (although running a similar workload), and claim a benefit of roughly 10 fold in peak throughput and peak disk response times by switching from the default LSI vHBA to the pvscsi vHBA for the SQL Server LUNs.  That sounds a little fishy, doesn't it?  Especially if you are familiar with...

Achieving a Million I/O Operations per Second from a Single VMware vSphere® 5.0 Host
http://www.vmware.com/files/pdf/1M-iops-perf-vsphere5.pdf

Page 10 of the performance study above includes the following text.
"… a PVSCSI adapter provides 8% better throughput at 10% lower CPU cost."


That is a much more modest (and probably more believable) claim than 10x performance benefit.  What gives?

Here's a few details about the mechanics:
LSI vHBA has adapter queue depth 128, cannot be increased. LUN queue depth cannot be increased from default 32.
pvscsi vHBA has default adapter queue depth 256, and default LUN queue depth 64.  Adapter queue depth can be increased to 1024, LUN queue depth to 256 with Windows registry settings.
http://www.pearsonitcertification.com/articles/article.aspx?p=2240989&seqNum=3
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2053145


And here's a detail about the testing that I just happened to come across in the Longwhiteclouds blog.


"Maximum queue depth supported by LSI driver (in guest) cannot be changed. So to keep the aggregate outstanding I/Os per controller lower than the max queue depth we had to use 16 OIOs per vDisk. To have a fair comparison between LSI and pvSCSI, second test also had 16 OIOs per vDisk for pvSCSI as well. " Chethan Kumar, author of the VMware paper, as quoted on Longwhiteclouds.

So, comparison testing was done within the queue depth constraints of the LSI vHBA.  But, in the case of these Enterprise Edition SQL Server workloads, the number of outstanding IOs would often exceed 600 and microbursts as high as 1000 outstanding IOs occurred.  That's well outside the LSI adapter queue depth, and the queuing penalty can be high.  Even with 4 LSI adapters in a VM, the aggregate adapter queue depth would be only 512.

If a SQL Server workload doesn't burst more than 32 outstanding IOs per LUN or more than 128 outstanding IOs per vHBA adapter, the change to pvscsi would most likely bring rather modest performance benefits - along the lines of the 8% better throughput at 10% lower CPU utilization indicated in the whitepaper.  In fact, at that low level of outstanding IO... maybe there would be a slight performance decline.  That's because the LSI vHBA can allow an IO request up to 32mb in size.  SQL Server won't (yet) perform disk IO that large.  The largest disk IO I've seen from SQL Server has been 4 mb.*  The pvscsi vHBA adapter currently will allow a maximum disk IO size of 512kb.

However - really large disk IOs from SQL Server are in my experience fairly rare, and high aggregate queue length is more common.  For that reason, I heartily recommend using the pvscsi vHBA for SQL Server vms.  Retaining the LSI vHBA for the boot drive is common, even when pvscsi vHBAs are added for database LUNs.  I've got nothing against that approach. But its important to ensure that a SQL Server vm can handle the outstanding IO generated by its workload.  CPUs are hungry - feed them lots of data quickly :-).


*But Niko has shown that columnstore will do read IO up to 8 mb :-)
Clustered Columnstore Indexes – part 50 ("Columnstore IO")
http://www.nikoport.com/2015/04/04/clustered-columnstore-indexes-part-50-columnstore-io/