Wednesday, May 28, 2014

SQL Server backups - HBA maxtransfersize; EMC VNX write-aside; backup target isolation

The SQL Server T-SQL backup command allows setting the maxtransfersize parameter to determine the largest SQL Server host side disk IO.  By default, a 1 mb maxtransfersize is used.

The fibre channel HBA also has a max transfer size - default on Windows is 512kb.  Emulex allows max transfer size to be up to 4 mb on Windows in latest driver releases, QLogic allows up to 2 mb. In general, I recommend increasing the HBA max transfer size to at least 1 mb... why let SQL Server *want* to send larger IOs, only to be denied by the HBA :-)

So, let's say the system is using an Emulex HBA, and the backup source database files and the backup targets are on LUNs served up from an EMC VNX.

I generally recommend two SQL Server backup target LUNs on EMC VNX or CLARiiON SANs, with one target LUN on each storage controller to balance research utilization.  In most cases, you'll want to use backup compression and I generally recommend one or two times the number of data files from the source database as the number of target files, with an equal number on each target LUN.

In most cases, though, I think SQL Server sites are using a single large LUN for a given SQL Server instance's backup targets, even if that LUN contains multiple files used as backup targets.

Now for the fun part.  CLARiiON and VNX have a limited amount of write cache on each controller.  Write cache destaging is controlled by configurable low and high write-pending watermarks.  Exceed the high watermark, and write cache destaging will become more aggressive until the write-pending is back down to the low watermark.  During more aggressive write cache destaging, reads have their priority lowered and read latency increases.

During extremely write intensive workloads, write pending levels can easily rise... then read latencies rise.  Cause forced flushes of a given controller's write cache, and all tenants of the controller will notice the latency increase until the flush has completed.

Backups are write intensive.  Full backup of a 1 TB data footprint SQL Server database, even with backup compression, can easily write out 300 GB or more to the target.

If the backup reads are well tuned (enough LUNs for the source database files, high enough queue depth, high enough BUFFERCOUNT parameter value), the backup writes will put an awful lot of write pressure on the target LUNs.

EMC has thought about that ahead of time.  The CLARiiON and the VNX allow configuring a write-aside value. Well... unless you are using virtual provisioning.  If using virtual provisioning, the write-aside default of 1 mb cannot be overridden.

When the write-aside is in effect, writes larger than that threshold bypass write cache and are written directly to the storage devices.

There's at least one good reason to want to do that: if there are other tenants of the SAN, by keeping the write cache free of the huge writes coming from the backup, there is less chance of dragging down performance for other applications.

Optimally, the backup write targets will be located on a set of physical disks separate from the source databases... and ideally separate from other activity at that time.  That would allow noncached writes to go to the hard disks with very little head movement - very high transfer rate expected.  In fact, I'd recommend this when possible even if the writes are cached... because without any other competition the cached writes can destage to these disks quite quickly.  And... later when its time to pull the backup off of those disks to get it into TSM, or NetBackup or ???, if there's no other activity the read transfer rate should be pretty high, too because of the minimized disk head movement.

So... for optimal SQL Server backups, a few fundamental things to think about before getting to the SQL Server level:  HBA maxtransfersize, array write cache effect of the backup, and physical disk isolation for the backup targets.

    

"\SQLServer:Buffer Manager\Page life expectancy" hides some detail on NUMA machines

Just a few words about "\SQLServer:Buffer Manager\Page life expectancy" vs. "\SQLServer:Buffer Node(*)\Page life expectancy".

On busy servers with multiple NUMA nodes, the overall PLE provided by "\SQLServer:Buffer Manager\Page life expectancy" can obscure some detail that is very helpful in investigations.  Consider the difference between "PLE" and the individual NUMA node PLEs (from \SQLServer:Buffer Manager\Page life expectancy) below.

Most OLTP workloads may never be disturbed by this condition.  Busy, batch-oriented workloads could fall victim to it, though.  That would include certain optimization strategies for SQL Server index maintenance.






If  only "PLE" were considered, it looks like there is a problem with server memory sufficiency (assuming the workload itself is not problematic).  But, when looking at the PLE of the 4 individual NUMA nodes, it becomes apparent that NUMA node 001 is working its memory hard, while the others are loafing.




 The workload on this server is being performed exclusively in the Resource Governor Default Workload Group.  Plotting the node PLEs against the parallel worker count for the Default RG group shows that its during times of high active parallel workers that the memory across the server seems to be used unevenly.



From my perspective its significant that two of the disk IO read spikes above coincide with precipitous drops in the Node 001 PLE.





 I aggregated '\Processor(*)\% Processor Time' aligned with NUMA node boundaries and this is the result.  CPU is being used fairly evenly across all 4 NUMA nodes, but memory in one NUMA node is stressed.

That scenario can happen in large query batches, where queries interested in the same large data sets run nearly concurrently.  First one in wins - the first query to start executing will very likely have all of its workers on a single NUMA node.  When pages are read into the bpool for those workers, they'll be inserted into the bpool region controlled by the "home node" of those workers.  Other queries just a few moments behind on the 000, 002, and 003 node will 'bogart' off  the leader in NUMA node 001 - they'll get many more cache hits - most of them from NUMA node 001's home node memory region.