A friend recently asked what minimum transaction log write latency I see on systems. You know me... never one for easy answers :-)
On the systems I work with, we use a fairly long-running, query concurrency-controlled ETL each night. Write latency to the transaction logs is a fairly volatile metric - I recommend limiting the volatility by eliminating queuing possibilities in the SQL Server physical host or guest. Putting a single busy transaction log on a LUN means that it will have the LUN queue depth all to itself for the up-to-32 outstanding writes issued against the transaction log. Ensuring that HBA adapter queue depth can sustain all read and write activity for the associated LUNs eliminates queuing at that level.
But, queuing at the storage controller can also effect write latency. (So can flow control between server and storage.) Also, write cache conditions may result in throttling - especially if write cache becomes saturated.
So... here are some graphs :-) On the systems I work with, we want to keep transaction log LUN write latency between 1 ms and 5 ms to keep everything happy. If "write log waits" are minimized - and especially making sure that "log buffer waits" are eliminated - there is little benefit to my workload by continuing to lower the transaction log write latency. It'd be different if I were working on high transaction volume OLTP. But this ETL is high volume, low commit. That can be seen below in the graphs by noticing that writes to the transaction log LUN average 60k quite regularly - the transaction log buffers are usually flushing because they are full and must be written, rather than usually being written due to transactions committing.
ps. I wondered about the peaks in write latency in the timeperiod. Were they due to host queuing? Didn't seem to be - checked LUN level queuing and total queuing across all LUNs during the timeperiod and not much correlation. I suspect queuing at the SAN storage controller due to activity coming in from other servers at that time.
Wednesday, February 4, 2015
Friday, January 30, 2015
In the graph below, there is at least some relationship between buffer page lookups/sec (logical IO) and CPU utilization in the shaded area. Before the shaded area begins, though CPU utilization is extremely high and logical IO rate is extremely low. Whenever I see that, I always try to rule out spinlock conditions - and if there are known applicable spinlock conditions I try to mitigate them (this is one of the reasons I try to learn about as many trace flags as I can).
During the shaded timeperiod notice that the log flush bytes/sec is much higher than previously, as well as higher buffer lookups/sec.
Log flushes/sec trends with log flush bytes/sec throughout the time range.
Because query concurrency and parallel worker concurrency is much lower during the shaded productive area than the preceding high CPU/low logical IO timeperiod, I definitely think that a spinlock issue can be involved.
I've got two spinlock issues in mind:
1. the global database lock on login, relieved by startup trace flag 1236. Although I think trace flag 1236 will be somewhat helpful, in this particular case I don't think the login rate is high enough to trigger spinlock contention to this extreme level.
2. Transaction log pool global access, relieved by startup trace flag 9024. That seems like a good suspect. This system is a 4 vcpu vm, with SQL Server seeing a single "NUMA Node". So, trace flag 9024 promoting log cache spinlock access from global to node level won't help by itself. But, with trace flag 8048 also enabled, log cache spinlock access gets promoted all the way to CPU level. *That* may help. There's not a lot of information about trace flag 9024, the conditions where relief is expected, or the symptoms that indicate it may be helpful. The kb article doesn't even indicate that it is a startup trace flag yet. But I'm hopeful it'll help in this case. I'll update this blog post with what I learn.