Friday, April 4, 2014

SQLServer transaction log writes - 64k aligned?

I spend a lot of time thinking about high speed ETL.  I also spend a lot of time thinking about DR solutions and backups.

Below you can read details on how I came to the following question (to which I don't yet know the answer and will update when I do): are SQL Server 60k writes 64k aligned?

Aha!  I don't think I'll have to bust out procmon for this after all.  Just get a Windows striped volume, put my txlog (and only the txlog) on the striped volume, start perfmon monitoring the physical disks/LUNs in the striped volume with perfmon (writes per second, current disk queue depth, write bytes/second) and spin up a workload that pushes the logwriter to as many in-flight 60k writes as possible.

If the average write size on the physical volumes is ~60k and the current queue length is 16 or less - awesome! That would mean striping is keeping each write intact and not spitting it, and that the queue depth on each LUN is lower (so that replication readers, etc have room in the queue depth of 32 to do their stuff without pushing anything into the OS wait queue.)

But if the average write size is ~32k... that would mean that most 60k writes by the log writer are being split into smaller pieces because they are not aligned with the 64k stripes used by the Windows LVM.

I guess even if the writes aren't 64k aligned, Windows striping may still be useful for my scenarios... but would have to stripe 4 LUNs together into a striped volume in order to lower the queue length for burdened log writer activity from 32 (with a single LUN) to 15.


Each SQL Server transaction log can sustain up to 32 concurrent in-flight writes, with each write up to 60k.  To get the fastest ETL, fast transaction log writes at queue length 32 are a necessity.  That means... put such a transaction log on its own Windows drive/mounted partition, since typically the HBA LUN service queue depth is 32.  Put other files on there, too, and the log writer in-flight writes might end up in the OS wait queue.  If writes wait on a full service queue of reads, they'll be ESPECIALLY slow.  There are other ways to make them especially slow - for example to serialize the inflight writes to to a synchronous SAN replication strategy.  Anyhooo...

In massive ETL its not unusual for the transaction log writer to wait on completion of 32 writes, each 60k, and not issue the next write until one of them completes.

Writes are usually to write SAN cache, and should be acked on receipt to write cache.  As such, as long as the write is in the HBA service queue (rather than in OS wait queue), front end port queue depth isn't saturated, front end CPU isn't saturated, and SAN write cache isn't saturated - writes should be doggone fast already. (The overhead of wire time - or 'wait for wire time' - for synchronous SAN replication also shouldn't be overlooked when evaluating write latency.) So what can be done to improve these writes that are already typically pretty fast?

I'm not a fan of using Windows striped volumes for SQL Server data files - there's a fixed 64k stripe size.  That will circumvent large readahead attempts by SQL Server.  But for speeding up transaction log access, striped volumes may be just the thing I need.  (Robert Davis - @sqlsoldier - pointed out that unless there is underlying data protection a Windows striped volume offers no data redundancy or protection. I'm only going down this path because underneath the Windows basic disks, whether in striped Windows volume or not, SAN storage is providing RAID10, RAID5, or RAID-DP protection.)

So... this is where the question of 64k alignment of the 60k writes comes in.

Assume 32 inflight writes at 60k each issued by SQLServer logwriter to a txlog all by itself on a Windows striped volume composed of two equally sized basic disks.  If the writes are not 64k aligned, the sames as the Windows stripes, the write activity passed down through the HBA will break down like the chart below.  Its painful to look at, I know.  Haven't figured out a less confusing way to represent it yet.  Basically, each 64k stripe on either basic disk will contain either a full 60k transaction log write and only 4k of the next transaction log write, or two partial transaction log writes.  All tolled, 32 writes gets broken down into 60 writes!  (By the way, this same idea is why its important to have Windows drives formatted so that their start aligns with expected striping.)
Basic Disk A        Basic Disk B
 1 - 60k 
 2 - 4k              2 - 56k  
 3 - 52k             3 - 8k
 4 - 12k             4 - 48k
 5 - 44k             5 - 16k
 6 - 20k             6 - 40k
 7 - 36k             7 - 24k
 8 - 28k             8 - 32k
 9 - 28k             9 - 32k
10 - 36k            10 - 24k
11 - 20k            11 - 40k
12 - 44k            12 - 16k
13 - 12k            13 - 48k
14 - 52k            14 -  8k
15 - 60k
                    16 - 60k
17 - 60k

18 - 4k             18 - 56k  
19 - 52k            19 - 8k
20 - 12k            20 - 48k
21 - 44k            21 - 16k
22 - 20k            22 - 40k
23 - 36k            23 - 24k
24 - 28k            24 - 32k
25 - 28k            25 - 32k
26 - 36k            26 - 24k
27 - 20k            27 - 40k
28 - 44k            28 - 16k
29 - 12k            29 - 48k
30 - 52k            30 -  8k
31 - 60k

                    32 - 60k

So by striping the txlog, instead of 1 LUN with 32 writes and 1920 write bytes inflight… its 2 LUNs, each with 30 writes & 960 total write bytes outstanding.  50% reduction in write bytes per volume, 6% reduction in concurrent write IOs per LUN (from 32 to 30).

On the other hand, if the writes are 64k aligned, it'd be an even split: 16 writes and 920 write bytes outstanding to each LUN, a 50% reduction in both outstanding writes and outstanding write bytes.

So unless someone knows the answer, I guess we'll be busting out procmon and tracking transaction log write offsets once we crank the workload up to consistently hit 60k writes.  If they are 64k aligned, I'll be happy - I can blog in the near future about Windows striped volumes getting me out of a few jams.  If not... it'll probably be back to the drawing board. 

No comments:

Post a Comment