Wednesday, April 9, 2014

srarw - sequential read after random write (ZFS, WAFL, reFS)

I like good marketing terms as much as the next guy, but what I really crave are terms and phrases that explain system and storage patterns and phenomena that I become very familiar with, and have a hard time explaining to other folks.

Several years ago, I was in exactly that position - having seen a condition that degraded database performance on a shadow paging filesystem* with HDDs and trying - in vain mostly - to explain the condition and my concern to colleagues.

*OK... a brief aside.  What on earth is a "shadow paging filesystem"?  Most familiar hard drive storage technology is "write-in-place".  Expand a database file by 1 GB, and that 1 GB of database file expansion translates through the OS filesystem, OS logical volume manager, SAN layers, etc to specific disk sectors.  Write database contents to that 1 GB file expansion, and the contents will be written to those disk sectors.  Update every single row, and the updates will take place via writes to those same disk sectors.  That is "write-in-place".
The alternative is a continual "redirect on write", or a shadow paging filesystem.  The initial contents contents get written to 1 GB worth of disk sectors A-Z.  Update every single row, and the updated contents don't get written in place, but rather get written to a new/different 1 GB of disk sectors a'-z'.
Once the new disk writes are complete, updates are made to the inode/pointer structure that stitches together the file(or LUN) presented to the host operating system.  The most common example of this type of continual redirect-on-write strategy is WAFL, used by the ONTAP operating system on NetApp storage.*

The issue was that a database object structure could be written initially completely sequentially, from the standpoint of the database, database server filesystem/LVM, AND the storage system.  However, later updates could occur to a small and scattered sample of the data within that contiguous range.  After the updates (assuming no page splits/migrated rows due to overflow of datbaase page/block boundaries), the data would be contiguous from database and database server filesystem/LVM standpoint.  But, the updated blocks would be rewritten in a new sequential location - making sequential on disk contents that were scattered from the standpoint of database and database server filesystem/LVM.

What's the harm in that?    Consider a full scan of such an object, whether in response to a user query or to fulfill an integrity check.  Before the interior, scattered updates the 1 GB range may very well be read with a minimal number of maximum size read operations at the OS and storage levels, with a minimal amount of disk head movement.  After the scattered internal updates?  The number and size of OS level read commands won't change (because I've stipulated earlier that none of the updates caused page splits/migrated rows).  However, the number and size of commands to retrieve the data from the hard drives to the storage controllers in the array would almost certainly have changed.  And the amount of disk head movement to retrieve the data could also have changed significantly.  What if 6 months of time and accumulated data had accrued between the original, completely sequential write of the data and the later scattered updates?  That could introduce significant head movement and significant wait into the data retrieval.

When I began discussing this phenomena with various engineers, the most common reply was: "yeah, but aren't you most concerned with OLTP performance, anyway?"  At that time in my life, that was completely true... however...
I also knew that a production system with true, all day, pure OLTP workload simply doesn't exist outside of very limited examples.  Integrity checks and backups are the main reason.  Show me a critical production database that operates without backups and integrity checks, and you've shown me a contender for a true all-day, every-day pure OLTP.
Otherwise, the degradation of sequential reads after scattered, internal, small updates is a REAL concern for every database when operating on a shadow paging filesystem.  That's true if the shadow paging filesystem is on the database server host (eg ZFS, of which I am a big fan), or on the storage subsystem (NetApp, or a system using ZFS).

Here's the kicker... someday I think it'll matter for SQL Server on Windows, too regardless of the underlying storage.  Although Microsoft reFS is not a good fit for SQL Server today (I'll come back and add links as to why later), I think future enhancements are likely to bring it into focus for SQL Server.

Finally, I found a name for the performance concern: SRARW.  Decoded: sequential read after random write.  And I found the name in a somewhat unlikely source: a paper written by a NetApp engineer.  In truth, I shouldn't have been all that surprised.  NetApp has a lot of brilliant people working for and with them.
Here's the paper that introduced the term SRARW to me:
Improving throughput for small disk requests with proximal I/O
Jiri Schindler, Sandip Shete, Keith A. Smith

Now... if you are running SQL Server or Oracle on NetApp, I encourage you to keep track of the pace of large sequential operations that always execute with the same number of threads.  If you see a significant slowdown in the pace, consider that SRARW and low level fragmentation may be one of the contributors.  NetApp has jobs that can be scheduled periodically to reallocate data... re-sequence data that has been made "outta order" due to small scattered interior writes.
There is also a NetApp "read reallocate" attribute that should be considered for some workloads and systems.
These items are better described at this location.
http://www.getshifting.com/wiki/reallocate

If you are using ZFS and SRARW performance degrades... unfortunately at this time your options are limited.
 

No comments:

Post a Comment