Sunday, January 12, 2014

Reality TV: SQL Server Storage Migration Part 1 (rough)

Ehhh... its a rough draft... but if I don't post this tonight it'll stay in my head for a really long time.  Someday I'll loop back to this and robust it up.  I'll include a link here when I do. Maybe it'll help someone even in the state its in.

Here's the setup: you are dropped into a new organization, without the previous DBA to ask questions.  And there is an upcoming SQL Server storage migration.  What then, huh?

Everything in this entry is SQL Server host side, and all in the initial data gathering stages, before actually devising the tests for the new rig.  So you can resume web surfing if what you need is outside that scope :)

1. Define performance capacity goals of the new storage
   a. current levels of concurrent activity, or is concurrency growth expected in the service life of new storage?
   b. current levels of quarterly/annual data growth, or growth pace increase expected?
   c. are current read latencies acceptable, or is there a new target?

2. If a test run of production workloads is possible in a test environment, it can be very useful for information that can be too expensive to gain production observation.
   It was extremely useful for us to reproduce production workloads in a test environment using procmon to capture sequence, operation (read/write), offset and length.  Using that information we were able to determine the distribution of read and write sizes and also estimate the number of sequential (or at least highly local) IOs versus nonsequential or random IO. 
   Evaluating this type of data in a database loaded with and without the -E startup parameter really helps evaluate whether it would be meaningful or not.
   The presence or absence of sequential 512k reads also helps to evaluate whether increasing the HBA maximum transfer size would be meaningful or not.
   Analyzing these log files also helped create a profile of read and write sizes for storage stress testing.
   Creating verbose log files can have significant observer overhead, so latency numbers from this type of profiling may not be meaningful.  But being able to determine a distribution of sizes and data locality was priceless.
   Here are some examples of capturing this type of data.

3. Suggested perfmon counters for storage performance capacity profiling
   I typically capture perfmon in tsv format, because its much smaller than the blg binary format and I can open it right away with excel :)
   I've settled on a 15 second interval for most uses.  Occasionally I'll use 30 second intervals, but I tend to stay at 15 or 30 seconds.  This is largely because I work with SQL Server systems and Oracle on various *NIX flavors.  Outside of Windows I stick with 15 or 30 seconds so I can capture data at a smaller interval than the filesystem sync daemon fires.  And by keeping both my Windows perfmon and *NIX collection intervals at 15 or 30 seconds, I can meaningfully compare data from different platforms that happen to be on the same SAN.
   Anyway... here are the perfmon counters I recommend for profiling.
   a. \LogicalDisk(*)\Disk Reads/sec
   b. \LogicalDisk(*)\Disk Writes/sec
   c. \LogicalDisk(*)\Disk Read Bytes/sec
   d. \LogicalDisk(*)\Disk Write Bytes/sec
   e. \LogicalDisk(*)\Avg. Disk sec/Read
   f. \LogicalDisk(*)\Avg. Disk sec/Write
   g. \LogicalDisk(*)\Current Disk Queue Length
   h. optional - \LogicalDisk(*)\Disk Bytes/sec (if you don't wanna just add read & write bytes)
   i. optional - \LogicalDisk(*)\Disk Transfers/sec (if you don't wanna just add reads & writes)

4. Capture perfmon for significant production workloads
   I usually capture perfmon logs for at least two weeks.  That way truly anomalous system activity or behavior might be identified and maybe not lead to wasted time.
   Don't be too quick to label activity or behavior anomalous :)
   Whether perfmon is collected for many full days, or in more targeted, shorter lengths of time, cover as many bases as possible.
   Bases to cover include:
   a. daily peak query read activity (eg a scheduled morning batch of reports)
   b. daily peak edit activity (eg a daily ETL)
   c. daily peak tempdb use (write IOPs and data footprint)
   d. backups
   e. integrity checks
   f. index maintenance
   g. account for significant business cycle variations (eg month end activity)

5. OK, ya get yer perfmon!  Now what?  The numbers mean the most in correlation to each other. I've got lots of graphs that I use frequently - all with clock time on the x axis.
   Some of my favorites:
   a. sec/read vs reads/sec & sec/read vs read bytes/sec
      This works best as two graphs just because scaling techniques are needed to get reads/sec and read bytes/sec onto the same axis.
      Is sec/read, the read latency, more strongly correlated to the number of read IOPs or the read bytes/second?
   b. sec/write vs writes/secs & sec/write vs write bytes/sec
      This works best as two graphs just because scaling techniques are needed to get writes/sec and write bytes/sec onto the same axis.
      Is sec/read, the read latency, more strongly correlated to the number of write IOPs or the write bytes/second?
   c. total disk transfers/sec vs sec/read & sec/write
      If read and write latency are more correlated to total disk transfers/sec than to the rate of the respective operation, storage controller CPU may be overloaded.
      Alternatively, a target front end adapter port queue depth may be getting overrun.
   d. total disk bytes vs sec/read & sec/write
      If read and write latency are more correlated to total disk bytes than other factors, the bandwidth in bytes somewhere may be getting overrun.  When SAN connections to individual disk devices were nearly always FC-AL loops, a flood of traffic for one disk on the loop could starve traffic for other disks in the loop.
   e. current disk queue length vs sec/read & sec/write
      Windows allows a maximum of 256 outstanding IOs per Windows volume.  That includes both the wait queue depth managed by Windows and the service queue depth (typically 32) managed by the multipathing software and/or the HBA driver.  Keep it pegged at 256 and both read AND write latencies will suffer, even with VERY fast disk devices underneath.
      Now a word about the service queue depth: keep it pegged, and write latency especially will suffer.  Cached writes are typically quite fast.  But if writes are stuck in the (usually) FIFO wait queue with the entire service queue filled with reads, write latency may exceed read latency.
      That can be a sign that service queue depth may need to be increased (per best practices of storage vendor), or more Windows host LUNs are needed for the workload, or that an increase in HBA maximum transfer size may be beneficial.
   f. current disk queue length vs reads/sec, writes/sec & transfers/sec
      If current disk queue length seems to remain high, is it because of read IOPs, write IOPs, or transfers/sec?
      Read IOPs causing current disk queue length to climb and remain high could be an indication of insufficient back end spindle count.  (Not conclusively.)
      Write IOPs causing disk queue length to climb and remain high could be an indication of replication overhead, such as copy-on-write retention, or replication to another array. (Again, not conclusively.)

6. Now what? Well... compare the expected host-level configuration of the new system and the current system, to make sure that its an apples to apples comparison.
   a. data layout.  Paging space on Windows install drive, or isolated to its own drive. (I like isolating to its own drive: IMO it is by far the easiest way to monitor paging activity. It also insulates the OS install drive from a potentially crushing amount of activity under heavy paging.) Same number of host LUNs for mdf & ndf data files?  Same number of tempdb files & LUNs? Same plan for transaction log files?
   b. NTFS cluster size - 64k on both systems?
   c. Windows volume offset - same offset, or at least aligned offsets on both systems.
   d. Windows volume options: IO priority, write caching policy, advanced caching policy, NtfsDisableLastAccessUpdate
   e. multipathing setup: Windows mpio, EMC Powerpath, Hitachi hdlm, ?? What is the multi-pathing algortihm?
   e. Data transport protocol: iscsi, FCoE, FC (Infiniband?  NFS?)
   f. HBA/CNA make, model, config
      max transfer size (Windows default 512k)
      LUN queue depth (usually per LUN, per initiator port)
      Execution throttle (queue depth for target port)
      Adapter port queue depth (default 1024... but VMWare pvscsi vHBA default 256)
   g. SQL Server options that can influence IO behavior, eg -E startup parameter (4mb parcels instead of 64k for proportional fill algorithm), trace flag 1118 (no more table or index mixed extents), trace flags 2562 or 2549 (altering dbcc checkdb behavior), or parameters affecting SQL Server backup behavior.
   h. Host side data accelerator/tier 0 cache, such as EMC xtremSW, Intel SSD + CAS, QLogic FabricCache, FusionIO...

So what have you got?  Well, now you can hopefully express the current config and workload in terms of peak sustained IOPs, peak sustained bytes/sec, peak sustained read/write IOPs, peak sustained read/write bytes, and read/write latency at each of those sustained peaks.  You can talk about expected increases in activity during the service life of the new storage.  If you've profiled with procmon, you can even describe the current distribution of read and write sizes.   Some storage admins will want to take all of that information into account.  Some of them only want a few of those metrics, and may be irritated that you've given more than they want.  If that's the case, apologize politely and say its my fault :)

OK... now you're ready for stress testing the new rig, with an understanding of current system performance capacity and behavior and the comparison of the old and new host configuration.

That means devising some tests!  Tune in next time...

No comments:

Post a Comment