Wednesday, January 22, 2014

Transaction Log buffers; How is txlog write latency hurting SQL Server?

**** Update 2015/10/22*****
Its been a while since I looked at 'dbcc dbtable' lc_state values to see how close a database is to experiencing log buffer waits.  I wrote a stored procedure to grab the info at 5 minute periods, investigating the effects of synchronous SAN replication for transaction log LUNs.  But that was back in January 2014, and the instance was SQL Server 2008 R2.

Today I was poking around in SQL Server 2016 and noticed that lc_state values are more friendly! I'll capture all 7 of the new(er) text values and then create a whole new blog post about this.  A colleague that was working with my January 2014 SP to capture this information told me that even in SQL Server 2012 lc_state was assigned a text value rather than numeric 0-6.  Oh well. 

IF OBJECT_ID('tempdb..#table') IS NULL 
   CREATE TABLE #table
      ([ParentObject] VARCHAR(255),
       [Object]       VARCHAR(255),
       [Field]        VARCHAR(255),
       [Value]        VARCHAR(255)
'dbcc dbtable(' + quotename(db_name()) + ') with tableresults, no_infomsgs';
EXEC (@sqlT);

SELECT   value,
         COUNT(value) AS NUM_log_buffers 
FROM     #table 
WHERE    field = 'lc_state' 
GROUP BY value;


Don't pay too much attention :) I'm just trying to work this out.  I've mentioned a high write latency SQL Server instance before, and captured two LUNs for that instance that stay pretty close to pegged at "current disk queue length" of 32.
You can see some of that fun here.

I was able to find a couple of sources on the web that mentioned both the maximum 32 outstanding async writes per txlog, and also mentioned that the total number of in-mem buffers was higher (128 total).

Paul Randall was nice enough to tell me that I could check on free txlog buffers by getting tableresults from 'dbcc dbtable(db_name()) with tableresults' (if you do it in master without specifying db name you'll pull in results from all databases - it'll be extensive).

So I put together a stored procedure to get the table results, count the rows with field=lc_state grouped by value.  Now I've got more questions :)

SQL Server 2008 R2.  Here's what I think is happening.
lc_state=0 is a zeroed, free, ready to become current buffer.  Under pressure there may not be a buffer of state 0 at all.  But, in a database idle since SQL Server startup, I've seen 127 buffers of state 0 and a single state 1(all of the buffers initialized and zeroed at startup).   At startup all buffers are initialized and zeroed... but once they've been written once (lc_state=5) they are marked free (lc_state=6) and only zeroed one at a time on demand.  That would be similar to how Windows and other OS handle RAM which needs to go on a "zeroable" list, and then zeroed, before being allocated to another use.

lc_state=1 is "buffer currently accepting txlog records".  In one of my 5 minute interval captures, there was no buffer with lc_state=1.  In one capture, with the txlog buffers under pressure (at approximately 5:55 am) there were two buffers with lc_state=1.  It would make sense that under pressure there may be no "current buffer" and in a race condition dbcc dbtable may report two current buffers (ie a 'nolock' type result?)

lc_state=2 are full buffers.  At the next trigger, they'll queue to be written and when the log writer can write them it will.

lc_state=5 are currently outstanding writes.  At a trigger (such as commit or checkpoint) the logwriter will call for up to 32 buffers to be written asynchronously.  Due to the total limit of 3840 kb of outstanding writes for the logwriter, there may be less than 32 writes sent even if there are 32 or more ready to be written.  Or, there may be less than 32 buffers ready for write when the logwriter trigger (commit, checkpoint, etc) occurs.

lc_state=6 is write complete, free, ready to be zeroed.  But these are only zeroed on demand - that's why there's never more than 1 lc_state=0  (free, zeroed, ready to become current).

So I don't have a clue yet about lc_state=3 or lc_state=4.

If I've correctly deduced the meaning of the txlog buffers, and the "on demand" zeroing of one buffer at a time from the "zeroable" list of previously written buffers... zeroing only 1 buffer at a time could be a bottleneck during heavy (optimized) ETL or other high change rate activity.

In this case, I think I may also have to monitor the 3840kb total limit for tx log writer async writes.  There's two busy user databases, tempdb is getting some activity... and I think there's one more database generating some activity.  So at some times, the 3840kb total txlog async write limit may also be throttling activity. 

Sunday, January 12, 2014

Reality TV: SQL Server Storage Migration Part 1 (rough)

Ehhh... its a rough draft... but if I don't post this tonight it'll stay in my head for a really long time.  Someday I'll loop back to this and robust it up.  I'll include a link here when I do. Maybe it'll help someone even in the state its in.

Here's the setup: you are dropped into a new organization, without the previous DBA to ask questions.  And there is an upcoming SQL Server storage migration.  What then, huh?

Everything in this entry is SQL Server host side, and all in the initial data gathering stages, before actually devising the tests for the new rig.  So you can resume web surfing if what you need is outside that scope :)

1. Define performance capacity goals of the new storage
   a. current levels of concurrent activity, or is concurrency growth expected in the service life of new storage?
   b. current levels of quarterly/annual data growth, or growth pace increase expected?
   c. are current read latencies acceptable, or is there a new target?

2. If a test run of production workloads is possible in a test environment, it can be very useful for information that can be too expensive to gain production observation.
   It was extremely useful for us to reproduce production workloads in a test environment using procmon to capture sequence, operation (read/write), offset and length.  Using that information we were able to determine the distribution of read and write sizes and also estimate the number of sequential (or at least highly local) IOs versus nonsequential or random IO. 
   Evaluating this type of data in a database loaded with and without the -E startup parameter really helps evaluate whether it would be meaningful or not.
   The presence or absence of sequential 512k reads also helps to evaluate whether increasing the HBA maximum transfer size would be meaningful or not.
   Analyzing these log files also helped create a profile of read and write sizes for storage stress testing.
   Creating verbose log files can have significant observer overhead, so latency numbers from this type of profiling may not be meaningful.  But being able to determine a distribution of sizes and data locality was priceless.
   Here are some examples of capturing this type of data.

3. Suggested perfmon counters for storage performance capacity profiling
   I typically capture perfmon in tsv format, because its much smaller than the blg binary format and I can open it right away with excel :)
   I've settled on a 15 second interval for most uses.  Occasionally I'll use 30 second intervals, but I tend to stay at 15 or 30 seconds.  This is largely because I work with SQL Server systems and Oracle on various *NIX flavors.  Outside of Windows I stick with 15 or 30 seconds so I can capture data at a smaller interval than the filesystem sync daemon fires.  And by keeping both my Windows perfmon and *NIX collection intervals at 15 or 30 seconds, I can meaningfully compare data from different platforms that happen to be on the same SAN.
   Anyway... here are the perfmon counters I recommend for profiling.
   a. \LogicalDisk(*)\Disk Reads/sec
   b. \LogicalDisk(*)\Disk Writes/sec
   c. \LogicalDisk(*)\Disk Read Bytes/sec
   d. \LogicalDisk(*)\Disk Write Bytes/sec
   e. \LogicalDisk(*)\Avg. Disk sec/Read
   f. \LogicalDisk(*)\Avg. Disk sec/Write
   g. \LogicalDisk(*)\Current Disk Queue Length
   h. optional - \LogicalDisk(*)\Disk Bytes/sec (if you don't wanna just add read & write bytes)
   i. optional - \LogicalDisk(*)\Disk Transfers/sec (if you don't wanna just add reads & writes)

4. Capture perfmon for significant production workloads
   I usually capture perfmon logs for at least two weeks.  That way truly anomalous system activity or behavior might be identified and maybe not lead to wasted time.
   Don't be too quick to label activity or behavior anomalous :)
   Whether perfmon is collected for many full days, or in more targeted, shorter lengths of time, cover as many bases as possible.
   Bases to cover include:
   a. daily peak query read activity (eg a scheduled morning batch of reports)
   b. daily peak edit activity (eg a daily ETL)
   c. daily peak tempdb use (write IOPs and data footprint)
   d. backups
   e. integrity checks
   f. index maintenance
   g. account for significant business cycle variations (eg month end activity)

5. OK, ya get yer perfmon!  Now what?  The numbers mean the most in correlation to each other. I've got lots of graphs that I use frequently - all with clock time on the x axis.
   Some of my favorites:
   a. sec/read vs reads/sec & sec/read vs read bytes/sec
      This works best as two graphs just because scaling techniques are needed to get reads/sec and read bytes/sec onto the same axis.
      Is sec/read, the read latency, more strongly correlated to the number of read IOPs or the read bytes/second?
   b. sec/write vs writes/secs & sec/write vs write bytes/sec
      This works best as two graphs just because scaling techniques are needed to get writes/sec and write bytes/sec onto the same axis.
      Is sec/read, the read latency, more strongly correlated to the number of write IOPs or the write bytes/second?
   c. total disk transfers/sec vs sec/read & sec/write
      If read and write latency are more correlated to total disk transfers/sec than to the rate of the respective operation, storage controller CPU may be overloaded.
      Alternatively, a target front end adapter port queue depth may be getting overrun.
   d. total disk bytes vs sec/read & sec/write
      If read and write latency are more correlated to total disk bytes than other factors, the bandwidth in bytes somewhere may be getting overrun.  When SAN connections to individual disk devices were nearly always FC-AL loops, a flood of traffic for one disk on the loop could starve traffic for other disks in the loop.
   e. current disk queue length vs sec/read & sec/write
      Windows allows a maximum of 256 outstanding IOs per Windows volume.  That includes both the wait queue depth managed by Windows and the service queue depth (typically 32) managed by the multipathing software and/or the HBA driver.  Keep it pegged at 256 and both read AND write latencies will suffer, even with VERY fast disk devices underneath.
      Now a word about the service queue depth: keep it pegged, and write latency especially will suffer.  Cached writes are typically quite fast.  But if writes are stuck in the (usually) FIFO wait queue with the entire service queue filled with reads, write latency may exceed read latency.
      That can be a sign that service queue depth may need to be increased (per best practices of storage vendor), or more Windows host LUNs are needed for the workload, or that an increase in HBA maximum transfer size may be beneficial.
   f. current disk queue length vs reads/sec, writes/sec & transfers/sec
      If current disk queue length seems to remain high, is it because of read IOPs, write IOPs, or transfers/sec?
      Read IOPs causing current disk queue length to climb and remain high could be an indication of insufficient back end spindle count.  (Not conclusively.)
      Write IOPs causing disk queue length to climb and remain high could be an indication of replication overhead, such as copy-on-write retention, or replication to another array. (Again, not conclusively.)

6. Now what? Well... compare the expected host-level configuration of the new system and the current system, to make sure that its an apples to apples comparison.
   a. data layout.  Paging space on Windows install drive, or isolated to its own drive. (I like isolating to its own drive: IMO it is by far the easiest way to monitor paging activity. It also insulates the OS install drive from a potentially crushing amount of activity under heavy paging.) Same number of host LUNs for mdf & ndf data files?  Same number of tempdb files & LUNs? Same plan for transaction log files?
   b. NTFS cluster size - 64k on both systems?
   c. Windows volume offset - same offset, or at least aligned offsets on both systems.
   d. Windows volume options: IO priority, write caching policy, advanced caching policy, NtfsDisableLastAccessUpdate
   e. multipathing setup: Windows mpio, EMC Powerpath, Hitachi hdlm, ?? What is the multi-pathing algortihm?
   e. Data transport protocol: iscsi, FCoE, FC (Infiniband?  NFS?)
   f. HBA/CNA make, model, config
      max transfer size (Windows default 512k)
      LUN queue depth (usually per LUN, per initiator port)
      Execution throttle (queue depth for target port)
      Adapter port queue depth (default 1024... but VMWare pvscsi vHBA default 256)
   g. SQL Server options that can influence IO behavior, eg -E startup parameter (4mb parcels instead of 64k for proportional fill algorithm), trace flag 1118 (no more table or index mixed extents), trace flags 2562 or 2549 (altering dbcc checkdb behavior), or parameters affecting SQL Server backup behavior.
   h. Host side data accelerator/tier 0 cache, such as EMC xtremSW, Intel SSD + CAS, QLogic FabricCache, FusionIO...

So what have you got?  Well, now you can hopefully express the current config and workload in terms of peak sustained IOPs, peak sustained bytes/sec, peak sustained read/write IOPs, peak sustained read/write bytes, and read/write latency at each of those sustained peaks.  You can talk about expected increases in activity during the service life of the new storage.  If you've profiled with procmon, you can even describe the current distribution of read and write sizes.   Some storage admins will want to take all of that information into account.  Some of them only want a few of those metrics, and may be irritated that you've given more than they want.  If that's the case, apologize politely and say its my fault :)

OK... now you're ready for stress testing the new rig, with an understanding of current system performance capacity and behavior and the comparison of the old and new host configuration.

That means devising some tests!  Tune in next time...

Monday, January 6, 2014

Complex Risk Assessment: Part II Trace Flag 4199 fixes w/o individual trace flags & SQL Server 2016 Update

*****Update 2017-03-02*****

Hey!!  It's 2017!  Let's talk about SQL Server 2016, k?

The incorporation of pre-2016 T4199 roll-up optimizer fixes by default with 2016 compatibility level(130) is quite well documented.
Using SQL Server 2016 with compatibility level below 130, these fixes are NOT enabled by default. 
Use trace flag 4199 or database scoped configuration QUERY_OPTIMIZER_HOTFIXES to enable these fixes if in a database at compatibility level below 130.

 So... between pre-2016 optimizer fixes enabled by default in SQL Server 2016 with compatibility level 130, and database scoped configuration QUERY_OPTIMIZER_HOTFIXES is it time to say goodbye to trace flag 4199?

Not so fast...

On January 18, 2017 both SQL Server 2016 RTM CU4 and SQL Server 2016 SP1 CU1 were released.
2016 RTM CU4 contains kb3208177 - 2016 SP1 CU1 does not have an equivalent documented fix.
Maybe there will be an equivalent fix in SQL Server 2016 SP1 CU2?

(SQL Server 2016 SP1 is documented as containing the fixes through SQL Server 2016 RTM CU3.

Optimizer fix enabled by trace flag 4199 in SQL Server 2016 RTM CU4
Improvement: Improves the query performance for SQL Server 2016 by changing the use of histograms on UNIQUE columns

*****End Update 2017-03-02  ***** 

**Update 2016-07-21**
Wow!  Its been 2.5 years since I originally put this blog post up!  So much has changed since then!
Especially because... with SQL Server 2016 RTM and the SQL Server 2016 database compatibility level, pre-2016 optimizer fixes rolled up into T4199 are enabled by default!  It gets a little tricky, because in SQL Server 2016 RTM with 2014 compatibility level and below... the optimizer fixes do NOT seem to be enabled by default.  That's my conclusion based on using a query that checks trace flag 4101 behavior, anyway.  But... database scope configuration for optimizer hotfixes makes the entire picture a bit complicated... there will be a future blog post on that, I promise.  Unless Microsoft beats me to it :-)
The main reason for this update: I just found another documented fix that's rolled into trace flag 4199 and I want to keep this post as complete as possible.  
This one is tricky, because the KBa doesn't mention trace flag 4199. But there's a CSS article about KB 222998 also, and *it* does mention trace flag 4199.  (So do the comments, btw.)
That CSS article is here.
And here's the KB article information.
MSSQL Version Desc
Trace flag
SQL Server 2005
Bad legacy CE estimate for antijoin query with multiple columns in WHERE

KBa 2222998 seems to be no more - or at least my link to it dangles.
But here's the related blog post from Jack Li that describes the fix.
Slow query performance because inaccurate cardinality estimate when using anti-semi join following upgrade from SQL Server 2000
**End 2016-07-21 Update**
Original blog post from 2014-01-06
This post is a followup to a previous blog post from 2014:

Steinar Andersen's T4199 blog post below expresses an understanding of the scope of trace flag 4199 and optimizer fixes very similar to the one I held until the last week.  Namely, that fixes to error and wrong results conditions would be enabled by default, while enabling other optimizer fixes may require both a code change and a trace flag (rolled up into T4199, or individually enabled with a separate trace flag).
Steinar Andersen

However, my recent webvestigation has turned up at least one error condition (Kb982441) and one wrong results fix (Kb2892741) enabled by T4199.  Moreover, these fixes cannot be enabled individually with a separate trace flag, as some T4199 rollup fixes can.

So... want to resolve the index rebuild errors of Kb982441, or resolve the wrong results of kb2892741?  Your choices:
1. Modify system and/or code to avoid defect triggers
2. Take everything that comes with T4199

In my opinion, that really changes the risk assessment involved with T4199.

I agree with the strategy of code + flag to enable various classes of fixes, optimizer behavior among them. 

But it doesn't seem true to the initial intent that an error or wrong results condition would require a trace flag for correction.
In my opinion, it also seems like T4199 optimizer fixes should have a timeline to become default behavior.  If database created in SQL Server 2014 includes a new cardinality estimator by default, shouldn't it benefit by default from optimizer fixes made in SQL Server 2005?

My opinions may not benefit SQL Server users much.  But at least the lists of T4199 fixes that I compile can help them :)

So, here you go... through 6 January 2014 these fixes, along with those previously listed in Part I, are the T4199 fixes of which I know.  Note: kb2757097 seems more incidental rather than intentional with respect to T4199.  But, since the KBa references T4199 I included it in the list.

Enjoy!  Hyperlinks behind each KBa number.

MSSQL Version Desc
Trace flag
SQL Server 2008
query w/MIN or MAX & partitioned view
SQL Server 2008
error 666 w/Index rebuild
SQL Server 2008
full-text query w/CONTAINS or FREETEXT & OR
SQL Server 2008 R2
SQL Server 2008 R2
SQL Server 2008
incorrect query results w/partitioned table
SQL Server 2012
spatial index ignored, T4199 may help

 Note: KBa 2667211 also mentions trace flag 4138 as correcting similar symptoms for the same operators even if the plan does not fit the pattern described in the KBa.  Trace flag 4138 seems distinct from the correction enabled by trace flag 4199 after this development is in place.  However I do not see a separate KBa for trace flag 4138.  In fact, KBa 2667211 indicates it was included in SQL Server 2008 R2 SP1 CU7, however the fix list for that CU does not include KBa 2667211.


Friday, January 3, 2014

Trace Flag 4199: Complex Risk Assessment Part I

I may have missed some rolled up trace flags, or some flagless fixes activated by T4199.  Finally, I may find some information about some 'undocumented' fixes included in T4199.   I've added references for those that I've found so far to a followup second post on this topic.

Trace Flag 4199: Complex Risk Assessment: Part II

**End update**

Recently a question came up on Twitter #sqlhelp about kb2904010, a recent fix for sample size in sample stats updates (kb article is linked below).  The original question was about determining the applicability of that fix to a given system.  For a given query, evaluating the appropriateness isn't too bad... but for a change like that you really should evaluate across the system.  And it seemed to me that the following procedure would be appropriate:
1. update all stats on the system with their typical sampling rate
2. save the stats info
3. apply and activate the fix
4. update all stats again with their typical sampling rate
5. compare stats with and without the fix

Here's the reason: if you only evaluate a single query, the effects of the fix on other queries are unknown.  In fact, that's true for any defined set of queries.  Even if you evaluate all current queries, what about the query tomorrow that is executed for the first time against existing data/indexes/stats keys?  So, in this case, comparing the effects across all current stats keys with their current sampling rates, while not necessary  gives information for risk assessment that wouldn't be available if the effect on only a subset of queries is evaluated.

I didn't say all of that on Twitter :) I'm long-winded in real life, but I try to behave when limited to 140.

There was a much shorter answer which took that particular exchange where it needed to go: the risk assessment of the fix in 2904010 is vastly increased by the fact that it is activated by trace flag 4199 (and does not have a separate trace flag which enables the fix individually).  You see, if that organization wasn't already using trace flag 4199, a two state test would not really be sufficient for risk assessment.  They'd have to have a 3 state test: baseline, baseline + T4199, baseline + T4199 + kb2904010. 

Why?  Cuz T4199 pulls a lotta stuff along with it, that's why.  In the days of my youth this is the type of roll-up fix that I would set out testing combinatorically - guessing which fixes would matter most to the system and trying to implement them singly.  Results with single fixes could be compared to baseline and with the full rollup.  Single fixes that showed strong positive returns in testing could be combined, with results compared to baseline and full rollup.

Wanna try?  I did.  But because there are 27 separate trace flags all rolled into trace flag 4199, I got tired out after running 1 test query through all possibilities.  Links below to the 28 kb hot fixes for these trace flags (one trace flag has a separate kba for SQL Server 2005 and SQL Server 2008).

Then... well, then come the 6 fixes which are activated by trace flag 4199 which CANNOT be activated separately.  You want 'em?  Gotta take the whole kaboodle.

That's a pretty complex risk assessment.  Now I will say this: there are considerable gains possible with trace flag 4199.  Just based on the activity of trace flag 4101 I saw the memory grant of some queries shrink from over 10GB to less than 2 mb!  And none of my test queries turned out with plans that were significantly worse on the test data set.  I haven't seen examples of chaos caused by trace flag 4199, and I know that Adam Machanic uses it widely (though he does of course encourage caution in adopting it).  So there are considerable benefits possible... I just wanted to try to put some documentation out there that would be useful for risk assessment.

In the tables below, general kba that mention T4199 appear first.  Then, sorted by trace flag number, the trace flag kbs for trace flags rolled up into T4199.  The MSSQL Version noted there is the earliest major version with the kb hotfix (not necessarily the version with the earliest calendar date availability of the hotfix).

Finally, sorted by kb article number, the 6 hotfixes I found that are activated by T4199 and CANNOT be activated individually.

Enjoy!  Hyperlinks behind each KBa number.


Trace flags rolled into T4199

Documents T4199 w/querytraceon

MSSQL Version Desc
Trace flag
SQL Server 2000
outer joins w/filter criteria
SQL Server 2005
semi joins; mentions T4118??
SQL Server 2005
optimizer timeout; mentions T4122 in err?
SQL Server 2005
multi-column join cardinality
SQL Server 2005
index union cost
SQL Server 2005
multi-join query compile time
SQL Server 2005
range filter w/multiple table partitions
SQL Server 2005
fast forward-only cursor
SQL Server 2005
error w/fast forward-only cursor
SQL Server 2005
query w/user-defined scalar function
SQL Server 2005
replication merge agent
SQL Server 2005
forward-only cursor
SQL Server 2005
join of local and remote table
SQL Server 2005
blocking issue for row updates
SQL Server 2005
LIKE and comparison operator
SQL Server 2005
cursor deadlock
SQL Server 2005
outer joins; requires T4101

4122 mentioned in 946020 in err?
SQL Server 2005
BIGINT statistics
SQL Server 2005
inner join of derived table w/DISTINCT
SQL Server 2005
transactional replication
SQL Server 2005
long compile time
SQL Server 2005
cursor for updates
SQL Server 2008
cursor for updates
SQL Server 2005
GETDATE() in table copies
SQL Server 2005
ON clause of query JOIN w/LIKE
SQL Server 2005
query notifications and error log
SQL Server 2008
error 605/804 on temp table insert

MSSQL Version Desc
Trace flag
SQL Server 2008
parameterized CTE query
SQL Server 2008
SQL Server 2005
multiple self-joins on same column
SQL Server 2008
SQL Server 2012
linked server w/different collation
SQL Server 2008 R2
stats update sample size