Tuesday, March 14, 2017

#TSQL2sday - The Daily WTF - A Tale of 2 Trace Flags


March 13, 2017 is T-SQL Tuesday!  Yahoo!!  Kennie Nybo Pontoppidan is hosting this month.

Here was the announce...

Announcing T-SQL Tuesday #88: The daily (database-related) WTF
http://www.pontop.dk/single-post/2017/03/07/Announcing-T-SQL-Tuesday-88-%E2%80%93-The-daily-database-related-WTF

And here is my contribution.  A list of performance changes is much easier to come up with than the best order to introduce those changes.

Yep, I remember deploying a trace flag on a production server for the first time.  Trace flag 8048, to mitigate spinlock contention related to memory allocation.  That step followed weeks of observation and research, attempts to recreate the problem on a test system (unsuccessful at the time), reaching out to Microsoft and various experts.

Although it was a nail-biter, it went well.  An important weekly batch of reports decreased in execution time from nearly 20 hours to just over 2 hours.  Even more reassuringly, looking at graphs of CPU utilization vs read bytes/sec and CPU vs buffer page lookups, the relationships were clear again.  As we increased and decreased demand on the server, we saw increases and decreases in CPU utilization.  Good.

Soon after, though, on that system and numerous others, we ran into another significant issue limiting scalability and dragging out execution time of batch reports: very large memory grants and long lists of pending memory grants with excessively long resource_semaphore waits.

After significant additional research, we learned that trace flag 4199 had numerous positive effects on our workload.  The easiest for us to demonstrate was the positive effect of the fix which could be independently enabled with trace flag 4101 - this fix was also rolled up into the trace flag 4199 fixes.

Here's a reference for that specific fix.
FIX: Reorder outer joins with filter criteria before non-selective joins and outer joins
https://support.microsoft.com/en-us/help/318530/fix-reorder-outer-joins-with-filter-criteria-before-non-selective-joins-and-outer-joins

And here's the posts where I've tried to keep track of all T4199 fixes.
Trace Flag 4199: Complex Risk Assessment Part I
http://sql-sasquatch.blogspot.com/2014/01/trace-flag-4199-complex-risk-assessment.html
Complex Risk Assessment: Part II Trace Flag 4199 fixes w/o individual trace flags & SQL Server 2016 Update
http://sql-sasquatch.blogspot.com/2014/01/trace-flag-4199-complex-risk-assessment_6.html

Trace flag 4199 was good for our workload.  We tested it fully on SQL Server 2008R2, and would later replicate the testing on SQL Server 2012 and 2014.

I've long believed that knowing the list of performance recommendations is much easier than knowing the best order to adopt the recommendations.  When I went on the road with my knowledge of perfmon, trace flag 8048, and trace flag 4199, strong support was given to that belief.

I left Wisconsin headed for Illinois with another performance engineer for an on-site observation and performance intervention.

For the first few hours we collected data with perfmon and some stored procedures to grab waits, spinlocks, and memory grant information.  I was excited to see patterns I recognized - spinlock contention clustered around CMEMTHREAD waits (a hint that trace flag 8048 would be applicable) and "eyes too big for stomach" memory grant requests (in queries with many joins, a hint that trace flag 4199 may be helpful to "right-size" the memory grant).

I'd seen systems that implemented both trace flags as startup parameters simultaneously.  I'd helped organizations implement first T8048, then T4199 (based on the timing of my research and testing of the trace flags).  This was the first time that there was a desire to implement the trace flags one-at-a-time and we had the choice of which to implement first.

I hadn't chosen to put T8048 in first previously - that was just the way everything worked out.  If I chose to follow that order - I'd be doing what I'd seen and done before.  But... there was also a reason to choose the reverse order, with T4199 first.  Spinlock issues - especially at that time - were considered more exotic performance issues than many of the "plan-shaping" issues that trace flag 4199 addressed.  Many administrators were much more familiar with that type of performance issue - eliminating significant waits, altering plan shapes, making the logical work of queries more efficient - than with the busy wait/management overhead of spinlocks.  Sometimes demonstrating an improvement that someone is already familiar with evaluating is a plus, helping to gain trust.  I didn't know of a specific reason NOT to put trace flag T4199 in place, followed by T8048 later.  And in this case it seemed like building up some interpersonal capital might be a good idea.

Trace flag 4199 went in as a service startup parameter, and SQL Server was restarted.  When the heavy workload hit, we watched resource utilization closely.  Things looked good for a few minutes.  Then the server became unresponsive.  CPU was saturated.

Trace flag 4199 had made logical work able to try harder.  Individual memory grants had shrunk, so query concurrency was higher.  The number of concurrent parallel workers was higher. The opportunities for contention in allocation of workspace memory for sorts and hashes increased significantly.  And spinlock contention resulted in CPU saturation.  There's a condition known as a spinlock convoy - many concurrent threads just trading a spinlock resource among themselves, with all of those not holding the resource spinning.  I'd read about that, but never seen it make a server unresponsive until that day.

Fortunately the workload was able to be resubmitted.  I sheepishly explained that the server becoming unresponsive was actually a sign that the trace flag was doing what it was supposed to do... but also meant I had chosen the wrong order to enable the trace flags.  The server was brought down.  When it was brought back up, we switched in trace flag 8048.  The workload was resubmitted, and a modest gain was observed.  The next day, trace flag 4199 was added and a significant performance and scalability gain was realized.

And I learned to consider much more closely the consequences of the order of performance intervention.  That's the only time I can remember trying to explain that the unresponsive server being witnessed by a whole room full of people was actually a good sign.  I'm sure there were at least a couple of people in that room thinking "performance intervention, WTF?"

1 comment: