Friday, March 31, 2017

#sqlserver tempdb: sort_in_tempdb & -k startup option

This is a very preliminary investigation on my part. I'm not going to move on these results in any production system at least until I've also tested effects on checkpoint for persistent databases.  Probably not until I understand more about the mechanics of this throttle, either.  So I plead that no-one incorporates -k into their own system based on this brief write-up... you'll need some extensive testing for your own context.

Months ago, Jimmy May (@aspiringgeek on Twitter) asked me if I'd used the SQL Server -k startup option.  He's really knowledgeable about interaction of SQL Server with storage (an understatement) and a thorough tester - he's seen some good results with -k.  I'd read about this startup option, but never tested it and hadn't seen it deployed on any systems I'd worked with.  What I'd read to that point had to do with checkpoint throttling. Details on that angle of -k startup option can be found in kba 929240 below.

FIX: I/O requests that are generated by the checkpoint process may cause I/O bottlenecks if the I/O subsystem is not fast enough to sustain the IO requests in SQL Server 2005

Now the systems I work on stress SQL Server in lots of corner-case ways 😁 but an overwhelming checkpoint is something I haven't yet observed.

On the other hand, I do see overwhelming tempdb spills. Tempdb sort/hash spill writes are some of the most aggressive writes to come out of SQL Server.  Systems susceptible to them are advised to consider how to mitigate performance risk to that very system, as well as mitigating the risk of becoming a noisy neighbor if shared storage or shared plumbing (ESXi server, top-of-rack switch, etc) is involved.

The most common performance interventions for tempdb - trace flag 1117 or equivalent, trace flag 1118 or equivalent, increasing data file count to reduce allocation page contention - do not mitigate the risk posed by a tempdb spill write flood.  In fact, since none of the resources for those interventions I am aware of address the underlying Windows volume, vHBA, or ESXi host LUN layout for tempdb there is a chance of actions taken to alleviate allocation page contention increasing the risk posed by tempdb spills.  More on that another day - io weaving is a topic I'll have to prepare some diagrams for 😁

Most disk IO throttles are a poor fit for trying to mitigate this risk also.  VMware provides SIOC and adaptive queue throttling if using vmdks.  Neither work well to tame tempdb write floods without also throttling access to persistent databases.  Many storage arrays provide QoS controls at their front end adapters for rate limiting by IOPs or bytes/sec.  These limits can apply per initiator (host hadapter) or per target LUN depending on the array model.  Per LUN QoS can be ok... but also unwieldy.  What about IO governance in Resource Governor?  It works per volume!!  Yay!  But its share-based - rather than limit-based - and will kick in under contention only. So... nope, not that either (but do keep in mind that RG IO governance works per Windows volume - I'll come back to that someday and how it fits into my recommendation NOT to co-locate data files for tempdb and persistent databases on the same Windows volumeπŸ˜‰).
***Same Day Update***
Thanks to Alex Friedman (@alexf0101 on twitter) for pointing out I was wrong about Resource Governor IO governance.  It is limit-based governance.  I'll have to give it a try πŸ˜€

IO Resource Governance in SQL Server 2014
Use SQL Server 2014 Resource Governor to Throttle Processes for IO Usage

But here's something tantalizing.  A kba about -k startup option initially written for SQL Server 2012.  Hmm.  It mentions throttling tempdb "work files".  Gives an example with checkdb.

Enable the "-k" startup parameter to control the rate that work files can spill to tempdb for SQL Server

Recall that I am using a create index statement with sort_in_tempdb as my proxy for simulating large sort/hash spills. You can see my initial work with that here.

tempdb: "insert into... select" vs "select... into" vs index sort_in_tempdb write behavior

But what does throttling "work files" mean?  Is the throttle selectively restricted to tempdb activity only?  What about temp table usage (that's not my focus for a while - but its an important question). What happens to "create index" with sort_in_tempdb if the -k option is enabled?  I'm glad you asked about "create index"!!

Let's start with the unthrottled create index, on an 8 vcpu VM with all flash storage.  Its SQL Server 2016 SP1.  Maxing out two ports of the adapter πŸ˜€ Check out that write latency on the right!  Whoo boy!  Years ago Paul Randall asked his readers to send in disk service time numbers from their SQL Server instances.  Tempdb write latencies were curiously high.  I know one possible reason πŸ˜€

These graphs are from 1 second perfmon captures.

So... both to keep from saturating the adapter (and punishing any activity sharing the adapter with very high service times) AND to keep from overwhelming storage with writes, might want to throttle that activity.

What happens if we add -k1200 to startup? Interesting.  The entire operation stretched out in elapsed time by about 5 seconds.  Bytes/sec didn't exceed 1500 mb.  But look at that difference in write latency!!  A single point of stress at about 140 ms, and everything else was nice and low.  Remember - at saturation, service times get punished.  Keep a resource from saturation - even just by a little bit - and you can be rewarded with much low service times.

But... maybe the -k didn't really do anything?  After all - it didn;t *really* throttle to 1200 mb/sec did it?  And... this is right around a 10% difference in write bytes/sec to tempdb.  Within variance for some type of tests.  I don't have the patience today to keep repeating this test, though 😜 So, lets try to answer that question with a different throttle value.  Onward to 800!

OK... that's enough for me to say -k is really doing something.  Even lower bytes/sec to tempdb.  Lower peak write latency.  Still a nasty peak, but very brief.

What about throttled to 400? Hey... this one looks really nice from the write service time perspective.  Interesting that tempdb write bytes/sec is as variable as it is - something to do with the mechanics of the throttle.  A lot of this stuff might be more even on a physical server than in a VM.

Although I don't have time to do multiple repeat tests, I always try to make time to not stop at the first result I like.  So let's go to 200!!!

Excellent!  The first result I dislike! Write latency stays super low except for the peak at the end.  Write bytes/sec is reasonable with an interesting spike.  (I hope to learn more about how this throttle actually works - that'll help me plan how to work it into best practices and/or performance interventions.)  But the reason I don't like this result is the really choppy CPU utilization.  That hints that the throttle mechanism itself is under stress.  So for this workload on this system, throttling to 200 is too aggressive a throttle.

So where would I go from here?  When its time for more testing, I'll recommend ten or more runs of unthrottled, 1200, 800, 400.  From the best results, ten or more runs at that level plus 100, and minus 100.  Then we'll see where we're at.  Till then...

*whispers* smash your axioms

No comments:

Post a Comment