Wednesday, March 8, 2017

SQL Server 2016 SP1 EE checkdb with physical_only: Trace flags 2549 & 2562

First things first: I don't think trace flag 2549 does anything in SQL Server 2016 SP1 Enterprise Edition.

Last time I said that, I was wrong 😀. I was tricked by trace flag 2549 needing to be enabled globally in order to change behavior - I was testing with it enabled at session level.  But... after opening a ticket with Microsoft the kb article was updated to note that T2549 must be enabled globally 😀.

*This* time I made sure to enable T2549 globally.  But I didn't see any difference in behavior although all 8 of my database filegroup data files were in directories under mount points of a single drive letter.  Checkdb reads were *really* aggressive either way.  Like - more aggressive than previously when T2549 did make a difference.  My execution times were within 5% of each other.  I'll scrutinize the perfmon and session waits more later.  But for now, I'm assuming that behavior is the same and expected performance the same with or without T2549.

Now on to the good stuff.  As I mentioned, checkdb with physical_only reads are quite aggressive in SQL Server 2016 SP1.  The system I tested on is an 8 vcpu vm with 64 GB RAM.  The storage is from a hybrid SAN array, although all of the disk devices are SSDs.  Aggregate database data footprint - across 8 files in a single filegroup - is 3335GB.  (Sum of fileproperty(<name>,'spaceused') across the 8 files).

Let's see some graphs!!!

Look at that read throughput in a vanilla checkdb with physical_only!  These perfmon captures are using 1 second intervals - nearly saturating bandwidth for a considerable amount of time during its execution.

Hard to see because of the scale, but there's a rise and and fall in memory grant usage corresponding to the "batches" that checkdb is using.  Way less than 10 gb of Tempdb is used at peak, for less than 6 minutes out of the roughly 60 minute execution time.

Let's take a look at CPU utilization.  Ohhh yeahh!  Most of the time there were 16 checkdb parallel threads in use for this 8 vcpu VM.  The only break in fairly heavy CPU utilization came during the processing of that large batch in the middle.

OK.  Recall that trace flag 2562 instructs checkdb to use a single large "batch" for all of its checking. This can lead to a significant speedup - especially on spinning disk.  The cost is more use of tempdb.  The next test - checkdb with physical_only + T2562 - demonstrates both the benefit and cost of this behavior.

Hooo doggie!!  It only took about 48 minutes and 30 seconds now!  Look at that fascinating step function of tempdb use!!  Notice how the read bytes/sec plummet briefly before each new step of tempdb use.  If only I had enough time to dig into the mechanics of that!  Maybe another day...

All right - lets take a look at CPU utilization.  CPU also drops at the beginning of each new step of tempdb use.  How tantalizing!

But there you have it.  Pretty aggressive checkdb with physical_only in SQL Server 2016 SP1.  Even a little bit more aggressive with trace flag 2562 enabled.  (I enabled it globally for these tests.)

When planning for checkdb with physical_only in SQL Server 2016 SP1, you may want to stick it into its own Resource Governor workload group so you can independently control DOP, maybe tinker with its max memory grant, and maybe even use IOPs governance.  I hope to return to those matters sometime soon here.  But first gotta also test T2528 (disable parallel checks) and full checkdb.

I should note that these checkdb tests were run in isolation.  If there is another concurrent workload on the system, not only will it be effected by checkdb, but that workload will impact checkdb performance.  Even moreso if that concurrent workload contains DML (because the internal snapshot for checkdb must be maintained).

*****2017 03 08 addendum *****
I wanted to further detail disk IO activity with graphs.  Since the end-point devices are all SSDs, there is less variability in latency and read bytes/sec than would be expected with 10k or 15k fibre channel HDDs.

That's pretty good read throughput, at pretty low latency.

Pretty spikey read IOPs from second to second.

Now let's look at the checkdb run with T2562 enabled.  First at read bytes/sec against sec/read service time.

Throughput is even higher - from peaks of 1200 mb/sec to peaks of 1600 mb/sec.  That additional throughput comes at a cost though - read service time has risen to about 8 times its previous values.

Lets look at read IOPs vs read service time for the checkdb with T2562.

IOPs have come down even though throughput went up.  So the average IO size is higher.  But the jump in service time is much higher than predicted by the increase in average read IO size alone.

Rather, saturation of bandwidth must account for some of the increase in service time.  In general, at saturation performance behavior is going to deviate from expectation based on non-saturated behavior.

This brings up an important point about decisions in performance engineering and planning for systems.  If the main goal is completing checkdb as quickly as possible and the 10-12 minutes saved execution time is valuable enough - I'd definitely use T2562 on this system.

But... if there was a concurrent query workload that generated 100 mb/s of read load?  I'd strongly consider running checkdb without T2562.  Even though it would take a bit longer, there's room in there for additional resource utilization before hitting saturation.

Anyway... ciao for now!

No comments:

Post a Comment