The system is a 6 vcpu, 48 GB RAM instance of SQL Server 2016 SP1.
I've been running through a lot of checkdb/checktable scenarios lately, been trying to keep up with blogging whenever I find something interesting.
This particular case is interesting because it defies expectations.
Here's my test, including instrumentation. The test subject is that lonely checktable command nestled in there :-) I use tablock because if not tablock or in an explicit snapshot, each time checktable is run a new internal snapshot will be created and won't be able to compare warm vs cold database cache.
So... its pretty verbose, sorry about that. But stuffing @@io_busy, virtual file stats, wait stats, spinlock stats and latch stats into temp tables before and after so they can be compared. I've got the dbcc dropcleanbuffers in there because I ran this scenario many, many, many times to verify I wasn't being thrown off by something. But my results were stable and reproducible.
So I ran the test below with cold database cache. (My database was still; might have to issue a checkpoint before the dropcleanbuffers to ensure the cache really is cold if issuing checktable against a table that's recently been edited.)
OK. Well - here's the output from the checktable and statistics time.
I've got all those temp tables for instrumentation - let's cash the check 😀
All right. Lets test again - this time without dropping clean buffers, so we can compare behavior with warm cache.
Here's the output from checktable and statistics time.
Huh. Almost 11.5 seconds of elapsed time with warm cache, vs just over 6.2 seconds of elapsed time with cold cache. That's unexpected.
Let's see what the temp tables caught - maybe they'll shed some light.
Well. No reads according to virtual file stats - that's what we expected. No latch waits. (That varied a little among test runs, but it was never a significant amount of latch waits.) The spinlocks are uninteresting... I was kind of hoping to find a spinlock-related reason for the warm cache checktable taking longer. No dice.
So what's going on? Well - first of all, take a look at CPU ms between the cold cache and warm cache runs. 34534 for cold cache, 33907 for warm cache. Even though the cold cache run finished substantially faster, it took 627 ms more CPU time to complete. That's less than 2% variance. But its stable in the context of this test. I ran it over and over and although the times shifted around a bit, the cold cache run was always faster but required around 500 more CPU ms to complete.
On the one hand that makes sense: the cold cache run issues 8103 read IOs according to virtual file stats and @@io_busy says 125 milliseconds of CPU time just to issue those IO commands.
But it still finished faster!
As far as I can tell, the secret is in the wait stats - cxpacket in particular. Below are the cold cache wait stats again.
And below are the warm cache wait stats again. The warm cache checktable - which I would have expected to finish faster - somehow incurred almost 3 times as much cxpacket wait time as the cold cache checktable - and over 3 times as much signal wait time for the cxpacket wait type.
The warm cache checktable had less work to do than the cold cache checktable. And, indeed, it did less work from an accumulated CPU time standpoint. But somehow, it didn't divvy up its work as well among the parallel workers. Due to a significant increase in cxpacket wait time and cxpacket signal wait time over the cold cache operation, it finished in almost twice the time!
This is one reason its extremely important to look not just at work, and not just at work + waits... not even at work + waits + management. Appropriate scheduling and distribution of work among resources is also important to keep everyone happy when they want extreme performance.