Showing posts with label checkdb. Show all posts
Showing posts with label checkdb. Show all posts

Thursday, April 27, 2017

SQL Server 2016 checktable - sometimes takes longer with warm cache than cold?!?

First a disclaimer - I've never run these tests on any version prior to SQL Server 2016.  So I am not at all claiming this is a new phenomenon on SQL Server 2016.  Its just an interesting behavior that surfaced while I am testing SQL Server 2016.

The system is a 6 vcpu, 48 GB RAM instance of SQL Server 2016 SP1.

I've been running through a lot of checkdb/checktable scenarios lately, been trying to keep up with blogging whenever I find something interesting.

This particular case is interesting because it defies expectations.

Here's my test, including instrumentation. The test subject is that lonely checktable command nestled in there :-) I use tablock because if not tablock or in an explicit snapshot, each time checktable is run a new internal snapshot will be created and won't be able to compare warm vs cold database cache.

So... its pretty verbose, sorry about that. But stuffing @@io_busy, virtual file stats, wait stats, spinlock stats and latch stats into temp tables before and after so they can be compared. I've got the dbcc dropcleanbuffers in there because I ran this scenario many, many, many times to verify I wasn't being thrown off by something. But my results were stable and reproducible.

So I ran the test below with cold database cache. (My database was still; might have to issue a checkpoint before the dropcleanbuffers to ensure the cache really is cold if issuing checktable against a table that's recently been edited.)


/* set nocount on;
   dbcc dropcleanbuffers;
*/
DECLARE @iobusy BIGINT;
DROP TABLE IF EXISTS #wstats_1;
DROP TABLE IF EXISTS #spins_1;
DROP TABLE IF EXISTS #latch_1;
DROP TABLE IF EXISTS #wstats_2;
DROP TABLE IF EXISTS #spins_2;
DROP TABLE IF EXISTS #latch_2;
DROP TABLE IF EXISTS #virtfile_1;
DROP TABLE IF EXISTS #virtfile_2;
DROP TABLE IF EXISTS #io_busy;
SELECT * into #wstats_1
FROM sys.dm_exec_session_wait_stats where session_id = @@spid;
SELECT * into #spins_1
FROM sys.dm_os_spinlock_stats where spins > 0;
SELECT * into #latch_1
FROM sys.dm_os_latch_stats ls where ls.waiting_requests_count > 0;
SELECT [file_id], num_of_reads, num_of_bytes_read, io_stall_read_ms INTO #virtfile_1
FROM sys.dm_io_virtual_file_stats(db_id(),null);
set @iobusy = @@timeticks * @@io_busy;
set statistics time, IO on;
dbcc checktable ('sasq_list') with tablock;
set statistics time, IO off;
SELECT @@timeticks * @@io_busy - @iobusy AS io_busy_microseconds INTO #io_busy;
SELECT [file_id], num_of_reads, num_of_bytes_read, io_stall_read_ms INTO #virtfile_2
FROM sys.dm_io_virtual_file_stats(db_id(),null);
SELECT * into #latch_2
FROM sys.dm_os_latch_stats ls where ls.waiting_requests_count > 0;
SELECT * into #spins_2
FROM sys.dm_os_spinlock_stats where spins > 0;
SELECT * into #wstats_2
FROM sys.dm_exec_session_wait_stats where session_id = @@spid;

OK.  Well - here's the output from the checktable and statistics time.


I've got all those temp tables for instrumentation - let's cash the check 😀


SELECT * FROM #io_busy;

SELECT vf2.[file_id], vf2.num_of_reads - vf1.num_of_reads AS num_of_reads, 
vf2.num_of_bytes_read - vf1.num_of_bytes_read AS num_of_bytes_read, 
vf2.io_stall_read_ms - vf1.io_stall_read_ms AS io_stall_read_ms
FROM #virtfile_1 vf1
JOIN #virtfile_2 vf2 ON vf1.[file_id] = vf2.[file_id]
WHERE vf2.num_of_reads > vf1.num_of_reads;

select ws2.wait_type, ws2.waiting_tasks_count - ws1.waiting_tasks_count as wait_count,
ws2.wait_time_ms - ws1.wait_time_ms wait_ms, ws2.signal_wait_time_ms - ws1.signal_wait_time_ms signal_ms
from #wstats_2 ws2 join #wstats_1 ws1 on ws1.wait_type = ws2.wait_type
where ws2.waiting_tasks_count - ws1.waiting_tasks_count > 160;

select ws2.name, ws2.spins - ws1.spins as spin_count,
ws2.collisions - ws1.collisions collision_count
from #spins_2 ws2 join #spins_1 ws1 on ws1.name = ws2.name
where ws2.collisions - ws1.collisions > 10;

select ws2.latch_class, ws2.waiting_requests_count - ws1.waiting_requests_count as latch_count,
ws2.wait_time_ms - ws1.wait_time_ms latch_ms
from #latch_2 ws2 join #latch_1 ws1 on ws1.latch_class = ws2.latch_class
where ws2.waiting_requests_count - ws1.waiting_requests_count > 0;

The results...


All right.  Lets test again - this time without dropping clean buffers, so we can compare behavior with warm cache.

Here's the output from checktable and statistics time.


Huh.  Almost 11.5 seconds of elapsed time with warm cache, vs just over 6.2 seconds of elapsed time with cold cache.  That's unexpected.

Let's see what the temp tables caught - maybe they'll shed some light.

Well.  No reads according to virtual file stats - that's what we expected.  No latch waits.  (That varied a little among test runs, but it was never a significant amount of latch waits.)  The spinlocks are uninteresting... I was kind of hoping to find a spinlock-related reason for the warm cache checktable taking longer.  No dice.



So what's going on?  Well - first of all, take a look at CPU ms between the cold cache and warm cache runs.  34534 for cold cache, 33907 for warm cache.  Even though the cold cache run finished substantially faster, it took 627 ms more CPU time to complete.  That's less than 2% variance.  But its stable in the context of this test.  I ran it over and over and although the times shifted around a bit, the cold cache run was always faster but required around 500 more CPU ms to complete.

On the one hand that makes sense: the cold cache run issues 8103 read IOs according to virtual file stats and @@io_busy says 125 milliseconds of CPU time just to issue those IO commands.

 But it still finished faster!

As far as I can tell, the secret is in the wait stats - cxpacket in particular.  Below are the cold cache wait stats again.



And below are the warm cache wait stats again.  The warm cache checktable - which I would have expected to finish faster - somehow incurred almost 3 times as much cxpacket wait time as the cold cache checktable - and over 3 times as much signal wait time for the cxpacket wait type.


The warm cache checktable had less work to do than the cold cache checktable.  And, indeed, it did less work from an accumulated CPU time standpoint.  But somehow, it didn't divvy up its work as well among the parallel workers.  Due to a significant increase in cxpacket wait time and cxpacket signal wait time over the cold cache operation, it finished in almost twice the time!

This is one reason its extremely important to look not just at work, and not just at work + waits... not even at work + waits + management.  Appropriate scheduling and distribution of work among resources is also important to keep everyone happy when they want extreme performance.



Tuesday, April 11, 2017

#SQLServer: How much of that free memory came from checkdb completing and releasing its memory grant? And what about tempdb?

Arun Sirpal (@blobeater1 on Twitter) asked a great question today (11 April 2017) about the impact of memory grant release on free memory after checkdb completes.

I love questions!  It means someone was really paying attention!

Here's the relevant blog post about the appearance of a lot of free memory when a checkdb (without tablock) completes.

Where'd all this #SQLServer Free Memory come from??!?
http://sql-sasquatch.blogspot.com/2017/04/whered-all-this-sqlserver-free-memory.html

So - what contribution to free memory on completion of checkdb is made by releasing the memory grant?  YMMV, but for my 3.35 TB database... almost none.  Here's the thing: checkdb doesn't get a single memory grant, rather at multiple times throughout its execution it gets memory grants.  And with each of those memory grants, like other queries, it may or may not use the entire memory grant.  So, the contribution of releasing the memory grant to free memory is dependent on how much used memory grant checkdb releases on completion.  That's kind of confusing to describe, maybe some graphs will help!

This is the same system, same checkdb as previous post.  The CPU utilization in blue clearly delineates when checkdb was active, since it was run in isolation.

As checkdb progressed, it requested varying memory grants - granted memory in the dark brown/orange below.  In order for this not to be too long of a blog post, I layered "reserved memory" in a lighter orange in front of granted memory.  So - the exposed dark brown granted memory is the workspace memory at any given time that is actually used.  Recall that when memory is granted its just a promise - the query "steals" memory against the grant over time as needed. 


So if we subtract reserved from granted... we'll be left with query memory used by checkdb.  Here's what that looks like.


This is fun!  Let's keep going! I said that over time, as the query needs more of its grant for sort/hash, it "steals" against the grant.  Hmmm... am I implying that used query memory is a subset of "stolen memory", and that only used query memory (excluding reserved memory from the grant) is within "stolen memory"?  Yessss!!  Yes, I am.

Check out used query memory layered in front of stolen memory in the graph below.  Stolen memory is greater than used query memory - but by a relatively fixed amount, and the greater stolen memory is during this isolated checkdb, the greater the share used query memory is.

Interesting, no?  Here's the relevance to the initial question: look how small the used query memory is as checkdb nears completion and CPU utilization drops to near zero.  Its pretty doggone low.  That was the contribution to free memory when the last grant of checkdb was released.  Not much.  But, as I said, YMMV.  Maybe the allocation order of HoBTs in your database is different, and a peak of used query memory like that between 20:50 and 21:50 occurs right before completion!  In that case, releasing the grant would lead to a pretty big contribution to free memory.



OK.  Now for a lagniappe :-)

See that dip in CPU utilization at 16:50, and at 17:50?  Let's tie that in to checkdb work via tempdb.

In perfmon, we've got "SQLServer:Databases(tempdb)\Data File(s) Size (KB)" and "SQLServer:Transactions\Free Space in tempdb (KB)". Let's subtract them to get tempdb used!

And - here's my trick that I think you'll see first from me: let's stack tempdb used on top of query memory used!!  Conceptually, its easiest to understand that when a single session is running in isolation, and when its a session that isn't explicitly using temp tables, row versioning, etc.  In that case, tempdb use comes from spill: sort/hash has consumed the memory grant, so tempdb is leveraged to complete remaining sort/hash work.

Here's what that looks like.  How nice for me; it just happened to work out that there were nice peaks in tempdb use at 16:50 and 17:50 - the same times that CPU utilization dipped.  So tempdb-related waits are prime suspects. :-)


 Ciao for now!!




Monday, April 10, 2017

Where'd all this #SQLServer Free Memory come from??!?

My blog post on memory accounting with perfmon has been pretty popular :-) You can look there to see validation of the idea that "database cache + free + stolen = total server memory" for SQL Server.

SQL Server Memory Accounting with Perfmon
http://sql-sasquatch.blogspot.com/2016/10/sql-server-memory-accounting-with.html

Today I was looking at perfmon from a system and I saw a huge increase in "SQLServer:Memory Manager\Free Memory (KB)" appear in the middle of a key workload.  How'd that happen?

Unfortunately, the perfmon data I was reviewing was missing one of the counters I collect routinely now: "\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec"

So... right now I'm not sure where all the free memory came from on that system.  But I've got a hunch, so I asked to get that counter included in the next perfmon collection I get from them.

Below is a test I ran with a full checkdb on my 3.35TB database.  The "(logical scan bytes +1)/sec Logarithmic" measure that is graphed really is "\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec" + 1, with an Excel logarithmic scale :-)  Its required to be positive nonzero to preclude errors.  And I just wanted it there as a clear indicator that checkdb was done.  So I started the SQL Server service for the instance and started checkdb.  An internal snapshot was created, and "Total server memory" kept growing to reach "max server memory".  All of the database cache was for the internal snapshot (other than checkdb only some monitors were running in the instance).

When checkdb was complete - blammo!!  The internal snapshot db was dropped, and all of the database cache associated with the dropped db was invalidated - resulting in an humongous increase in free memory within SQL Server "total server memory".




Maybe that's what happened in that production system.  I should be able to rule that in/out watching free memory for a similar increase, and checking "\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec".

Wednesday, March 8, 2017

SQL Server 2016 SP1 EE checkdb with physical_only: Trace flags 2549 & 2562

First things first: I don't think trace flag 2549 does anything in SQL Server 2016 SP1 Enterprise Edition.

Last time I said that, I was wrong 😀. I was tricked by trace flag 2549 needing to be enabled globally in order to change behavior - I was testing with it enabled at session level.  But... after opening a ticket with Microsoft the kb article was updated to note that T2549 must be enabled globally 😀.

*This* time I made sure to enable T2549 globally.  But I didn't see any difference in behavior although all 8 of my database filegroup data files were in directories under mount points of a single drive letter.  Checkdb reads were *really* aggressive either way.  Like - more aggressive than previously when T2549 did make a difference.  My execution times were within 5% of each other.  I'll scrutinize the perfmon and session waits more later.  But for now, I'm assuming that behavior is the same and expected performance the same with or without T2549.

Now on to the good stuff.  As I mentioned, checkdb with physical_only reads are quite aggressive in SQL Server 2016 SP1.  The system I tested on is an 8 vcpu vm with 64 GB RAM.  The storage is from a hybrid SAN array, although all of the disk devices are SSDs.  Aggregate database data footprint - across 8 files in a single filegroup - is 3335GB.  (Sum of fileproperty(<name>,'spaceused') across the 8 files).

Let's see some graphs!!!

Look at that read throughput in a vanilla checkdb with physical_only!  These perfmon captures are using 1 second intervals - nearly saturating bandwidth for a considerable amount of time during its execution.



Hard to see because of the scale, but there's a rise and and fall in memory grant usage corresponding to the "batches" that checkdb is using.  Way less than 10 gb of Tempdb is used at peak, for less than 6 minutes out of the roughly 60 minute execution time.

Let's take a look at CPU utilization.  Ohhh yeahh!  Most of the time there were 16 checkdb parallel threads in use for this 8 vcpu VM.  The only break in fairly heavy CPU utilization came during the processing of that large batch in the middle.


OK.  Recall that trace flag 2562 instructs checkdb to use a single large "batch" for all of its checking. This can lead to a significant speedup - especially on spinning disk.  The cost is more use of tempdb.  The next test - checkdb with physical_only + T2562 - demonstrates both the benefit and cost of this behavior.

Hooo doggie!!  It only took about 48 minutes and 30 seconds now!  Look at that fascinating step function of tempdb use!!  Notice how the read bytes/sec plummet briefly before each new step of tempdb use.  If only I had enough time to dig into the mechanics of that!  Maybe another day...


All right - lets take a look at CPU utilization.  CPU also drops at the beginning of each new step of tempdb use.  How tantalizing!


But there you have it.  Pretty aggressive checkdb with physical_only in SQL Server 2016 SP1.  Even a little bit more aggressive with trace flag 2562 enabled.  (I enabled it globally for these tests.)

When planning for checkdb with physical_only in SQL Server 2016 SP1, you may want to stick it into its own Resource Governor workload group so you can independently control DOP, maybe tinker with its max memory grant, and maybe even use IOPs governance.  I hope to return to those matters sometime soon here.  But first gotta also test T2528 (disable parallel checks) and full checkdb.

I should note that these checkdb tests were run in isolation.  If there is another concurrent workload on the system, not only will it be effected by checkdb, but that workload will impact checkdb performance.  Even moreso if that concurrent workload contains DML (because the internal snapshot for checkdb must be maintained).

*****2017 03 08 addendum *****
I wanted to further detail disk IO activity with graphs.  Since the end-point devices are all SSDs, there is less variability in latency and read bytes/sec than would be expected with 10k or 15k fibre channel HDDs.

That's pretty good read throughput, at pretty low latency.


Pretty spikey read IOPs from second to second.


Now let's look at the checkdb run with T2562 enabled.  First at read bytes/sec against sec/read service time.

Throughput is even higher - from peaks of 1200 mb/sec to peaks of 1600 mb/sec.  That additional throughput comes at a cost though - read service time has risen to about 8 times its previous values.

Lets look at read IOPs vs read service time for the checkdb with T2562.


IOPs have come down even though throughput went up.  So the average IO size is higher.  But the jump in service time is much higher than predicted by the increase in average read IO size alone.

Rather, saturation of bandwidth must account for some of the increase in service time.  In general, at saturation performance behavior is going to deviate from expectation based on non-saturated behavior.

This brings up an important point about decisions in performance engineering and planning for systems.  If the main goal is completing checkdb as quickly as possible and the 10-12 minutes saved execution time is valuable enough - I'd definitely use T2562 on this system.

But... if there was a concurrent query workload that generated 100 mb/s of read load?  I'd strongly consider running checkdb without T2562.  Even though it would take a bit longer, there's room in there for additional resource utilization before hitting saturation.

Anyway... ciao for now!