Thursday, April 27, 2017

SQL Server 2016 checktable - sometimes takes longer with warm cache than cold?!?

First a disclaimer - I've never run these tests on any version prior to SQL Server 2016.  So I am not at all claiming this is a new phenomenon on SQL Server 2016.  Its just an interesting behavior that surfaced while I am testing SQL Server 2016.

The system is a 6 vcpu, 48 GB RAM instance of SQL Server 2016 SP1.

I've been running through a lot of checkdb/checktable scenarios lately, been trying to keep up with blogging whenever I find something interesting.

This particular case is interesting because it defies expectations.

Here's my test, including instrumentation. The test subject is that lonely checktable command nestled in there :-) I use tablock because if not tablock or in an explicit snapshot, each time checktable is run a new internal snapshot will be created and won't be able to compare warm vs cold database cache.

So... its pretty verbose, sorry about that. But stuffing @@io_busy, virtual file stats, wait stats, spinlock stats and latch stats into temp tables before and after so they can be compared. I've got the dbcc dropcleanbuffers in there because I ran this scenario many, many, many times to verify I wasn't being thrown off by something. But my results were stable and reproducible.

So I ran the test below with cold database cache. (My database was still; might have to issue a checkpoint before the dropcleanbuffers to ensure the cache really is cold if issuing checktable against a table that's recently been edited.)


/* set nocount on;
   dbcc dropcleanbuffers;
*/
DECLARE @iobusy BIGINT;
DROP TABLE IF EXISTS #wstats_1;
DROP TABLE IF EXISTS #spins_1;
DROP TABLE IF EXISTS #latch_1;
DROP TABLE IF EXISTS #wstats_2;
DROP TABLE IF EXISTS #spins_2;
DROP TABLE IF EXISTS #latch_2;
DROP TABLE IF EXISTS #virtfile_1;
DROP TABLE IF EXISTS #virtfile_2;
DROP TABLE IF EXISTS #io_busy;
SELECT * into #wstats_1
FROM sys.dm_exec_session_wait_stats where session_id = @@spid;
SELECT * into #spins_1
FROM sys.dm_os_spinlock_stats where spins > 0;
SELECT * into #latch_1
FROM sys.dm_os_latch_stats ls where ls.waiting_requests_count > 0;
SELECT [file_id], num_of_reads, num_of_bytes_read, io_stall_read_ms INTO #virtfile_1
FROM sys.dm_io_virtual_file_stats(db_id(),null);
set @iobusy = @@timeticks * @@io_busy;
set statistics time, IO on;
dbcc checktable ('sasq_list') with tablock;
set statistics time, IO off;
SELECT @@timeticks * @@io_busy - @iobusy AS io_busy_microseconds INTO #io_busy;
SELECT [file_id], num_of_reads, num_of_bytes_read, io_stall_read_ms INTO #virtfile_2
FROM sys.dm_io_virtual_file_stats(db_id(),null);
SELECT * into #latch_2
FROM sys.dm_os_latch_stats ls where ls.waiting_requests_count > 0;
SELECT * into #spins_2
FROM sys.dm_os_spinlock_stats where spins > 0;
SELECT * into #wstats_2
FROM sys.dm_exec_session_wait_stats where session_id = @@spid;

OK.  Well - here's the output from the checktable and statistics time.


I've got all those temp tables for instrumentation - let's cash the check 😀


SELECT * FROM #io_busy;

SELECT vf2.[file_id], vf2.num_of_reads - vf1.num_of_reads AS num_of_reads, 
vf2.num_of_bytes_read - vf1.num_of_bytes_read AS num_of_bytes_read, 
vf2.io_stall_read_ms - vf1.io_stall_read_ms AS io_stall_read_ms
FROM #virtfile_1 vf1
JOIN #virtfile_2 vf2 ON vf1.[file_id] = vf2.[file_id]
WHERE vf2.num_of_reads > vf1.num_of_reads;

select ws2.wait_type, ws2.waiting_tasks_count - ws1.waiting_tasks_count as wait_count,
ws2.wait_time_ms - ws1.wait_time_ms wait_ms, ws2.signal_wait_time_ms - ws1.signal_wait_time_ms signal_ms
from #wstats_2 ws2 join #wstats_1 ws1 on ws1.wait_type = ws2.wait_type
where ws2.waiting_tasks_count - ws1.waiting_tasks_count > 160;

select ws2.name, ws2.spins - ws1.spins as spin_count,
ws2.collisions - ws1.collisions collision_count
from #spins_2 ws2 join #spins_1 ws1 on ws1.name = ws2.name
where ws2.collisions - ws1.collisions > 10;

select ws2.latch_class, ws2.waiting_requests_count - ws1.waiting_requests_count as latch_count,
ws2.wait_time_ms - ws1.wait_time_ms latch_ms
from #latch_2 ws2 join #latch_1 ws1 on ws1.latch_class = ws2.latch_class
where ws2.waiting_requests_count - ws1.waiting_requests_count > 0;

The results...


All right.  Lets test again - this time without dropping clean buffers, so we can compare behavior with warm cache.

Here's the output from checktable and statistics time.


Huh.  Almost 11.5 seconds of elapsed time with warm cache, vs just over 6.2 seconds of elapsed time with cold cache.  That's unexpected.

Let's see what the temp tables caught - maybe they'll shed some light.

Well.  No reads according to virtual file stats - that's what we expected.  No latch waits.  (That varied a little among test runs, but it was never a significant amount of latch waits.)  The spinlocks are uninteresting... I was kind of hoping to find a spinlock-related reason for the warm cache checktable taking longer.  No dice.



So what's going on?  Well - first of all, take a look at CPU ms between the cold cache and warm cache runs.  34534 for cold cache, 33907 for warm cache.  Even though the cold cache run finished substantially faster, it took 627 ms more CPU time to complete.  That's less than 2% variance.  But its stable in the context of this test.  I ran it over and over and although the times shifted around a bit, the cold cache run was always faster but required around 500 more CPU ms to complete.

On the one hand that makes sense: the cold cache run issues 8103 read IOs according to virtual file stats and @@io_busy says 125 milliseconds of CPU time just to issue those IO commands.

 But it still finished faster!

As far as I can tell, the secret is in the wait stats - cxpacket in particular.  Below are the cold cache wait stats again.



And below are the warm cache wait stats again.  The warm cache checktable - which I would have expected to finish faster - somehow incurred almost 3 times as much cxpacket wait time as the cold cache checktable - and over 3 times as much signal wait time for the cxpacket wait type.


The warm cache checktable had less work to do than the cold cache checktable.  And, indeed, it did less work from an accumulated CPU time standpoint.  But somehow, it didn't divvy up its work as well among the parallel workers.  Due to a significant increase in cxpacket wait time and cxpacket signal wait time over the cold cache operation, it finished in almost twice the time!

This is one reason its extremely important to look not just at work, and not just at work + waits... not even at work + waits + management.  Appropriate scheduling and distribution of work among resources is also important to keep everyone happy when they want extreme performance.



Wednesday, April 26, 2017

SQL Server - DBCC Clonedatabase vs Scripting Objects

Although the 'dbcc clonedatabase' command is new, the idea of clone databases as a 'dataless' database but including table & index definitions as well as optimizer statistics is not new.

Here's a blog post from May of 2010 describing the longstanding strategy of using the scripting functionality from SSMS to create a statistics clone database in SQL Server.

How to create a SQL Server Clone Statistics Only Database
https://www.mssqltips.com/sqlservertip/1999/how-to-create-a-sql-server-clone-statistics-only-database/

I wanted to put this scripting option in SQL Server/SSMS 2016 head-to-head against the clonedatabase command to see how everything turns out.  Why not do that in front of a crowd of 250 people?  (Well - not really.  I did run the clonedatabase in front of everybody though.)

Here's what was in the source database.  The scripting option won't let me script out the Query Store, of course.  But clonedatabase command ill include it unless the option to exclude is specified.



So, to make sure it was apples-to-apples comparison between the 'clone' methods, I toggled a few of the advanced scripting options.



Then I kicked it off. Thank goodness it was several hours before my presentation. I bet a dollar with a colleague over how long it would take to complete. He guessed 56 minutes. I thought 1 hour 25 minutes. Thank goodness it was "Price is Right" rules, cuz we were both way off. It took 2 hours and 2 minutes. Now I'm $1.00 richer.😄

The end result of the scripting exercise was this file clocking in at over 500 mb. I haven't loaded it into a target database yet - I'll report how long it too and resulting database size once I have.



Then... during my presentation late in the day, in front of about 250 people, I launched the dbcc clonedatabase command. It took 37 seconds. Yayy!! 😁

The end result? 1096 mb of transaction log, and 896 mb of row files.  (Remember that 666 mb of the data in the row files is Query Store.)



Let's take a look at the inventory of the Query Store. There's a lot of stuff in there.



There are some concerns with clonedatabase. Sensitive data may be included in Query Store query plans/text if sensitive values are used as literals in queries. Sensitive data may be included in optimizer statistics on sensitive columns. Intellectual property can be a concern in the stored procedures or functions included. Intellectual property may even be a concern for schema design.

I will point out that the only new concern (that I'm aware of) with the dbcc clonedatabase command, versus the scripting option is the possibility of sensitive data included in Query Store query plans and query text.

There's a very good post below for security/sensitivity considerations.

Understanding data security in cloned databases created using DBCC CLONEDATABASE
https://blogs.msdn.microsoft.com/sql_server_team/understanding-data-security-in-cloned-databases-created-using-dbcc-clonedatabase/

In spite of the security/sensitivity considerations, I'm really excited about the clonedatabase command.  Not just because of the incredible speed bump compared to scripting.  Also because including Query Store in the clone is a natural fit.

Even if security/sensitivity issues complicate sharing a clone database with Microsoft or other off-site vendors, the value of using a clone database for investigating query shape, cardinality estimates, or even query store investigation decoupled from a production database should be high.

Monday, April 24, 2017

AIX: from process ID, to physical/virtual memory used, to TCP connection (Half-baked draft)

# ps -ef | grep -i java
xlisten8  8388648        1   1   Apr 11      - 24:58 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar
xlisten8  9633838        1   1   Apr 11      - 25:44 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar
xlisten8  9961686        1   1   Apr 11      - 32:53 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar
xlisten8 10682506        1   0   Apr 11      - 15:14 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar
xlisten8 10748116        1   0   Apr 11      - 28:20 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar
xlisten8 11600048        1   1   Apr 11      - 62:18 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar
xlisten8 13041852        1   1   Apr 11      - 29:10 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar
xlisten8 14745762        1   1   Apr 11      - 15:22 java -cp sasqconfig.jar:sasqjava.jar:sasqjdbc4.jar -Xms256M -Xmx256M -jar JLC.jar

# svmon -P -O summary=basic,unit=MB | head -n 4 ; svmon -P -O summary=basic,unit=MB | grep java
Unit: MB

-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual
11600048 java           1679.27   256.48   340.14  2041.33
 9961686 java           1651.38   256.41   330.53  2005.05
 9633838 java           1613.63   256.43   341.31  1978.62
13041852 java           1593.01   256.38   369.52  1998.18
 8388648 java           1585.89   256.39   343.54  1966.96
10748116 java           1430.42   256.29   209.10  1684.59
10682506 java            928.71   256.25   157.50  1130.74
14745762 java            928.30   256.25   158.09  1130.82

# netstat -Aan | grep ESTABLISHED | grep -v 127.0.0.1 | grep -v tcp4 | grep -v tcp6
f100050000c393b8 tcp        0      0  10.xxx.xx.xxx.48323   10.xxx.xx.xxx.50321    ESTABLISHED
f100050000f583b8 tcp        0      0  10.xxx.xx.xxx.48316   10.xxx.xx.xxx.50505    ESTABLISHED
f100050001f4e3b8 tcp        0      0  10.xxx.xx.xxx.48311   10.xxx.xx.xxx.50844    ESTABLISHED
f100050001ea6bb8 tcp        0      0  10.xxx.xx.xxx.48314   10.xxx.xx.xxx.56565    ESTABLISHED
f100050001c773b8 tcp        0      0  10.xxx.xx.xxx.48332   10.xxx.xx.xxx.50531    ESTABLISHED
f100050001ea2bb8 tcp        0      0  10.xxx.xx.xxx.48361   10.xxx.xx.xxx.50839    ESTABLISHED
f100050000f6b3b8 tcp        0      0  10.xxx.xx.xxx.50707   10.xxx.xx.xxx.50419    ESTABLISHED
f100050000de4bb8 tcp        0      0  10.xxx.xx.xxx.33387   10.xxx.xx.xxx.50505    ESTABLISHED

# rmsock f100050000c393b8 tcpcb
The socket 0xf100050000c39008 is being held by proccess 9633838 (java).
# rmsock f100050000f583b8 tcpcb
The socket 0xf100050000f58008 is being held by proccess 14745762 (java).
# rmsock f100050001f4e3b8 tcpcb
The socket 0xf100050001f4e008 is being held by proccess 9961686 (java).
# rmsock f100050001ea6bb8 tcpcb
The socket 0xf100050001ea6808 is being held by proccess 10748116 (java).
# rmsock f100050001c773b8 tcpcb
The socket 0xf100050001c77008 is being held by proccess 13041852 (java).
# rmsock f100050001ea2bb8 tcpcb
The socket 0xf100050001ea2808 is being held by proccess 11600048 (java).
# rmsock f100050000f6b3b8 tcpcb
The socket 0xf100050000f6b008 is being held by proccess 10682506 (java).
# rmsock f100050000de4bb8 tcpcb
The socket 0xf100050000de4808 is being held by proccess 8388648 (java).

Oracle 12cR2 vs SQL Server 2016: Histogram types



In Oracle, there are 4 histogram types:
frequency
height-balanced
top-n frequency
hybrid
Frequency histogram: for low numbers of distinct values. If ndv < bucket count limit, then each ndv gets a histogram bucket. If ndv > limit, frequency histogram is disqualified.
Height-balanced histogram: histogram buckets with approximately the same number of rows in each bucket
Top-n Frequency: lowest value, highest value, n-2 most popular values in between as bucket end-points.
Hybrid: Top-n Frequency not qualified because N end-point values do not cross threshold to qualify.

In SQL Server 2016 (as has been the case for several major versions), one histogram type is available.  Of the histogram types available in Oracle 12cR2, it is most similar to the hybrid histogram.  For a nullable column, a NULL histogram bucket will be available.  For non-null values, up to 200 additional buckets are available. The desired number of buckets cannot be specified. For values of particular significance, or in order to accommodate skew in data, filtered statistics can be created.  For filtered statistics to reliably be consulted by the optimizer, a query filter predicate must match the predicate of the filtered statistics. 

Resources

March 2017 Whitepaper (1.1 mb pdf) 
Understanding Optimizer Statistics with Oracle Database 12c Release 2

Three part Oracle Database Histogram series from Jonathan Lewis

Another three part Oracle Database Histogram series from Jonathan Lewis

Two part Oracle Database Histogram series from Mohamed Houri

Friday, April 21, 2017

SQL Server 2016: 1 GB memory grant to create an *empty* CCI?

**Update 22 October 2017**
Joe Obbish reaches the same conclusions in his thoroughly documented Stackexchange treatment.  Check it out - I usually like his stuff more than mine :-)

Why does CCI creation on an empty table request a huge memory grant?
https://dba.stackexchange.com/questions/172434/why-does-cci-creation-on-an-empty-table-request-a-huge-memory-grant

*****

CREATE TABLE dbo.MyFact_heap 
( [Key] bigint NOT NULL, 
[Type] nvarchar(50) NOT NULL, 
Id nvarchar(50) NOT NULL, 
AltKey_1 bigint NOT NULL DEFAULT -1, 
AltKey_2 bigint NOT NULL DEFAULT -1, 
AltKey_3 bigint NOT NULL DEFAULT -1, 
AltKey_4 bigint NOT NULL DEFAULT -1, 
AltKey_5 bigint NOT NULL DEFAULT -1, 
AltKey_6 bigint NOT NULL DEFAULT -1, 
AltKey_7 bigint NOT NULL DEFAULT -1, 
AltKey_8 bigint NOT NULL DEFAULT -1, 
AltKey_9 bigint NOT NULL DEFAULT -1, 
AltKey_10 bigint NOT NULL DEFAULT -1, 
String_1 nvarchar(300) NOT NULL DEFAULT N'*Default', 
Numeric_1 numeric(18,2) NULL, 
Date_1 date NULL,
TimeValue time NULL, 
Int_1 int NULL, 
String_2 nvarchar(300) NOT NULL DEFAULT N'*Default', 
Tiny_1 tinyint NULL DEFAULT 0, 
Tiny_2 tinyint NULL DEFAULT 0, 
Tiny_3 tinyint NULL, 
Tiny_4 tinyint NULL, 
Tiny_5 tinyint NULL, 
Tiny_6 tinyint NOT NULL DEFAULT 1 ) 


CREATE CLUSTERED COLUMNSTORE INDEX CCI__MyFact 
ON dbo.MyFact_heap

Requesting the actual plan for the CREATE, I find this in the plan XML.  1197064 as the SerialDesiredMemory, the RequiredMemory, the RequestedMemory.


That's a big grant for something that operates on 0 rows.  When creating many CCIs concurrently, that's a problem - queries can timeout.

Might have to use Resource Governor creatively to allow high concurrency CCI creation.

Grant request is smaller if only one column in the CCI.  So I guess could create a smaller CCI and add columns.  That's also kinda painful.  I'll ask on #sqlhelp, see if anyone has something clever...

*****
So far the only things to come up on Twitter were the min_grant_percent and max_grant_percent query hints introduced in SQL Server 2012 SP3.  But...

"Note These two new query memory grant options aren't available for index creation or rebuild."

https://support.microsoft.com/en-us/help/3107401/new-query-memory-grant-options-are-available-min-grant-percent-and-max-grant-percent-in-sql-server-2012

Later I found this while looking around.


"Starting with SQL Server 2016, you can create the table as a clustered columnstore index. It is no longer necessary to first create a rowstore table and then convert it to a clustered columnstore index."

Hmmm. Here's a great post from Pedro Lopes (Twitter @sqlpto), which includes some details for using Extended Events in memory grant investigation.




So far, it looks like creating the clustered columnstore inline with table creation, as below, avoids a memory grant.  (That makes sense.)  Whereas creating a clustered columnstore index on an empty rowstore table requires a memory grant - at the same size as up to 1 million rows for the table! (That's a surprise.)

If you'll be creating CCIs at fairly high concurrency, you may also want to consider creating the CCI inline with table create, as below, in order to avoid a potential bottleneck in memory grants.


CREATE TABLE dbo.MyFact 
( [Key] bigint NOT NULL, 
[Type] nvarchar(50) NOT NULL, 
Id nvarchar(50) NOT NULL, 
AltKey_1 bigint NOT NULL DEFAULT -1, 
AltKey_2 bigint NOT NULL DEFAULT -1, 
AltKey_3 bigint NOT NULL DEFAULT -1, 
AltKey_4 bigint NOT NULL DEFAULT -1, 
AltKey_5 bigint NOT NULL DEFAULT -1, 
AltKey_6 bigint NOT NULL DEFAULT -1, 
AltKey_7 bigint NOT NULL DEFAULT -1, 
AltKey_8 bigint NOT NULL DEFAULT -1, 
AltKey_9 bigint NOT NULL DEFAULT -1, 
AltKey_10 bigint NOT NULL DEFAULT -1, 
String_1 nvarchar(300) NOT NULL DEFAULT N'*Default', 
Numeric_1 numeric(18,2) NULL, 
Date_1 date NULL,
TimeValue time NULL, 
Int_1 int NULL, 
String_2 nvarchar(300) NOT NULL DEFAULT N'*Default', 
Tiny_1 tinyint NULL DEFAULT 0, 
Tiny_2 tinyint NULL DEFAULT 0, 
Tiny_3 tinyint NULL, 
Tiny_4 tinyint NULL, 
Tiny_5 tinyint NULL, 
Tiny_6 tinyint NOT NULL DEFAULT 1,
INDEX CCI_MyFact CLUSTERED COLUMNSTORE )