Tuesday, February 23, 2016

SQL Server: Who on Earth would rebuild an index at 0.44% fragmentation?!?

*Warning* I know, I know... columnstore is all the rage.  This post is strictly rowstore stuff, sorry. :-)
 
Within SQL Server, 8k database pages are organized into 8 page extents. 
 
Here's a good introductory resource.
Understanding Pages and Extents
 
The idea of 'mixed extents' is an important one - especially because future mixed extents in user hobts can be eliminated via trace flag 1118.  (Yes, tempdb uses T1118 by default in SQL Server 2016 but it seems that in user databases if you want to avoid mixed extents, gotta still use T1118.) 
 
The idea of fragmentation for databases is somewhat complex - and sometimes downright controversial.  In various contexts some or all of the following concepts may be involved:
1. Contiguity of database free space within data files.  This isn't discussed very often these days, but look back in the Oracle forums: especially in some of the flame-throwing fragmentation discussions of old and it comes up.
2. Order of index pages compared to order of organizing index key.  This idea is extremely popular for SQL Server.
3. Compactness of contents within data pages.  This idea is starting to gain traction in SQL Server circles, largely due to work by Paul Randal and Jonathan Kehayias, showing that low data density in hobt pages wastes database cache and by extension CPU. 
4. Contiguity of database pages within database files.  Assume that SQL Server data files are on Windows volumes 1:1 mapped to underlying LUNs (rather than Storage Spaces, dynamic disk striped volume, etc).  Maximum read size will be determined by interaction of contiguous hobt pages, maximum transfer size allowed to disk storage (usually a limit of the adapter driver) and various rules about read requests in the database engine.*
 
On a high IOPs system, or a system where readahead read bursts stress disk IO queue depth, increasing average read size and decreasing the number of necessary read operations can be an important part of tuning.
 
*For example, backup reads don't need to be concerned about boundaries between hobts, but when a query requests reads from table A index B, pages from table C won't be returned even if it means lowering the total read operation count.
 
**************
 
Let's take a look at an example.
 
I've got table TABLE1, with clustered index CI_TABLE1.  There's also the cleverly named nonclustered index, NC_TABLE1.
 
My database has 8 equally sized data files in the primary filegroup.  Each database file is on its own LUN; that's how I roll.
 
Well - let's look at the stalwart foundation of determining when to defragment an index.
 
SELECT s_tables.name AS table_name,
s_indexes.name AS index_name,
pstats.index_type_desc,
pstats.index_level,
pstats.index_depth,
CAST(pstats.avg_fragmentation_in_percent AS NUMERIC(18,2)) AS avg_frag_pct,
fragment_count,
CAST(pstats.avg_fragment_size_in_pages AS NUMERIC(18,2)) AS avg_frag_pages,
pstats.page_count,
pstats.compressed_page_count,
CAST(pstats.avg_page_space_used_in_percent AS NUMERIC(18,2)) AS avg_page_used_pct,
pstats.record_count
FROM sys.tables AS s_tables
JOIN sys.indexes AS s_indexes
ON s_indexes.[object_id] = s_tables.[object_id]
CROSS APPLY
sys.dm_db_index_physical_stats
(db_id(),
s_tables.[object_id],
s_indexes.index_id,NULL,'DETAILED') AS pstats
WHERE s_tables.name = 'TABLE1';
 
 
That looks pretty good, huh? 
 
Who on earth would rebuild an index with 0.44% fragmentation?
 
I'm your huckleberry.
 
I'm not the guy you want to write queries... but I will write queries in a pinch.  Usually it's because I can't find a query that does what I want.
 
So... I wrote this query.
 
DECLARE @table_name VARCHAR(256);
SET @table_name = 'TABLE1'
SELECT subQ3.contig_extent_run_len,
       count(*)  AS contig_extent_run_count,
       subQ3.table_name,
       subQ3.index_name
FROM
      (SELECT subQ2.table_name,
              subQ2.index_name,
              subQ2.contig_page_run_len,
              count(*) AS contig_page_run_count,
              CASE WHEN contig_page_run_len % 8 = 0
                        THEN contig_page_run_len / 8
                   ELSE 1 + (contig_page_run_len / 8)
              END      AS contig_extent_run_len
       FROM
            (SELECT subQ1.table_name,
                    subQ1.index_name,
                    subQ1.fileId,
                 min(subQ1.page_id)             AS contig_page_run_beg,
                 max(subQ1.page_id)             AS contig_page_run_end,
                 1 + max(subQ1.page_id)
                   - min(subQ1.page_id)
                                                AS contig_page_run_len
             FROM
                  (SELECT s_tables.name                  AS table_name,
                          s_indexes.name                 AS index_name,
                          allocated_page_file_id         AS fileId,
                          allocated_page_page_id         AS page_id,
                          allocated_page_page_id
                       - ROW_NUMBER() OVER(PARTITION BY allocated_page_file_id
                                              ORDER BY allocated_page_page_id)
                                                         AS contig_cluster_grouper,
                          extent_page_id/8               AS extent_num,
                          extent_page_id,
                          is_allocated
                   FROM   sys.tables                     AS s_tables
                   JOIN   sys.indexes                    AS s_indexes
                     ON   s_indexes.[object_id] = s_tables.[object_id]
                   CROSS APPLY
                          SYS.dm_db_database_page_allocations(DB_ID(),
                                                              s_tables.[object_id],
                                                              s_indexes.index_id,
                                                              NULL,
                                                              'LIMITED')
                   WHERE  s_tables.[object_id] = object_id(@table_name)
                   ) subQ1
             GROUP BY subQ1.table_name,
                      subQ1.index_name,
                      subQ1.fileId,
                      subQ1.contig_cluster_grouper
             ) subQ2
       GROUP BY subQ2.table_name,
                subQ2.index_name,
                subQ2.contig_page_run_len
      ) subQ3
GROUP BY subQ3.table_name,
         subQ3.index_name,
         subQ3.contig_extent_run_len
 
 
 
What does that mean?  In this SQL Server instance, trace flag 1118 is preventing mixed extents.  The query uses a 'gaps & islands' approach to group index pages into runs of contiguous extents within each database file.  Then counts the number of times that runs of a single extent, two contiguous extents, and so on occur.  The NC_TABLE1 index is composed of 5 runs.  Two of them are single extents.  One run is 4 contiguous extents of the NC_TABLE1 index.  There's also a run of 7 extents, and one run of 23 extents.
 
Is that good?  NC_TABLE1 is 36 total extents.  288 eight k pages.  2.25 mb. It can be read in 5 reads - one read for each contiguous run.
CI_TABLE1 is comprised of 48 extents.  3 mb. It can be read in 11 reads - again, one for each contiguous run.
 
The SQL Server instance has the -E startup option enabled.  Without that startup option, proportional fill algorithm would distribute incoming data to the 8 data files with a potential smallest allocation of a single 64k extent before rotating round-robin to the additional files in the filegroup.  With the -E startup option, the smallest possible new allocation by proportional fill is sixty four 64k extents - 4 mb.
 
That means if I can rebuild the indexes into new space at the tail of the files, the contiguity should be improved considerably.
 
Yeah, I tried it :-)
 
ALTER INDEX ALL ON table1 REBUILD WITH
(ONLINE = ON,
FILLFACTOR =90,
SORT_IN_TEMPDB = ON,
MAXDOP = 1,
DATA_COMPRESSION = PAGE);
 
Now what does sys.dm_db_index_physical_stats look like? Well - I lost a level in the nonclustered index.  Yeah, the fragmentation numbers look a little better - but they were already so good no-one would bother with a rebuild.
 
 
 
And what do results to my ugly 'gaps and islands' query look like now? 
 
 
With -E, each pass of the proportional fill algorithm will send up to 64 extents of new allocations to each file.  There was no other DML activity on the server, and I ran the rebuild at DOP 1.  There actually are 3 contiguous runs of CI_TABLE1at or approaching 64 extents.  NC_TABLE1 now has all of its data pages in a single contiguous run of 21 extents.
 
I've got more research to do, and more work to do on the contiguity query (it doesn't distinguish between reserved but unallocated pages,  doesn't distinguish between data pages and index pointer pages, etc).
 
But I wanted to finally get this blog post out here - ideas I started working with in my "Hunch about SQL Server Fragmentation" series.  For very disk intensive workloads, there's an aspect of fragmentation missing from the threshold people use to decide when to rebuild/reorg.  And that same aspect of fragmentation is largely missing from the evaluation of effectiveness of rebuild/reorg.

 
 
 
 
 
 

 
 

Monday, February 22, 2016

SQL Server start time, wait stats reset

The values for number of waiting jobs and accumulated wait time for each sys.dm_os_wait_stats wait_type continue to grow over time.  The value resets with a SQL Server restart or with an explicit command:
DBCC SQLPERF("sys.dm_os_latch_stats" , CLEAR)
When an AlwaysOn Availability Group replica changes status, it also means a shift in the wait stat start time and the wait stats themselves (difference SQL Server instance may have a very different profile).  
There are various ways to identify the SQL Server instance start time.  The query below retrieves data from sys.dm_os_sys_info and from tempdb - two common estimates of SQL Server instance start.  I've heard that sys.dm_os_sys_info isn't available in Azure - I'm personally too timid to be cloudin' right now :)  I added to the query an estimate I'm fond of - the (minimum) SQL Server transaction log writer start time.

SELECT subQ.currentTime,
       subQ.sys_info__start,
       subQ.Lgwr_start,
       subQ.tempdb_crdate,
       dateadd(second,
               0 - subQ.sqltrace_wait_ms/1000,
               subQ.currentTime)              AS sqltrace_wait_est_start
FROM   (SELECT getdate()                      AS currentTime,
               (SELECT sqlserver_start_time
                FROM sys.dm_os_sys_info)      AS sys_info__start,
               (SELECT crdate
                FROM sysdatabases
                WHERE name = 'tempdb')        AS tempdb_crdate,
               (SELECT MIN(start_time)
                FROM sys.dm_exec_requests
                WHERE COMMAND = 'LOG WRITER') AS Lgwr_start,
               ws.wait_type, 
               ws.wait_time_ms                AS sqltrace_wait_ms
        FROM   sys.dm_os_wait_stats           AS ws
        WHERE  wait_type = 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP')
                                              AS subQ


There's another estimate in there: sqltrace_wait_est_start.
A clever idea - find a wait type exclusive to a system process that spends almost all of its time sleeping.  The accumulated wait time for that wait type is then approximately the amount of time since wait stats were last reset - whether due to a SQL Server restart or due to an explicit command to clear the wait stats.
This is an idea I owe to the following blog post.
When was sys.dm_os_wait_stats last cleared?
Now - this doesn't directly help to identify a status change in an Availability Group.  But if grabbing wait stats at regular intervals, the estimated instance and wait stats start times can be retrieved, too.  Changes in these estimated start times can then be responded to accordingly when reviewing wait stats.

Sunday, February 21, 2016

Perfmon SQL Server Resource Pool Stats Read IO (ms) - Careful what you trust...

Though perfmon "\SQLServer:Resource Pool Stats(default)\Avg Disk Read IO (ms)" is a very convenient measure, I don't have much confidence in it right now.
 
Reviewing recent observations, this graph stood out to me.
 
 
I am typically very careful with database layout, to optimize for the high IOPs, high outstanding IO workloads I test.  A graph with no correlation between read IOPs and read service time is intriguing.
 
And I admit I spent too much time investigating exotic theories before I returned to a fundamental question: how confident am I in what is measured?
 
In this VM (the only VM active on the physical server), SQL Server is the sole significant resource consumer.  In this workload, the default resource pool was the sole significant busy pool.
 
So I can compare measurements of the SQL Server Resource Governor pool to server-wide measurements.
 
 
That looks pretty tight.  I expect server-wide read IOPs to sometimes exceed SQL Server read IOPs due to perfmon and other small consumers on the server.  SQL Server read IOPs are sometimes reported slightly higher than server-wide read IOPs.  I'm not too troubled by the level of difference I've seen, though.
 
Different story when considering read service times.
 

 
Blimey!  That doesn't square at all.  This workload was active over 8 LUNs for the active user database primary filegroup data files with a separate txlog LUN, and 2 LUNs for tempdb database data files with a separate txlog LUN.  Maybe if I look at the maximum read service time across all of those LUNs I'll make better sense of the situation?
 
Nope.
 
 

Younger me would probably take some time to dig into this.  But I'm old, and jaded.  Just gonna report to Microsoft the same thing I'll tell you:
"\SQLServer:Resource Pool Stats(default)\Avg Disk Read IO (ms)" can't be trusted.
 
I'll update this post someday if I see that change.
 
Ciao!
 

 
 
 
 

 

Monday, February 8, 2016

SQL Server 2016 Query Store - export & import?

Looking around for a good way to export and import query store from SQL Server 2016.
 
I'm performing large scale comparisons of the legacy cardinality estimator and the new cardinality estimator in SQL Server 2016.
 
I've captured over 9000 queries in two flavors.  First execution was SQL Server 2016 compatibility mode with trace flag 9481 (to force use of legacy cardinality estimator) & trace flag 4134 (I'll retire as soon as its confirmed the wrong results issue is corrected by default).  The second execution was with no trace flags enabled.
 
 
The database will soon be replaced with another test bed - approximately 1 TB larger.  I'd love to be able to compare behavior against the existing database to behavior on the larger database.
 
If all else fails, I'll set all user index & column stats to norecompute then truncate all user tables.  Then shrink the database and smush it into a single file.  That would preserve all the info I am interested in - and I could compare stats & query store across the smushed database and the new test bed.
 
Although I could export information from each of the query store catalog views, I suspect there might be information in the query store that isn't exposed through the catalog views.  Plus I'd rather not handle over 19,000 XML plans :-)
 
So I'm kinda hoping there is a clever export/import possibility...