Friday, August 28, 2015

I got a hunch about SQL Server Fragmentation Part I

I'm about a year behind in blogging, maybe more.  If I split my ideas into smaller bites, and make a series when appropriate, perhaps I'll get more blogging done.  Seems to work for me on Twitter :-)

If you hunger for more of my musing about SQL Server fragmentation, my next post is at:
I got a hunch about SQL Server fragmentation. Part II

(September 15, 2015) If you've read this post in the past, and it seems kinda different now... you're right!!  After posts 2-5 in the series I came back and re-factored :-) Trying to include html code windows for all of my queries now, simplified the example table structure a little, and so on...   

I had a hunch about achieving maximum possible logical fragmentation in a SQL Server clustered index based on some work I'd done in a different database engine.  Finally decided to put my hunch to the test.

In a brand new database with a single data file, I created my test table.  The table is very simple, constructed so that each row will consume 1 8kb leaf page due primarily to the 4096 character varchar column.  Who'd really create a table like that?  And why wouldn't they page compress it??  Come on, I'm only an enthusiast and this is my sandbox :-)

CREATE TABLE [dbo].[sq_dashE_1]
            ([ID_INT] [int] NOT NULL,
             [VARCHAR_4096] [varchar](4096) NULL
             CONSTRAINT [PK_sq_dashE_1] 
             ) ON [PRIMARY];

I inserted 4084 rows in the table, starting with clustered index key value 4084 and subtracting one along the way, down to one. Why 4084? Cuz this isn't the first time I've done this - or even in the first twelve times, really. :-) I knew it'd come out to 4096 total pages with index pages included.

DECLARE @string VARCHAR(4096) = 'AA';
DECLARE @ij     INT           = 4084, @jk     INT = 12; 
--stuff @string to 4096 characters
WHILE @jk > 1
            SELECT @string = STUFF(@string,2,0,@string);
            SELECT @jk = @jk - 1;
      END ;

WHILE @ij > 0 
            INSERT INTO sq_dashE_1 VALUES (@ij, @string); 
            SELECT @ij=@ij-1;
      END ;

So how does it look? After populating the table, I used the index physical stats function to get fragmentation information.

SELECT object_name(pstats.object_id) AS tname,
FROM   sys.dm_db_index_physical_stats(db_id(),NULL,NULL,NULL,'DETAILED') AS pstats
WHERE  pstats.object_id IN (object_id('sq_dashE_1'))

Aha!  You can see above that there are 4084 leaf pages, and 4084 fragments in the leaf pages.  The average fragment size in pages is 1. It seems maybe populating the ascending key clustered index in reverse order fragmented the table as bad as possible evvaarrr.

Merrill Aldrich made a great tool for inspecting physical page placement of small databases.  You can download version 1.2 of this tool at the following link. It's a wonderful tool for exploration - but not really suited for large databases in its current form.

Using this tool, its easy to spotcheck my hunch: Inserting rows in decreasing key value into an ascending clustered index leads to "perfect" or complete fragmentation as reported by avg_fragmentation_in_percent from the sys.dm_db_index_physical_stats function.  The test table sq_dashE_1 is in orange below.  But probably the first thing you notice is sq_dashE_1 pages and extents are quite contiguous. (And nevermind the high page numbers and the other database contents seen at the periphery which may or may not be from post 5 in this series :-})

In the graphic above, notice that the "previous page" for page ID 208088 is page 208090.  And the
"next page" page 208088 (the lowest page ID allocated to sq_dashE_1) is undefined.  (Page ID 208089 is skipped between leaf pages on either side because its an index page at a higher level - it only points to leaves.)

Let's take a look at details provided from function sys.dm_db_database_page_allocations.

SELECT   extent_page_id,
FROM     sys.dm_db_database_page_allocations(DB_ID(), OBJECT_ID('sq_dashE_1'), 1, NULL, 'DETAILED')
WHERE    page_type_desc<>'IAM_PAGE'
ORDER BY allocated_page_page_id

So in the graphic above, perhaps more nonintuitively than you'd ever imagined, I show the backward but predictable relationship of the logical order of the pages and their physical order.  I'm only showing 2 extents worth of the results, but the pattern continues throughout.

How many 64kb extents does test table sq_dashE_1 span?

SELECT COUNT (DISTINCT extent_page_id) AS number_of_extents_LIMITED
FROM   sys.dm_db_database_page_allocations(DB_ID(), OBJECT_ID('sq_dashE_1'), 1, NULL, 'LIMITED');

SELECT COUNT (DISTINCT extent_page_id) AS number_of_extents_DETAILED
FROM   sys.dm_db_database_page_allocations(DB_ID(), OBJECT_ID('sq_dashE_1'), 1, NULL, 'DETAILED')
WHERE  page_type_desc<>'IAM_PAGE';

Test table sq_dashE_1 spans 513 extents.  The LIMITED returns of function sys.dm_db_database_page_allocations don't bring back values for page_type_desc to exclude the IAM page :-)

OK.  So the 4084 leaf pages are reported by sys.dm_db_index_physical_stats as 4084 fragments of 1 page each.  The leaf pages are spread across 513 extents.  How many physical reads will it take to do an unordered full scan of the table?  Anyone think 4084, one read per fragment?  How about 513 - reading an entire 64k extent at a time? Let's see.

IF OBJECT_ID('tempdb..#blackhole') IS NOT NULL DROP TABLE #blackhole 
SELECT num_of_reads,num_of_bytes_read from sys.dm_io_virtual_file_stats(DB_ID(),1);
SELECT * INTO #blackhole FROM sq_dashE_1 where varchar_4096 like '%b%'
SELECT num_of_reads,num_of_bytes_read from sys.dm_io_virtual_file_stats(DB_ID(),1);

87 reads. 33554432 bytes=4096 eight kb pages(4084 leaf pages and index pages).

And what about the report from STATISTICS IO?

Here's something I mentioned elsewhere - still not gonna support my assertion, just gonna drop it here: disk reads to support database pages by read-ahead aren't reported by STATISTICS IO, only the number of database pages read by read-ahead.  So in this case we've seen 87 disk reads from sys.dm_io_virtual_file_stats and STATISTICS IO is reporting 3.  That leaves 84 reads to cover the 4072 pages read by read-ahead.  The average read size across all 87 reads was over 370kb - somewhere between 5 and 6 extents by average.

And if you read ahead in the series to the 5th post, you'll see a similar table, with a measly reported .0245% fragmentation that requires 338 reads for a similar query!!!
I got a hunch about SQL Server Fragmentation, Part V

So maybe fragmentation isn't all that bad??
Nahh, that's not where I'm going with this at all :-)

But I do plan to show that, depending on goals for index maintenance, index fragmentation evaluation may need to be a lot different than current common methods.

Besides, in post 4 I show a "perfectly defragmented" table of similar structure that accomplishes the same query in 68 reads :-)
I got a hunch about SQL Server Fragmentation, Part IV

OK, done for today.  Well... ok, I got one more thing.

There are 4096 8k pages in sq_dashE_1.  That'd be 512 extents(eight pages each) if perfectly packed.  But we know from sys.dm_db_database_page_allocations the pages for sq_dashE_1 span 513 extents.  So somewhere there's a blank space.

Your move, Taylor.

Wednesday, August 5, 2015

SQL Server Workspace Memory - with a Twist

In my recent blog post 'SQL Server Granted/Reserved/Stolen Workspace Memory in Perfmon' I hinted at a twist in the story of perfmon accounting of workspace memory.

Here's the stolen/free/database cache account of memory from perfmon on 'system B'.  Its a big memory system - a terabyte of RAM.


 At times more than 250 GB of RAM is stolen memory.  Wonder how much of that is used workspace memory?  How about inserting an Excel column for the difference between 'SQLServer:Memory Manager\Granted Workspace Memory (KB)' and 'SQLServer:Memory Manager\Reserved Server Memory (KB)'?

Whoa... reserved memory should be a subset of granted memory?!?

The amount of granted memory is so small I can't even see it on the same scale as the reserved memory. Adjust the scale and I can see that 'SQLServer:Memory Manager\Granted Workspace Memory (KB)' rarely has much memory accounted at all.

Well, I know that there are multiple Resource Governor workload groups on this server.  Here's a graph of their CPU activity during this observation time.

WrkGrp_4 is the main source of activity during the observation period. I wonder how these workgroups are distributed among Resource Governor resource pools?  Each resource pool added to the RG results in a new set of memory brokers added to the system, including a new memory broker to account for memory grants within the pool.

Sure - I could show the RG configuration but I'm a charts-n-graphs kinda guy :-)  Comparing this graph of Resource Pool CPU utilization to the previous graph, its clear each of the 4 workload groups is associated with its own resource pool.

With additional resource pools and their brokers in the mix,  the accounting of  'SQLServer:Memory Manager\Granted Workspace Memory (KB)' and 'SQLServer:Memory Manager\Reserved Server Memory (KB)' is expected to change.

We can see that 'SQLServer:Memory Manager\Granted Workspace Memory (KB)' only reports the granted query memory for the default resource pool.

With that being the case,  consistency would predict that 'SQLServer:Memory Manager\Reserved Server Memory (KB)' only report the unused portion of memory grants within the default resource pool.  If that were the case, 'SQLServer:Memory Manager\Reserved Server Memory (KB)' values would always be less than 'SQLServer:Memory Manager\Granted Workspace Memory (KB)' - we've already seen that's not the case.

Below is a transparent red overlay of  'SQLServer:Memory Manager\Reserved Server Memory (KB)' on top of a stacked graph of 'Active memory grant amount (KB)' for each of the resource pools.
That's more like it! 

So there's a fundamental difference between perfmon values for 'SQLServer:Memory Manager\Granted Workspace Memory (KB)' and 'SQLServer:Memory Manager\Reserved Server Memory (KB)': the granted memory values are ONLY for the default resource pool, while the reserved memory values are an aggregate across all resource pools.  Without that important tidbit, memory accounting can get pretty weird if there are active added resource pools to the Resource Governor configuration.

With that piece of knowledge, I add a column to Excel that sums the active memory grant amount for each of the resource pools, then subtract 'SQLServer:Memory Manager\Reserved Server Memory (KB)'.  The result is the total stolen/used workspace memory across all resource pools.  I wish perfmon had a counter for reserved memory per resource pool to accompany its granted per pool counter.  Alas - if you want that you're gonna hafta reap it from sys.dm_exec_query_memory_grants or sys.dm_exec_query_resource_semaphores.

Anyway - with my new computed column I can see that stolen/used query memory is that largest contributor to stolen memory during the observation, and based on the counters for granted per pool I know that the vast majority of the used/stolen query memory during the observation was against grants for ResPool_4.

There are three big reasons I like to track used query memory, especially in the context of Resource Governor:
1. Often query concurrency is limited by the amount of granted memory - but used query memory may be a small fraction of granted throughout.  If there's CPU to spare, using Resource Governor to limit the maximum query grant size can increase system performance by decreasing resource_semaphore waits and increasing query concurrency.
2. Sorting and hashing activity in workspace memory is not counted in the perfmon 'buffer page lookups' counter.  So when there is massive amounts of sorting, it can bend the relationship of CPU utilization to logical IO on the system.  Important to know when modeling system behavior.
3. If there are multiple resource pools active at the same time (other than the pre-existing default and internal pools), the system is vulnerable to out-of-memory exceptions when granted workspace memory becomes overcommitted and non-workspace memory steal/allocations fail.

But those stories will have to wait for yet another day... 

Once you graph the used query memory on a given system, if you see that its just a fraction of granted memory and are experiencing pending memory grants/resource_semaphore waits - you may want to use Resource Governor workload group max query memory grant to increase query concurrency.  An older blog post where I address that.

Resource Governor to restrict Max Query Memory Grant