Tuesday, January 31, 2023

#SQLServer - USE HINT ('DISABLE_OPTIMIZER_ROWGOAL') - SURPRISE! TOP operator in graphical query plan

 Hello!!!

 Sometimes perhaps the things I care about will really leave people wondering. :-)

In this case, it's a SQL Server graphical query plan oddity.

I don't think this behavior is necessarily related to a performance problem.

But I think this behavior can make it *harder* to troubleshoot SQL Server graphical query plans for performance.  This is a case where a TOP operator shows up in a graphical plan unexpectedly - and seemingly against the expressed intent of the USE HINT ('DISABLE_OPTIMIZER_ROWGOAL') hint.

Here are some sources for additional background about the general idea.

KB2667211 - A query may take a long time to run if the query optimizer uses the Top operator in SQL Server 2008 R2 or in SQL Server 2012
https://support.microsoft.com/en-us/topic/kb2667211-a-query-may-take-a-long-time-to-run-if-the-query-optimizer-uses-the-top-operator-in-sql-server-2008-r2-or-in-sql-server-2012-780ec760-894a-3aac-0d61-e954ea05633f

KB4051361 - Optimizer row goal information in query execution plan added in SQL Server 2014, 2016 and 2017
https://support.microsoft.com/en-us/topic/kb4051361-optimizer-row-goal-information-in-query-execution-plan-added-in-sql-server-2014-2016-and-2017-ec130d21-1adc-3d3d-95a5-cdb075722269

Inside the Optimizer: Row Goals In Depth
Paul White
@sql_kiwi
2010 August 18
https://www.sql.kiwi/2010/08/inside-the-optimiser-row-goals-in-depth.html

All right. Game on.

A little bit of info about the system.


Let's do some setup (I'll append all of the T-SQL at the end of the blog post for those that want to follow along at home).

 


OK, let's see what our folly hath wrought. Good, good. Let the A9 flow...

I know this query seems almost non-sensical.  But I assure you it was lovingly distilled from a much more complex case I ran into in the field.  Sure, the query could be simplified.  But that's not the point of this exercise :-) Just grabbing the estimated plan here, because that's all I care about in this repro.

(If t1 and t0 were actually different heaps this would be about as good SQL Server could do without any indexes.)


OK, it's a funky query... but nothing too unusual yet in the graphical plan.

So we've got a query that probably no *person* would write and the graphical plan isn't really too objectionable.  Why don't we add a hint that no *person* would add to this query, for good measure?  One that would be expected to have no effect? If there isn't an apparent row goal already, what change could come about if we added the DISABLE_OPTIMIZER_ROWGOAL hint?

Oh. That's weird.  A TOP operator magically appeared in the plan.

The attribute EstimateRowsWithoutRowGoal (discussed in the second KB article linked above) doesn't appear in the plan XML.  So... I guess this is "row goal lite" or something?

 

Martin Smith recommended testing the a hint to disable a rule: QUERYRULEOFF GbAggToConstScanOrTop.

The surprise TOP operator still shows up in the estimated plan.

 


And... I didn't realize this until after I initially published this blog post.

The sum of the costs of the operators in the graphical plans with the surprise TOP operator... is 106%.

 

~~~

Here's the T-SQL for those that want to have their own fun :-)


CREATE TABLE #temp (col1 NVARCHAR(300) NOT NULL);

;WITH 
  n1 AS (SELECT n = 1 UNION ALL SELECT 1)
, n2 AS (SELECT n = 1 FROM n1 a1 CROSS JOIN n1 a2)
, n3 AS (SELECT n = 1 FROM n2 a1 CROSS JOIN n2 a2)
, n4 AS (SELECT n = 1 FROM n3 a1 CROSS JOIN n3 a2)
, n5 AS (SELECT n = 1 FROM n4 a1 CROSS JOIN n4 a2)
, n6 AS (SELECT n = 1 FROM n5 a1 CROSS JOIN n3 a2)
, n7 AS (SELECT n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM n6)
INSERT INTO #temp
SELECT CASE WHEN n%11=0 THEN 'A1'
            WHEN n%10=0 THEN 'A2'
            WHEN n%9=0 THEN 'A3'
            WHEN n%8=0 THEN 'A4'
            WHEN n%7=0 THEN 'A5'
            WHEN n%6=0 THEN 'A6'
            WHEN n%5=0 THEN 'A7'
            WHEN n%4=0 THEN 'A8'
            ELSE 'A9'
       END
FROM n7;

SELECT   col1, how_many = count(*) 
FROM     #temp 
GROUP BY col1
ORDER BY col1 ASC;

SELECT  col1
FROM    #temp t0
WHERE   EXISTS (    SELECT 1
					FROM   #temp t1
					WHERE  t1.col1 = 'A9'
					AND    t0.col1 = t1.col1 )
OPTION(RECOMPILE, USE HINT('FORCE_DEFAULT_CARDINALITY_ESTIMATION');

SELECT  col1
FROM    #temp t0
WHERE   EXISTS (    SELECT 1
					FROM   #temp t1
					WHERE  t1.col1 = 'A9'
					AND    t0.col1 = t1.col1 )
OPTION(RECOMPILE, USE HINT('FORCE_DEFAULT_CARDINALITY_ESTIMATION', 'DISABLE_OPTIMIZER_ROWGOAL'));





Thursday, December 8, 2022

Let's check up on the SQL Server transaction log buffers in SQL Server 2019 CU 16!!!

 I'm looking into a severe LOGFLUSHQ spinlock issue in SQL Server 2019 CU16.

The issue first presented itself afaik in CU 15.

I'll probably blog more thoroughly about all that another day.

Here I just wanted to return to a detail from yesteryear.

Last time I checked was probably SQL Server 2014 - at that time each database had 128 in-memory transaction log buffers, each of which could be up to 60 kb (hence why each transaction log write could be up to 60 kb).

Since the logflushq spinlock contention is in the context of transaction log flush/write queue management, I figured I'd better check if the number of buffers had increased, or if the maximum size of those buffers had increased.

The T-SQL code I used is down below...

These results from SSMS show that SQL Server 2019 is what I am used to so far:

128 transaction log buffers, each with a max size of 60k.

And lc_state for those buffers can be retrieved from DBCC DBTABLE.





DROP TABLE IF EXISTS #table; 
CREATE TABLE #table
([ParentObject] VARCHAR(255),
[Object] VARCHAR(255),
[Field] VARCHAR(255),
[Value] VARCHAR(255)
);

DECLARE @sqlT NVARCHAR(1000)=
'dbcc dbtable(' + quotename(db_name()) + ') with tableresults, no_infomsgs';

INSERT INTO #table
EXEC (@sqlT);

SELECT SQLServer_Version = @@version;

SELECT ParentObject
FROM #table
WHERE LEFT(ParentObject, 
CASE WHEN PATINDEX('% @%', ParentObject) > 2 
THEN PATINDEX('% @%', ParentObject) - 1 ELSE 1 END) = 'LogMgr'
GROUP BY ParentObject;

WITH LogMgr AS 
(SELECT ParentObject
FROM #table
WHERE LEFT(ParentObject, 
CASE WHEN PATINDEX('% @%', ParentObject) > 2 
THEN PATINDEX('% @%', ParentObject) - 1 ELSE 1 END) = 'LogMgr'
GROUP BY ParentObject),
LogMgr_objects AS
(SELECT OBJECT_TYPE = LEFT(t.Object, 
CASE WHEN PATINDEX('% @%', t.Object) > 2 
THEN PATINDEX('% @%', t.Object) - 1 ELSE 1 END),
t.object
FROM #table t
JOIN LogMgr ON t.ParentObject = LogMgr.ParentObject
GROUP BY LEFT(t.Object, 
CASE WHEN PATINDEX('% @%', t.Object) > 2 
THEN PATINDEX('% @%', t.Object) - 1 ELSE 1 END), t.object)
SELECT LO.OBJECT_TYPE, object_count = COUNT(*)
FROM LogMgr_objects LO
GROUP BY LO.OBJECT_TYPE;

SELECT t.Field, t.value, NUM_log_buffers = count(*)
FROM #table t
WHERE t.object LIKE 'LC @0x%'
AND t.field = 'lc_maxDataSize'
GROUP BY t.field, t.value;

SELECT t.Field, t.value, NUM_log_buffers = count(*)
FROM #table t
WHERE t.object LIKE 'LC @0x%'
AND t.field = 'lc_state'
GROUP BY t.field, t.value;

Monday, April 4, 2022

SQL Server - recursive CTE to retrieve dependency chain of Resource Governor Classifier Function

SQL Server Resource Governor classifier functions can include references to functions as well as tables in the master database.  And each referenced function can in turn reference additional functions and tables.

In order to get a thorough understanding of how sessions are classified on a given system, one may need some familiarity with each function or table in the dependency tree.

This recursive function will retrieve all object names in the dependency tree (object names will be retrieved but column names will not be retrieved, due to the referenced_minor_id = 0 qualifier.

;WITH DTree AS 
(	SELECT		root_id		= classifier_function_id
, referencing_id = classifier_function_id , referenced_id = classifier_function_id , NestLevel = 1 FROM sys.dm_resource_governor_configuration UNION ALL SELECT root_id         = rgc.classifier_function_id , d1.referencing_id , d1.referenced_id , NestLevel + 1 FROM sys.sql_expression_dependencies d1 JOIN DTree r ON d1.referencing_id = r.referenced_id CROSS JOIN sys.dm_resource_governor_configuration rgc WHERE d1.referenced_minor_id = 0) SELECT classifier_function = OBJECT_NAME(t1.root_id) , referencing_name = OBJECT_NAME(t1.referencing_id) , referenced_name = OBJECT_NAME(t1.referenced_id) , referenced_type = so.type_desc , t1.NestLevel FROM DTree t1 JOIN sys.objects so ON so.object_id = t1.referenced_id ORDER BY t1.NestLevel, t1.referencing_id;



SQL Server - recursive CTE to retrieve rowstore index IAM chain

 A SQL Server recursive CTE can reach a maximum level of recursion of 32768.

Strting with the first IAM page for a rowstore index, the entire IAM chain can be retrieved, as each IAM page indicates the next page in the chain.

So a recursive CTE could reliably retrieve IAM chains for the rowstore indexes in a filegroup until the filegroup exceeds 128 TB in aggregate file size. (Because each IAM page accounts for database file usage in a ~4GB range of a file, for a single rowstore index.)

This CTE will retrieve an IAM chain. AFAIK, it's performance will beat the brakes off any similar query relying on sys.dm_db_database_page_allocations.




















DECLARE @object_id	INT		= OBJECT_ID('NOTES_FROM_BEYOND');
DECLARE @index_id	SMALLINT	= 1; 
DECLARE @type_desc	VARCHAR(20)	= 'IN_ROW_DATA';
					/*IN_ROW_DATA, ROW_OVERFLOW_DATA, LOB_DATA*/
;WITH start_page AS 
(	SELECT				fp.file_id
	,				fp.page_id
	FROM				sys.system_internals_allocation_units siau
	JOIN				sys.partitions sp 
	ON				siau.container_id	= sp.partition_id
	CROSS APPLY			sys.fn_PhysLocCracker(siau.first_iam_page) fp
	WHERE				sp.object_id		= @object_id
	AND				sp.index_id		= @index_id
	AND				siau.type_desc		= @type_desc	)
, rec_cte AS
(	SELECT				next_page_file_id	= CONVERT(BIGINT, sp.file_id)
	,				next_page_page_id	= CONVERT(BIGINT, sp.page_id)
	,				iam_page_ordinal	= 1
	FROM				start_page sp
	UNION ALL
	SELECT				CONVERT(BIGINT, dpi.next_page_file_id)
	,				CONVERT(BIGINT, dpi.next_page_page_id)
	,				iam_page_ordinal = iam_page_ordinal + 1
	FROM				rec_cte
	CROSS APPLY			sys.dm_db_page_info(DB_ID(), next_page_file_id, next_page_page_id, 'DETAILED') dpi
	WHERE				dpi.next_page_file_id <> 0    ) 
SELECT		rc.iam_page_ordinal, file_id = rc.next_page_file_id, page_id = rc.next_page_page_id
FROM		rec_cte rc
WHERE		rc.next_page_file_id IS NOT NULL
ORDER BY	rc.iam_page_ordinal
OPTION		(maxrecursion 32767);

Monday, February 28, 2022

#SQLServer Trivial Plans for Inserts; Stats Expectations and Reality

OK.  SQL Server trivial plans for rowstore table INSERT. And related optimizer stats interaction.

TL;DR cached trivial plans for INSERT can be surprisingly stubborn. If a query matches to one, it won't perform or queue a stats update even if the stats are stale.  If the stats have been updated and would otherwise warrant a per-index plan - but there is a matching cached trivial plan for a per-row plan... outta luck. Might hafta DBCC FREEPROCCACHE or add OPTION(RECOMPILE) hint to make sure a cached trivial plan doesn't prevent a per-index update for an INSERT when you really want one.

~~~~ 

The blog post referenced below, first published 2015 March 16, is pretty good.

The relevant takeaway from that post is: in a database with AUTO_UPDATE_STATISTICS = ON, a statistics update will NOT invalidate a relevant cached query plan. A subsequent query which qualifies to use a cached plan will continue to use the cached trivial plan. This is Scenario 1 in the blog post. The following text is directly from the blog post.

When a plan is trivial, it’s unnecessary to recompile the query even statistics has been updated.  Optimizer generates trivial plan for very simple queries (usually referencing a single table).  In XML plan, you will see statementOptmLevel="TRIVIAL". In such case, it’s futile and you won't get a better or different plan.

The demo code in the blog post uses SELECT queries only. In my examples in my own blog post, I'm concerned with INSERT queries.  Are INSERT queries eligible for both trivial plans and plans that have been more fully optimized? If INSERT queries can use a trivial plan, is it possible for a cached trivial plan to remain in use when a better and/or different plan could be selected?

Perhaps the reader is familiar with the Paul White blog post referenced below, from 2013 January 26. In that blog post, Paul White discusses wide, per-index INSERT query plans and narrow, per-row INSERT query plans.  How do these plan types relate to trivial plans?

OK. Enough jibber-jabber. Time to do something. Like create a stored procedure.  Which will be used in a database with AUTO_UPDATE_STATISTICS = ON and AUTO_UPDATE_STATISTICS_ASYNC = OFF.

This stored procedure will drop two tables if they already exist, then create those two tables.  The source table has three columns and a single index - the clustered primary key.  The target table also has three columns and a clustered primary key. In addition the target table has two non-clustered indexes.

The @insert_no_rows parameter controls whether or not there will be an INSERT query from the empty source table to the empty target table. That wouldn't move any data - but maybe it will cache a plan?

The INSERT query is dynamic SQL - both in the stored procedure and when I issue it adhoc later.  It isn't necessary for this query to be dynamic SQL - it was just easier that way to make sure the batch SQL text matched and plan re-use happened when it was a possibility. While writing this blog post at one point the query SQL text in the stored procedure has a different number of tabs preceding the formatted T-SQL than appeared in my adhoc query afterward and it took me forever to figure out why plan re-use wasn't occurring. :-) 

CREATE OR ALTER PROCEDURE test__trivial_plan_inserts @insert_no_rows INT = 0, @x INT = 0
AS
BEGIN
	DROP TABLE IF EXISTS trivial_test_source;
	DROP TABLE IF EXISTS trivial_test_target;

	CREATE TABLE trivial_test_source
	(	col1 INT CONSTRAINT pk__trivial_test_source PRIMARY KEY CLUSTERED
	,	col2 INT
	,	col3 INT
	);

	CREATE TABLE trivial_test_target
	(	col1 INT CONSTRAINT pk__trivial_test_target PRIMARY KEY CLUSTERED
	,	col2 INT INDEX nci__trivial_test_target__col2
	,	col3 INT INDEX nci__trivial_test_target__col3
	);

	-- white space in the INSERT dynamic SQL must match exactly
	-- in stored procedure and adhoc EXEC for cached plan reuse
	IF @insert_no_rows = 1
	BEGIN
		DECLARE @sqlT NVARCHAR(1000) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
		EXEC (@sqlT);
	END
	-- white space in the INSERT dynamic SQL must match exactly
	-- in stored procedure and adhoc EXEC for cached plan reuse

	;WITH	n0 AS (	SELECT TOP (32) n = 1 
					FROM master.dbo.spt_values)
	,		n1 AS (	SELECT n = ROW_NUMBER() OVER (ORDER BY(SELECT NULL))
					FROM n0 t1 CROSS JOIN n0 t2)
	INSERT INTO	trivial_test_source
	SELECT		TOP (@x) n, n, n
	FROM		n1;
END

For the first experiment, let's not worry about caching plans. Let's call the stored procedure like this:

EXEC test__trivial_plan_inserts @insert_no_rows = 0, @x = 1023;

So, give me 1023 rows in table trivial_test_source.

After the stored procedure executes, let's grab an actual plan from this simple INSERT query:

DECLARE @sqlT NVARCHAR(100) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
EXEC (@sqlT);
-- white space in the INSERT dynamic SQL must match exactly
-- in stored procedure and adhoc EXEC for cached plan reuse

And here's the actual plan.  That's a narrow, per-row plan. Note the estimate of 1023 rows; dead-on.



The Object properties in the graphical plan specify the target table indexes which will be updated for each row inserted into the clustered index. So far, so good.














































If we look at the plan XML, we can see this is a trivial plan.


OK. Now let's call the stored procedure like this:
EXEC test__trivial_plan_inserts @insert_no_rows = 0, @x = 1024;


That will drop the source and target tables, recreate the tables, and populate the source table with 1024 rows instead of the previous 1023.

Now get an actual plan for the insert again:

DECLARE @sqlT NVARCHAR(100) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
EXEC (@sqlT);
-- white space in the INSERT dynamic SQL must match exactly
-- in stored procedure and adhoc EXEC for cached plan reuse

And... bam!!! Yeah, the graphical plan below certainly looks like a different plan. This is what Paul White refers to as a wide, per-index update.

First, at the right of the blue box below, the source table scan operator feeds rows into the clustered index insert operator. No sort is needed to optimize the insert because the source and target tables have the same primary key clustered index definition. These rows are also fed into a table spool at the left of the blue box below.  The table spool at the left of the blue box is the same table spool at the left of the gold box - just different zones of the same plan.


Table spool populated, it is used as a source for a sort, then insert into a non-clustered index.  This happens once in the upper blue box, then again in the lower gold box.


Demonstrating the benefit of a per-index update over a per-row update for an INSERT of many thousand rows is left as an exercise for the reader.

While I won't investigate at this time, I do want to make note of an interesting switcheroo that takes place. Recall that within the CREATE TABLE, the clustered primary key and the two non-clustered indexes on col2 and col3 were created with inline syntax.







How did the non-clustered index on col3 become index_id 2, while the nonclustered index on col2 became index_id 3?  I don't know - and i don't know of anywhere that could be consequential. Yet.














How did I notice that little switcheroo?  Well, every per-index INSERT graphical plan I've seen has shown indexes in index_id order from top-down.  And when I've observed large per-index updates with sys.dm_tran_locks, the acquisition of locks over time by the session indicates indexes are tended to in index_id order.  I won't at this time be investigating further whether per-index updates always handle indexes in index_id order, or how the little index_id switcheroo occurred. Just an interesting bread crumb.

Before we do the next part I want to make sure the database is set up like I think...
Excellent, exactly what I wanted.










So. This time we drop the tables and recreate them. We issue the INSERT INTO... SELECT while both source and target tables are empty. Then populate the source table with 1024 rows.

EXEC test__trivial_plan_inserts @insert_no_rows = 1, @x = 1024;

And now we are ready to grab an actual plan...

DECLARE @sqlT NVARCHAR(100) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
EXEC (@sqlT);
-- white space in the INSERT dynamic SQL must match exactly
-- in stored procedure and adhoc EXEC for cached plan reuse

So now we have 1024 rows - that number of rows previously got a per-index update plan.  But this time it got a per-row plan. Huh.









Well, what the heck.  If we look at the plan XML we see a clue. RetrievedFromCache.



Now wait just a minute. That source table went from 0 rows to 1024 rows; did the stats get updated?

Wow. I guess not. When the insert matched a cached trivial plan, the stats did not get updated even though they were stale and auto update stats is true in this database.


So - what if I try this again? Drop the tables, recreate the tables, run the insert with zero rows to put a plan in the cache. Put 1024 rows in the source table. Then... I'll explicitly update stats.

EXEC test__trivial_plan_inserts @insert_no_rows = 1, @x = 1024;
UPDATE STATISTICS trivial_test_source;

OK, stats are updated.




So what happens now? Ouch. It doesn't matter.  If the stats are stale and there's a matching cached trivial plan - the stats don't get updated.  If the stale stats ARE updated before the query is executed - the matching trivial plan isn't invalidated. So that matching cached trivial plan will still be used. Even though I don't want it to be used. I just want SQL Server to take a fresh look at what's going on.

And that fresh look can solve the problem.  Clearing the plan cache with DBCC FREEPROCCACHE (or ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE CACHE in the relevant database) would cause SQL Server to take a fresh look, and decide if the INSERT should get a per-index plan instead of a per-row plan*.  Tacking OPTION(RECOMPILE) hint on to the end of that INSERT would also work in a much more targeted manner. Sometimes, in order to prevent a cached trivial plan from forcing a per-row update when we really want a per-index update, gonna hafta encourage SQL Server to take a fresh look in one of those ways.

*There are a number of SQL Server instance-wide configuration settings changes to which will also clear plan cache, such as maxdop and cost threshold for parallelism. But i don't recommend changing any of them solely to clear plan cache. Microsoft lists the configuration options which clear plan cache on the following BOL page.

Tuesday, December 7, 2021

#SQLServer - Database Cache Suppression

Third blog post in a quick series about potential trouble at the high AND low side of [\SQLServer:Buffer Manager\Target pages].  So far I've primarily talked about trouble on the low side of [Target pages], due to artificial restriction on the size of the database cache instance-wide, and an uneven expectation of [Free Memory] across the SQLOS memory nodes which can result in some SQLOS memory nodes having almost no database cache.  This blog post is another example of the uneven demand for [Free Memory] across SQLOS memory nodes.

Here are the first two blog posts, in order, where I started discussing this topic.

#SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages]
https://sql-sasquatch.blogspot.com/2021/12/sqlserver-column-store-object-pool.html


#SQLServer Database Cache Memory Magic at SQLOS Node Level -- Chaotic or Even A Little Bit Sinister

The previous two posts have focused on December 1 on a particular system. That system is unusual in that it has 7 vNUMA nodes.  That's a pretty surprising number, and it comes from the vNUMA configuration of that 80 vcpu vm not being aligned with physical resources beneath.  That'll get straightened out sometime soon and it'll live the rest of its life as a 4x20 as soon as it does.

How much of the problem do I expect to go away once that system is a 4x20?  A little. But it won't remove the problem entirely.

Let's look at a different system in this post - a 4x22, 88 vcpu system set up as good as I know how.

Well... yeah, there's something I don't necessarily like to see.  Shortly after 3:15 am the database page read rate increases but database cache size decreases.   Sometimes that happens, due to other memory commitments.  There's only so much memory and if there's another need, database cache might hafta shrink even if it means more reads and more read waits.  But in this case it isn't apparent there is a need for ~256 gb of SQLOS [Free Memory] to be maintained for over an hour.  It looks like maybe something was anticipated, but never happened?


Sure enough - as in the example we've been looking at previously, there's suddenly a huge swing in [Target pages], and it predicts the change in [Free Memory] retention.



Adjust the ratio of Y-axes to 8:1 to account for the left axis measured in KB, and the right axis measured in 8kb pages...

Now [Target pages] is pretty nicely redictive of [Database Cache Memory (KB)].  Until nearly 4:45 am, anyway, when [Target pages] goes back to the stratosphere. 



So for well over an hour, on this SQL Server 2019 instance on a nicely laid out 4x22, 88 vcpus vm SQLOS maintained over 256 gb of [Free Memory].  While also under considerable database page read load.  That very likely lead to much higher page re-reads than would have been necessary otherwise.  Other than that - how fairly distributed among the SQLOS memory nodes was production and maintenance of [Free Memory]?  Ouch. The intensity of page freeing on individual SQLOS nodes is just as intense on this 4 node system as it was on the 7 node system.  Individual nodes still get down to almost no database cache.  But conditions seem to change faster - things stay really bad for less time than on the 7 node system.






I haven't looked at a 2 node system yet.  Or an 8 node sysem.  I definitely predict the 8 node system will look... scary.  I might be able to round up a 2 node system before an 8 node system to evaluate.








#SQLServer Database Cache Memory Magic at SQLOS Node Level -- Chaotic or Even A Little Bit Sinister

 This blog post continues to look at a topic I blogged about yesterday.

Yesterday's first post:

sql.sasquatch: #SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages] (sql-sasquatch.blogspot.com)

https://sql-sasquatch.blogspot.com/2021/12/sqlserver-column-store-object-pool.html

And here's a post I finished later today, following this one.

sql.sasquatch: #SQLServer - Database Cache Suppression (sql-sasquatch.blogspot.com)

https://sql-sasquatch.blogspot.com/2021/12/sqlserver-database-cache-suppression.html

Hey!! What's up with the lavender boxes in all the graphs in this post!? I'm using the lavender boxes for visual alignment from graph to graph.  Not really necessary in the system-wide graphs since the changes in SQLOS free memory and database cache are so drastic; but they help me to align visually when i look at individual SQLOS nodes where the pattern gets more complicated.

To briefly recap: we looked at a system where SQLOS [Database Cache Memory] and [Free Memory] experienced dramatic, aggressive changes. Those changes were not related to database page read rates, evn though that is the primary inflow to database cache.


We saw that aggressive changes to [Target pages] predicted the aggressive changes to SQLOS database cache and free memory.  The maximum value for [Target pages] in this example is not attainable on the system; it is much larger than the SQL Server [Max Server Memory] setting.


The lower values for [Target pages], however, are very predictive of the [Database Cache Memory (KB)] values, once the left and right Y-axes are aligned in an 8:1 ratio.



Currently the only known drivers for these drastic changes in [Target pages] and the consequential changes in database cache and free memory are the entrance and exit of the column store object pool cache.  If I find more, I'll update these posts.

Today I want to look at the effect this has at the SQLOS memory node level.  By default, SQL Server SQLOS creates a memory node at startup for each detected vNUMA node on the system. Worker threads on the schedulers associated with the memory node tend the SQLOS memory on that node.  The patterns above and in yesterday's post look very orderly, even if not optimal.  Today we'll see if that holds true at the SQLOS node level as well.  This particular system currently has 7 vNUMA nodes. (sometimes vNUMA on monster VMs requires a little extra care to get into expected configuration. Hopefully soon this will be a 4x20 eighty vcpu VM with 4 vNUMA nodes instead of its current non-standard configuration.)

Here's memory graphs for each of the 7 current SQLOS nodes.  After these 7, I'll single out two to look at specifically.









Wow. Well, things definitely don't look as orderly at the SQLOS memory node level as the do at the system-wide level.  Let's focus on two of the nodes - 003 and 006. Now, unless someone has done some fancy rigamarole with affinitization, one won't be able to select a SQLOS scheduler node or memory node for a session's connection to be associated with.  Similarly, without fancy affinity won't be able to reliably predict which scheduler node(s) or memory node(s) will be home for parallel worker activity in the case of parallel queries (the nodes for parallel workers can be different than the scheduler node/memory node for the execution context ID 0 worker thread).

Imagine a DOP 1 query that handles a large table scan.  Because the high estimate of rows is past the tipping point(and return of a column absent from any index), table scan is always chosen rather than index access. The table being scanned is 20 GB in size - but assume no part of the table is in cache at the start of the hypothetical activity.  A session runs that query 6 times in succession, with different filter values (but each time its a full table scan).

This happens once between 7:30 am and 8:30 am, and once between 10:00 am and 11:00 am.

What would you expect to happen if both of those sessions performed their DOP 1 queries on node 003 (the top graph immediately below)? What would be different if instead those sessions performed their queries on node 006 (the lower of the graphs immediately below)?




OK, OK. I've shown the possibility of performance-by-lottery in this case. The hypothetical activity on node 003 might be able to cache the entire 20 gb table on the first scan, with second through sixth queries getting 100% cache hit.  The hypothetical ativity on node 006 has no chance of doing so. Each scan of that 20 gb table will have to read the full 20 gb table - and absorb any associated pageio_latch waits.

But is there any indication something like that actually happened? Yes, actually.

First of all, CPU utilization across the whole VM was cookin' pretty good.


Now let's look at the individual SQLOS memory nodes, with CPU utilization for the associated vNUMA nodes.

vNUMA node 3 has very high CPU utilization throughout.


vNUMA node 6 has spikier CPU, and it rarely is as high as on vNUMA node 3.  That's not entirely surprising to me.  The query activity responsible for CPU activity on vNUMA node 6 is experiencing many more pageio_latch waits than on vNUMA node 3 - because there is almost no database cache.  From 10:00 am - 11:00 am there is another indication of query activity on memory node 006 - other than the CPU activity on vNUMA node 6.  The continued growth of stolen memory on SQLOS memory node 006 is most likely due to growth of memory stolen against query workspace memory grants, for sort/hash activity and the like.



OK. So now I've shown that the amount of database cache and free memory on a system can change drastically due to the influence of [Target pages]. [Target pages] itself can dramatically change due to entrance/exit of the column store object pool memory broker clerk.  And even if these patterns look somewhat orderly at the system level, on large NUMA systems they may look chaotic or downright sinister.