Monday, April 4, 2022

SQL Server - recursive CTE to retrieve dependency chain of Resource Governor Classifier Function

SQL Server Resource Governor classifier functions can include references to functions as well as tables in the master database.  And each referenced function can in turn reference additional functions and tables.

In order to get a thorough understanding of how sessions are classified on a given system, one may nee some familiarity with each function or table in the dependency tree.

This recursive function will retrieve all object names in the dependency tree (object names will be retrieved but column names will not be retrieved, due to the referenced_minor_id = 0 qualifier.

(	SELECT		root_id		= classifier_function_id
, referencing_id = classifier_function_id , referenced_id = classifier_function_id , NestLevel = 1 FROM sys.dm_resource_governor_configuration UNION ALL SELECT root_id         = rgc.classifier_function_id , d1.referencing_id , d1.referenced_id , NestLevel + 1 FROM sys.sql_expression_dependencies d1 JOIN DTree r ON d1.referencing_id = r.referenced_id CROSS JOIN sys.dm_resource_governor_configuration rgc WHERE d1.referenced_minor_id = 0) SELECT classifier_function = OBJECT_NAME(t1.root_id) , referencing_name = OBJECT_NAME(t1.referencing_id) , referenced_name = OBJECT_NAME(t1.referenced_id) , referenced_type = so.type_desc , t1.NestLevel FROM DTree t1 JOIN sys.objects so ON so.object_id = t1.referenced_id ORDER BY t1.NestLevel, t1.referencing_id;

SQL Server - recursive CTE to retrieve rowstore index IAM chain

 A SQL Server recursive CTE can reach a maximum level of recursion of 32768.

Strting with the first IAM page for a rowstore index, the entire IAM chain can be retrieved, as each IAM page indicates the next page in the chain.

So a recursive CTE could reliably retrieve IAM chains for the rowstore indexes in a filegroup until the filegroup exceeds 128 TB in aggregate file size. (Because each IAM page accounts for database file usage in a ~4GB range of a file, for a single rowstore index.)

This CTE will retrieve an IAM chain. AFAIK, it's performance will beat the brakes off any similar query relying on sys.dm_db_database_page_allocations.

DECLARE @index_id	SMALLINT	= 1; 
DECLARE @type_desc	VARCHAR(20)	= 'IN_ROW_DATA';
;WITH start_page AS 
(	SELECT				fp.file_id
	,				fp.page_id
	FROM				sys.system_internals_allocation_units siau
	JOIN				sys.partitions sp 
	ON				siau.container_id	= sp.partition_id
	CROSS APPLY			sys.fn_PhysLocCracker(siau.first_iam_page) fp
	WHERE				sp.object_id		= @object_id
	AND				sp.index_id		= @index_id
	AND				siau.type_desc		= @type_desc	)
, rec_cte AS
(	SELECT				next_page_file_id	= CONVERT(BIGINT, sp.file_id)
	,				next_page_page_id	= CONVERT(BIGINT, sp.page_id)
	,				iam_page_ordinal	= 1
	FROM				start_page sp
	SELECT				CONVERT(BIGINT, dpi.next_page_file_id)
	,				CONVERT(BIGINT, dpi.next_page_page_id)
	,				iam_page_ordinal = iam_page_ordinal + 1
	FROM				rec_cte
	CROSS APPLY			sys.dm_db_page_info(DB_ID(), next_page_file_id, next_page_page_id, 'DETAILED') dpi
	WHERE				dpi.next_page_file_id <> 0    ) 
SELECT		rc.iam_page_ordinal, file_id = rc.next_page_file_id, page_id = rc.next_page_page_id
FROM		rec_cte rc
WHERE		rc.next_page_file_id IS NOT NULL
ORDER BY	rc.iam_page_ordinal
OPTION		(maxrecursion 32767);

Monday, February 28, 2022

#SQLServer Trivial Plans for Inserts; Stats Expectations and Reality

OK.  SQL Server trivial plans for rowstore table INSERT. And related optimizer stats interaction.

TL;DR cached trivial plans for INSERT can be surprisingly stubborn. If a query matches to one, it won't perform or queue a stats update even if the stats are stale.  If the stats have been updated and would otherwise warrant a per-index plan - but there is a matching cached trivial plan for a per-row plan... outta luck. Might hafta DBCC FREEPROCCACHE or add OPTION(RECOMPILE) hint to make sure a cached trivial plan doesn't prevent a per-index update for an INSERT when you really want one.


The blog post referenced below, first published 2015 March 16, is pretty good.

The relevant takeaway from that post is: in a database with AUTO_UPDATE_STATISTICS = ON, a statistics update will NOT invalidate a relevant cached query plan. A subsequent query which qualifies to use a cached plan will continue to use the cached trivial plan. This is Scenario 1 in the blog post. The following text is directly from the blog post.

When a plan is trivial, it’s unnecessary to recompile the query even statistics has been updated.  Optimizer generates trivial plan for very simple queries (usually referencing a single table).  In XML plan, you will see statementOptmLevel="TRIVIAL". In such case, it’s futile and you won't get a better or different plan.

The demo code in the blog post uses SELECT queries only. In my examples in my own blog post, I'm concerned with INSERT queries.  Are INSERT queries eligible for both trivial plans and plans that have been more fully optimized? If INSERT queries can use a trivial plan, is it possible for a cached trivial plan to remain in use when a better and/or different plan could be selected?

Perhaps the reader is familiar with the Paul White blog post referenced below, from 2013 January 26. In that blog post, Paul White discusses wide, per-index INSERT query plans and narrow, per-row INSERT query plans.  How do these plan types relate to trivial plans?

OK. Enough jibber-jabber. Time to do something. Like create a stored procedure.  Which will be used in a database with AUTO_UPDATE_STATISTICS = ON and AUTO_UPDATE_STATISTICS_ASYNC = OFF.

This stored procedure will drop two tables if they already exist, then create those two tables.  The source table has three columns and a single index - the clustered primary key.  The target table also has three columns and a clustered primary key. In addition the target table has two non-clustered indexes.

The @insert_no_rows parameter controls whether or not there will be an INSERT query from the empty source table to the empty target table. That wouldn't move any data - but maybe it will cache a plan?

The INSERT query is dynamic SQL - both in the stored procedure and when I issue it adhoc later.  It isn't necessary for this query to be dynamic SQL - it was just easier that way to make sure the batch SQL text matched and plan re-use happened when it was a possibility. While writing this blog post at one point the query SQL text in the stored procedure has a different number of tabs preceding the formatted T-SQL than appeared in my adhoc query afterward and it took me forever to figure out why plan re-use wasn't occurring. :-) 

CREATE OR ALTER PROCEDURE test__trivial_plan_inserts @insert_no_rows INT = 0, @x INT = 0
	DROP TABLE IF EXISTS trivial_test_source;
	DROP TABLE IF EXISTS trivial_test_target;

	CREATE TABLE trivial_test_source
	(	col1 INT CONSTRAINT pk__trivial_test_source PRIMARY KEY CLUSTERED
	,	col2 INT
	,	col3 INT

	CREATE TABLE trivial_test_target
	(	col1 INT CONSTRAINT pk__trivial_test_target PRIMARY KEY CLUSTERED
	,	col2 INT INDEX nci__trivial_test_target__col2
	,	col3 INT INDEX nci__trivial_test_target__col3

	-- white space in the INSERT dynamic SQL must match exactly
	-- in stored procedure and adhoc EXEC for cached plan reuse
	IF @insert_no_rows = 1
		DECLARE @sqlT NVARCHAR(1000) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
		EXEC (@sqlT);
	-- white space in the INSERT dynamic SQL must match exactly
	-- in stored procedure and adhoc EXEC for cached plan reuse

	;WITH	n0 AS (	SELECT TOP (32) n = 1 
					FROM master.dbo.spt_values)
					FROM n0 t1 CROSS JOIN n0 t2)
	INSERT INTO	trivial_test_source
	SELECT		TOP (@x) n, n, n
	FROM		n1;

For the first experiment, let's not worry about caching plans. Let's call the stored procedure like this:

EXEC test__trivial_plan_inserts @insert_no_rows = 0, @x = 1023;

So, give me 1023 rows in table trivial_test_source.

After the stored procedure executes, let's grab an actual plan from this simple INSERT query:

DECLARE @sqlT NVARCHAR(100) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
EXEC (@sqlT);
-- white space in the INSERT dynamic SQL must match exactly
-- in stored procedure and adhoc EXEC for cached plan reuse

And here's the actual plan.  That's a narrow, per-row plan. Note the estimate of 1023 rows; dead-on.

The Object properties in the graphical plan specify the target table indexes which will be updated for each row inserted into the clustered index. So far, so good.

If we look at the plan XML, we can see this is a trivial plan.

OK. Now let's call the stored procedure like this:
EXEC test__trivial_plan_inserts @insert_no_rows = 0, @x = 1024;

That will drop the source and target tables, recreate the tables, and populate the source table with 1024 rows instead of the previous 1023.

Now get an actual plan for the insert again:

DECLARE @sqlT NVARCHAR(100) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
EXEC (@sqlT);
-- white space in the INSERT dynamic SQL must match exactly
-- in stored procedure and adhoc EXEC for cached plan reuse

And... bam!!! Yeah, the graphical plan below certainly looks like a different plan. This is what Paul White refers to as a wide, per-index update.

First, at the right of the blue box below, the source table scan operator feeds rows into the clustered index insert operator. No sort is needed to optimize the insert because the source and target tables have the same primary key clustered index definition. These rows are also fed into a table spool at the left of the blue box below.  The table spool at the left of the blue box is the same table spool at the left of the gold box - just different zones of the same plan.

Table spool populated, it is used as a source for a sort, then insert into a non-clustered index.  This happens once in the upper blue box, then again in the lower gold box.

Demonstrating the benefit of a per-index update over a per-row update for an INSERT of many thousand rows is left as an exercise for the reader.

While I won't investigate at this time, I do want to make note of an interesting switcheroo that takes place. Recall that within the CREATE TABLE, the clustered primary key and the two non-clustered indexes on col2 and col3 were created with inline syntax.

How did the non-clustered index on col3 become index_id 2, while the nonclustered index on col2 became index_id 3?  I don't know - and i don't know of anywhere that could be consequential. Yet.

How did I notice that little switcheroo?  Well, every per-index INSERT graphical plan I've seen has shown indexes in index_id order from top-down.  And when I've observed large per-index updates with sys.dm_tran_locks, the acquisition of locks over time by the session indicates indexes are tended to in index_id order.  I won't at this time be investigating further whether per-index updates always handle indexes in index_id order, or how the little index_id switcheroo occurred. Just an interesting bread crumb.

Before we do the next part I want to make sure the database is set up like I think...
Excellent, exactly what I wanted.

So. This time we drop the tables and recreate them. We issue the INSERT INTO... SELECT while both source and target tables are empty. Then populate the source table with 1024 rows.

EXEC test__trivial_plan_inserts @insert_no_rows = 1, @x = 1024;

And now we are ready to grab an actual plan...

DECLARE @sqlT NVARCHAR(100) = N'INSERT INTO trivial_test_target SELECT * FROM trivial_test_source;'
EXEC (@sqlT);
-- white space in the INSERT dynamic SQL must match exactly
-- in stored procedure and adhoc EXEC for cached plan reuse

So now we have 1024 rows - that number of rows previously got a per-index update plan.  But this time it got a per-row plan. Huh.

Well, what the heck.  If we look at the plan XML we see a clue. RetrievedFromCache.

Now wait just a minute. That source table went from 0 rows to 1024 rows; did the stats get updated?

Wow. I guess not. When the insert matched a cached trivial plan, the stats did not get updated even though they were stale and auto update stats is true in this database.

So - what if I try this again? Drop the tables, recreate the tables, run the insert with zero rows to put a plan in the cache. Put 1024 rows in the source table. Then... I'll explicitly update stats.

EXEC test__trivial_plan_inserts @insert_no_rows = 1, @x = 1024;
UPDATE STATISTICS trivial_test_source;

OK, stats are updated.

So what happens now? Ouch. It doesn't matter.  If the stats are stale and there's a matching cached trivial plan - the stats don't get updated.  If the stale stats ARE updated before the query is executed - the matching trivial plan isn't invalidated. So that matching cached trivial plan will still be used. Even though I don't want it to be used. I just want SQL Server to take a fresh look at what's going on.

And that fresh look can solve the problem.  Clearing the plan cache with DBCC FREEPROCCACHE (or ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE CACHE in the relevant database) would cause SQL Server to take a fresh look, and decide if the INSERT should get a per-index plan instead of a per-row plan*.  Tacking OPTION(RECOMPILE) hint on to the end of that INSERT would also work in a much more targeted manner. Sometimes, in order to prevent a cached trivial plan from forcing a per-row update when we really want a per-index update, gonna hafta encourage SQL Server to take a fresh look in one of those ways.

*There are a number of SQL Server instance-wide configuration settings changes to which will also clear plan cache, such as maxdop and cost threshold for parallelism. But i don't recommend changing any of them solely to clear plan cache. Microsoft lists the configuration options which clear plan cache on the following BOL page.

Tuesday, December 7, 2021

#SQLServer - Database Cache Suppression

Third blog post in a quick series about potential trouble at the high AND low side of [\SQLServer:Buffer Manager\Target pages].  So far I've primarily talked about trouble on the low side of [Target pages], due to artificial restriction on the size of the database cache instance-wide, and an uneven expectation of [Free Memory] across the SQLOS memory nodes which can result in some SQLOS memory nodes having almost no database cache.  This blog post is another example of the uneven demand for [Free Memory] across SQLOS memory nodes.

Here are the first two blog posts, in order, where I started discussing this topic.

#SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages]

#SQLServer Database Cache Memory Magic at SQLOS Node Level -- Chaotic or Even A Little Bit Sinister

The previous two posts have focused on December 1 on a particular system. That system is unusual in that it has 7 vNUMA nodes.  That's a pretty surprising number, and it comes from the vNUMA configuration of that 80 vcpu vm not being aligned with physical resources beneath.  That'll get straightened out sometime soon and it'll live the rest of its life as a 4x20 as soon as it does.

How much of the problem do I expect to go away once that system is a 4x20?  A little. But it won't remove the problem entirely.

Let's look at a different system in this post - a 4x22, 88 vcpu system set up as good as I know how.

Well... yeah, there's something I don't necessarily like to see.  Shortly after 3:15 am the database page read rate increases but database cache size decreases.   Sometimes that happens, due to other memory commitments.  There's only so much memory and if there's another need, database cache might hafta shrink even if it means more reads and more read waits.  But in this case it isn't apparent there is a need for ~256 gb of SQLOS [Free Memory] to be maintained for over an hour.  It looks like maybe something was anticipated, but never happened?

Sure enough - as in the example we've been looking at previously, there's suddenly a huge swing in [Target pages], and it predicts the change in [Free Memory] retention.

Adjust the ratio of Y-axes to 8:1 to account for the left axis measured in KB, and the right axis measured in 8kb pages...

Now [Target pages] is pretty nicely redictive of [Database Cache Memory (KB)].  Until nearly 4:45 am, anyway, when [Target pages] goes back to the stratosphere. 

So for well over an hour, on this SQL Server 2019 instance on a nicely laid out 4x22, 88 vcpus vm SQLOS maintained over 256 gb of [Free Memory].  While also under considerable database page read load.  That very likely lead to much higher page re-reads than would have been necessary otherwise.  Other than that - how fairly distributed among the SQLOS memory nodes was production and maintenance of [Free Memory]?  Ouch. The intensity of page freeing on individual SQLOS nodes is just as intense on this 4 node system as it was on the 7 node system.  Individual nodes still get down to almost no database cache.  But conditions seem to change faster - things stay really bad for less time than on the 7 node system.

I haven't looked at a 2 node system yet.  Or an 8 node sysem.  I definitely predict the 8 node system will look... scary.  I might be able to round up a 2 node system before an 8 node system to evaluate.

#SQLServer Database Cache Memory Magic at SQLOS Node Level -- Chaotic or Even A Little Bit Sinister

 This blog post continues to look at a topic I blogged about yesterday.

Yesterday's first post:

sql.sasquatch: #SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages] (

And here's a post I finished later today, following this one.

sql.sasquatch: #SQLServer - Database Cache Suppression (

Hey!! What's up with the lavender boxes in all the graphs in this post!? I'm using the lavender boxes for visual alignment from graph to graph.  Not really necessary in the system-wide graphs since the changes in SQLOS free memory and database cache are so drastic; but they help me to align visually when i look at individual SQLOS nodes where the pattern gets more complicated.

To briefly recap: we looked at a system where SQLOS [Database Cache Memory] and [Free Memory] experienced dramatic, aggressive changes. Those changes were not related to database page read rates, evn though that is the primary inflow to database cache.

We saw that aggressive changes to [Target pages] predicted the aggressive changes to SQLOS database cache and free memory.  The maximum value for [Target pages] in this example is not attainable on the system; it is much larger than the SQL Server [Max Server Memory] setting.

The lower values for [Target pages], however, are very predictive of the [Database Cache Memory (KB)] values, once the left and right Y-axes are aligned in an 8:1 ratio.

Currently the only known drivers for these drastic changes in [Target pages] and the consequential changes in database cache and free memory are the entrance and exit of the column store object pool cache.  If I find more, I'll update these posts.

Today I want to look at the effect this has at the SQLOS memory node level.  By default, SQL Server SQLOS creates a memory node at startup for each detected vNUMA node on the system. Worker threads on the schedulers associated with the memory node tend the SQLOS memory on that node.  The patterns above and in yesterday's post look very orderly, even if not optimal.  Today we'll see if that holds true at the SQLOS node level as well.  This particular system currently has 7 vNUMA nodes. (sometimes vNUMA on monster VMs requires a little extra care to get into expected configuration. Hopefully soon this will be a 4x20 eighty vcpu VM with 4 vNUMA nodes instead of its current non-standard configuration.)

Here's memory graphs for each of the 7 current SQLOS nodes.  After these 7, I'll single out two to look at specifically.

Wow. Well, things definitely don't look as orderly at the SQLOS memory node level as the do at the system-wide level.  Let's focus on two of the nodes - 003 and 006. Now, unless someone has done some fancy rigamarole with affinitization, one won't be able to select a SQLOS scheduler node or memory node for a session's connection to be associated with.  Similarly, without fancy affinity won't be able to reliably predict which scheduler node(s) or memory node(s) will be home for parallel worker activity in the case of parallel queries (the nodes for parallel workers can be different than the scheduler node/memory node for the execution context ID 0 worker thread).

Imagine a DOP 1 query that handles a large table scan.  Because the high estimate of rows is past the tipping point(and return of a column absent from any index), table scan is always chosen rather than index access. The table being scanned is 20 GB in size - but assume no part of the table is in cache at the start of the hypothetical activity.  A session runs that query 6 times in succession, with different filter values (but each time its a full table scan).

This happens once between 7:30 am and 8:30 am, and once between 10:00 am and 11:00 am.

What would you expect to happen if both of those sessions performed their DOP 1 queries on node 003 (the top graph immediately below)? What would be different if instead those sessions performed their queries on node 006 (the lower of the graphs immediately below)?

OK, OK. I've shown the possibility of performance-by-lottery in this case. The hypothetical activity on node 003 might be able to cache the entire 20 gb table on the first scan, with second through sixth queries getting 100% cache hit.  The hypothetical ativity on node 006 has no chance of doing so. Each scan of that 20 gb table will have to read the full 20 gb table - and absorb any associated pageio_latch waits.

But is there any indication something like that actually happened? Yes, actually.

First of all, CPU utilization across the whole VM was cookin' pretty good.

Now let's look at the individual SQLOS memory nodes, with CPU utilization for the associated vNUMA nodes.

vNUMA node 3 has very high CPU utilization throughout.

vNUMA node 6 has spikier CPU, and it rarely is as high as on vNUMA node 3.  That's not entirely surprising to me.  The query activity responsible for CPU activity on vNUMA node 6 is experiencing many more pageio_latch waits than on vNUMA node 3 - because there is almost no database cache.  From 10:00 am - 11:00 am there is another indication of query activity on memory node 006 - other than the CPU activity on vNUMA node 6.  The continued growth of stolen memory on SQLOS memory node 006 is most likely due to growth of memory stolen against query workspace memory grants, for sort/hash activity and the like.

OK. So now I've shown that the amount of database cache and free memory on a system can change drastically due to the influence of [Target pages]. [Target pages] itself can dramatically change due to entrance/exit of the column store object pool memory broker clerk.  And even if these patterns look somewhat orderly at the system level, on large NUMA systems they may look chaotic or downright sinister.

Monday, December 6, 2021

#SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages]

***** Update 2021 December 7 *****

In the DIY section of this post I added an explanation about the up-to-five second delay between creation of the column store object pool memory broker clerk, and the update to [Target pages].  I also added some modified code with built in WAITFOR '00:00:06' between each operation and checking the memory counters :-) That always gives the desired results... but wow waiting ~36 seconds for it to complete seems like it takes forever.

Happy holidays, y'all.

***** End Update December 7 *****


I use perfmon a lot for SQL Server stuff :-)

Have you ever looked at how the Microsoft Docs describe [\SQLServer:Buffer Manager\Target pages]?

SQL Server, Buffer Manager object - SQL Server | Microsoft Docs

There's not much there. As of 2021 December 6:

Target pagesIdeal number of pages in the buffer pool.

That's not much to go on. And I'm a very inquisitive individual.

So much so that I wrote 3 posts (so far) about this issue.

The second post:

sql.sasquatch: #SQLServer Database Cache Memory Magic at SQLOS Node Level -- Chaotic or Even A Little Bit Sinister (

The third post:

sql.sasquatch: #SQLServer - Database Cache Suppression (

When I saw the memory graph below for a SQL Server 2017 system, my interest was piqued.

Lots of database pages were being read into cache as seen in the graph below, and there isn't any seeming correlation between the page read rate and the size of database cache.  Hmmm.

Database page reads are usually the major cause of inflow into database cache, and as such are a primary driver (together with database cache size) of the calculated [\SQLServer:Buffer Manager\Page life expectancy] counter. Considering the graphs immediately above and below, as one would expect, at a given page read rate PLE is higher when database cache is larger.

Given the database page read rate and PLE as seen above, why would SQL Server choose to amass and maintain for long periods of time ~30% of [Total Server Memory] as [Free Memory]? Especially since it has to constrain [Database Cache Memory] to about 50% of [Total Server Memory] to do that?

Sure enough, our friend [\SQLServer:Buffer Manager\Target pages] seems to be involved.  The maximum value of the measure in orange below, against the Y-axis on the right, is 2456027136 pages. With 8 kb pages, that works out to 18738 gb. That's an awful lot since [Max Server Memory] is pretty close to 1024 gb. This system is a vm with 1200 gb vRAM.

What if I change the ratio between the left Y-axis and the right Y-axis? The unit of measure on the left is kb, and on the right it's 8 kb pages. So what happens if I use an 8:1 ratio for the measures? Same maximum of 1280 million for the left Y-axis. But instead of 3000 million as a maximum on the right Y-axis, a maximum of 160 million.

Yeah, that's better! Suddenly [\SQLServer:Buffer Manager\Target pages] at the low end has a much more intuitive relatonship with the database cache size. Now [Target pages] is better predictive of database cache size, when it's NOT equal to 2456027136 pages. :-)

So, what event or condition can lead to such a drastic change in [\SQLServer:Buffer Manager\Target pages], with such significant consequences for database cache size?  For now, I only know of one event that changes [Target pages] from a seemingly too-high-to-be-achieved number, to an achievable number which *may* result in a database cache being artificially constrained in size on a given system: creation and presence of the column store object pool memory broker clerk.

The buffer pool memory broker clerk is ever-present in a SQL Server instance.  But the column store object pool memory broker clerk is a Houdini-kinda thing.  It doesn't get created until buffer pages for a columnstore segment are read and need to be serialized into the object pool.  Writing to or reading from delta stores isn't sufficient to create or sustain the column store object pool. Using an index reorganize on delta store contents only to compress a new columnstore row group won't do it.  Bulk insert of over 102400 rows with sufficient memory grant to immediately compress into a new rowgroup - nope, that won't create or sustain the column store object pool memory broker clerk, either.  Have to read buffer pool pages for a columnstore segment.  And once there are no more buffer pool pages in the buffer pool... poof! That Houdini-like column store object pool memory broker disappears! And [\SQLServer:Buffer Manager\Target pages] returns from a maybe-reasonable, maybe-too-restricted value to an unachievable-on-this-system value.

Maybe there's another event or condition that can cause this Jeckyll-Hyde transformation. I don't know of one yet, but I'll keep searching, and asking.

Maybe you'd like to try this yourself? You can, on SQL Server 2017 or 2019. (Pretty sure this'll have similar results on SQL Server 2016 and 2022, but I've not verified that.)

Here are some screenshots, and all the code is down at the bottom of the post.

Don't do it in a production instance, please :-) Make sure there isn't an existing column store object pool when you try this, and that potential columnstore work from others while you do this doesn't affect your results. An instance where you are the only user, immediatelu after a clean shutdown and restart should work well.  If you're the only user, checkpoint and dropcleanbuffers or freesystemcache for columnstore object pool should get you to a clean memory slate to start.

Here's the T-SQL query I use to get a quick lookie-loo at memory right after I start up the SQL Server instance.  Notice there is no column store object pool memory broker clerk. Not just zero in size - the row doesn't exist in the DMV - hence my LOJ to the DMV on 1=1 just to report the NULL size. Look at the comparison between [Max Server Memory] in kb and [Target pages] in kb!  This is my laptop, so MSM is set to 4 gb. That [Target pages] value! 258998272 kb works out to 247 gb. Not gonna get that on my laptop :-)


After verifying that column store object pool didn't exist, I used the code below to get one.

I actually checked after each step: creating the table, inserting 1000 row into delta store, reading 1000 rows from delta store, reorg to compress the delta store into a rowgroup.  Each of those steps showed I still didn't have a column store object pool in the instance. And [Target pages] remained at an unattainable number.

Finally, I read the table contents from the columnstore compressed rowgroup. And that created the memory broker clerk. And brought [Target pages] down from the stratosphere, to a number that could actually be achieved on this system (and thus could serve as a reasonable target). The target of 3695216 kb is just a hair over 88% of the max server memory value of 4 gb or 4194304 kb.

As I mentioned earlier, there may be other events or conditions that can cause this type of shift in [Target pages]. I don't know of any, yet. But I'll update this post when I learn of some.

I have some concerns on both sides of that major change in [Target pages], actually. The really high value is great at making the database cache aggressive, and keeping free memory to a minimum in the instance.  But - does it make SQL Server reluctant to shrink the database cache when it could in order to bring [Total Server Memory] back down to [Target Server Memory] after SQL Server has "colored outside the lines"?  I think it might.  Also, while allowing the database cache to remain aggressive in using up nearly all free memory is very good for PLE, the pressure on internal memory use within [Total Server Memory] may be causing unnecessary delays or CPU usage.  For example, if there's little-to-no free memory while a sort or hash is executing and memory needs to be stolen against the query's workspace memory grant, it would seem to be better to have some free memory around than to have to scrounge for it.
On the other hand, restricting database cache for what seems to be a fixed amount of free memory in some cases can lead to inefficient use of memory.  I definitely think that was the case in the graphs above, where free memory was over 396 gb at 7:30 am, when the read rate was over 240000 pages/s (1.8 gigabytes/s) and PLE ranged between 4 and 24.  A second concern is how evenly distributed memory freeing activity is across SQLOS nodes (this topic deserves at least a full post on its own, if not a series). 

Anyway... here's the code I used in my little DIY above :-)
	msm(max_server_memory__kb) AS
(	SELECT CONVERT(INT, value_in_use) * 1024 FROM sys.configurations 
	WHERE name			= 'max server memory (MB)')
,	tgt(target_pages__kb) AS
(	SELECT cntr_value * 8 FROM sys.dm_os_performance_counters 
	WHERE counter_name	= 'Target Pages')
,	bpool(bpool__kb) AS
(	SELECT total_kb FROM sys.dm_os_memory_broker_clerks
	WHERE clerk_name	= 'Buffer Pool')
,	cs(csopool__kb) AS
(	SELECT total_kb FROM sys.dm_os_memory_broker_clerks
	WHERE clerk_name	= 'Column store object pool')
SELECT *, vers = @@version

CREATE TABLE lonely_cci (c1 INT, INDEX cci__lonely_cci CLUSTERED COLUMNSTORE);
/*	no change in [Target Pages], no cso pool broker clerk yet */

INSERT INTO lonely_cci
FROM master..spt_values;
/*	only 1000 rows inserted, so into delta store 
	no change in [Target Pages], no cso pool broker clerk yet */

SELECT cnt = COUNT(*) FROM lonely_cci;
/*	reading from delta store
	no change in [Target Pages], no cso pool broker clerk yet */

/*	the reorg compresses the delta store into a columnstore rowgroup
	no change in [Target Pages], no cso pool broker clerk yet */

SELECT cnt = COUNT(*) FROM lonely_cci;
/*	reading from the bpool pages associated with the first CCI rowgroup
	initializes the column store object pool memory broker clerk
	and serializes the memory contents of the bpool pages into the pool
	now [Target Pages] is a value reflective/directive of bpool size
	and the object pool memory broker clerk shows up in dm_os_memory_broker_clerks */

/*	-- any of the following will result in eviction of the rowgroup's bpool pages from bpool
	-- once there are no bpool pages for any columnstore segment in the bpool
	--		[Target Pages] returns to a value much greater 
	--			than [Max Server Memory] or [Target Server Memory]
	--		column store object pool memory broker clerk disappears 
	--			from dm_os_memory_broker_clerks
	DBCC FREESYSTEMCACHE('Column store object pool') WITH no_infomsgs;
	TRUNCATE TABLE lonely_cci;
	DROP TABLE lonely_cci; */

***** Update 2021 December 7 *****
The [Target pages] counter is only updated every ~5 seconds.  I don't know if the operational value is changed faster than that with shared memory for counter retrieval only updated every ~5 seconds? Or maybe the Resource Monitor has a ~5 second schedule for actually updating the operational value for [Target pages]? I *think* the operational value is updated immediately and shared memory only updated every ~5 seconds - that's the pattern I've seen with counters related to Resource Governor Workload Group requests completed/second and similar perfmon counters, too. 

When I used:
WAITFOR DELAY '00:00:05'
after each operation and before checking [Target pages], there were still occasional misses of the updated value (eg there was a non-NULL size reported for column store object pool memory broker clerk but [Target pages] was still the same too-big value as when the broker clerk did not exist).  By waiting 6 seconds after each operation before checking I always saw [Target pages] updated as expected: when column store object pool broker clerk existed, [Target pages] was less than [Max Server Memory] (and less than [Target Server Memory]); when column store broker clerk did not exist, [Target pages] was much larger than [Max Server Memory].

Here's some modified code including the WAITFOR DELAY '00:006' if you want to just run the durn thing in one go :-)  It completes in ~36 seconds but it feels like forever before it finishes :-)

CREATE OR ALTER PROCEDURE #memcheck @step NVARCHAR(30) = 'no comment' AS
WAITFOR DELAY '00:00:06';
	msm(max_server_memory__kb) AS
(	SELECT CONVERT(INT, value_in_use) * 1024 FROM sys.configurations 
	WHERE name			= 'max server memory (MB)')
,	ser(target_server__kb) AS
(	SELECT cntr_value FROM sys.dm_os_performance_counters  
	WHERE counter_name	= 'Target Server Memory (KB)')
,	buf(target_pages__kb) AS
(	SELECT cntr_value * 8 FROM sys.dm_os_performance_counters 
	WHERE counter_name	= 'Target pages')
,	bpool(bpool__kb) AS
(	SELECT total_kb FROM sys.dm_os_memory_broker_clerks
	WHERE clerk_name	= 'Buffer Pool')
,	cs(csopool__kb) AS
(	SELECT total_kb FROM sys.dm_os_memory_broker_clerks
	WHERE clerk_name	= 'Column store object pool')
SELECT capture_tm = SYSDATETIME(), target_pages__diy_operation = @step, *, vers = LEFT(@@version, PATINDEX('%(KB%', @@version) - 2)
DROP TABLE IF EXISTS #diy__target_pages;
CREATE TABLE #diy__target_pages
(	capture_tm					DATETIME2
,	target_pages__diy_operation	NVARCHAR(40)
,	max_server_memory__kb		BIGINT
,	target_server__kb			BIGINT
,	target_pages__kb			BIGINT
,	bpool__kb					BIGINT
,	csopool__kb					BIGINT
,	vers						NVARCHAR(40)
,	INDEX cci__diy__target_pages CLUSTERED (capture_tm)) WITH (DATA_COMPRESSION = ROW);

CREATE TABLE #lonely_cci (c1 INT, INDEX cci__lonely_cci CLUSTERED COLUMNSTORE);
INSERT INTO #diy__target_pages EXEC #memcheck N'cci created';
INSERT INTO #lonely_cci SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM master..spt_values;
INSERT INTO #diy__target_pages EXEC #memcheck N'delta store insert';
IF (SELECT COUNT(*) FROM #lonely_cci) = 1 PRINT 'nope';
INSERT INTO #diy__target_pages EXEC #memcheck N'delta store select';
INSERT INTO #diy__target_pages EXEC #memcheck N'reorg';
IF (SELECT COUNT(*) FROM #lonely_cci) = 1 PRINT 'nope';
INSERT INTO #diy__target_pages EXEC #memcheck N'compressed select';
DROP TABLE #lonely_cci; /* see comment below about ways to get rid of column store object pool memory broker clerk */
INSERT INTO #diy__target_pages EXEC #memcheck N'drop table';
SELECT * FROM #diy__target_pages ORDER BY capture_tm ASC;
/*	-- any of the following will result in eviction of the rowgroup's bpool pages from bpool
	-- once there are no bpool pages for any columnstore segment in the bpool
	--		[Target Pages] returns to a value much greater 
	--			than [Max Server Memory] or [Target Server Memory]
	--		column store object pool memory broker clerk disappears 
	--			from dm_os_memory_broker_clerks
	DBCC FREESYSTEMCACHE('Column store object pool') WITH no_infomsgs;
	TRUNCATE TABLE #lonely_cci;
	DROP TABLE #lonely_cci; */

And this was the result when I ran that code on my lil ol' laptop.

Happy holidays!

***** End Update December 7 *****

Monday, July 5, 2021

#SQLServer: There's Something About SQLOS Memory Node Free Memory on NUMA systems...

In 2018 I wrote two blog posts about memory accounting behavior on NUMA servers I was seeing in SQL Server 2016, 2017, and eventually 2019.

Some amount of memory counted by a SQLOS memory node as [Stolen Memory] was *also* counted by another SQLOS memory node as [Database cache].

SQL Server 2016 Memory Accounting: A Suspicious Surprise
2018 July 2

SQL Server 2016 Memory Accounting Part II: Another Suspicious Example
2018 October 25

At first I thought it likely to be a reporting error only - something that made the numbers not tie out unless it was taken into consideration.

Maybe such a condition could exist without bubbling up as a performance degradation or error situation?

As I continued to observe the issue, however, I took note of the ways this double-counting of memory could result in performance drag or errors.  I won't detail that here, but it is worth noting that KB4536005, which resolved this condition, lists out of memory errors as a potential consequence of this accounting error.

#SQLServer 2019 CU2
#SQLServer 2017 CU20
#SQLServer 2016 SP2 CU15
KB4536005 - FIX: Fix incorrect memory page accounting that causes out-of-memory errors in SQL Server

At about the same time as noticing that some memory was double-counted among the SQLOS nodes, I noticed something else.  Occasionally, one or more SQLOS memory nodes would report an outlandishly high number as the free SQLOS memory on that node.  The number was obvious nonsense as it was higher than the amount of vRAM available to the vm.

I rationalized this as probably the result of a race condition at a time of low free memory - values for two memory measures reported at slightly different times yielding a negative number which came through the performance object as an absurdly high positive number.  Because there were other memory issues to chase after, I was satisfied with the answer I provided myself, and willing to believe the condition could exist without any appreciable effect on performance or errors.

It's a few years later now and... uh... I no longer believe that.

I've seen cases where that ridiculously high SQLOS node free memory values are reported on the same SQLOS node every 30 seconds for two straight hours.  I've seen cases where two of four SQLOS nodes are reporting the absurd high values at the same time.  Nope, doesn't fit the profile of benign race condition anymore.  Maybe I could just let it go if one of the SQL Server instances where this occurs doesn't also have active investigation for frequent buffer latch timeouts and for SQL Server seemingly failing to respond to low memory conditions.

OK, so I have to dig in.

In this post I will stray from my typical, graph heavy presentation. This one works better as tables, because the free memory involved is so doggone small compared to database cache and stolen memory it's really hard to find on the graph.

All of the tables below are from a single system, running SQL Server 2019 cu9. Perfmon is capturing lots of counters every 30 seconds.

First we'll look at instance-wide numbers from [\SQLServer:Memory Manager].

[Total Server Memory (KB)] is in column 2, with a purple header. 
[Database Cache Memory (KB)] is in column 3 with a light blue header. 
[Stolen Server Memory (KB)] is in column 4, also with a light blue header.

Column 5, [Calculated Free Kb] has a light pink header. It is the result of this calculation in each interval:
[Total Server Memory (KB)] - [Database Cache Memory (KB)] - [Stolen Server Memory (KB)]

Column 6 [Free Memory (KB)] has an orange header. Notice below that it always has the same value as the preceding column, my calculated free memory.

I expect to see the same relationship at the SQLOS memory node level, as we look at [\SQLServer:Memory Node(*)] counters.  Indeed in each of the tables below this calculation for the memory node holds true, as the similar calculation holds true for the memory manager:

[\SQLServer:Memory Node(*)\Total Node Memory (KB)]
- [\SQLServer:Memory Node(*)\Database Node Memory (KB)]
- [\SQLServer:Memory Node(*)\Stolen Node Memory (KB)]
= [\SQLServer:Memory Node(*)\Free Node Memory (KB)]

When I showed the tables above to my colleagues, they were onto me right away. "Where is node 002?"

Well... yeah. Node 002 looks a little different.  I've only seen three distinct absurd [Free Node Memory (KB)] values so far: 18446744073709300000, 18446744073709400000, 18446744073709500000.

Considering that each of the 4 vNUMA nodes on this system has 640,000 mb vRAM, maybe the lowest value above of -154608 kb doesn't seem like it could cause that much trouble. That's about -150 mb.  And of course free memory can't really be negative. So there's a memory accounting error of at least 150 mb. How bad could that be?

Well... low memory condition signal is when less than 200 mb available memory, right?
In some cases, a memory accounting error of 150 mb could defintely cause trouble. Even on a system with a massive amount of memory.