Monday, December 6, 2021

#SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages]

Howdy!

I use perfmon a lot for SQL Server stuff :-)

Have you ever looked at how the Microsoft Docs describe [\SQLServer:Buffer Manager\Target pages]?

SQL Server, Buffer Manager object - SQL Server | Microsoft Docs

There's not much there. As of 2021 December 6:

Target pagesIdeal number of pages in the buffer pool.


That's not much to go on. And I'm a very inquisitive individual.

So when I saw this memory graph for SQL Server 2017 system, my interest was piqued.













Lots of database pages were being read into cache, and there isn't any seeming correlation between the page read rate and the size of database cache.  Hmmm.













Database page reads are usually the major cause of inflow into database cache, and as such are a primary driver (together with database cache size) of the calculated [\SQLServer:Buffer Manager\Page life expectancy] counter. Considering the graphs immediately above and below, as one would expect, at a given page read rate PLE is higher when database cache is larger.













Given the database page read rate and PLE as seen above, why would SQL Server choose to amass and maintain for long periods of time ~30% of [Total Server Memory] as [Free Memory]? Especially since it has to constrain [Database Cache Memory] to about 50% of [Total Server Memory] to do that?

Sure enough, our friend [\SQLServer:Buffer Manager\Target pages] seems to be involved.  The maximum value of the measure in orange below, against the Y-axis on the right, is 2456027136 pages. With 8 kb pages, that works out to 18738 gb. That's an awful lot since [Max Server Memory] is pretty close to 1024 gb. This system is a vm with 1200 gb vRAM.




What if I change the ratio between the left Y-axis and the right Y-axis? The unit of measure on the left is kb, and on the right it's 8 kb pages. So what happens if I use an 8:1 ratio for the measures? Same maximum of 1280 million for the left Y-axis. But instead of 3000 million as a maximum on the right Y-axis, a maximum of 160 million.

Yeah, that's better! Suddenly [\SQLServer:Buffer Manager\Target pages] at the low end has a much more intuitive relatonship with the database cache size. Now [Target pages] is better predictive of database cache size, when it's NOT equal to 2456027136 pages. :-)













So, what event or condition can lead to such a drastic change in [\SQLServer:Buffer Manager\Target pages], with such significant consequences for database cache size?  For now, I only know of one event that changes [Target pages] from a seemingly too-high-to-be-achieved number, to an achievable number which *may* result in a database cache being artificially constrained in size on a given system: creation and presence of the column store object pool memory broker clerk.

The buffer pool memory broker clerk is ever-present in a SQL Server instance.  But the column store object pool memory broker clerk is a Houdini-kinda thing.  It doesn't get created until buffer pages for a columnstore segment are read and need to be serialized into the object pool.  Writing to or reading from delta stores isn't sufficient to create or sustain the column store object pool. Using an index reorganize on delta store contents only to compress a new columnstore row group won't do it.  Bulk insert of over 102400 rows with sufficient memory grant to immediately compress into a new rowgroup - nope, that won't create or sustain the column store object pool memory broker clerk, either.  Have to read buffer pool pages for a columnstore segment.  And once there are no more buffer pool pages in the buffer pool... poof! That Houdini-like column store object pool memory broker disappears! And [\SQLServer:Buffer Manager\Target pages] returns from a maybe-reasonable, maybe-too-restricted value to an unachievable-on-this-system value.

Maybe there's another event or condition that can cause this Jeckyll-Hyde transformation. I don't know of one yet, but I'll keep searching, and asking.

Maybe you'd like to try this yourself? You can, on SQL Server 2017 or 2019. (Pretty sure this'll have similar results on SQL Server 2016 and 2022, but I've not verified that.)

Here are some screenshots, and all the code is down at the bottom of the post.

Don't do it in a production instance, please :-) Make sure there isn't an existing column store object pool when you try this, and that potential columnstore work from others while you do this doesn't affect your results. An instance where you are the only user, immediatelu after a clean shutdown and restart should work well.  If you're the only user, checkpoint and dropcleanbuffers or freesystemcache for columnstore object pool should get you to a clean memory slate to start.

Here's the T-SQL query I use to get a quick lookie-loo at memory right after I start up the SQL Server instance.  Notice there is no column store object pool memory broker clerk. Not just zero in size - the row doesn't exist in the DMV - hence my LOJ to the DMV on 1=1 just to report the NULL size. Look at the comparison between [Max Server Memory] in kb and [Target pages] in kb!  This is my laptop, so MSM is set to 4 gb. That [Target pages] value! 258998272 kb works out to 247 gb. Not gonna get that on my laptop :-)













 





After verifying that column store object pool didn't exist, I used the code below to get one.

I actually checked after each step: creating the table, inserting 1000 row into delta store, reading 1000 rows from delta store, reorg to compress the delta store into a rowgroup.  Each of those steps showed I still didn't have a column store object pool in the instance. And [Target pages] remained at an unattainable number.

































Finally, I read the table contents from the columnstore compressed rowgroup. And that created the memory broker clerk. And brought [Target pages] down from the stratosphere, to a number that could actually be achieved on this system (and thus could serve as a reasonable target). The target of 3695216 kb is just a hair over 88% of the max server memory value of 4 gb or 4194304 kb.



















As I mentioned earlier, there may be other events or conditions that can cause this type of shift in [Target pages]. I don't know of any, yet. But I'll update this post when I learn of some.

I have some concerns on both sides of that major change in [Target pages], actually. The really high value is great at making the database cache aggressive, and keeping free memory to a minimum in the instance.  But - does it make SQL Server reluctant to shrink the database cache when it could in order to bring [Total Server Memory] back down to [Target Server Memory] after SQL Server has "colored outside the lines"?  I think it might.  Also, while allowing the database cache to remain aggressive in using up nearly all free memory is very good for PLE, the pressure on internal memory use within [Total Server Memory] may be causing unnecessary delays or CPU usage.  For example, if there's little-to-no free memory while a sort or hash is executing and memory needs to be stolen against the query's workspace memory grant, it would seem to be better to have some free memory around than to have to scrounge for it.
On the other hand, restricting database cache for what seems to be a fixed amount of free memory in some cases can lead to inefficient use of memory.  I definitely think that was the case in the graphs above, where free memory was over 396 gb at 7:30 am, when the read rate was over 240000 pages/s (1.8 gigabytes/s) and PLE ranged between 4 and 24.  A second concern is how evenly distributed memory freeing activity is across SQLOS nodes (this topic deserves at least a full post on its own, if not a series). 

Anyway... here's the code I used in my little DIY above :-)
; WITH
	msm(max_server_memory__kb) AS
(	SELECT CONVERT(INT, value_in_use) * 1024 FROM sys.configurations 
	WHERE name			= 'max server memory (MB)')
,	tgt(target_pages__kb) AS
(	SELECT cntr_value * 8 FROM sys.dm_os_performance_counters 
	WHERE counter_name	= 'Target Pages')
,	bpool(bpool__kb) AS
(	SELECT total_kb FROM sys.dm_os_memory_broker_clerks
	WHERE clerk_name	= 'Buffer pool')
,	cs(csopool__kb) AS
(	SELECT total_kb FROM sys.dm_os_memory_broker_clerks
	WHERE clerk_name	= 'Column store object pool')
SELECT *, vers = @@version
FROM msm CROSS JOIN tgt CROSS JOIN bpool LEFT OUTER JOIN cs ON 1 = 1;

CREATE TABLE lonely_cci (c1 INT, INDEX cci__lonely_cci CLUSTERED COLUMNSTORE);
/*	no change in [Target Pages], no cso pool broker clerk yet */

INSERT INTO lonely_cci
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values;
/*	only 1000 rows inserted, so into delta store 
	no change in [Target Pages], no cso pool broker clerk yet */

SELECT cnt = COUNT(*) FROM lonely_cci;
/*	reading from delta store
	no change in [Target Pages], no cso pool broker clerk yet */

ALTER INDEX cci__lonely_cci ON lonely_cci REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
/*	the reorg compresses the delta store into a columnstore rowgroup
	no change in [Target Pages], no cso pool broker clerk yet */

SELECT cnt = COUNT(*) FROM lonely_cci;
/*	reading from the bpool pages associated with the first CCI rowgroup
	initializes the column store object pool memory broker clerk
	and serializes the memory contents of the bpool pages into the pool
	
	now [Target Pages] is a value reflective/directive of bpool size
	and the object pool memory broker clerk shows up in dm_os_memory_broker_clerks */



/*	-- any of the following will result in eviction of the rowgroup's bpool pages from bpool
	-- once there are no bpool pages for any columnstore segment in the bpool
	--		[Target Pages] returns to a value much greater 
	--			than [Max Server Memory] or [Target Server Memory]
	--		column store object pool memory broker clerk disappears 
	--			from dm_os_memory_broker_clerks
	CHECKPOINT; DBCC DROPCLEANBUFFERS WITH no_infomsgs; 
	DBCC FREESYSTEMCACHE('Column store object pool') WITH no_infomsgs;
	TRUNCATE TABLE lonely_cci;
	DROP TABLE lonely_cci; */




Monday, July 5, 2021

#SQLServer: There's Something About SQLOS Memory Node Free Memory on NUMA systems...

In 2018 I wrote two blog posts about memory accounting behavior on NUMA servers I was seeing in SQL Server 2016, 2017, and eventually 2019.

Some amount of memory counted by a SQLOS memory node as [Stolen Memory] was *also* counted by another SQLOS memory node as [Database cache].

SQL Server 2016 Memory Accounting: A Suspicious Surprise
2018 July 2
https://sql-sasquatch.blogspot.com/2018/07/sql-server-2016-memory-accounting.html

SQL Server 2016 Memory Accounting Part II: Another Suspicious Example
https://sql-sasquatch.blogspot.com/2018/10/sql-server-2016-memory-accounting-part.html
2018 October 25

At first I thought it likely to be a reporting error only - something that made the numbers not tie out unless it was taken into consideration.

Maybe such a condition could exist without bubbling up as a performance degradation or error situation?

As I continued to observe the issue, however, I took note of the ways this double-counting of memory could result in performance drag or errors.  I won't detail that here, but it is worth noting that KB4536005, which resolved this condition, lists out of memory errors as a potential consequence of this accounting error.

#SQLServer 2019 CU2
#SQLServer 2017 CU20
#SQLServer 2016 SP2 CU15
KB4536005 - FIX: Fix incorrect memory page accounting that causes out-of-memory errors in SQL Server


At about the same time as noticing that some memory was double-counted among the SQLOS nodes, I noticed something else.  Occasionally, one or more SQLOS memory nodes would report an outlandishly high number as the free SQLOS memory on that node.  The number was obvious nonsense as it was higher than the amount of vRAM available to the vm.

I rationalized this as probably the result of a race condition at a time of low free memory - values for two memory measures reported at slightly different times yielding a negative number which came through the performance object as an absurdly high positive number.  Because there were other memory issues to chase after, I was satisfied with the answer I provided myself, and willing to believe the condition could exist without any appreciable effect on performance or errors.

It's a few years later now and... uh... I no longer believe that.

I've seen cases where that ridiculously high SQLOS node free memory values are reported on the same SQLOS node every 30 seconds for two straight hours.  I've seen cases where two of four SQLOS nodes are reporting the absurd high values at the same time.  Nope, doesn't fit the profile of benign race condition anymore.  Maybe I could just let it go if one of the SQL Server instances where this occurs doesn't also have active investigation for frequent buffer latch timeouts and for SQL Server seemingly failing to respond to low memory conditions.

OK, so I have to dig in.

In this post I will stray from my typical, graph heavy presentation. This one works better as tables, because the free memory involved is so doggone small compared to database cache and stolen memory it's really hard to find on the graph.

All of the tables below are from a single system, running SQL Server 2019 cu9. Perfmon is capturing lots of counters every 30 seconds.

First we'll look at instance-wide numbers from [\SQLServer:Memory Manager].

[Total Server Memory (KB)] is in column 2, with a purple header. 
[Database Cache Memory (KB)] is in column 3 with a light blue header. 
[Stolen Server Memory (KB)] is in column 4, also with a light blue header.

Column 5, [Calculated Free Kb] has a light pink header. It is the result of this calculation in each interval:
[Total Server Memory (KB)] - [Database Cache Memory (KB)] - [Stolen Server Memory (KB)]

Column 6 [Free Memory (KB)] has an orange header. Notice below that it always has the same value as the preceding column, my calculated free memory.


I expect to see the same relationship at the SQLOS memory node level, as we look at [\SQLServer:Memory Node(*)] counters.  Indeed in each of the tables below this calculation for the memory node holds true, as the similar calculation holds true for the memory manager:

[\SQLServer:Memory Node(*)\Total Node Memory (KB)]
- [\SQLServer:Memory Node(*)\Database Node Memory (KB)]
- [\SQLServer:Memory Node(*)\Stolen Node Memory (KB)]
= [\SQLServer:Memory Node(*)\Free Node Memory (KB)]




When I showed the tables above to my colleagues, they were onto me right away. "Where is node 002?"

Well... yeah. Node 002 looks a little different.  I've only seen three distinct absurd [Free Node Memory (KB)] values so far: 18446744073709300000, 18446744073709400000, 18446744073709500000.



Considering that each of the 4 vNUMA nodes on this system has 640,000 mb vRAM, maybe the lowest value above of -154608 kb doesn't seem like it could cause that much trouble. That's about -150 mb.  And of course free memory can't really be negative. So there's a memory accounting error of at least 150 mb. How bad could that be?

Well... low memory condition signal is when less than 200 mb available memory, right?
In some cases, a memory accounting error of 150 mb could defintely cause trouble. Even on a system with a massive amount of memory.









Saturday, June 12, 2021

#SQLServer Snapshot Isolation Level - increase query CPU time and elapsed time by a factor of 8!

OK. This is a little test in my SQL Server 2019 CU11 instance.  But the behavior I will demonstrate goes way back... and seems like it just may go way forward, too.

First, let's create a test database, with a nice little 3 column CCI table in it.

USE [master];
GO

CREATE DATABASE [test_SI] 
ON PRIMARY (	NAME = N'test_SI'
,				FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.V2019CU11\MSSQL\DATA\test_SI.mdf')
LOG ON (		NAME = N'test_SI_log'
,				FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL15.V2019CU11\MSSQL\DATA\test_SI_log.ldf');

ALTER DATABASE [test_SI] SET ALLOW_SNAPSHOT_ISOLATION ON;
GO
USE [test_SI]
DROP TABLE IF EXISTS dbo.SI_DeletedRows;

CREATE TABLE dbo.SI_DeletedRows
(	x_key		[bigint]	NOT NULL
,	x_value		[int]		NULL
,	x_comment	[varchar](512)	NULL
,	INDEX [CCI__SI_DeletedRows] CLUSTERED COLUMNSTORE
);

Let's take 2 minutes (or less) to insert 104,203,264 rows into the clustered columnstore index created. 

;WITH sixteen AS
(	SELECT num = 1 UNION ALL SELECT 1
        UNION ALL SELECT 1 UNION ALL SELECT 1
        UNION ALL SELECT 1 UNION ALL SELECT 1
        UNION ALL SELECT 1 UNION ALL SELECT 1
        UNION ALL SELECT 1 UNION ALL SELECT 1
        UNION ALL SELECT 1 UNION ALL SELECT 1
        UNION ALL SELECT 1 UNION ALL SELECT 1
        UNION ALL SELECT 1 UNION ALL SELECT 1)
INSERT INTO     dbo.SI_deletedRows
SELECT ROW_NUMBER() OVER (ORDER BY s1.number DESC), s1.number, s1.name FROM master..spt_values s1
CROSS JOIN	master..spt_values s2	
CROSS JOIN sixteen;
/* 104203264 rows
min - 1:37
max - 1:56 */

All right - now let's check on the rowgroup quality.  Pretty nice; 100 rowgroups. 99 of them at maximum size, and 1 with almost 400,000 rows.

SELECT		csrgps.state_desc, num_groups = count(*), rows_in_group = csrgps.total_rows
,		csrgps.trim_reason_desc, csrgps.transition_to_compressed_state_desc
FROM		sys.dm_db_column_store_row_group_physical_stats csrgps
WHERE		csrgps.object_id = object_id('dbo.SI_DeletedRows')
GROUP BY	csrgps.state_desc, csrgps.total_rows, csrgps.trim_reason_desc
,		csrgps.transition_to_compressed_state_desc;


Let's take a minute and delete 1 row out of every 15 from the CCI.

DELETE FROM dbo.SI_DeletedRows WHERE x_key % 15 = 0;
/* 6946884 rows, 6.67%
0:51 */ 

Here's a quick little query against the CCI, in Read Committed transaction isolation level. The query is run all by itself on the instance, at DOP 1.

SET TRANSACTION ISOLATION LEVEL Read Committed
SET STATISTICS TIME ON
SELECT	COUNT(1) FROM dbo.SI_DeletedRows
WHERE	x_comment IN ('money', 'smallmoney', 'offline', 'Little Vito')
OR	x_comment IS NULL
OPTION (MAXDOP 1);
/*  SQL Server Execution Times:
    CPU time = 312 ms,  elapsed time = 327 ms. */

Let's run the query again at DOP 1. But change the isolation level to Snapshot. No other queries are running on the instance, so there is no competition for resources.  There is nothing in the version store, and there are no other open transactions against the objects in the query.  Snapshot isolation all by itself is responsible for increasing the CPU time and the elapsed time of this query on a CCI with deleted rows by a factor of eight.

SET TRANSACTION ISOLATION LEVEL SNAPSHOT
SET STATISTICS TIME ON
SELECT	COUNT(1) FROM dbo.SI_DeletedRows 
WHERE	x_comment IN ('money', 'smallmoney', 'offline', 'Little Vito')
OR	x_comment IS NULL
OPTION (MAXDOP 1);
/* SQL Server Execution Times:
   CPU time = 2625 ms,  elapsed time = 2638 ms. */

I've seen this behavior on production systems.  But in the field, the effect can be even more pronounced than this example.  I've seen more complicated queries involving CCIs have snapshot isolation increase their runtime from under 1 second to over 60 seconds.

That's too bad, since snapshot isolation is a canonical solution that can facilitate read analytics queries against the same database supporting read-write workloads.

If Snapshot Isolation does not meet workload goals, often RCSI is evaluated as an option. In this case, RCSI introduces the same performance overhead as snapshot isolation.

An Availability Group readable secondary also experiences the same performance overhead.

Removing all deleted rows from a CCI allows CCI queries in Read Committed and Snapshot Isolation levels to perform comparably (assuming no competition for resources or locks).  But, CCI reorgs will only remove deleted rows from rowgroups with at least 10% of their rows deleted.  And CCI rebuilds may not be appropriate given resource and other constrains.

Tuesday, April 27, 2021

SQL Server: Perfmon active parallel threads vs active requests vs DOP

This question came across my desk today. And I sent an answer I hope was helpful.  Honestly the second part of the answer, still forthcoming, might be more what the question was after.  But first I'll try to prevent wrong conclusions/evaluarion, later I'll give some ideas how to evaluate in an actionable way.


~~~~

How does the perfmon active parallel threads counter related to the DOP settings? I guess I assumed it would be something like active threads x DOP = active parallel threads but that doesn’t seem to be right.

~~~~

My lengthy but not (yet) graph-y response 😊

The short answer: there isn’t a direct, predictable relationship for active parallel threads to workload group DOP or active requests at the workload group level(or even for sum of all workload group active px threads in a resource pool vs active memory grants in the pool – which is tighter but still unpredictable*).  I often add those measures for a workload group (as a stacked graph) and put CPU % for the workload group as a line graph on the other Y axis for trending. 

When a parallel plan is prepared, it may contain multiple zones or branches which can operate independently after they branch off and until they are combined. So when the plan is selected, a reservation for DOP * branches parallel workers is made(the session already has an execution context ID 0 worker which is compiling the plan and will do all the work for DPO 1 queries).  If the current outstanding reservations plus all execution context ID 0 workers (whether coordinators for parallel queries, DOP 1 queries, or in the process of plan compilation) *plus* the sought reservation exceeds [Max Worker Threads Count] for the instance, DOP will be adjusted downward until it fits.  If there is a specific memory grant necessary to support the DOP and that memory grant is unavailable, DOP could be downgraded for memory reasons, too. (I’ve only seen that happen for CCI inserts but I guess it could happen elsewhere)  Worker thread pressure can result in parallel queries getting downgraded all the way to DOP 1.  (In sys.query_store_runtime_stats you can see last_dop = 1 and if look at linked sys.query_store_plan is_parallel_plan = 1). 

I’ve seen  a single ***!!REDACTED!!*** query at DOP 8 reserve 200 parallel workers or more!

Now, the trick with parallel workers is that no matter how many branches in the plan and how many total parallel workers, they will *all* go on the same SQL Server schedulers (so all on the same vcpus) and the count of those schedulers/vcpus will be equal to the DOP. (Execution context ID 0 thread can co-locate with parallel threads for the same query, or not.  Even if all px threads for a query are within a given autosoftNUMA node or SQLOS memory node, the execution context ID 0 thread can be elsewhere.)

So DOP doesn’t directly govern how many workers a parallel plan will reserve. But it does determine how many vcpus the workers for that query will occupy.  The parallel workers for a DOP 8 query on a 16 vcpu system cannot get the vm to more than 50% cpu busy no matter how many of them there are. Because they will be on no more than 8 of the 16 vcpus.

OK, final trick with this: the initial parallel worker thread reservation is similar to a memory grant in that its based on initial understanding and estimates of the plan by the optimizer, while “active parallel worker threads” are determined by current runtime activity.

It’s possible (even likely, really) that a query which reserves 200 parallel worker threads doesn’t actually use them all.  If one branch finishes before another branch starts, those workers might be reused. So the reservation may be higher than “active parallel worker threads” in perfmon ever gets.

All of these details can be seen in sys.dm_exec_query_memory_grants. AFAIK every parallel query will have a memory grant. In that DMV, can see reserved parallel threads, active threads at the time, and max px threads used by the query till that point in time.

I’ll create another post about tuning DOP based on perfmon measures later today.

Some additional details from Pedro Lopes of Microsoft in the 2020 July 7 post linked below.

What is MaxDOP controlling? - Microsoft Tech Community

 *The exceptions to the unpredictability

- if a workload is composed solely of queries for which the plans have solely batch mode operators. Then each query will have DOP active parallel workers + 1 execution context ID 0 worker (or occasionally a single active worker for certain plan operators which may force such)

- if the workload is comprised fully of parallel checkdb/checktable workers. In this case the max active parallel workers will be 2*DOP (and the session still has an execution context ID 0 thread).  Beware that as of SQL Server 2019 CU9 large scale parallel CHECKDB/CHECKTABLE operations spend a considerable, irreducible amount of time effectively operating at DOP 1. 

Monday, April 26, 2021

A Very Circuitous Answer to a Question about #SQLServer PLE

Usually if I use PLE at all, rather than using PLE as a first-tier indicator of an issue, I use it to confirm issues I’ve detected or suspected based on other evidence.  Below I give some of the reasons.  And ideas to look at rather than PLE.

~~

If the vm has multiple vNUMA nodes, in addition to the [\NUMA Node Memory(*)\*] counters which describe memory utilization at the vNUMA node level, you’ll see [\SQLServer:Buffer Node(*)\*] and [SQLServer:Memory Node(*)\*] counters to describe activity within the SQLOS memory nodes (unless an instance has trace flag 8015 enabled, which disables #sqlserver NUMA detection).

From the Buffer Nodes counters, the counter I use most frequently is [\SQLServer:Buffer Node(*)\Page life expectancy].

[\SQLServer:Buffer Manager\Page life expectancy] is equal to  [\SQLServer:Buffer Node(000)\Page life expectancy] if there is only 1 vNUMA node (or if NUMA detection disabled with T8015).  Otherwise,  [\SQLServer:Buffer Manager\Page life expectancy] is equal to the harmonic mean of [\SQLServer:Buffer Node(*)\Page life expectancy] values.

Here’s a blog post from 4 ½ years ago where I try to derive overall PLE as harmonic mean of node PLEs. Plus you get to see what my graphs looked like 4 ½ years ago 😊

Harmonic mean of SQLOS Buffer Node PLEs on NUMA servers 

http://sql-sasquatch.blogspot.com/2016/10/harmonic-mean-of-sqlos-buffer-node-ples.html

The main thing is, cache churn on a single vNUMA node can have a pretty significant impact on instance-wide PLE.

And the formula for determining PLE at the SQLOS memory node/vNUMA node level isn’t public.

But, let’s think through the things that would have to go into a page life expectancy calculation.

First would be size of the database cache.  The larger the database cache with a fixed workload, the longer we’d expect a given page to last in database cache.

Then we’d also need a database page insertion rate.  Database pages can be inserted due to page reads, or due to the first row insert to a previously empty page.  Improving performance of a given workload means increasing the pace of page reads, or increasing the pace of first row inserts, or both.  For a given database cache size, increasing the pace of work decreases PLE.  For a given database cache size, decreasing the pace of work will increase PLE.   That’s about as short of an explanation as I an provide about why PLE isn’t a direct measure of performance or resource utilization.

Then there’s the change in database cache size.  The database cache isn’t a fixed size.  It has a minimum size of 2% of [Target Node Memory] on each node, once it’s crossed that minimum size.  And SQL Server tries to keep it at least 20% of [Target Server Memory].  But for a given workload, the larger the database cache, the higher the PLE.

Now, the formula for PLE isn’t publicly documented.  So this is speculation on my part. But I believe that not only are observed values for sizes and rates used to calculate PLE.  I believe “velocity” or rate of change is used as well.  I’m pretty sure PLE is not only adaptive but predictive.

OK. Well, if I don’t look at PLE until I look at other stuff, what other stuff would I look at first?

I’d start with a stacked graph of [Database Cache] + [Free Memory] + [Stolen Memory], first. Put in [Target Memory], too, as a line graph.

I’ll use some graphs from some work last year.

Especially if there is no use of paging space, I’d move on to memory grants. (If there is use of paging space like our colleague noticed, investigation to see if MSM is too high, or if another memory-consuming app is on the VM can be valuable.  The memory-consumer could even be sending a SQL Server backup to a backup manager like Data Domain, Commvault, etc if using a method that doesn’t bypass file cache aka not using unbuffered IO. Let's talk 'bout paging space another day.)

Let’s add in CPU for the resource pool.

The part in the blue box looks a bit suspicious. It’s got fewer active memory grants than other “busy” times.  But CPU utilization is higher.

 

Huh. If we look at overall SQL Server memory and this time bring in PLE as a line graph, that period looks pretty exceptional, too. High CPU, and growing PLE.

 

Oh. A huge amount of backup traffic.  (SQL Server backups can use a lot of CPU if they are compressed or encrypted.)   So backup traffic, some active memory grants, some pending memory grants.

And if I look at granted memory vs the maximum allowed workspace memory, I see the dynamic that is leading to pending grants.


Max Server Memory on this instance is 960 GB.  But memory grants for a given resource pool are not calculated directly against MSM.  Rather, they are calculated against the maximum target for the relevant resource semaphore for the resource pool.  In this SQL Server instance, no resource pools have been added beyond default (which contains all of the user workload sessions) and the internal resource pool.  So the graph above works out and makes sense.  (If there are multiple non-internal resource pools - eg default + ORM_unrealistic - that simultaneously have memory grants, a graph like this might not make sense and info from the DMVs may be needed.  Because [Reserved Memory] in perfmon is global across all resource pools.)

Notice how when [Reserved Memory] is layered in front of [Granted Memory], only a small portion of [Granted Memory] is visible at the top.  Looking across all the queries and workload groups, as a whole, the queries are asking for wayyy more workspace memory than they need.  Lots of room for tuning.

Now there’s one more unusual thing about the area in the blue box.  That’s a lot of room between the granted memory and the maximum workspace memory. Why are there still so many pending memory grants for so long?  It has to do with what query is at the front of the waiting line, and how big of a grant it wants.

At least one of the workload groups at this time still had a 25% maximum grant per query.  But that difference between max workspace memory and granted memory is way more than 25%, isn’t it?

Here’s a rule about memory grants that matter when start getting close to the max: for a grant to succeed, 1.5x the grant must be available for granting.  That way, at least half the size of the most recent grant is still available for grants afterward.  It’s a way to make sure a small number of very grant hungry queries don’t permanently squeeze out queries that request smaller grants.    

OK, that’s probably more than anyone needed to see today 😊

Just remember – PLE by itself doesn’t tell me much of what’s going on in SQL Server.

In the box below PLE is constantly rising.  But the workloads suffer from the same performance problem inside the box as outside the box – really long resource_semaphore waits for memory grants.