Tuesday, December 7, 2021

#SQLServer Database Cache Memory Magic at SQLOS Node Level -- Chaotic or Even A Little Bit Sinister

 This blog post continues to look at a topic I blogged about yesterday.

Yesterday's first post:

sql.sasquatch: #SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages] (sql-sasquatch.blogspot.com)


And here's a post I finished later today, following this one.

sql.sasquatch: #SQLServer - Database Cache Suppression (sql-sasquatch.blogspot.com)


Hey!! What's up with the lavender boxes in all the graphs in this post!? I'm using the lavender boxes for visual alignment from graph to graph.  Not really necessary in the system-wide graphs since the changes in SQLOS free memory and database cache are so drastic; but they help me to align visually when i look at individual SQLOS nodes where the pattern gets more complicated.

To briefly recap: we looked at a system where SQLOS [Database Cache Memory] and [Free Memory] experienced dramatic, aggressive changes. Those changes were not related to database page read rates, evn though that is the primary inflow to database cache.

We saw that aggressive changes to [Target pages] predicted the aggressive changes to SQLOS database cache and free memory.  The maximum value for [Target pages] in this example is not attainable on the system; it is much larger than the SQL Server [Max Server Memory] setting.

The lower values for [Target pages], however, are very predictive of the [Database Cache Memory (KB)] values, once the left and right Y-axes are aligned in an 8:1 ratio.

Currently the only known drivers for these drastic changes in [Target pages] and the consequential changes in database cache and free memory are the entrance and exit of the column store object pool cache.  If I find more, I'll update these posts.

Today I want to look at the effect this has at the SQLOS memory node level.  By default, SQL Server SQLOS creates a memory node at startup for each detected vNUMA node on the system. Worker threads on the schedulers associated with the memory node tend the SQLOS memory on that node.  The patterns above and in yesterday's post look very orderly, even if not optimal.  Today we'll see if that holds true at the SQLOS node level as well.  This particular system currently has 7 vNUMA nodes. (sometimes vNUMA on monster VMs requires a little extra care to get into expected configuration. Hopefully soon this will be a 4x20 eighty vcpu VM with 4 vNUMA nodes instead of its current non-standard configuration.)

Here's memory graphs for each of the 7 current SQLOS nodes.  After these 7, I'll single out two to look at specifically.

Wow. Well, things definitely don't look as orderly at the SQLOS memory node level as the do at the system-wide level.  Let's focus on two of the nodes - 003 and 006. Now, unless someone has done some fancy rigamarole with affinitization, one won't be able to select a SQLOS scheduler node or memory node for a session's connection to be associated with.  Similarly, without fancy affinity won't be able to reliably predict which scheduler node(s) or memory node(s) will be home for parallel worker activity in the case of parallel queries (the nodes for parallel workers can be different than the scheduler node/memory node for the execution context ID 0 worker thread).

Imagine a DOP 1 query that handles a large table scan.  Because the high estimate of rows is past the tipping point(and return of a column absent from any index), table scan is always chosen rather than index access. The table being scanned is 20 GB in size - but assume no part of the table is in cache at the start of the hypothetical activity.  A session runs that query 6 times in succession, with different filter values (but each time its a full table scan).

This happens once between 7:30 am and 8:30 am, and once between 10:00 am and 11:00 am.

What would you expect to happen if both of those sessions performed their DOP 1 queries on node 003 (the top graph immediately below)? What would be different if instead those sessions performed their queries on node 006 (the lower of the graphs immediately below)?

OK, OK. I've shown the possibility of performance-by-lottery in this case. The hypothetical activity on node 003 might be able to cache the entire 20 gb table on the first scan, with second through sixth queries getting 100% cache hit.  The hypothetical ativity on node 006 has no chance of doing so. Each scan of that 20 gb table will have to read the full 20 gb table - and absorb any associated pageio_latch waits.

But is there any indication something like that actually happened? Yes, actually.

First of all, CPU utilization across the whole VM was cookin' pretty good.

Now let's look at the individual SQLOS memory nodes, with CPU utilization for the associated vNUMA nodes.

vNUMA node 3 has very high CPU utilization throughout.

vNUMA node 6 has spikier CPU, and it rarely is as high as on vNUMA node 3.  That's not entirely surprising to me.  The query activity responsible for CPU activity on vNUMA node 6 is experiencing many more pageio_latch waits than on vNUMA node 3 - because there is almost no database cache.  From 10:00 am - 11:00 am there is another indication of query activity on memory node 006 - other than the CPU activity on vNUMA node 6.  The continued growth of stolen memory on SQLOS memory node 006 is most likely due to growth of memory stolen against query workspace memory grants, for sort/hash activity and the like.

OK. So now I've shown that the amount of database cache and free memory on a system can change drastically due to the influence of [Target pages]. [Target pages] itself can dramatically change due to entrance/exit of the column store object pool memory broker clerk.  And even if these patterns look somewhat orderly at the system level, on large NUMA systems they may look chaotic or downright sinister.

1 comment: