Tuesday, December 7, 2021

#SQLServer - Database Cache Suppression

Third blog post in a quick series about potential trouble at the high AND low side of [\SQLServer:Buffer Manager\Target pages].  So far I've primarily talked about trouble on the low side of [Target pages], due to artificial restriction on the size of the database cache instance-wide, and an uneven expectation of [Free Memory] across the SQLOS memory nodes which can result in some SQLOS memory nodes having almost no database cache.  This blog post is another example of the uneven demand for [Free Memory] across SQLOS memory nodes.

Here are the first two blog posts, in order, where I started discussing this topic.

#SQLServer Column Store Object Pool -- the Houdini Memory Broker Clerk AND Perfmon [\SQLServer:Buffer Manager\Target pages]

#SQLServer Database Cache Memory Magic at SQLOS Node Level -- Chaotic or Even A Little Bit Sinister

The previous two posts have focused on December 1 on a particular system. That system is unusual in that it has 7 vNUMA nodes.  That's a pretty surprising number, and it comes from the vNUMA configuration of that 80 vcpu vm not being aligned with physical resources beneath.  That'll get straightened out sometime soon and it'll live the rest of its life as a 4x20 as soon as it does.

How much of the problem do I expect to go away once that system is a 4x20?  A little. But it won't remove the problem entirely.

Let's look at a different system in this post - a 4x22, 88 vcpu system set up as good as I know how.

Well... yeah, there's something I don't necessarily like to see.  Shortly after 3:15 am the database page read rate increases but database cache size decreases.   Sometimes that happens, due to other memory commitments.  There's only so much memory and if there's another need, database cache might hafta shrink even if it means more reads and more read waits.  But in this case it isn't apparent there is a need for ~256 gb of SQLOS [Free Memory] to be maintained for over an hour.  It looks like maybe something was anticipated, but never happened?

Sure enough - as in the example we've been looking at previously, there's suddenly a huge swing in [Target pages], and it predicts the change in [Free Memory] retention.

Adjust the ratio of Y-axes to 8:1 to account for the left axis measured in KB, and the right axis measured in 8kb pages...

Now [Target pages] is pretty nicely redictive of [Database Cache Memory (KB)].  Until nearly 4:45 am, anyway, when [Target pages] goes back to the stratosphere. 

So for well over an hour, on this SQL Server 2019 instance on a nicely laid out 4x22, 88 vcpus vm SQLOS maintained over 256 gb of [Free Memory].  While also under considerable database page read load.  That very likely lead to much higher page re-reads than would have been necessary otherwise.  Other than that - how fairly distributed among the SQLOS memory nodes was production and maintenance of [Free Memory]?  Ouch. The intensity of page freeing on individual SQLOS nodes is just as intense on this 4 node system as it was on the 7 node system.  Individual nodes still get down to almost no database cache.  But conditions seem to change faster - things stay really bad for less time than on the 7 node system.

I haven't looked at a 2 node system yet.  Or an 8 node sysem.  I definitely predict the 8 node system will look... scary.  I might be able to round up a 2 node system before an 8 node system to evaluate.

No comments:

Post a Comment