Recently on #sqlhelp there was an instance of a 1 GB Query Store being nearly full, and the instance slowing to a crawl. It was difficult to even run diagnostic queries, and to resolve the service was restarted and Query Store set to read only (so all contents were retained).
Hard to say if the problem was work or wait. Certainly both could have contributed. But most diagnostic queries are pretty light - unless CPU is saturated or there are long waits to compile or get a memory grant, they'll usually run even if a bit sluggish.
So I suspect that this was a work problem. Eg with the Query Store nearly full, management work became so expensive that user query work was starved for CPU time. That doesn't seem far-fetched because the number of queries in the 1 GB query store was over 100,000 if I recall correctly.
There have been issues in the past with the plan cache when it was at or near plan count capacity - it would become a one-in-one-out scenario, with the session that wants to insert a new plan being responsible for the necessary cleanup. If the hash buckets were really long it became expensive to walk them.
I'm not sure how size-based cleanup is implemented for the Query Store, other than seeing documented that it kicks in near 80% of configured limit. If sessions waiting to compile a query take on responsibility of cleaning out stuff from Query Store, there could be a situation very similar to the plan cache conundrum of old.
If that's happening and CPU utilization is near 100% either server-wide or on a subset of cores, then spinlock stats may become more enlightening than wait stats.
I did see a number of new spinlocks added for Query Store...
QUERY_STORE_ASYNC_PERSIST
QUERY_STORE_CAPTURE_POLICY_INTERVAL
QUERY_STORE_CAPTURE_POLICY_STATS
QUERY_STORE_HT_CACHE
QUERY_STORE_PLAN_COMP_AGG
QUERY_STORE_PLAN_LIST
QUERY_STORE_SELF_AGG
QUERY_STORE_STMT_COMP_AGG
SPL_QUERY_STORE_EXEC_STATS_AGG
SPL_QUERY_STORE_EXEC_STATS_READ
SPL_QUERY_STORE_STATS_COOKIE_CACHE
Hard to say if the problem was work or wait. Certainly both could have contributed. But most diagnostic queries are pretty light - unless CPU is saturated or there are long waits to compile or get a memory grant, they'll usually run even if a bit sluggish.
So I suspect that this was a work problem. Eg with the Query Store nearly full, management work became so expensive that user query work was starved for CPU time. That doesn't seem far-fetched because the number of queries in the 1 GB query store was over 100,000 if I recall correctly.
There have been issues in the past with the plan cache when it was at or near plan count capacity - it would become a one-in-one-out scenario, with the session that wants to insert a new plan being responsible for the necessary cleanup. If the hash buckets were really long it became expensive to walk them.
I'm not sure how size-based cleanup is implemented for the Query Store, other than seeing documented that it kicks in near 80% of configured limit. If sessions waiting to compile a query take on responsibility of cleaning out stuff from Query Store, there could be a situation very similar to the plan cache conundrum of old.
If that's happening and CPU utilization is near 100% either server-wide or on a subset of cores, then spinlock stats may become more enlightening than wait stats.
I did see a number of new spinlocks added for Query Store...
SELECT name FROM sys.dm_os_spinlock_stats WHERE name LIKE '%QUERY_STORE%' ORDER BY name
QUERY_STORE_ASYNC_PERSIST
QUERY_STORE_CAPTURE_POLICY_INTERVAL
QUERY_STORE_CAPTURE_POLICY_STATS
QUERY_STORE_HT_CACHE
QUERY_STORE_PLAN_COMP_AGG
QUERY_STORE_PLAN_LIST
QUERY_STORE_SELF_AGG
QUERY_STORE_STMT_COMP_AGG
SPL_QUERY_STORE_EXEC_STATS_AGG
SPL_QUERY_STORE_EXEC_STATS_READ
SPL_QUERY_STORE_STATS_COOKIE_CACHE
No comments:
Post a Comment