This question came across my desk today. And I sent an answer I hope was helpful. Honestly the second part of the answer, still forthcoming, might be more what the question was after. But first I'll try to prevent wrong conclusions/evaluarion, later I'll give some ideas how to evaluate in an actionable way.
How does the perfmon active parallel threads counter related to the DOP settings? I guess I assumed it would be something like active threads x DOP = active parallel threads but that doesn’t seem to be right.
My lengthy but not (yet) graph-y response 😊
The short answer: there isn’t a direct, predictable relationship for active parallel threads to workload group DOP or active requests at the workload group level(or even for sum of all workload group active px threads in a resource pool vs active memory grants in the pool – which is tighter but still unpredictable*). I often add those measures for a workload group (as a stacked graph) and put CPU % for the workload group as a line graph on the other Y axis for trending.
When a parallel plan is prepared, it may contain multiple zones or branches which can operate independently after they branch off and until they are combined. So when the plan is selected, a reservation for DOP * branches parallel workers is made(the session already has an execution context ID 0 worker which is compiling the plan and will do all the work for DPO 1 queries). If the current outstanding reservations plus all execution context ID 0 workers (whether coordinators for parallel queries, DOP 1 queries, or in the process of plan compilation) *plus* the sought reservation exceeds [Max Worker Threads Count] for the instance, DOP will be adjusted downward until it fits. If there is a specific memory grant necessary to support the DOP and that memory grant is unavailable, DOP could be downgraded for memory reasons, too. (I’ve only seen that happen for CCI inserts but I guess it could happen elsewhere) Worker thread pressure can result in parallel queries getting downgraded all the way to DOP 1. (In sys.query_store_runtime_stats you can see last_dop = 1 and if look at linked sys.query_store_plan is_parallel_plan = 1).
I’ve seen a single
***!!REDACTED!!*** query at DOP 8 reserve 200 parallel workers or more!
Now, the trick with parallel workers is that no matter how many branches in the plan and how many total parallel workers, they will *all* go on the same SQL Server schedulers (so all on the same vcpus) and the count of those schedulers/vcpus will be equal to the DOP. (Execution context ID 0 thread can co-locate with parallel threads for the same query, or not. Even if all px threads for a query are within a given autosoftNUMA node or SQLOS memory node, the execution context ID 0 thread can be elsewhere.)
So DOP doesn’t directly govern how many workers a parallel plan will reserve. But it does determine how many vcpus the workers for that query will occupy. The parallel workers for a DOP 8 query on a 16 vcpu system cannot get the vm to more than 50% cpu busy no matter how many of them there are. Because they will be on no more than 8 of the 16 vcpus.
OK, final trick with this: the initial parallel worker thread reservation is similar to a memory grant in that its based on initial understanding and estimates of the plan by the optimizer, while “active parallel worker threads” are determined by current runtime activity.
It’s possible (even likely, really) that a query which reserves 200 parallel worker threads doesn’t actually use them all. If one branch finishes before another branch starts, those workers might be reused. So the reservation may be higher than “active parallel worker threads” in perfmon ever gets.
All of these details can be seen in sys.dm_exec_query_memory_grants. AFAIK every parallel query will have a memory grant. In that DMV, can see reserved parallel threads, active threads at the time, and max px threads used by the query till that point in time.
I’ll create another post about tuning DOP based on perfmon measures later today.
Some additional details from Pedro Lopes of Microsoft in the 2020 July 7 post linked below.
- if a workload is composed solely of queries for which the plans have solely batch mode operators. Then each query will have DOP active parallel workers + 1 execution context ID 0 worker (or occasionally a single active worker for certain plan operators which may force such)
- if the workload is comprised fully of parallel checkdb/checktable workers. In this case the max active parallel workers will be 2*DOP (and the session still has an execution context ID 0 thread). Beware that as of SQL Server 2019 CU9 large scale parallel CHECKDB/CHECKTABLE operations spend a considerable, irreducible amount of time effectively operating at DOP 1.