His post on the nature of OLEDB wait is great - I highly recommend it.
Scheduler stories: OLEDB, the external wait that isn’t preemptive
Before I start digging into the waits experienced by checktable, and in particular OLEDB wait, I want to take a high level look at the scalability of checktable operations, starting with checktable with physical_only. Below are graphs and additional information showing checktable executions on my 8 vcpu VM (with 64 gb vRAM). Got results from maxdop 1 to maxdop 8. (Although I actually ran the tests in descending order of maxdop for reasons unknown even to me 😜 )
Lets talk about the difference between "performance" and "scalability". Performance is a measure of the pace of work that a workload can achieve on a given system - or alternatively the measure of time required to complete a unit of work on that system. Scalability is the capacity of the workload and system to increase pace of work as additional resources are added.
So we can describe the performance of a checktable operation on a given table and system, at a given degree of parallelism (since the degree of parallelism as well as the number of (v)CPUs on the system may both be limiters of compute resources for the workload). As the degree of parallelism changes from 1 to the number of (v)CPUs on the system, the performance of checktable may change. Describing that change in performance will characterize the scalability of checktable on that system. In the final analysis, a scalability limit may come from the workload(application level coding or database level coding), or it may come from the system.
All of the numbers below are from perfmon - collected with a 1 second interval, logged to a csv. The X axis is number of seconds since the start of the checktable(1-indexed rather than 0-indexed). Don't worry - as I continue I'll be pulling in lots of stuff from DMVs that can't easily be gleaned from perfmon. :-) But any of the info I grab from DMVs will be summary of time period fgrom start to end opf operation, rather than in 1 second increments. Observer overhead of the DMV queries is too high for my tastes otherwise.
This system is running SQL Server 2016 SP1.
Here's the operation at DOP 1. All of the checktable operations are running in a Resource Pool named after me. In this one you'll see something interesting that I can't explain. For approximately 60 seconds at the tail end of the activity, CPU utilization hovered near 25%. My DOP 1 checktable was the only active request in SQL Server, though (confirmed by looking for active requests in the default and internal Resource Pools, the only other Resource Pools in the instance).
Something outside of SQL Server was using 12.5% CPU utilization in the VM. Oh well... it didn't have enough of a footprint in memory or disk to cause an issue, and my checktable was happily running along on its own vcpu. Notice that neither the rate of disk read bytes and the rate of logical scan read bytes was able to achieve previous maximum sustained levels, and the CPU usage reported for my resource pool remained at a steady 12.5% throughout.
At DOP 2, on the 8 vcpu vm the checktable could account for up to 25% of vm-wide cpu utilization. But wait time appears to have increased substantially, and cpu utilization reaches 25% only briefly near the end.
At maxdop 3, checktable could consume 37.5% cpu. But it seems to be falling farther from its potential maximum.
Maxdop 4 *could* see up to 50% CPU utilization... doesn't get too close.
62.5% maximum possible at maxdop 5 - but it seems that wait time increases must be outpacing declines in elapsed time. This checktable does not seem to be scaling very well.
Maxdop 6 would allow for a maximum of 75% CPU utilization.
Maxdop 7 would allow for up to 87.5% CPU utilization. Even the brief spikes don't get too close.
At maxdop 8 on the 8 vcpu vm, it appears about half of the cpu time is spent in a wait state.
I think its very important to look at time series graphs to understand behavior... hope I didn't lose you to boredom by including all eight of the time series. There's interesting stuff I hope to return to later in there - especially that always-present peak in CPU utilization in the parallel operations about 2/3 of the way in.
But to compare scalability of checktable across tables, and across options such as physical_only, we'll have to look at things from a higher level.
Each of the schedulers engaged by checktable is either accruing cpu_ms for the checktable (as reported by sys.dm_exec_requests) or not at any given time. Since these tests are running in isolation on the vm, for now we can assume that time for the schedulers/vcpus that isn't accounted for in cpu_ms is idle_ms.
So we can use this formula when dop, elapsed_ms, and cpu_ms are known:
DOP * elapsed_ms = cpu_ms + idle_ms
That allows the 8 checktable operations to be summarized in this graph. From DOP1 to DOP 8 the cpu_ms of the operation is extremely steady. From DOP 1 to DOP 4 there are significant decreases in elapsed time as dop increases. After dop 4, reduction in elapsed time is slight. Throughout the tested range, idle_ms increased at a nearly linear rate.
I've got another table - TableB that I've been working with extensively.
Here's how scalability for checktable with physical_only looked on TableB. Remarkably similar to scalability for TableA. Hmmm...