These results are from a 4 vcpu VMware VM. The test is an offline CREATE INDEX, at MAXDOP 4. This vm - running SQL Server 2014 with trace flags 1117 and 1118 globally enabled - has two equally sized tempdb files, each on their own Windows volume in the guest (no other files in those NTFS filesystems). The tempdb Windows volumes, like the other volumes holding persistent database files or transaction logs (including the tempdb transaction log) are distributed among 3 pvscsi vHBAs. But the underlying storage is by no means supercharged here. This system isn't a full performance-at-scale testing rig. More like dev & exploration 😁
The 4 parallel threads (the dark blue area) are keeping the vcpu at 100% cpu busy (the red dots). The CREATE INDEX is the only active request in the default workload group (the really light blue dots), and there are no other workload groups except the internal group. Perfmon was collecting metrics in 1 second intervals. Notice that it caught a task waiting for a page latch for a few seconds!! (That's the light blue dashed line.) But - that waiter was at a time of reported 100% vcpu utilization. Any optimization that removed that wait with no other changes would simply see the time replaced with signal wait or sos_suspend_queue time.
Well... how about query memory and tempdb usage? The query received and used a memory grant of almost 4 GB, which it used in a step function manner as seen in red dotted line below. The one second perfmon collection interval really shows the microburst nature of this tempdb activity. Those one second burst are up to 500 mb/sec. That type of write throughput can be really punishing to shared storage.
Let's take a look at the tempdb footprint during this CREATE INDEX. The bursts of writes leads to pretty steady tempdb footprint growth throughout the activity.
Let's take a look at writes per second to these two Windows volumes, rather than write bytes. The shape of the area graph is identical - only the unit of measure has changed. Throughout the operation the writes are 64kb extents. Since this CREATE INDEX is running in isolation, the writes and write bytes are divided almost exactly between the two files. (Although its kind of hard to tell in these graphs - next time I'll have to grab virtual file stats before and after, too, to show the breakdown.)
OK, one last graph I'll share from this experiment: vCPU utilization vs write bytes. With more vcpus and a higher DOP for the operation (or two concurrent operations at DOP 4) I'd expect more work to be done per unit of time. And for the write rate to increase accordingly. I kind of cheated because before I ran this test I looked at some of my old 1 second captures from an 8 vcpu system on a monster server :-) That test was with a varied workload of complex, long-running SQL queries at query concurrency 16 (enforced by Resource Governor active request limit). That system did get peaks of over 1.2 GB/s of writes to its two tempdb files, though. 😊
Now, my working assumption is that sort_in_tempdb has the same potential allocation page bottleneck as sort and hash spill, and ultimately the same as the allocation page contention that we know and love from systems with session counts many times the (v)cpu count putting pressure on lots of temp tables with create/destroy/populate.
But on these systems the target for query concurrency is typically 2 to 3 times vcpu count, with a MAXDOP no higher than 8. Of the queries running, only 2 to 4 are expected to spill concurrently at any given time. There is no significant contending concurrent use of #temp or ##globaltemp tables among the concurrent queries.
I believe that type of workload (where spill is the main consideration) has a very different profile from the typical use cases that drive count of tempdb files based on (v)CPU count. For a number of reasons, I think on these systems even up to 32 physical cores 2 or 4 tempdb data files(each on their own Windows volume or co-located if onboard SSD/flash is used for tempdb) will be better than 8 or more. Hopefully I'll articulately explain why later...
For now its off to check on resources for a larger scale run of this test - with single command at high DOP and mulitple concurrent commands at lower DOP to see how the waiters, CPU utilization, and write activity play out.
The 4 parallel threads (the dark blue area) are keeping the vcpu at 100% cpu busy (the red dots). The CREATE INDEX is the only active request in the default workload group (the really light blue dots), and there are no other workload groups except the internal group. Perfmon was collecting metrics in 1 second intervals. Notice that it caught a task waiting for a page latch for a few seconds!! (That's the light blue dashed line.) But - that waiter was at a time of reported 100% vcpu utilization. Any optimization that removed that wait with no other changes would simply see the time replaced with signal wait or sos_suspend_queue time.
Well... how about query memory and tempdb usage? The query received and used a memory grant of almost 4 GB, which it used in a step function manner as seen in red dotted line below. The one second perfmon collection interval really shows the microburst nature of this tempdb activity. Those one second burst are up to 500 mb/sec. That type of write throughput can be really punishing to shared storage.
Let's take a look at the tempdb footprint during this CREATE INDEX. The bursts of writes leads to pretty steady tempdb footprint growth throughout the activity.
Let's take a look at writes per second to these two Windows volumes, rather than write bytes. The shape of the area graph is identical - only the unit of measure has changed. Throughout the operation the writes are 64kb extents. Since this CREATE INDEX is running in isolation, the writes and write bytes are divided almost exactly between the two files. (Although its kind of hard to tell in these graphs - next time I'll have to grab virtual file stats before and after, too, to show the breakdown.)
OK, one last graph I'll share from this experiment: vCPU utilization vs write bytes. With more vcpus and a higher DOP for the operation (or two concurrent operations at DOP 4) I'd expect more work to be done per unit of time. And for the write rate to increase accordingly. I kind of cheated because before I ran this test I looked at some of my old 1 second captures from an 8 vcpu system on a monster server :-) That test was with a varied workload of complex, long-running SQL queries at query concurrency 16 (enforced by Resource Governor active request limit). That system did get peaks of over 1.2 GB/s of writes to its two tempdb files, though. 😊
Now, my working assumption is that sort_in_tempdb has the same potential allocation page bottleneck as sort and hash spill, and ultimately the same as the allocation page contention that we know and love from systems with session counts many times the (v)cpu count putting pressure on lots of temp tables with create/destroy/populate.
But on these systems the target for query concurrency is typically 2 to 3 times vcpu count, with a MAXDOP no higher than 8. Of the queries running, only 2 to 4 are expected to spill concurrently at any given time. There is no significant contending concurrent use of #temp or ##globaltemp tables among the concurrent queries.
I believe that type of workload (where spill is the main consideration) has a very different profile from the typical use cases that drive count of tempdb files based on (v)CPU count. For a number of reasons, I think on these systems even up to 32 physical cores 2 or 4 tempdb data files(each on their own Windows volume or co-located if onboard SSD/flash is used for tempdb) will be better than 8 or more. Hopefully I'll articulately explain why later...
For now its off to check on resources for a larger scale run of this test - with single command at high DOP and mulitple concurrent commands at lower DOP to see how the waiters, CPU utilization, and write activity play out.
No comments:
Post a Comment