Tuesday, April 2, 2013

Detecting merry-go-round scans? Rambling about trace flags 8048 and 8015

OK... anyone know of a way to detect/identify merry-go-round scans with SQL Server extended events or another clever way? Its got to be on a loaded system with a performance test workflow that simulates various DSS/analytics batch report loads. So firing multiple queries and counting physical IO in comparison to the size of the scan objects won't work :) I actually want to run the workflow and quantify the merry-go-round scans somehow... then make a change (startup flags 8048 + 8015), repeat the workflow and compare results.

Miscellaneous rambling to follow...

I've spent a lot of time researching, thinking, and arguing about SQL Server NUMA behavior with respect to the ETL and batch report workflows I work with. And arguing about how or why the combination of trace flag 8048 and 8015 would help. (Please, please don't use trace flag 8015 without trace flag 8048 on a busy system. Maybe you'll never hear that warning from anyone other than me. And I'm not gonna give you test results to support my warning. Don't have time. When I draw some pictures for this, maybe they'll be good enough that its clear why 8015 should always be partnered with 8048.)

More than a few folks thought I was tilting windmills: this stuff has been around for 8 years, ol' sasquatch, you really think you found something so simple that NOBODY else seems to care about? After reducing a 20 hour elapsed time to a 4 hour elapsed time with trace flag 8048 and no other changes, I sat in on the phone call where "an expert" wanted to talk about the root cause of poorly tuned queries. Excuse me? EXCUSE ME?!? OK... if we took 16 hours out of the elapsed time with one configuration change, how you gonna wiggle query tuning (which can ONLY reduce the remaining 4 hours of elapsed time) into a "root cause" and wiggle system behavior surrounding NUMA configuration out of that spot?

But I think I'm over that now. Maybe. Those hours of discussion, and the hours I spent researching outside of the discussions, are finally starting to pay off. Some structured in-house performance testing showed repeatable gains with trace flag 8048 + 8015. I'll post more details, and some gangly diagrams of what I think is happening, soon. But in a nutshell what we saw was a ~12% reduction in batch report window elapsed time, and a 25% reduction in total disk read IO during the batch (so there was also a peak and average disk throughput reduction during the report batch window). This was on a 4 socket, 4 NUMA node server. The disk IO reduction was smaller in similar tests run on a 2 socket, 2 NUMA node server - that was exactly what I expected. There are four socket, 8 NUMA node AMD servers out there (due to the glued architecture of at least some AMD processors) - these changes should make their DBAs dance! (Well... if those DBAs work with workflows similar to the ones I work with most of the time, anyway.)

The reduction in IO was expected. The reduction in elapsed time was a little surprising. I've been tracking a particular performance degradation that seems to cohabit with persistent large numbers of "foreign pages" showing up in perfmon for 2 or more NUMA nodes and decreasing database efficiency per CPU percent utilized... but our test systems didn't bring this nasty condition about. So I was pretty much expecting reduced read IO and the same elapsed time with maybe a bit lower CPU utilized, due to less disk IO being managed, and a small amount less MSSQL buffer management. I was pleasantly surprised with the reduction in elapsed time.

But I can't let it rest at "pleasantly surprised". Its part of who I am. Performance degradation AND improvement should both be predictable in my book. The level of benefit may not be predictable due to interrelated factors in the system... but if I make a change and it helps, I can't be completely comfortable until I at least think I understand why.

Enter the merry-go-round scan, also known as advanced scans. Its a feature that is only in editions similar to Enterprise Edition (for example, Developer, Evaluation, Data Center) and absent from Express and Standard. Here's the idea: for a large enough full tablescan, why not let another query that needs to perform the full tablescan glom right on to the current scan wherever its at? One the scan is complete, the johnny-come-lately can complete its work by re-scanning the stuff it missed.

It'll reduce IO if the table is a "buffer-buster". And I think this may be the mechanism by which the elapsed time of the batch report workflow was reduced. But I'd kinda like to know :)


A few notes about these tests. The test database was created from scratch via massive ETLs. After the ETLs, indexes were enabled and built, then defragmented.

After that, there were no modifications to the persistent tables between test runs of the report batches. So, the index rebuilds eliminated fragmentation to the extent they could, and nothing introduced additional fragmentation between the comparative performance runs of the batch report workflows.

Also, none of the indexes rebuilt were "no recompute", so all of the statistics were based on fullscan. As good as they'd get... and again there would have been no statistics changes between the test runs.

Wanted to make those points to prevent the questions (which would be valuable questions) about whether changed statistics could have lead to more efficient plans, resulting in the reduced elapsed time...


  1. You are leaving out some pretty big details. make and model of the server number of CPU sockets. Number of CPU cores. Amount of physical memory installed. Is this server old? Does is support hardware NUMA? Are all memory banks populated evenly per CPU/Numa node? If TF8048 & TF8015 fixed your performance issues those are the questions I'd be asking. Also, what did the wait stats look like before and after? did you see a ton of os_scheduler_yields or spin locks before the trace flags?

    Glad to see that people aren't afraid to test and test to figure out why something isn't working the way it should!

  2. Yes, indeed. Lots of missing details. My intent here was just to provide enough of a background that folks would understand why I want to detect/identify/count merry-go-round scans.
    There's lots more to mine here... I don't want to post more test details before I've finished my diagrams of the behavior I believe is responsible for the performance effects.
    Ultimately, a single system is not my concern. I am establishing best practices for numerous organizations with similar datasets/ETLs/batch report workflows, across a very rich matrix of server and storage configurations.