Wednesday, May 4, 2016

SQL Server: A quick word about large sorts/hashes, lookups/sec, and CPU utilization

Several years ago I began trying to model SQL Server DSS system behavior based on activity rates and resource utilization.  Its an ongoing effort :-)  This is the motivation for many of my strange-looking perfmon graphs - I'm trying to fit it all together.

One of my first stops was to graph CPU utilization against "page lookups" across many systems.
I quickly learned that spinlock contention could disrupt the correlation between CPU utilization and page lookups or database logical IO.  I was suspecting that anyway - I'd seen the same effect on another database platform.

It stands to reason that high rates of plan compile, or lots of XML parsing and manipulation, or usage of CPU by CLR, or CPU consumed by something other than SQL Server will likewise disrupt the correlation.  Operations using worktables can also throw things for a loop, whether reviewing perfmon or results from SET STATISTICS IO ON.

But today I want to mention another disrupter of the relationship between CPU utilization and database logical IO: significant query use of large sorts and hashes.

Consider the following perfmon from a DSS system(96 GB RAM on the system).  Page Lookups/second in blue against the left vertical axis, CPU utilization in orange against the right vertical axis.  I've color-coded this observation period into green, yellow and red portions of time.  During the initial green period of time, there is a strong correlation between CPU utilization and page lookups.  That gives way to the yellow time period, where the relationship between CPU and logical IO comes unglued.  Finally, in the red section the amount of CPU utilized is completely disproportionate to database logical IO.

I don't want to bore you with my "home movie" of the dozens of perfmon graphs I build when I review systems.  :-)  Let's cut to the chase.


The graph above has tempdb footprint (light blue) stacked on top of used query memory (dark blue) against the left vertical axis.  The green period has very limited use of query memory.  During the yellow period, a moderate amount of query memory was used.  During the red period, a large amount of query memory was used and at a number of points operations spilled into tempdb.  As query memory was used more extensively, the CPU:lookups/sec correlation was more disrupted.

Once fully considered, this makes sense: query memory is "stolen" from the database page buffer pool.  References to pages in the page pool are "page lookups", but each time stolen query memory is poked and prodded... well, that's not a page lookup.  But it has CPU cost.

So if you are reviewing a system that usually has high correlation between CPU utilization and page lookups/second and suddenly the relationship is disrupted: yeah, could be spinlock contention.  Could be lots of other things, too.  Might be large sorts/hashes dominating memory traffic and less reliance on the database cache where "page lookups" are tracked.

OK... I guess that wasn't really a "quick word".

I tried; not sorry. :-)

I got a hunch about SQL Server Fragmentation: Part VI

***Image quality in this post is horrible, I apologize.  I'll replace later.

Its been a while... but I'm feeling a hunch come on again...

I'm working in a database with a single mdf data file.  (That's important context when looking at results of sys.dm_db_database_page_allocations.)

Let's create my favorite kind of two-column demo table - clustered index key on an integer and an uncompressed varchar(4096).  That forces one row per CI page.

Then insert into the table - in reverse numeric order, key values from 4084 to 1 and a varchar just to take up space.

 How does that look in terms of fragmentation, by the most commonly used measure?  99.9755%, huh?  (See part I of my musings "I got a hunch about SQL Server Fragmentation Part I" at )

OK, now lets delete every row with an odd-numbered clustered index value.

How does fragmentation look now? Pretty much the same at 99.9511% (are you surprised that the fragmentation measure improved slightly when all that happened was delete of every-other row??).

Now, *almost everyone in the world* will tell you that after a certain level of fragmentation you should go with an index rebuild rather than an index reorganization.  Right?  But... I'm the guy that *sometimes* rebuilds indexes at 0.44% fragmentation (see
What do you think the outcome was?
Let's check.

Cool!  From 99.9511% fragmentation to 0.6853% fragmentation!  Not bad for a reorg rather than a rebuild! :-)

As I built the table, rows were inserted into the clustered index with keys decreasing in numeric order.  Traveling the clustered index in increasing key order, the "next data page" after each data page almost certainly has a lower page ID.  That's the cause of the high reported fragmentation percent in the original table, and in the resulting table after delete of all odd-numbered rows.

The index reorg moves many rows into the empty pages (each odd-numbered row delete resulted in an empty page).  But if all the reorg did was smush together the reverse-ordered rows/pages, the resulting reported fragmentation at the end would still be high.

The index reorg ALSO reordered pages as it was able.

Before the index reorg, every data page had a page ID that was HIGHER than its "next page ID", due to reverse-ordered inserts.  After the rebuild, with 2043 data pages in the clustered index...

  After the index reorg there are only 3 data pages left that have page IDs higher than their next page page ID (unnatural order, if you will).

So, two takeaways:
  • reorg can re-order data pages into a better logical order in addition to making the index more compact
  • the general guidelines about "reorg in this range of reported fragmentation, rebuild above this level of reported fragmentation, ignore otherwise" don't tell the whole story

 I'm going somewhere with this, y'all, I promise.  It's just gonna take me a while to get there...