Friday, July 29, 2016

SSAS Multidimensional Cube Processing - What Kind of Work is Happening?

In order to evaluate SQL Server systems, I've worked perfmon data over and over until I was satisfied that I could tell a high level story of what work was being done(rows scanned/sec,rows returned/sec, rows written/sec, log flush bytes/sec) and who was doing that work(active requests, active parallel threads).

I'm starting to work through a similar process for SSAS MultiDimensional Cubes - processing first.

Some graphs from perfmon.

Because Analytics Services is on a separate VM from the RDBMS server, I know that all of the incoming data to be worked on is coming over the wire.

So what kind of relationship is there between network bytes received and CPU utilization?


That trends pretty well.  I've put a box around an interesting area. In that area, bytes received per second is about the same as earlier periods of time.  But CPU is quite a bit lower.
What's being done with the data that is coming in?

Well, that makes sense.  Rows are read, converted, and written.  Looks like writing rows is timer-based, or maybe threshold-based?

But the unusual timeperiod is still unusual.  Rows are read, but not converted or written.  Hmmm.

OK, lets take a look at rows read/converted/written against CPU utilization.


OK.  CPU utilization tracks pretty will with rows read/converted outside of the mystery range following 9:30.
But now 3 additional interesting timeperiods are evident, in the red boxes.  CPU utilization is much higher in the red boxes than would be predicted by rows read/converted.

What's the CPU doing at about 8:05, 8:45, 9:30? (Not forgetting the original question about 9:40 to about 10:15.)

Maybe there's other kinds of work I have included in these graphs?

How about index rows/sec?


Aha!  OK.  So index rows/sec is another part of the picture, and helps to explain some of the CPU utilization not explained by rows read/converted/written.  But it still doesn't explain the tail end of the graph.

I also noticed that there is a counter for rows created/sec.


All right. Got some rows created at 8:05 am, and 8:45 am.  None at 9:30.  And none in the final mystery period.

So the work of read/convert/create/write rows and index rows is a pretty good profile of the CPU work done up until about 9:40 am.

What type of threads were doing the work? What about the non I/O threads in the processing pool.


Wow.  Very weak relationship between the number of these threads and CPU utilization.  These processing pool non-I/O threads may be assigned to tasks, and in that sense "busy".  But in this case, it sure seems like they are mostly busy waiting.

There are a couple more thread types to consider.


So the I/O threads match well to the index rows and created rows.  There's only 1 busy query pool thread over the whole time. 

The processing pool non I/O threads must be doing most if not ALL of the read/convert/write work.  But most of these threads must be spending most of their time waiting, since the number of "busy threads" has almost no correlation to CPU utilization.

But I still don't know what was going on in that mystery period.  That'll have to be for another day...










Friday, June 24, 2016

SQL Server trace flag 8721 - auto stats sync & async update info

Aaron Morelli (@sqlcrossjoin on Twitter) has a great collection of trace flags in the pdf linked below.  (Check to see if there's a more recent update, v6 was current as of April 2016.)
https://sqlcrossjoin.files.wordpress.com/2016/04/sqlcrossjoin_traceflagrepository_v6.pdf

Trace flag 8721 is of interest to me when investigating ETL behavior.  A complex ETL could change stats numerous times for lots of stats keys.  Information for auto stats updates in addition to stats updates performed explicitly by ETL code can really help complete the picture of what happens during an investigation.

I found this post from Prince Rastogi where he walks through a demo of trace flag 8721 by creating a new table with a stats key, increasing the rows from 0 to 1, then executing a query that requests the stats.  Here's an important note: use trace flag 3605 (redirect some output to SQL Server log) together with trace flag 8721.
http://www.sqlservergeeks.com/sql-server-trace-flag-8721/

I had two questions, and answered them both by following the same process.
My first question: Does it still work in SQL Server 2016? Yep.
Second question: Does it capture both async & sync auto stats updates? Yep.

Snippet from the SQL Server error log below.

*****
Date,Source,Severity,Message
06/24/2016 12:16:22,spid23s,Unknown,AUTOSTATS: UPDATED Stats: test_Tflag8721..PK__test_Tfl__1B434632D45E0816 Dbid = 7 Indid = 1 Rows: 1 Duration: 0ms
06/24/2016 12:16:22,spid57,Unknown,AUTOSTATS: Tbl: test_Tflag8721 Objid:2089019169 Rows: 1.000000 Threshold: 500 Duration: 0ms **** >>>> ASYNC <<<< ****
06/24/2016 12:16:22,spid57,Unknown,AUTOSTATS: Tbl: test_Tflag8721 Objid:2089019169 Rows: 1.000000 Threshold: 500 Duration: 0ms **** >>>> ASYNC <<<< ****
06/24/2016 12:16:22,spid57,Unknown,AUTOSTATS: Tbl: test_Tflag8721 Objid:2089019169 Rows: 1.000000 Threshold: 500 Duration: 0ms **** >>>> ASYNC <<<< ****
06/24/2016 12:16:11,spid57,Unknown,The error log has been reinitialized. See the previous log for older entries.
06/24/2016 12:16:11,spid57,Unknown,Default collation: SQL_Latin1_General_CP1_CI_AS (us_english 1033)
06/24/2016 12:16:11,spid57,Unknown,The service account is 'NT Service\MSSQLSERVER'. This is an informational message; no user action is required.
06/24/2016 12:16:11,spid57,Unknown,Logging SQL Server messages in file 'M:\MSSQL\Log\ERRORLOG'.
06/24/2016 12:16:11,spid57,Unknown,Authentication mode is MIXED.
06/24/2016 12:16:11,spid57,Unknown,System Manufacturer: 'VMware<c/> Inc.'<c/> System Model: 'VMware Virtual Platform'.
06/24/2016 12:16:11,spid57,Unknown,Server process ID is 1488.
06/24/2016 12:16:11,spid57,Unknown,All rights reserved.
06/24/2016 12:16:11,spid57,Unknown,(c) Microsoft Corporation.
06/24/2016 12:16:11,spid57,Unknown,UTC adjustment: -5:00
06/24/2016 12:16:11,spid57,Unknown,Microsoft SQL Server 2016 (CTP3.0) - 13.0.700.242 (X64) <nl/>    Oct 26 2015 18:46:06 <nl/>    Copyright (c) Microsoft Corporation<nl/>    Enterprise Evaluation Edition (64-bit) on Windows Server 2012 R2 Standard 6.3 <X64> (Build 9600: ) (Hypervisor)
06/24/2016 12:16:11,spid57,Unknown,Attempting to cycle error log. This is an informational message only; no user action is required.
06/24/2016 12:15:52,spid57,Unknown,Setting database option AUTO_UPDATE_STATISTICS_ASYNC to ON for database 'SEASIDE'.
06/24/2016 12:12:56,spid57,Unknown,AUTOSTATS: Tbl: test_Tflag8721 Objid:2057019055 Rows: 1.000000 Threshold: 500 Duration: 2ms
06/24/2016 12:12:56,spid57,Unknown,AUTOSTATS: UPDATED Stats: test_Tflag8721..PK__test_Tfl__1B434632ECB10C7B Dbid = 7 Indid = 1 Rows: 1 Duration: 0ms
06/24/2016 12:12:50,spid57,Unknown,The error log has been reinitialized. See the previous log for older entries.
06/24/2016 12:12:50,spid57,Unknown,Default collation: SQL_Latin1_General_CP1_CI_AS (us_english 1033)
06/24/2016 12:12:50,spid57,Unknown,The service account is 'NT Service\MSSQLSERVER'. This is an informational message; no user action is required.
06/24/2016 12:12:50,spid57,Unknown,Logging SQL Server messages in file 'M:\MSSQL\Log\ERRORLOG'.
06/24/2016 12:12:50,spid57,Unknown,Authentication mode is MIXED.
06/24/2016 12:12:50,spid57,Unknown,System Manufacturer: 'VMware<c/> Inc.'<c/> System Model: 'VMware Virtual Platform'.
06/24/2016 12:12:50,spid57,Unknown,Server process ID is 1488.
06/24/2016 12:12:50,spid57,Unknown,All rights reserved.
06/24/2016 12:12:50,spid57,Unknown,(c) Microsoft Corporation.
06/24/2016 12:12:50,spid57,Unknown,UTC adjustment: -5:00
06/24/2016 12:12:50,spid57,Unknown,Microsoft SQL Server 2016 (CTP3.0) - 13.0.700.242 (X64) <nl/>    Oct 26 2015 18:46:06 <nl/>    Copyright (c) Microsoft Corporation<nl/>    Enterprise Evaluation Edition (64-bit) on Windows Server 2012 R2 Standard 6.3 <X64> (Build 9600: ) (Hypervisor)
*****

Tuesday, June 14, 2016

SSIS "buffers spooled" on a NUMA server


 So this happened today during an ETL.



A quick look through the InterTubes tells me that SSIS should spool buffers to disk only if
  • memory allocation failed
  • low memory resource notification sent by Windows OS

So the server had plenty of free RAM.  But NUMA node 1 was in a pinch.  And SSIS spooled its buffers to disk.  Doggone it.

I guess I'd figured that notifications were sent based on server-wide memory state.  But I guess maybe memory state on each NUMA node can lead to a memory notification?

The target SQL Server instance, a beefy one, was also on this physical server.  There's 1.5 TB of RAM on the server.  :-)

We lowered max server memory by 100GB, hoping to get a reasonable amount of free memory on NUMA node 1 for tomorrow's ETL.  (The server was also rebooted for an unrelated reason.)

I dunno though.  This is SQL Server 2014 SP1 CU2.   

Memory is already being consumed unevenly across the NUMA nodes.  If NUMA node 1 gets short end of the stick by 6:00 am tomorrow, I think for now we may try to affinitize SSIS to NUMA node 2, in hopes that its memory allocations will come from that node and it that memory allocations will continue to allow enough free memory there even while short-changing NUMA node 1.  

*****Update*****
A blog post showing use of Windows Resource Manager to affinitize SSIS.  But... this feature was deprecated in Windows Server 2012 R2. 
:-( Sad trombone.  

Maybe next version of SSIS can support affinity natively, similar to SQL Server database engine?  Or maybe SSIS can be made NUMA aware?  Or...

Well, at any rate, for maximum scalability and flexibility, gotta find some way of making this work...




Monday, June 13, 2016

L'esprit de L'escalier - Whither SQL Server Backup Buffers?

I've got a long, long list of interesting questions.  And I blog frightfully slowly :-)

So... why don't I start blogging more of the questions?  Someone else may very well be able to springboard from one of my questions to a *really* astounding answer!  Maybe.  I hope my questions are that good anyway :-)

Sometimes the questions come to me as a form of L'esprit de L'escalier - not something I wish I'd said but a question I wish I had asked.

So it is with my question: "Whither SQL Server Backup Buffers?"

Yesterday (or maybe really early this morning) I posed the question: "Whence SQL Server Backup Buffers?"

Here's what I noticed: 'Target server memory' 33553968 hadn't been achieved.  'Total server memory' during the backup was 14947794.  Of that 4194496 was allocated to memoryclerk_backup.

When the backup was cancelled, the memoryclerk_backup allocation was simply returned to the OS.

Why not give it to free memory?  It should have been nice, contiguous chunks of memory addresses.  The server was far from reaching target server memory.  The server wasn't under memory pressure.  This is what it looks like right now... and it would have looked nearly the same right after the backup was cancelled.
That's a lot of free memory.  Take away 4 GB from it for the backup buffers... yeah, its still a lot of free memory.

I wonder:
-If 'max server memory' wasn't being overridden by a more reasonable target (because max server memory is 120 GB on a 32GB VM), would the behavior still be the same before reaching target?  I bet it would be.
-Is this behavior specific to not having reached target?  Or when reaching target would backup buffers be allocated, potentially triggering a shrink of the bpool, then returned to the OS afterward requiring the bpool to grow again?
-What other allocations are returned directly to the OS rather than given back to the memory manager to add to free memory?  I bet CLR does this, too.  Really large query plans?  Nah, I bet they go back into the memory manager's free memory.
-Does this make a big deal?  I bet it could.  Especially if a system is prone to develop persistent foreign memory among its NUMA nodes.  In most cases, it probably wouldn't matter.

Maybe I'll get around to testing some of this.  But it probably won't be for a while.

Whence SQL Server Backup Buffers?

Recently I saw a question on Twitter #sqlhelp about backup buffers, asking if they came from the bpool or MTL.

I think the question is trickier than first realized :-)

MTL (or mem-to-leave) is at this point more of an historic term than anything else.  This term was most relevant in the days of 32-bit OS.  You can search old blog posts (not of mine) and old forum posts to see tons of fun related to 32-bit Windows and attempting to ensure that multi-page allocations and Windows direct allocations had enough available memory in the first 4 GB of address space. (Hunt around for "-g" startup option and "/3GB" switch to *really* appreciate 64-bit Windows & SQL Server.)

Next, SQL Server 2012 saw a very significant change in SQL Server memory management.  Previous to SQL Server 2012, there was both a single page allocator (SPA) and a multi-page allocator (MPA) in SQL Server.  While single page allocations were under governance of "max server memory", multiple page allocations were not.  SQL Server 2012 instead saw the introduction of the any page allocator, which took over the jobs of both previous allocators.  On a related note, the multi-page allocations which were previously outside of "max server memory" control were moved under control of "max server memory".  (I'm really glad that happened - in general it made server behavior much more predictable for a given configuration.)

Now about the idea of "coming from the buffer pool."  SQL Server still uses the term "stolen memory" to describe behavior and within its memory accounting.  But in SQL Server 2012 and beyond, the bpool is a consumer of memory from the memory manager in much the same way as others.  So although the bpool may shrink in order that another consumer can grow, that's a process managed by the memory manager.  And the bpool is usually on the "need to shrink" side only because it tends to be one of the largest, if not the largest, consumers.

OK... so now I'll mention what I *think* the question really meant.  I think the question was meant to determine:
"Are backup buffers governed by 'max server memory', or should I plan for backup buffers outside of 'max server memory' like we still do for SQL Server worker threadstacks?"

That's a pretty good question.  If you search around, you may find a number of posts where the conclusion is... "I don't remember if its included in max server memory or not." :-)

Let's see what can be learned with a quick little experiment.

Don't try this at home.  :-)

Only if you have exclusive access to the memory resources on the server AND if a significant read load on the storage won't disrupt anyone.

What will happen if I run a backup to DISK='NUL' on a very large database with buffercount = 1024 and maxtransfersize = 4194304 (4 mb)?  Altogether, that should be 4 GB of memory allocation - should be quite noticeable.

I'm testing on SQL Server 2016 RC2.

Here's how things looked before the backup.

   
Kinda boring.  Memoryclerk_backup has nothing in node 0 or node 64.
'Total Server Memory (KB)'  is just a hair over 10GB, even though 'Target Server Memory (KB)' is just shy of 32 GB.  'Max server memory' is actually 120000 MB.  There hasn't been much happening in this instance for a few days.

Lets fire up a backup.
The test database is several terabytes, but in this case I'm writing the backup to NUL and I'll cancel the backup as soon as I learn what I want to know.

How do the backup memory clerks and performance counters look now?




Memoryclerk_backup on node 0 now has just over 4 GB allocated to it.  Total server memory has increased by... let's see... 14947744 KB - 10753248 KB =  4194496 KB.  That's exactly the size of the allocation to the backup clerk.

What about stolen memory?  It increased by 5457832 KB - 1257344 KB = 4200488 KB.  That's a little bit more than what memoryclerk_backup picked up.  In part because I opened a second query window while testing :-)

Remember - the memory isn't really stolen from the buffer pool necessarily.  Notice that 'Database Cache Memory (KB)' is the same 9467600 KB before and during the backup.  'Free Memory (KB)' has declined by 28304 KB - 22312 KB = 5992 KB.

Doesn't that number look suspicious?

4194496 KB (backup allocation) + 5992 KB (additional stolen memory) = 4200488 KB

Whew! Glad that worked out :-)

So some memory really was stolen.  But it wasn't stolen from the bpool, which remained the same size during the backup (until I canceled it) as it was before the backup.  The additional stolen memory came from the free memory.

Now, because target memory in this SQL Server instance is so much lower than 'max server memory' (my VM got shrunk from 128GB RAM to 32 GB RAM and I never adjusted 'max server memory'), I can't say that I proved that backup buffers will always be from within the allocation governed by 'max server memory'.  Can I say it's left as an exercise for the reader?

It's late and I gotta get some sleep...

:-)

Wednesday, May 4, 2016

SQL Server: A quick word about large sorts/hashes, lookups/sec, and CPU utilization

Several years ago I began trying to model SQL Server DSS system behavior based on activity rates and resource utilization.  Its an ongoing effort :-)  This is the motivation for many of my strange-looking perfmon graphs - I'm trying to fit it all together.

One of my first stops was to graph CPU utilization against "page lookups" across many systems.
I quickly learned that spinlock contention could disrupt the correlation between CPU utilization and page lookups or database logical IO.  I was suspecting that anyway - I'd seen the same effect on another database platform.

It stands to reason that high rates of plan compile, or lots of XML parsing and manipulation, or usage of CPU by CLR, or CPU consumed by something other than SQL Server will likewise disrupt the correlation.  Operations using worktables can also throw things for a loop, whether reviewing perfmon or results from SET STATISTICS IO ON.

But today I want to mention another disrupter of the relationship between CPU utilization and database logical IO: significant query use of large sorts and hashes.

Consider the following perfmon from a DSS system(96 GB RAM on the system).  Page Lookups/second in blue against the left vertical axis, CPU utilization in orange against the right vertical axis.  I've color-coded this observation period into green, yellow and red portions of time.  During the initial green period of time, there is a strong correlation between CPU utilization and page lookups.  That gives way to the yellow time period, where the relationship between CPU and logical IO comes unglued.  Finally, in the red section the amount of CPU utilized is completely disproportionate to database logical IO.



I don't want to bore you with my "home movie" of the dozens of perfmon graphs I build when I review systems.  :-)  Let's cut to the chase.

 

The graph above has tempdb footprint (light blue) stacked on top of used query memory (dark blue) against the left vertical axis.  The green period has very limited use of query memory.  During the yellow period, a moderate amount of query memory was used.  During the red period, a large amount of query memory was used and at a number of points operations spilled into tempdb.  As query memory was used more extensively, the CPU:lookups/sec correlation was more disrupted.

Once fully considered, this makes sense: query memory is "stolen" from the database page buffer pool.  References to pages in the page pool are "page lookups", but each time stolen query memory is poked and prodded... well, that's not a page lookup.  But it has CPU cost.

So if you are reviewing a system that usually has high correlation between CPU utilization and page lookups/second and suddenly the relationship is disrupted: yeah, could be spinlock contention.  Could be lots of other things, too.  Might be large sorts/hashes dominating memory traffic and less reliance on the database cache where "page lookups" are tracked.

OK... I guess that wasn't really a "quick word".

I tried; not sorry. :-)