sql.sasquatch: perfmon

Showing posts with label perfmon. Show all posts

Wednesday, October 18, 2017

Oops. When is a #SQLServer maximum not really the maximum? (Workspace Memory)

When is a maximum not really the maximum?
When it's a maximum for an explicitly or implicitly modified default.
Whether "the definitive documentation" says so or not.

Yesterday on Twitter #sqlhelp this question came up.

*****

#sqlhelp Anyone have DEFINITIVE documentation on what the MemoryManager.Maximum Workspace Memory counter indicates (grant-able mem right?)
— Michael K. Campbell (@AngryPets) October 17, 2017

*****

Aha! I thought to myself. For this I am purposed! To show how Perfmon and DMV data tie out!

So I responded thusly in part I...

*****

#sqlhelp i'm never sure what definitive documentation is :-)
https://docs.microsoft.com/en-us/sql/relational-databases/performance-monitor/sql-server-memory-manager-object
http://sqlmag.com/sql-server/minding-memory
1/
— L_ N___ (@sqL_handLe) October 18, 2017

*****

And thusly in Part II...

*****

2/2 to tie out measures keep in mind:
-internal resource pool can't be restricted so its max is true max
-2 semaphores in each pool
#sqlhelp pic.twitter.com/ASY9XmzgUo
— L_ N___ (@sqL_handLe) October 18, 2017

*****

To wit, I included a clever graphic in the erstwhile conclusion of my rejoinder...

Indeed the internal resource pool defies limitation by the Resource Governor; the sum of resource semaphore target_memory_kb values for that pool is the true maximum workspace memory limit for the SQL Server instance. But is that number necessarily the same as is reported by "SQLServer:Memory Manager\Maximum Workspace Memory (KB)"?

Based on the two sources I cited, that may look to be the case. From the Microsoft.com link above for the MemoryManager object:

But there are wrinkles. As it turns out, the number reported in that perfmon counter is the sum of resource semaphore targets for the default pool. Two factors can cause the default pool resource semaphores' targets to be lower than those of the internal pool: conditions in the max_memory_percent and min_memory_percent of Resource Governor pools other than the internal pool.

Consider what happens when default pool max_memory_percent is changed from 100 to 50, with min_memory_percent remaining at 0 for all other pools. The RP_maxmem_100 pool still reports the same sum of target values as the internal pool. Both of these pools report a total number twice as large as the default pool reports. And its the default pool's sum of resource semaphore targets that gets reported by "SQLServer:Memory Manager\Maximum Workspace Memory (KB)".

OK, now lets bring default pool max_memory_percent back to 100. What happens if a pool other than default has a nonzero min_memory_percent defined? Pool RP_maxmem_100 has min_memory_percent 27 in the example below. Remember, the internal pool is unphased by design. No other pool has a minimum memory set, so the sum of Pool RP_maxmem_100 resource semaphore targets is the as for the internal pool. But the default pool now has a semaphore target total and "Query exec memory target" value of 100% - 27% = 73% => 73% * 192890640 = 140861040 (when rounded down to the nearest multiple of 8k).

As Paul Harvey would say:
Now you know... the rest of the story.

Ciao for now!

perfmon "SQLServer:Resource Pool Stats(*)\Query exec memory target (KB)" ties out well with sys.dm_exec_query_resource_semaphores

SELECT opc.[object_name], opc.instance_name,

       opc.cntr_value AS [Query exec memory target (KB)], 
       SUM(qrs.target_memory_kb) total__target_memory_kb
FROM sys.dm_os_performance_counters opc
JOIN sys.dm_resource_governor_resource_pools rp ON opc.instance_name = rp.[name]
JOIN sys.dm_exec_query_resource_semaphores qrs ON rp.pool_id = qrs.pool_id
WHERE counter_name = 'Query exec memory target (KB)'
GROUP BY rp.pool_id, opc.[object_name], opc.instance_name, opc.cntr_value
ORDER BY rp.pool_id

And the results...

Tuesday, October 17, 2017

SQL Server 2016 - memory broker clerk DMV ties out well with perfmon

I invest time in trying to fit pieces of memory together, to understand how various types of accounting for database memory work... and work together.

Sometimes its tricky and takes me a while :-)

But today for the first time I tried to tie out sys.dm_os_memory_broker_clerks with equivalent counters from sys.os_performance_counters - the counters we get in perfmon. I was surprised how easily sys.dm_os_memory_broker_clerks could be reconciled with perfmon.

First, a disclaimer. The columnstore object pool changes pretty quickly once no queries are running. And sys.os_performance_counters lags behind sys.dm_os_memory_broker_clerks. It seems like values are pushed from the source for sys.dm_os_memory_broker_clerks to the source for sys.os_performance_counters on a regular interval - maybe every 2 seconds. This is most noticeable when the broker clerk for columnstore disappears entirely from sys.dm_os_memory_broker_clerks but is still present in sys.os_performance_counters for a brief time.

OK... here are the queries I used to compare...

;WITH broker_clerks AS
      (SELECT instance_name, counter_name, cntr_value 
       FROM sys.dm_os_performance_counters 
       WHERE object_name = 'SQLServer:Memory Broker Clerks')
SELECT instance_name clerk_name, 8 * [Memory broker clerk size] AS [total_kb], 
8 * [Simulation size] AS [simulated_kb],
8 * [Periodic evictions (pages)] AS [periodic_freed_kb], 
8 * [Pressure evictions (pages/sec)] AS [internal_freed_kb]
FROM broker_clerks
PIVOT (MAX (cntr_value) 
       FOR counter_name IN ([Memory broker clerk size], [Simulation size], [Periodic evictions (pages)], [Pressure evictions (pages/sec)])
       ) broker_clerks_pvt
ORDER BY instance_name;

SELECT   clerk_name, total_kb, simulated_kb, periodic_freed_kb, internal_freed_kb
FROM     sys.dm_os_memory_broker_clerks
ORDER BY clerk_name;

And this is what the results looked like...

Thursday, April 13, 2017

#SQLServer: Waits keeping CPU utilization down? Or Management Cost inflating CPU utilization? Part II

Ed Elliott (@EdDebug on Twitter) today mentioned how expensive disk IO waits - or another common wait - could lead to the sawtooth CPU utilization pattern I was pondering yesterday.

So... before I start looking at some exotic sources of resource utilization/performance issues, I should eliminate some of the more typical wait types as potential causes of depressed CPU utilization during the sawtooth.

Yesterday's blog post:

#SQLServer: Are waits keeping CPU utilization down? Or is Management Cost inflating CPU utilization? Part I
http://sql-sasquatch.blogspot.com/2017/04/sqlserver-are-waits-keeping-cpu.html

And a brief reminder in graph form below(as previously, 30 second perfmon interval). From approximately 1:20 am to 1:35 am either:
1) there is sufficient thread/query demand to keep CPU utilization high, but CPU utilization is suppressed by some type(s) of wait
2) thread/query demand isn't high enough to sustain high CPU utilization, but a management cost is occasionally driving CPU utilization unexpectedly high (eg spinlock contention, etc).

Let's try to eliminate some common wait types. First up for consideration: resource_semaphore waits, the purgatory of pending query memory grants. This type of system I do expect to see pending query memory grants unless the maximum memory grant has been limited via Resource Governor.

Sure enough - there are pending memory grants, racking up resource_semaphore wait time. But only after the sawtooth pattern is, for the most part, over. Still - from 1:40 am to 1:57 am there is idle CPU - converting those pending grants to running queries will improve the performance of this workload by increasing concurrency and using otherwise idle resources. But... pending query memory grants can't explain the sawtooth.

Onward!!

This timeperiod is primarily ETL activity. So... transaction log writes are a potential source of depressing waits (both in the technical and psychic sense).

Let's take a look.

Transaction log writes have been observed to inject two main wait types into this ETL. The first is likely the more common across other systems: log write wait. This is the wait associated with transaction commit. The other wait type is log buffer wait. This is an interesting one. Each database has 128 in-memory transaction log buffers. When scheduled to write, they can be written up to 32 at a time. What if none of the 128 in-memory transaction log buffers are free when a transaction needs to write some txlog records to a buffer in the course of the transaction? The session in the transaction incurs a log buffer wait - which can end only when one of the buffer writes to the transaction log is complete and the in-memory buffer has been freed to take on new stuff.

First - the more common write log wait. Oh yeah - got some write log waits. But not in the cool yellow box. Mainly before the box, and a little bit after.

Here's the thing about waits: even if there aren't a huge number, if they are humongous they can still drive down CPU utilization. So let's take a look at average wait time against the waits started per second.

I'm glad we took a look. It's still not the cause of the sawtooth. But its interesting to see that in the yellow there is some time charged to log buffer wait, so at least a few of them must have occurred even though only a fraction of the waits previously occurring and at lower average wait time.

All right. Let's move on to disk read and write activity to persistent database files. Waits for database file reads are a common location for performance optimization, and exorbitant disk read waits can definitely dominate a workload. There's an important mitigating factor though, which I rarely see discussed: the fabulous SQL Server readahead.

Here's the thing about page IO latch waits*: only two opportunities for them - one a guarantee, the other a maybe. The guarantee is for a synchronous read - read without readahead. Ask for the read right when needed by query processing - no choice but to wait. The wait maybe is a little trickier: ask for the read ahead of time - check on it when you need it for query processing. If its there when asked - no wait! But if its needed and asked for - not there yet - gotta wait. So... that shouldn't happen all the time, and when it does happen the wait time incurred should be somewhat less than the disk read latency for that read.

*this is my own explanation - which could totally be wrong and I'm open to correction. If let's argue about it, let's argue about based on data, though 😂😂😂

All right. Funky explanation out of the way, let's look at some graphs. How can we profile "readahead efficiency"? Well... here's readahead pages/sec in light blue in front of read pages/sec in dark blue. So the exposed dark blue are "nonreadahead pages". Pages read in synchronous read operations (important to note that measure is 8k pages - not IO operations - for both of these perfmon numbers). Guaranteed to be some page IO latch wait time for the read operations which supplied those pages. And the disk reads that yielded readahead pages? Well - if they were all blazing fast or if there were enough runnable threads to make CPU waits "swallow" the potential read wait time for readahead, there'd be no page io latch wait time. But if not blazing fast and not enough runnable thread pressure to "hide" read wait time, some page IO latch wait time is gonna leak through even for readahead.

All right. Let's look at page IO latch waits started per second, compared to CPU utilization. Nope - not more waits in the yellow box than outside the yellow box.

But again - its not only how many waits are started but how long they last. So let's look...

Well, not only are the wait starts unremarkable in context, the average wait time is lower inside the yellow box than outside the yellow box.

I'm not ready to quit looking at read activity though. I made some wacky claims about the relationship of page IO latch wait time to disk read service time. Let's see if what I said holds water in this context. In the 2nd graph below, I can see that read service time average topped out around 180ms. But page IO latch wait average topped out around 120ms. OK - that's kinda like what I predicted. Because not all reads will generate waits and some will generate only only waits for part of the read service time, we'd expect average page IO latch wait time to be lower than average read service time when readahead is involved. But why are there sometimes more page IO latch waits started per second than reads completed per second? Two reasons: multiple queries interested in the same pages at nearly the same time, and multiple parallel query threads in the same query interested in the same page at nearly the same time :-)

Let no graph go to waste. We've already seen that read times for the persistent database files are almost certainly not to blame for the sawtooth. But I also like to show that for a workload like this - unless there's a lot of RAM on the server, there's a loose correlation between read bytes/sec and CPU utilization. Well... except inside that curious yellow box :-)

All right - let's move on from persistent database data file reads, to persistent database file writes. Just cuz :-)

More writes outside the yellow box, too. Curiouser and curiouser.

Let's look at the write service time that accompanies those writes...

Yep, this system could use some attention to write latency. Probably saturated write cache, and likely due to more than just this server on shared storage. But... the write latency to persistent datafiles definitely doesn't seem to be a suspect for the sawtooth.

What if it was? We discussed log buffer waits and log write waits as waits associated with high write service time or write floods to transaction logs. But write floods to persistent database files are more rare, and we don't hear about wait conditions associated very often. The one I see sometimes - based on write service time to persistent database files and the ability of 'total server memory' to absorb dirty pages before incurring checkpoint or lazy writer activity - is free list stalls. Yup. Got some of those. Lowering write service time and/or increasing RAM will help. But... as with other stuff we've looked at so far... bupkis in the paint.

All right. We've got resource_semaphore, transaction log, and persistent database file read and write waits off the suspect list for now. Let's do one more group: tempdb data files. I separate tempdb data file reads and writes from persistent db datafile reads and writes for a few reasons:
1) if only spills, read bytes and write bytes will be one-to-one
2) tempdb spill writes are womper aggressive compared to datafile writes for persistent data files. Or compared to checkpoint & lazy writes to tempdb. They're just their own animal :-)
3) tempdb spill writes have their own subset of waits that pertain to them

Maybe you'll remember this graph from yesterday. Used Tempdb stacked on top of used grant memory. This graph tells us footprint in tempdb - I use this type of graph a lot. But it doesn't tell us traffic to tempdb, which is what would indicate potential wait events.

Fortunately on this system, they have the tempdb datafiles isolated to mount points with tempdb in the name :-) I heartily recommend isolating tempdb datafiles from other stuff at the Windows volume level. And putting tempdb in the name really helps sorting perfmon. :-) So... two Windows volumes for tempdb datafiles. Piece of cake to add the reads/sec and writes/sec for those volumes into a single reads/sec and writes/sec measure.

Wow. Almost no tempdb traffic in the yellow box of wonder. Now, if I didn't have a hot plate of pasta near me and a family waiting to eat dinner with me, I'd also include read bytes/sec and write bytes/second to the tempdb Windows volumes. And sec/read and sec/write. But I've shown enough to eliminate tempdb from interest for now. Except for page latch contention - ie too few sets of allocation pages in memory to handle the traffic. I'll handle that last bit maybe tomorrow.

Wednesday, April 12, 2017

#SQLServer: Waits keeping CPU utilization down? Or Management Cost inflating CPU utilization? Part I

Two of the first graphs I look at to understand system resource utilization are active parallel threads against CPU utilization, and active requests against CPU utilization. One or both relationships should stay fairly well correlated - as parallel threads and/or active queries increase, CPU utilization should increase until reaching CPU saturation.

*Unless* a wait type or a group of wait types starts to dominate the workload, such that time is passing with *no* runnable thread in the lot for a scheduler or more than one scheduler.

In the graphs below(30 second perfmon interval), the time period from 12:45 am to about 1:20 am, and from 1:35 am to 2:00 am look fairly healthy from that standpoint.

But the yellow box is confounding. I personally hate CPU sawtooth patterns like that - just about my least favorite pattern to come out of graphing system resource utilization. Only thing I like less is a straight horizontal line at saturation.

I don't know much about this particular system yet. But... I can't help myself. I must know. I know its a 2 socket physical server, with 32 cores.

There are no added Resource Governor Resource Pools or Workload Groups - just Default and Internal.

Well - could it be something other than SQL Server that is chewing up CPU? I guess...

Ooohhh. Nope, I take that back. Definitely SQL Server chewing up that CPU.

My money is on 1:20 - 1:35 am being a case of exorbitant management cost. Spinlock contention, probably. Maybe related to handling a system rowset object, or the plan cache... or it could even be good old spinlock contention surrounding CMEMTHREAD waits.

I'm not sure right now. But considering the system is at a low point for concurrent queries and parallel threads at that time - its hard for me to believe waits are keeping CPU utilization lower than the trend. Rather, it looks more likely that somehow, during a time of low concurrency something is artificially increasing CPU utilization with management work.

This is particularly intriguing. About 4 GB of tempdb in use during the CPU sawtooth, but almost no granted memory used. Hmmmm.

There was memory granted at the time.

This is also quite intriguing. Plan cache was at its plateau during this unusual period of time. And when a significant part of plan cache was released shortly after 1:35 am, a more expected relationship between CPU utilization and queries/parallel threads was restored.

Stay tuned... when I figure it out I'll let you know...

(to be continued...)

#SQLServer: Are waits keeping CPU utilization down? Or is Management Cost inflating CPU utilization? Part II
http://sql-sasquatch.blogspot.com/2017/04/sqlserver-are-waits-keeping-cpu_13.html

Tuesday, April 11, 2017

#SQLServer: How much of that free memory came from checkdb completing and releasing its memory grant? And what about tempdb?

Arun Sirpal (@blobeater1 on Twitter) asked a great question today (11 April 2017) about the impact of memory grant release on free memory after checkdb completes.

I love questions! It means someone was really paying attention!

Here's the relevant blog post about the appearance of a lot of free memory when a checkdb (without tablock) completes.

Where'd all this #SQLServer Free Memory come from??!?
http://sql-sasquatch.blogspot.com/2017/04/whered-all-this-sqlserver-free-memory.html

So - what contribution to free memory on completion of checkdb is made by releasing the memory grant? YMMV, but for my 3.35 TB database... almost none. Here's the thing: checkdb doesn't get a single memory grant, rather at multiple times throughout its execution it gets memory grants. And with each of those memory grants, like other queries, it may or may not use the entire memory grant. So, the contribution of releasing the memory grant to free memory is dependent on how much used memory grant checkdb releases on completion. That's kind of confusing to describe, maybe some graphs will help!

This is the same system, same checkdb as previous post. The CPU utilization in blue clearly delineates when checkdb was active, since it was run in isolation.

As checkdb progressed, it requested varying memory grants - granted memory in the dark brown/orange below. In order for this not to be too long of a blog post, I layered "reserved memory" in a lighter orange in front of granted memory. So - the exposed dark brown granted memory is the workspace memory at any given time that is actually used. Recall that when memory is granted its just a promise - the query "steals" memory against the grant over time as needed.

So if we subtract reserved from granted... we'll be left with query memory used by checkdb. Here's what that looks like.

This is fun! Let's keep going! I said that over time, as the query needs more of its grant for sort/hash, it "steals" against the grant. Hmmm... am I implying that used query memory is a subset of "stolen memory", and that only used query memory (excluding reserved memory from the grant) is within "stolen memory"? Yessss!! Yes, I am.

Check out used query memory layered in front of stolen memory in the graph below. Stolen memory is greater than used query memory - but by a relatively fixed amount, and the greater stolen memory is during this isolated checkdb, the greater the share used query memory is.

Interesting, no? Here's the relevance to the initial question: look how small the used query memory is as checkdb nears completion and CPU utilization drops to near zero. Its pretty doggone low. That was the contribution to free memory when the last grant of checkdb was released. Not much. But, as I said, YMMV. Maybe the allocation order of HoBTs in your database is different, and a peak of used query memory like that between 20:50 and 21:50 occurs right before completion! In that case, releasing the grant would lead to a pretty big contribution to free memory.

OK. Now for a lagniappe :-)

See that dip in CPU utilization at 16:50, and at 17:50? Let's tie that in to checkdb work via tempdb.

In perfmon, we've got "SQLServer:Databases(tempdb)\Data File(s) Size (KB)" and "SQLServer:Transactions\Free Space in tempdb (KB)". Let's subtract them to get tempdb used!

And - here's my trick that I think you'll see first from me: let's stack tempdb used on top of query memory used!! Conceptually, its easiest to understand that when a single session is running in isolation, and when its a session that isn't explicitly using temp tables, row versioning, etc. In that case, tempdb use comes from spill: sort/hash has consumed the memory grant, so tempdb is leveraged to complete remaining sort/hash work.

Here's what that looks like. How nice for me; it just happened to work out that there were nice peaks in tempdb use at 16:50 and 17:50 - the same times that CPU utilization dipped. So tempdb-related waits are prime suspects. :-)

Ciao for now!!

Monday, April 10, 2017

Where'd all this #SQLServer Free Memory come from??!?

My blog post on memory accounting with perfmon has been pretty popular :-) You can look there to see validation of the idea that "database cache + free + stolen = total server memory" for SQL Server.

SQL Server Memory Accounting with Perfmon
http://sql-sasquatch.blogspot.com/2016/10/sql-server-memory-accounting-with.html

Today I was looking at perfmon from a system and I saw a huge increase in "SQLServer:Memory Manager\Free Memory (KB)" appear in the middle of a key workload. How'd that happen?

Unfortunately, the perfmon data I was reviewing was missing one of the counters I collect routinely now: "\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec"

So... right now I'm not sure where all the free memory came from on that system. But I've got a hunch, so I asked to get that counter included in the next perfmon collection I get from them.

Below is a test I ran with a full checkdb on my 3.35TB database. The "(logical scan bytes +1)/sec Logarithmic" measure that is graphed really is "\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec" + 1, with an Excel logarithmic scale :-) Its required to be positive nonzero to preclude errors. And I just wanted it there as a clear indicator that checkdb was done. So I started the SQL Server service for the instance and started checkdb. An internal snapshot was created, and "Total server memory" kept growing to reach "max server memory". All of the database cache was for the internal snapshot (other than checkdb only some monitors were running in the instance).

When checkdb was complete - blammo!! The internal snapshot db was dropped, and all of the database cache associated with the dropped db was invalidated - resulting in an humongous increase in free memory within SQL Server "total server memory".

Maybe that's what happened in that production system. I should be able to rule that in/out watching free memory for a similar increase, and checking "\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec".

Friday, March 31, 2017

#sqlserver tempdb: sort_in_tempdb & -k startup option

***Disclaimer***
This is a very preliminary investigation on my part. I'm not going to move on these results in any production system at least until I've also tested effects on checkpoint for persistent databases. Probably not until I understand more about the mechanics of this throttle, either. So I plead that no-one incorporates -k into their own system based on this brief write-up... you'll need some extensive testing for your own context.
***************

Months ago, Jimmy May (@aspiringgeek on Twitter) asked me if I'd used the SQL Server -k startup option. He's really knowledgeable about interaction of SQL Server with storage (an understatement) and a thorough tester - he's seen some good results with -k. I'd read about this startup option, but never tested it and hadn't seen it deployed on any systems I'd worked with. What I'd read to that point had to do with checkpoint throttling. Details on that angle of -k startup option can be found in kba 929240 below.

FIX: I/O requests that are generated by the checkpoint process may cause I/O bottlenecks if the I/O subsystem is not fast enough to sustain the IO requests in SQL Server 2005
https://support.microsoft.com/en-us/help/929240

Now the systems I work on stress SQL Server in lots of corner-case ways 😁 but an overwhelming checkpoint is something I haven't yet observed.

On the other hand, I do see overwhelming tempdb spills. Tempdb sort/hash spill writes are some of the most aggressive writes to come out of SQL Server. Systems susceptible to them are advised to consider how to mitigate performance risk to that very system, as well as mitigating the risk of becoming a noisy neighbor if shared storage or shared plumbing (ESXi server, top-of-rack switch, etc) is involved.

The most common performance interventions for tempdb - trace flag 1117 or equivalent, trace flag 1118 or equivalent, increasing data file count to reduce allocation page contention - do not mitigate the risk posed by a tempdb spill write flood. In fact, since none of the resources for those interventions I am aware of address the underlying Windows volume, vHBA, or ESXi host LUN layout for tempdb there is a chance of actions taken to alleviate allocation page contention increasing the risk posed by tempdb spills. More on that another day - io weaving is a topic I'll have to prepare some diagrams for 😁

Most disk IO throttles are a poor fit for trying to mitigate this risk also. VMware provides SIOC and adaptive queue throttling if using vmdks. Neither work well to tame tempdb write floods without also throttling access to persistent databases. Many storage arrays provide QoS controls at their front end adapters for rate limiting by IOPs or bytes/sec. These limits can apply per initiator (host hadapter) or per target LUN depending on the array model. Per LUN QoS can be ok... but also unwieldy. What about IO governance in Resource Governor? It works per volume!! Yay! But its share-based - rather than limit-based - and will kick in under contention only. So... nope, not that either (but do keep in mind that RG IO governance works per Windows volume - I'll come back to that someday and how it fits into my recommendation NOT to co-locate data files for tempdb and persistent databases on the same Windows volume😉).
***Same Day Update***
Thanks to Alex Friedman (@alexf0101 on twitter) for pointing out I was wrong about Resource Governor IO governance. It is limit-based governance. I'll have to give it a try 😀

IO Resource Governance in SQL Server 2014
https://blogs.technet.microsoft.com/dataplatforminsider/2013/11/07/io-resource-governance-in-sql-server-2014/
Use SQL Server 2014 Resource Governor to Throttle Processes for IO Usage
https://www.mssqltips.com/sqlservertip/3370/use-sql-server-2014-resource-governor-to-throttle-processes-for-io-usage/
******

But here's something tantalizing. A kba about -k startup option initially written for SQL Server 2012. Hmm. It mentions throttling tempdb "work files". Gives an example with checkdb.

Enable the "-k" startup parameter to control the rate that work files can spill to tempdb for SQL Server
https://support.microsoft.com/en-us/help/3133055/enable-the--k-startup-parameter-to-control-the-rate-that-work-files-can-spill-to-tempdb-for-sql-server

Recall that I am using a create index statement with sort_in_tempdb as my proxy for simulating large sort/hash spills. You can see my initial work with that here.

tempdb: "insert into... select" vs "select... into" vs index sort_in_tempdb write behavior
http://sql-sasquatch.blogspot.com/2017/03/tempdb-index-sortintempdb-vs-select.html

But what does throttling "work files" mean? Is the throttle selectively restricted to tempdb activity only? What about temp table usage (that's not my focus for a while - but its an important question). What happens to "create index" with sort_in_tempdb if the -k option is enabled? I'm glad you asked about "create index"!!

Let's start with the unthrottled create index, on an 8 vcpu VM with all flash storage. Its SQL Server 2016 SP1. Maxing out two ports of the adapter 😀 Check out that write latency on the right! Whoo boy! Years ago Paul Randall asked his readers to send in disk service time numbers from their SQL Server instances. Tempdb write latencies were curiously high. I know one possible reason 😀

These graphs are from 1 second perfmon captures.

So... both to keep from saturating the adapter (and punishing any activity sharing the adapter with very high service times) AND to keep from overwhelming storage with writes, might want to throttle that activity.

What happens if we add -k1200 to startup? Interesting. The entire operation stretched out in elapsed time by about 5 seconds. Bytes/sec didn't exceed 1500 mb. But look at that difference in write latency!! A single point of stress at about 140 ms, and everything else was nice and low. Remember - at saturation, service times get punished. Keep a resource from saturation - even just by a little bit - and you can be rewarded with much low service times.

But... maybe the -k didn't really do anything? After all - it didn;t *really* throttle to 1200 mb/sec did it? And... this is right around a 10% difference in write bytes/sec to tempdb. Within variance for some type of tests. I don't have the patience today to keep repeating this test, though 😜 So, lets try to answer that question with a different throttle value. Onward to 800!

OK... that's enough for me to say -k is really doing something. Even lower bytes/sec to tempdb. Lower peak write latency. Still a nasty peak, but very brief.

What about throttled to 400? Hey... this one looks really nice from the write service time perspective. Interesting that tempdb write bytes/sec is as variable as it is - something to do with the mechanics of the throttle. A lot of this stuff might be more even on a physical server than in a VM.

Although I don't have time to do multiple repeat tests, I always try to make time to not stop at the first result I like. So let's go to 200!!!

Excellent! The first result I dislike! Write latency stays super low except for the peak at the end. Write bytes/sec is reasonable with an interesting spike. (I hope to learn more about how this throttle actually works - that'll help me plan how to work it into best practices and/or performance interventions.) But the reason I don't like this result is the really choppy CPU utilization. That hints that the throttle mechanism itself is under stress. So for this workload on this system, throttling to 200 is too aggressive a throttle.

So where would I go from here? When its time for more testing, I'll recommend ten or more runs of unthrottled, 1200, 800, 400. From the best results, ten or more runs at that level plus 100, and minus 100. Then we'll see where we're at. Till then...

*whispers* smash your axioms

Wednesday, March 29, 2017

#sqlserver tempdb: "insert into... select" vs "select... into" vs index sort_in_tempdb write behavior

I believe pretty strongly that a collection of best practices should keep workload in mind. Form fit to function, if you will.

So when it comes to tempdb best practices, I like to remind folks that not all use of tempdb looks the same from the system side.

On the systems I work with, sort/hash spill is the main component of tempdb use. On many OLTP systems, rapid create/drop of temp tables is a much more prominent factor in tempdb considerations. On some systems, row versioning or perhaps use of large local or global temp tables is a significant consideration.

Two concerns arise from the various uses of tempdb:

do they all stress allocation pages in the same way? Tempdb allocation page contention and resulting pagelatch waits are the classic drivers of increasing the tempdb data file count for a given workload.
do they all utilize the storage subsystem in the same way? This is another, separate consideration in tempdb design which may influence decisions about which RAID level or disk group to put tempdb on, whether to use flash or HDD, how many Windows volumes to use, etc.

I'll have to return to that first consideration on another day :-)

Today I want to turn to the question of how some tempdb activity engages with storage.

Sort and hash spill is my primary concern, but today I'm going to weasel out by using CREATE INDEX with the sort_in_tempdb option as a proxy for sort spill. At a later time (once my other tests behave a bit more) I'll come back to this and compare sort and hash spills to sort_in_tempdb activity.

The system I'm using is an 8 vcpu VM, set up with storage from an all flash array. SQL Server 2016 SP1.
Here's the setup for my test.

/* test setup */
create table test_small (num1 BIGINT,
                         string VARCHAR(4000),
                         num2 BIGINT,
                         constraint pk_test_small PRIMARY KEY CLUSTERED (num1))
WITH (data_compression = page);

DECLARE @line VARCHAR(80) = 
'
12345678901234567890123456789012345678901234567890123456789012345678901234567890';
DECLARE @string VARCHAR(4000) = 
   @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line;

;WITH L0 AS (SELECT 1 AS c UNION ALL SELECT 1),      --          2
      L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0), --          4
      L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1), --         16
      L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2), --        256
      L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3), --      65536
      L5 AS (SELECT 1 AS c FROM L4 A CROSS JOIN L4), -- 4294967296
      L_FINAL AS (SELECT TOP (16777216) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) N FROM L5 )
INSERT INTO test_small
SELECT N, @string, N
FROM L_FINAL
WHERE N % 3 = 0

Crude, but effective. You can probably tell that I was initially working with a table 3 times as large and scaled it back :-) The large string in each row forces 1 row per page, giving a fairly predictable table size when the table is not compressed.

For the first test scenario, lets create a temp table, and insert all rows into it.

/* scenario 1 */
create table #test_small_1 (num1 BIGINT,
                            string VARCHAR(4000),
                            num2 BIGINT);

INSERT INTO #test_small_1
SELECT *
FROM test_small;

OK, now lets use a "SELECT... INTO" query.

/* scenario 2 */
SELECT * 
INTO #test_small_2
FROM test_small;

Finally, lets use "CREATE INDEX" with sort_in_tempdb.

/* scenario 3 */
CREATE INDEX nci_small_test ON test_small(Num2)
INCLUDE (String, Num1)
WITH (SORT_IN_TEMPDB = ON, ONLINE = OFF, maxdop = 4);

Now... lets see how that all turned out. 1 second perfmon captures below.

Interesting that the "select... into" took a little longer, isn't it :-) But the two temp table operations were close to each other in write bytes/sec to tempdb. Look how aggressive that "create index" was though! Maxed out adapter bandwidth.

Let's look at writes/sec rather than write bytes/sec.
Whoa. The "select... into" in the middle issued wwwaaaayyyy more writes, and way more writes/sec to tempdb than either of the other operations. But... each operation wrote the same amount to tempdb. So the "select... into" was using some pretty small writes in comparison.

Based on the number of writes, the fact that the "select... into" showed higher disk queue length than the first table option isn't too surprising. But... maybe its a little surprising that the "create index" has such high queue length, considering the much lower number of writes/second?

But if we look at sec/write... aha! The table insert and "select... into" operations have write latency staying well below 64ms. Its a little surprising to me that the "insert into" had higher peak write latency than the "select into..." That may certainly be an oddity of that particular test run. But I don't have a long enough attention span to repeat the test 10 or more times to even it out :-) Nah, the real surpise is the write latency over 300 ms in the "create index". On flash storage, too.

So... different uses of tempdb stress storage in different ways. If I'm correct that "create index" is a good proxy for hash/sort spills, then tempdb spills require special consideration in how to destage aggressive write bytes/sec.

And... a theme I'll come back to in the future... relevant to the high write latency experienced by the "Create index"...
strange things happen to service time at saturation.

Monday, March 13, 2017

perfmon binary blg files - they're doggone big! an example of relog

At SQLSaturday 600 in Chicago this past weekend (March 11, 2017) I showed that perfmon logged to a binary log file can be much larger than alternative formats like csv.

(My presentation can be found here:
http://www.sqlsaturday.com/600/Sessions/Details.aspx?sid=59564)

Here's ~72 minutes of activity logged in 1 second increments, captured in both a binary blg file and a flat file csv format. 553 columns including the timestamp.

That's more than a 6 times difference in size!

For that reason *if* I use a binary log file at all, I put in a limited set of counters. This is a typical counter set that I may use when capturing perfmon in a blg binary file.

\SQLServer:Access Methods\Full Scans/sec
\SQLServer:Access Methods\Index Searches/sec
\SQLServer:Access Methods\Page Splits/sec
\SQLServer:Access Methods\Range Scans/sec
\SQLServer:Backup Device\*
\SQLServer:Buffer Manager\*
\SQLServer:Buffer Node(*)\*
\SQLServer:Databases(_Total)\Backup/Restore Throughput/sec
\SQLServer:Databases(_Total)\Bulk Copy Rows/sec
\SQLServer:Databases(_Total)\Bulk Copy Throughput/sec
\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec
\SQLServer:Databases(_Total)\Log Bytes Flushed/sec
\SQLServer:Databases(_Total)\Log Flushes/sec
\SQLServer:Databases(tempdb)\Data File(s) Size (KB)
\SQLServer:General Statistics\Logins/sec
\SQLServer:General Statistics\User Connections
\SQLServer:Locks(*)\Lock Requests/sec
\SQLServer:Locks(*)\Lock Waits/sec
\SQLServer:Memory Manager\*
\SQLServer:Memory Node(*)\*
\SQLServer:Plan Cache(*)\Cache Pages
\SQLServer:Resource Pool Stats(*)\*
\SQLServer:SQL Statistics\Batch Requests/sec
\SQLServer:SQL Statistics\SQL Compilations/sec
\SQLServer:SQL Statistics\SQL Re-Compilations/sec
\SQLServer:Transactions\Free Space in tempdb (KB)
\SQLServer:Wait Statistics(*)\*
\SQLServer:Workload Group Stats(*)\*

There's only a few reasons I personally use a binary file for logging perfmon, anyway. One is a timing issue. If logman is used to start logging perfmon, and the data collector is started *before* the SQL Server service is started, the only format that can capture the SQL Server specific counters like the list above: blg. That can be a pain in the neck when trying to capture perfmon on a Windows cluster, or around a maintenance window where server restart is expected.

Binary format is kinda nice because as soon as its opened it opens in a graphing interface.

But with a long counter list... that's just a mess.

I do a lot of work in Excel. Years ago I used tsv format mainly for logging perfmon. But then somewhere along the line, Excel stopped recognizing tsv as a native filetype for itself. I messed around changing associations for a while... ending up giving up and using csv format from then on.

So if I do get a large blg binary format perfmon capture, I relog it. Like this...

relog "C:\Users\lnieders\Desktop\sqlsatCHI\perfmon_blg_03111211.blg" -f csv -o "C:\Users\lnieders\Desktop\sqlsatCHI\perfmon_blg_03111211_relog.csv"

This is the relog in a cmd.exe while its in flight.

And when its complete...

Here's the results of the relog.

So the csv from relogging the blg is still larger than the csv that was captured in csv format. But way smaller than the original blg itself.

And once its in csv format, I can make all of my crazy graphs in Excel 😊

Here's a reference for logman to capture perfmon logs.
https://technet.microsoft.com/en-us/library/bb490956.aspx

And a reference for relog if you want to change log format, narrow the time window, pull out a subset of counters, etc.
https://technet.microsoft.com/en-us/library/bb490958.aspx

Wednesday, March 1, 2017

Plotting the CPU in a VM shouldn't look like this...

***** Author's note *****
My half-baked drafts often contain more questions than answers 😀😀😀
When I get more information, I may be able to come back and update. If I actually answer the questions, hopefully I'll promote to a full baked blog post. No promises, though.

If this VMware half-baked draft intrigues you, this related half-baked draft might also be of interest.
Windows Guest: VMware Perfmon metrics intermittently missing?
http://sql-sasquatch.blogspot.com/2016/08/windows-guest-vmware-perfmon-metrics.html
**********

I've become accustomed to perfmon graphs that look like this from within a VMware vm.

Its a 4 vcpu system. When I first worked with VMs, I'd expect that metric to align with 100 * vcpu count - so a max value of 400. But as you can see below it approaches 500. Maybe that's because the non-vcpu worlds consumed physical CPU time on behalf of the VM on pcpus other than the 4 pcpus serving the vcpus. (So the vcpus themselves could consume up to 400% with the non-vcpu worlds adding some beyond that) It might also be due to calculations based on rated frequency of the pcpus and speedstep kicking in. Could even be both? Maybe by the time I convert this from a half-backed draft to a full-fledged blog post I'll be able to propose a way to know what leads to the overage.

Here's the reason for this half-baked blog post: the graph below took me by surprise.

Something weird is going on - probably outside this VM - and severely effecting the reported relationship between guest vcpu utilization in 'Processor Info' and 'VM Processor'.

Its important to understand what 'VM Processor(*)\%Processor Time' means in the context of a VM. It means the time there are runnable instructions already on the vcpu. But the vcpu might not be bound to a pcpu beneath. For example - what if the pcpus are oversubscribed, and two busy VMs are fighting over the same pcpus? The vcpus in the guests could report 100% busy while only being bound to pcpus 50% of the time. In that scenario, high %rdy time would be expected at the ESXi host level. Could be a %COSTP situation, too - a scheduling requirement of a multi-vcpu VM. Could be lots of halt & wakeup cycles, resulting in higher reported vcpu utilization in the guest than pcpu utilization in the host. Lots of migrations of vcpus across NUMA nodes could also lead to higher vcpu utilization reported in guest than pcpu utilization at guest.

Could also be a memory mismatch. If the guest is performing operations on vRAM which is actually not backed by physical RAM in the ESXi host, operations may be considered 100% CPU time within the guest even though the host registers a little bit of CPU time and a lot of wait time for paging/swap space traffic. In a similar fashion, vMotion means a lot of memory traffic becomes disk traffic, and vcpu utilization in the guest can be exaggerated by sluggish interaction with vRAM. But two hours of wall clock time is an afwul lot of vmotion :-)

But I can't shake the feeling that the nearly perfect 1:00 am to 3:00 am window means something very important here. Maybe VM backups in the shared datastore than has this VMs vmdk?

I'll be back hopefully to update this one with more detail in the future...

*Current suspects include:

· guest vRAM not backed by host physical RAM

· oversubscribed pcpus & excessive %rdy

· excessive %Costop

· excessive migrations

· excessive halt/wakeup cycles

· vm backups in a shared datastore

· excessive vmotion activity

· patch apply to ESXi host?