Tuesday, March 14, 2017

#TSQL2sday - The Daily WTF - A Tale of 2 Trace Flags


March 13, 2017 is T-SQL Tuesday!  Yahoo!!  Kennie Nybo Pontoppidan is hosting this month.

Here was the announce...

Announcing T-SQL Tuesday #88: The daily (database-related) WTF
http://www.pontop.dk/single-post/2017/03/07/Announcing-T-SQL-Tuesday-88-%E2%80%93-The-daily-database-related-WTF

And here is my contribution.  A list of performance changes is much easier to come up with than the best order to introduce those changes.

Yep, I remember deploying a trace flag on a production server for the first time.  Trace flag 8048, to mitigate spinlock contention related to memory allocation.  That step followed weeks of observation and research, attempts to recreate the problem on a test system (unsuccessful at the time), reaching out to Microsoft and various experts.

Although it was a nail-biter, it went well.  An important weekly batch of reports decreased in execution time from nearly 20 hours to just over 2 hours.  Even more reassuringly, looking at graphs of CPU utilization vs read bytes/sec and CPU vs buffer page lookups, the relationships were clear again.  As we increased and decreased demand on the server, we saw increases and decreases in CPU utilization.  Good.

Soon after, though, on that system and numerous others, we ran into another significant issue limiting scalability and dragging out execution time of batch reports: very large memory grants and long lists of pending memory grants with excessively long resource_semaphore waits.

After significant additional research, we learned that trace flag 4199 had numerous positive effects on our workload.  The easiest for us to demonstrate was the positive effect of the fix which could be independently enabled with trace flag 4101 - this fix was also rolled up into the trace flag 4199 fixes.

Here's a reference for that specific fix.
FIX: Reorder outer joins with filter criteria before non-selective joins and outer joins
https://support.microsoft.com/en-us/help/318530/fix-reorder-outer-joins-with-filter-criteria-before-non-selective-joins-and-outer-joins

And here's the posts where I've tried to keep track of all T4199 fixes.
Trace Flag 4199: Complex Risk Assessment Part I
http://sql-sasquatch.blogspot.com/2014/01/trace-flag-4199-complex-risk-assessment.html
Complex Risk Assessment: Part II Trace Flag 4199 fixes w/o individual trace flags & SQL Server 2016 Update
http://sql-sasquatch.blogspot.com/2014/01/trace-flag-4199-complex-risk-assessment_6.html

Trace flag 4199 was good for our workload.  We tested it fully on SQL Server 2008R2, and would later replicate the testing on SQL Server 2012 and 2014.

I've long believed that knowing the list of performance recommendations is much easier than knowing the best order to adopt the recommendations.  When I went on the road with my knowledge of perfmon, trace flag 8048, and trace flag 4199, strong support was given to that belief.

I left Wisconsin headed for Illinois with another performance engineer for an on-site observation and performance intervention.

For the first few hours we collected data with perfmon and some stored procedures to grab waits, spinlocks, and memory grant information.  I was excited to see patterns I recognized - spinlock contention clustered around CMEMTHREAD waits (a hint that trace flag 8048 would be applicable) and "eyes too big for stomach" memory grant requests (in queries with many joins, a hint that trace flag 4199 may be helpful to "right-size" the memory grant).

I'd seen systems that implemented both trace flags as startup parameters simultaneously.  I'd helped organizations implement first T8048, then T4199 (based on the timing of my research and testing of the trace flags).  This was the first time that there was a desire to implement the trace flags one-at-a-time and we had the choice of which to implement first.

I hadn't chosen to put T8048 in first previously - that was just the way everything worked out.  If I chose to follow that order - I'd be doing what I'd seen and done before.  But... there was also a reason to choose the reverse order, with T4199 first.  Spinlock issues - especially at that time - were considered more exotic performance issues than many of the "plan-shaping" issues that trace flag 4199 addressed.  Many administrators were much more familiar with that type of performance issue - eliminating significant waits, altering plan shapes, making the logical work of queries more efficient - than with the busy wait/management overhead of spinlocks.  Sometimes demonstrating an improvement that someone is already familiar with evaluating is a plus, helping to gain trust.  I didn't know of a specific reason NOT to put trace flag T4199 in place, followed by T8048 later.  And in this case it seemed like building up some interpersonal capital might be a good idea.

Trace flag 4199 went in as a service startup parameter, and SQL Server was restarted.  When the heavy workload hit, we watched resource utilization closely.  Things looked good for a few minutes.  Then the server became unresponsive.  CPU was saturated.

Trace flag 4199 had made logical work able to try harder.  Individual memory grants had shrunk, so query concurrency was higher.  The number of concurrent parallel workers was higher. The opportunities for contention in allocation of workspace memory for sorts and hashes increased significantly.  And spinlock contention resulted in CPU saturation.  There's a condition known as a spinlock convoy - many concurrent threads just trading a spinlock resource among themselves, with all of those not holding the resource spinning.  I'd read about that, but never seen it make a server unresponsive until that day.

Fortunately the workload was able to be resubmitted.  I sheepishly explained that the server becoming unresponsive was actually a sign that the trace flag was doing what it was supposed to do... but also meant I had chosen the wrong order to enable the trace flags.  The server was brought down.  When it was brought back up, we switched in trace flag 8048.  The workload was resubmitted, and a modest gain was observed.  The next day, trace flag 4199 was added and a significant performance and scalability gain was realized.

And I learned to consider much more closely the consequences of the order of performance intervention.  That's the only time I can remember trying to explain that the unresponsive server being witnessed by a whole room full of people was actually a good sign.  I'm sure there were at least a couple of people in that room thinking "performance intervention, WTF?"

Monday, March 13, 2017

perfmon binary blg files - they're doggone big! an example of relog

At SQLSaturday 600 in Chicago this past weekend (March 11, 2017) I showed that perfmon logged to a binary log file can be much larger than alternative formats like csv.

(My presentation can be found here:
http://www.sqlsaturday.com/600/Sessions/Details.aspx?sid=59564)

Here's ~72 minutes of activity logged in 1 second increments, captured in both a binary blg file and a flat file csv format. 553 columns including the timestamp.


That's more than a 6 times difference in size!

For that reason *if* I use a binary log file at all, I put in a limited set of counters.  This is a typical counter set that I may use when capturing perfmon in a blg binary file.


\SQLServer:Access Methods\Full Scans/sec
\SQLServer:Access Methods\Index Searches/sec
\SQLServer:Access Methods\Page Splits/sec
\SQLServer:Access Methods\Range Scans/sec
\SQLServer:Backup Device\*
\SQLServer:Buffer Manager\*
\SQLServer:Buffer Node(*)\*
\SQLServer:Databases(_Total)\Backup/Restore Throughput/sec
\SQLServer:Databases(_Total)\Bulk Copy Rows/sec
\SQLServer:Databases(_Total)\Bulk Copy Throughput/sec
\SQLServer:Databases(_Total)\DBCC Logical Scan Bytes/sec
\SQLServer:Databases(_Total)\Log Bytes Flushed/sec
\SQLServer:Databases(_Total)\Log Flushes/sec
\SQLServer:Databases(tempdb)\Data File(s) Size (KB)
\SQLServer:General Statistics\Logins/sec
\SQLServer:General Statistics\User Connections
\SQLServer:Locks(*)\Lock Requests/sec
\SQLServer:Locks(*)\Lock Waits/sec
\SQLServer:Memory Manager\*
\SQLServer:Memory Node(*)\*
\SQLServer:Plan Cache(*)\Cache Pages
\SQLServer:Resource Pool Stats(*)\*
\SQLServer:SQL Statistics\Batch Requests/sec
\SQLServer:SQL Statistics\SQL Compilations/sec
\SQLServer:SQL Statistics\SQL Re-Compilations/sec
\SQLServer:Transactions\Free Space in tempdb (KB)
\SQLServer:Wait Statistics(*)\*
\SQLServer:Workload Group Stats(*)\*

There's only a few reasons I personally use a binary file for logging perfmon, anyway. One is a timing issue. If logman is used to start logging perfmon, and the data collector is started *before* the SQL Server service is started, the only format that can capture the SQL Server specific counters like the list above: blg. That can be a pain in the neck when trying to capture perfmon on a Windows cluster, or around a maintenance window where server restart is expected.

Binary format is kinda nice because as soon as its opened it opens in a graphing interface.

But with a long counter list... that's just a mess.


I do a lot of work in Excel.  Years ago I used tsv format mainly for logging perfmon.  But then somewhere along the line, Excel stopped recognizing tsv as a native filetype for itself.  I messed around changing associations for a while... ending up giving up and using csv format from then on.

So if I do get a large blg binary format perfmon capture, I relog it. Like this...



relog "C:\Users\lnieders\Desktop\sqlsatCHI\perfmon_blg_03111211.blg" -f csv -o "C:\Users\lnieders\Desktop\sqlsatCHI\perfmon_blg_03111211_relog.csv"

This is the relog in a cmd.exe while its in flight.


And when its complete...



Here's the results of the relog.



So the csv from relogging the blg is still larger than the csv that was captured in csv format.  But way smaller than the original blg itself.

And once its in csv format, I can make all of my crazy graphs in Excel 😊

Here's a reference for logman to capture perfmon logs.
https://technet.microsoft.com/en-us/library/bb490956.aspx

And a reference for relog if you want to change log format, narrow the time window, pull out a subset of counters, etc.
https://technet.microsoft.com/en-us/library/bb490958.aspx


Thursday, March 9, 2017

Some tools I use in blogging...

***** 2017 March 9 *****
I go here all the time to keep SQL Server builds and versions straight.
https://sqlserverbuilds.blogspot.com/

I use hilite.me for HTML code formatting. (I like SQL language, vs style.)
http://hilite.me/

**********



SQL Server - Who's got a lock and how long've they had it?

SQL Server 2016 SP1 for the adventure below.

A question came up this morning on Twitter #sqlhelp about capturing all activity against a given SQL Server table for a period of time.  The goal was to identify what was taking out a lock on a table and taking its sweet time before releasing it 😀

Extended Events and SQL Trace are options to capture all of this activity.  But the observer overhead is something to consider.

Depending on whether observer overhead is a risk and whether transactions span SQL requests (and thus might prevent identifying the SQL text responsible for the lock request), polling at a given interval such as 1 or 5 minutes with a query like that below might also fit the bill.


Session 1
Create the table, grab table lock.


drop table if exists sqL_handLe;
begin tran
create table sqL_handLe (pk bigint);
waitfor delay '00:00:30';


Session 2
Identify the locks

select so.name as [lock_object_name],
tl.resource_associated_entity_id as [lock_object_id],
tl.resource_database_id,
st.session_id,
dt.database_transaction_begin_time,
--dt.database_transaction_log_record_count,
--dt.database_transaction_log_bytes_used,
--dt.database_transaction_log_bytes_reserved,
tl.request_mode,
tl.request_status,
tl.resource_type, 
--tl.resource_subtype,
--tl.resource_description,
--tl.resource_lock_partition,
sqlT.text as request_sql_text
from sys.dm_tran_locks tl
join sys.objects so with (nolock) 
on so.object_id = tl.resource_associated_entity_id
join sys.dm_tran_database_transactions dt on dt.transaction_id = tl.request_owner_id
and tl.resource_database_id = dt.database_id
join sys.dm_tran_session_transactions st on st.transaction_id = tl.request_owner_id
full outer join sys.dm_exec_requests req on req.session_id = st.session_id
outer apply sys.dm_exec_sql_text(req.sql_handle) as sqlT
where so.name ='sqL_handLe'

Results of the query while the batch which created the table is still executing (during the waitfor).



Results of the query after the batch has completed - no request so no sql_handle => no SQL text returned.  But the lock is still held.



Polling for locks every minute or so and storing results in a table can be much more lightweight than a trace or Extended events.  However - not guaranteed to capture the SQL text which was responsible for creating the lock if the transaction spans batches/requests.  So that's the trade-off.

A reminder to myself to clean up after myself.


drop table if exists sqL_handLe;
commit tran

Wednesday, March 8, 2017

SQL Server 2016 SP1 EE checkdb with physical_only: Trace flags 2549 & 2562

First things first: I don't think trace flag 2549 does anything in SQL Server 2016 SP1 Enterprise Edition.

Last time I said that, I was wrong 😀. I was tricked by trace flag 2549 needing to be enabled globally in order to change behavior - I was testing with it enabled at session level.  But... after opening a ticket with Microsoft the kb article was updated to note that T2549 must be enabled globally 😀.

*This* time I made sure to enable T2549 globally.  But I didn't see any difference in behavior although all 8 of my database filegroup data files were in directories under mount points of a single drive letter.  Checkdb reads were *really* aggressive either way.  Like - more aggressive than previously when T2549 did make a difference.  My execution times were within 5% of each other.  I'll scrutinize the perfmon and session waits more later.  But for now, I'm assuming that behavior is the same and expected performance the same with or without T2549.

Now on to the good stuff.  As I mentioned, checkdb with physical_only reads are quite aggressive in SQL Server 2016 SP1.  The system I tested on is an 8 vcpu vm with 64 GB RAM.  The storage is from a hybrid SAN array, although all of the disk devices are SSDs.  Aggregate database data footprint - across 8 files in a single filegroup - is 3335GB.  (Sum of fileproperty(<name>,'spaceused') across the 8 files).

Let's see some graphs!!!

Look at that read throughput in a vanilla checkdb with physical_only!  These perfmon captures are using 1 second intervals - nearly saturating bandwidth for a considerable amount of time during its execution.



Hard to see because of the scale, but there's a rise and and fall in memory grant usage corresponding to the "batches" that checkdb is using.  Way less than 10 gb of Tempdb is used at peak, for less than 6 minutes out of the roughly 60 minute execution time.

Let's take a look at CPU utilization.  Ohhh yeahh!  Most of the time there were 16 checkdb parallel threads in use for this 8 vcpu VM.  The only break in fairly heavy CPU utilization came during the processing of that large batch in the middle.


OK.  Recall that trace flag 2562 instructs checkdb to use a single large "batch" for all of its checking. This can lead to a significant speedup - especially on spinning disk.  The cost is more use of tempdb.  The next test - checkdb with physical_only + T2562 - demonstrates both the benefit and cost of this behavior.

Hooo doggie!!  It only took about 48 minutes and 30 seconds now!  Look at that fascinating step function of tempdb use!!  Notice how the read bytes/sec plummet briefly before each new step of tempdb use.  If only I had enough time to dig into the mechanics of that!  Maybe another day...


All right - lets take a look at CPU utilization.  CPU also drops at the beginning of each new step of tempdb use.  How tantalizing!


But there you have it.  Pretty aggressive checkdb with physical_only in SQL Server 2016 SP1.  Even a little bit more aggressive with trace flag 2562 enabled.  (I enabled it globally for these tests.)

When planning for checkdb with physical_only in SQL Server 2016 SP1, you may want to stick it into its own Resource Governor workload group so you can independently control DOP, maybe tinker with its max memory grant, and maybe even use IOPs governance.  I hope to return to those matters sometime soon here.  But first gotta also test T2528 (disable parallel checks) and full checkdb.

I should note that these checkdb tests were run in isolation.  If there is another concurrent workload on the system, not only will it be effected by checkdb, but that workload will impact checkdb performance.  Even moreso if that concurrent workload contains DML (because the internal snapshot for checkdb must be maintained).

*****2017 03 08 addendum *****
I wanted to further detail disk IO activity with graphs.  Since the end-point devices are all SSDs, there is less variability in latency and read bytes/sec than would be expected with 10k or 15k fibre channel HDDs.

That's pretty good read throughput, at pretty low latency.


Pretty spikey read IOPs from second to second.


Now let's look at the checkdb run with T2562 enabled.  First at read bytes/sec against sec/read service time.

Throughput is even higher - from peaks of 1200 mb/sec to peaks of 1600 mb/sec.  That additional throughput comes at a cost though - read service time has risen to about 8 times its previous values.

Lets look at read IOPs vs read service time for the checkdb with T2562.


IOPs have come down even though throughput went up.  So the average IO size is higher.  But the jump in service time is much higher than predicted by the increase in average read IO size alone.

Rather, saturation of bandwidth must account for some of the increase in service time.  In general, at saturation performance behavior is going to deviate from expectation based on non-saturated behavior.

This brings up an important point about decisions in performance engineering and planning for systems.  If the main goal is completing checkdb as quickly as possible and the 10-12 minutes saved execution time is valuable enough - I'd definitely use T2562 on this system.

But... if there was a concurrent query workload that generated 100 mb/s of read load?  I'd strongly consider running checkdb without T2562.  Even though it would take a bit longer, there's room in there for additional resource utilization before hitting saturation.

Anyway... ciao for now!




Friday, March 3, 2017

SQL Server 2016 Spinlocks: Query Store

Recently on #sqlhelp there was an instance of a 1 GB Query Store being nearly full, and the instance slowing to a crawl.  It was difficult to even run diagnostic queries, and to resolve the service was restarted and Query Store set to read only (so all contents were retained).

Hard to say if the problem was work or wait. Certainly both could have contributed.  But most diagnostic queries are pretty light - unless CPU is saturated or there are long waits to compile or get a memory grant, they'll usually run even if a bit sluggish.

So I suspect that this was a work problem.  Eg with the Query Store nearly full, management work became so expensive that user query work was starved for CPU time.  That doesn't seem far-fetched because the number of queries in the 1 GB query store was over 100,000 if I recall correctly.

There have been issues in the past with the plan cache when it was at or near plan count capacity - it would become a one-in-one-out scenario, with the session that wants to insert a new plan being responsible for the necessary cleanup.  If the hash buckets were really long it became expensive to walk them.

I'm not sure how size-based cleanup is implemented for the Query Store, other than seeing documented that it kicks in near 80% of configured limit.  If sessions waiting to compile a query take on responsibility of cleaning out stuff from Query Store, there could be a situation very similar to the plan cache conundrum of old.

If that's happening and CPU utilization is near 100% either server-wide or on a subset of cores, then spinlock stats may become more enlightening than wait stats.

I did see a number of new spinlocks added for Query Store...

SELECT name FROM sys.dm_os_spinlock_stats WHERE name LIKE '%QUERY_STORE%' ORDER BY name

QUERY_STORE_ASYNC_PERSIST
QUERY_STORE_CAPTURE_POLICY_INTERVAL
QUERY_STORE_CAPTURE_POLICY_STATS
QUERY_STORE_HT_CACHE
QUERY_STORE_PLAN_COMP_AGG
QUERY_STORE_PLAN_LIST
QUERY_STORE_SELF_AGG
QUERY_STORE_STMT_COMP_AGG
SPL_QUERY_STORE_EXEC_STATS_AGG
SPL_QUERY_STORE_EXEC_STATS_READ
SPL_QUERY_STORE_STATS_COOKIE_CACHE

CREATE INDEX sort_in_tempdb - test model for sort/hash spill stress?

These results are from a 4 vcpu VMware VM.  The test is an offline CREATE INDEX, at MAXDOP 4.  This vm - running SQL Server 2014 with trace flags 1117 and 1118 globally enabled - has two equally sized tempdb files, each on their own Windows volume in the guest (no other files in those NTFS filesystems).  The tempdb Windows volumes, like the other volumes holding persistent database files or transaction logs (including the tempdb transaction log) are distributed among 3 pvscsi vHBAs.  But the underlying storage is by no means supercharged here.  This system isn't a full performance-at-scale testing rig. More like dev & exploration 😁
 
The 4 parallel threads (the dark blue area) are keeping the vcpu at 100% cpu busy (the red dots).  The CREATE INDEX is the only active request in the default workload group (the really light blue dots), and there are no other workload groups except the internal group.  Perfmon was collecting metrics in 1 second intervals.  Notice that it caught a task waiting for a page latch for a few seconds!! (That's the light blue dashed line.)  But - that waiter was at a time of reported 100% vcpu utilization.  Any optimization that removed that wait with no other changes would simply see the time replaced with signal wait or sos_suspend_queue time.


Well... how about query memory and tempdb usage?  The query received and used a memory grant of almost 4 GB, which it used in a step function manner as seen in red dotted line below.  The one second perfmon collection interval really shows the microburst nature of this tempdb activity.  Those one second burst are up to 500 mb/sec.  That type of write throughput can be really punishing to shared storage.


Let's take a look at the tempdb footprint during this CREATE INDEX.  The bursts of writes leads to pretty steady tempdb footprint growth throughout the activity.



Let's take a look at writes per second to these two Windows volumes, rather than write bytes. The shape of the area graph is identical - only the unit of measure has changed.  Throughout the operation the writes are 64kb extents.  Since this CREATE INDEX is running in isolation, the writes and write bytes are divided almost exactly between the two files. (Although its kind of hard to tell in these graphs - next time I'll have to grab virtual file stats before and after, too, to show the breakdown.)



OK, one last graph I'll share from this experiment:  vCPU utilization vs write bytes.  With more vcpus and a higher DOP for the operation (or two concurrent operations at DOP 4) I'd expect more work to be done per unit of time.  And for the write rate to increase accordingly.  I kind of cheated because before I ran this test I looked at some of my old 1 second captures from an 8 vcpu system on a monster server :-)  That test was with a varied workload of complex, long-running SQL queries at query concurrency 16 (enforced by Resource Governor active request limit).  That system did get peaks of over 1.2 GB/s of writes to its two tempdb files, though. 😊



Now, my working assumption is that sort_in_tempdb has the same potential allocation page bottleneck as sort and hash spill, and ultimately the same as the allocation page contention that we know and love from systems with session counts many times the (v)cpu count putting pressure on lots of temp tables with create/destroy/populate.

But on these systems the target for query concurrency is typically 2 to 3 times vcpu count, with a MAXDOP no higher than 8.  Of the queries running, only 2 to 4 are expected to spill concurrently at any given time.  There is no significant contending concurrent use of #temp or ##globaltemp tables among the concurrent queries.

I believe that type of workload (where spill is the main consideration) has a very different profile from the typical use cases that drive count of tempdb files based on (v)CPU count.  For a number of reasons, I think on these systems even up to 32 physical cores 2 or 4 tempdb data files(each on their own Windows volume or co-located if onboard SSD/flash is used for tempdb) will be better than 8 or more.  Hopefully I'll articulately explain why later...

For now its off to check on resources for a larger scale run of this test - with single command at high DOP and mulitple concurrent commands at lower DOP to see how the waiters, CPU utilization, and write activity play out.