Tuesday, July 15, 2014

Hitachi VSP and Universal Replicator: Asynchronous replication is not free

Occasionally I'll hear the idea that asynchronous array-level replication does not effect primary system performance.  That's just not true.  It doesn't effect performance in the same way as synchronous replication does. But that's different from not affecting primary system performance at all*.

But why take my word from it?  EMC and Hitachi are both really good at array level replication.  I've already quoted from EMC documentation on my blog today, so now I'll switch to Hitachi documentation :-)

Hitachi Virtual Storage Platform
Hitachi Universal Replicator User Guide


Performance is affected because of journal-to-journal copying. Usage rates are also affected.
Page 1-3 


*It is possible that a database with asynchronous writes and a light enough level of writes may see the effects of asynchronous - even synchronous - array replication without noticeably impeding query completion speed.  But I'm talking about pedal-to-the-metal data warehouse loads and the like :)

EMC VMAX Virtual Provisioning: Striped Metas now generally recommended

For some reason, I find it very easy to come across REV A12 of the EMC® Symmetrix®Virtual Provisioning™ documentation referenced below compared to REV A14.  Putting up this blog post more for myself than others... perhaps this way I'll easily be able to track down REV A14.  For me REV A14 is important because its when the general recommendation switched from concatenated metas to striped metas.  No doubt the performance benefits were the motivation, and EMC waited until enough operations on striped metas were possible to ease administrative burden of recommending striped metas.

I find the recommendation of striped metas especially important for Oracle redo logs and SQL Server transaction logs when SRDF synchronous replication is used.  This is because the serialization required for replication has less of a performance drag with striped metas than concatenated metas. Striped metas may also help when asynchronous replication is used, because of the write cache policies on the VMAX.

Best Practices for Fast, Simple Capacity Allocation with EMC® Symmetrix®Virtual Provisioning™

In most cases, EMC recommends using concatenated rather than striped metadevices with Virtual Provisioning.
Page 12

Best Practices for Fast, Simple Capacity Allocation with EMC® Symmetrix®Virtual Provisioning™
Document 7383 300-006-718 REV A14  (link requires EMC Account login)

In most cases, EMC recommends using striped rather than concatenated metadevices with Virtual Provisioning because they offer the following advantages:

  • With Synchronous SRDF®, Enginuity allows one outstanding write per thin device per path. With concatenated metadevices, this could cause a performance problem by limiting the concurrency of writes. 
    This limit will not affect striped metadevices in the same way because of the small size of the metavolume stripe (1 cylinder or 1920 blocks).
  • In Enginuity releases prior to 5875, the code allows eight read requests per path per thin device. This may limit the number of read requests that can be passed through to the thin pool regardless of the number of data devices it contains. This can cause slower performance in environments with a high read miss rate. A striped metadevice offers improved throughput, because each metamember is allowed to queue requests to the backend in parallel. Even with the limit removed in Enginuity 5875, there are still potential performance benefits to using striped thin metas over concatenated thin metas.
  • Symmetrix Enginuity has a logical volume write pending limit to prevent one volume from monopolizing writeable cache. Because each metamember gets a small percentage of cache, a striped meta is likely to offer more writeable cache to the metavolume.
However, concatenated thin metadevices do offer two important operational advantages over striped thin metadevices:

  • Non-metadevices can be converted to concatenated metadevices without destroying the existing data. Beginning with Enginuity 5875 and Solutions Enabler 7.2, it’s possible to convert a thin device into a concatenated meta without unmapping the device first.
  • Concatenated thin metadevices can be expanded without destroying existing data by adding metamembers to an existing metadevice. Beginning with Enginuity 5875, a striped metadevice can also be expanded while preserving existing data. 
    Pages 12-13

Tuesday, July 8, 2014

Before I drop all those SQL Server auto-generated column stats....

A number of systems I work with have been hit particularly hard by kbA 2904010 (the fix requires relevant code and T4199 to be activated, by the way).

This defect causes the actual number of rows sampled for statistics to be far short of the specified target.  For example, many tables of millions of rows had auto stats updates with a default sample target size of 20%, but the actual sample sizes for these updates was as small as .2%!  (Used 'dbcc show_statistics(table_name,stats_name) with stat_header' to see rows_sampled and rows as of last stats update.)  Some horrible plans resulted.  Over the next few weeks I hope to write a number of blog entries about this particular defect, what I've learned about stats as a result, and what I've done or encouraged others to do for mitigation.

Mapping out a fix plan on paper wasn't too bad... find the most hurt stats that were most used, and those are candidates for 'update statistics... WITH FULLSCAN,NORECOMPUTE'.

On one table in particular with 6GB of data, over 6 million rows, no indexes other than the clustered index... updating all statistics took almost 90 minutes on a test system, with an 'ideal memory' indicated by the query memory grant DMV ranging up to 18GB!  Updating statistics for the clustered index alone took only 6 minutes and didn't require a memory grant at all.  In this particular case, the clustered index key was complex - 6 columns in total.  Of the 7 auto-generated column statistics for the table, 5 of them were the non-initial columns in the clustered index.  There were 4 remaining columns in the table - 2 of which had column statistics generated.  Now - I've got a feeling that redesigning this clustered index could be helpful all around, as well as adding some nonclustered index.  But that'll have to be a project for another day. 

Deciding to update the index statistics only as a first pass is pretty easy - but what is the liability of not EVER updating the column statistics? What would be the liability of dropping them? 

A while back I ran across an intriguing suggestion to simply drop all auto-created columns statistics in a database, by Ami Levin.  Read the blog post here:
I do some things that folks consider risky... but I really am serious about risk management.  I wouldn't risk drop all column statistics without some way to measure which ones are being used and which ones are not being used.

A colleague asked me about evaluating the statistics in use on a running system.  At first all I remembered was trace flags 9204 & 9292 (see trace flag 2363 for SQL Server 2014 and beyond).  Paul White blogged about these here:

That allows investigating an individual query (an activity I always encourage for developers and report writers)... but not helpful for my context.

But in the comments to Paul's blog post Fabiano Neves Amorim showed up!  I read this blog post about a year ago... but without an immediate need it just kind of filed away in my mind somewhere near the plots and plot holes of old James Bond films I watched.  This time I paid attention :-)

So, trace flag 8666 seems to be the key for what I want to do.  It allows peeking into 'InternalInfo' in the plan XML, in order to get table names (wszTable) and stat names (wszStatName) that were used to compile plans.  Cool!  So here's the disclaimer:

Trace flag 8666 is NOT an officially documented SQL Server trace flag.  I do not recommend using this trace flag without a tight leash!  In fact, any querying of the plan cache can become quite intensive - plans can be big and pulling lots of them into RAM can flush other database contents out while other folks are trying to get work done!  I know cuz I've caused trouble for folks - so do better than me!  Also, there's absolutely no guarantee that trace flag 8666 will continue to allow this insight in the future.

But, here's a new toy of mine, for profiling stats use on performance test servers when a full batch load of queries is loaded up.

There's no doubt in my mind that some folks reading this are WAY better at handling xquery or XML in general than I am.  Just to get the query below working at all, then to get rid of the cartesian product, then to improve its performance to just better than awful took an amazing amount of stumbling around for someone completely sober :-)  So if there are suggestions to improve the performance of this thing, I'm all ears...

But... here's my toy... in an early incarnation, anyways...
Note that in order to somewhat limit scope, only plans in the database of execution are considered based on sys.dm_exec_plan_attributes.


--Only for sandbox use unless behavior & potential memory load are understood!!
   XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan' as p),
   handles (plan_handle,usecounts)
       sys.dm_exec_cached_plans AS cp
       sys.dm_exec_plan_attributes(plan_handle) AS epa
       cp.cacheobjtype = 'Compiled Plan'
   ,TableUsed.XMLCol3.value('@FieldValue','NVarChar(500)') AS TableName
   ,StatsUsed.XMLCol5.value('@FieldValue','NVarchar(500)') AS StatsName
   handles AS handles
   sys.dm_exec_query_plan(handles.plan_handl ) AS qp
   qp.query_plan.nodes('//p:InternalInfo') AS InternalInfo(XMLCol1)
   InternalInfo.XMLCol1.nodes('p:EnvColl/p:Recompile') AS Recomp(XMLCol2)
   Recomp.XMLCol2.nodes('p:Field[@FieldName="wszTable"]') AS TableUsed(XMLCol3)
   Recomp.XMLCol2.nodes('p:ModTrackingInfo') AS Tinfo(XMLCol4)
   Tinfo.XMLCol4.nodes('p:Field[@FieldName="wszStatName"]') AS StatsUsed(XMLCol5)
   qp.query_plan.exist('//p:InternalInfo') = 1
   InternalInfo.XMLCol1.exist('p:EnvColl/p:Recompile') = 1
   Recomp.XMLCol2.exist('p:Field[@FieldName="wszTable"]') = 1
   Recomp.XMLCol2.exist('p:ModTrackingInfo') = 1
   Tinfo.XMLCol4.exist('p:Field[@FieldName="wszStatName"]') = 1
   TableUsed.XMLCol3.value('@FieldValue','NVarChar(500)') NOT LIKE 'sys%'
   handles.usecounts DESC ;

Wednesday, June 18, 2014

Adventures with SQL Server & perfmon Part 3: Less is more for Me & TokenAndPermUserStore

 This is a continuation of my adventures with SQL Server and perfmon.
Adventures with SQL Server & perfmon Part 1: Resource Governor & Query Memory
Adventures with SQL Server & perfmon Part 2: TokenAndPermUserStore & spinlock suspects

In Part 2 I explained that I often evaluate the amount of logical work (in the form of buffer pool lookups/sec) performed in a database against the CPU utilized.  If I see something suspicious, I dig in.  What I saw on that system made me suspicious of some type of spinlock condition at the very beginning of queries in the batch.

My previous post has some graphs that explain the situation a bit.  Here's a brief recap.  RG1 is a resource governor workload group, attached to its own resource governor resource pool.  And here are two graphs showing the disparity between CPU utilization and logical reads, as well as the prevalence of CMEMTHREAD wait starts in the database instance.

Bpool Lookups (light blue above) = "\SQLServer:Buffer Manager\Page lookups/sec"
%CPU (dark red above)= "\SQLServer:Resource Pool Stats(RG1)\CPU usage %"
CMEMTHREAD Wait Starts (red dots above) =  "\SQLServer:Wait Statistics(Waits started per second)\Thread-safe memory objects waits"

So I hit the books.  Or the internet, or whatever.  And I learned about TokenAndPermUserStore.

Here's the stuff I read through.

TokenAndPermUserStore cache performance issue
Explicitly parameterize adhoc queries
Use stored procedures
Force parameterization
Periodic use of:

Trace flag 4618 for 1024 "access check cache quota" entries

Add trace flag 4610 to 4618 for 8192 "access check cache quota" entries entries

Trace flag 4621 + Windows Registry setting 'TokenPermQuota' to customize size

Blog Post about trace flag 4621

The blog post below gives a query example from sys.dm_os_memory_clerks that can be used to evaluate this condition. The sys.dm_os_memory_clerks query is lightweight; the blog post also includes a sys.dm_os_memory_cache_entries query that should not be used during peak workload due to its overhead.

Check this out. This blog post uses a sys.dm_os_memory_clerks query together with perfmon "\SQLServer\User Settable" to track this condition. Maybe its a good thing I didn't see this post until I was almost done with my own post - maybe it would have scared me off :-)

So I evaluated the system.  Then I evaluated system behavior based on what I know about the workflow.  Its a batch-based workflow, with almost all queries executed by a single service account.  The security/access cache doesn't necessarily deliver a lot of value.
So I opted to set the cache elements on the 64 bit system to the 32 bit defaults.

--default sizes below if 'access check cache bucket'=0 & 'access check cache quota'=0
--x86 32 bit default sizes 'access check cache bucket'=256,  'access check cache quota'=1024
--x64 / ia64 default sizes 'access check cache bucket'=2048, 'access check cache quota'=8192
--check current values
EXEC sp_configure @configname = 'access check cache bucket' ;
EXEC sp_configure @configname = 'access check cache quota' ;

EXEC sp_configure @configname = 'access check cache bucket' , @configvalue = 256 ;
EXEC sp_configure @configname = 'access check cache quota' , @configvalue = 1024 ;

--verify the new values
EXEC sp_configure @configname = 'access check cache bucket' ;
EXEC sp_configure @configname = 'access check cache quota' ;

--reset 'show advanced options' to 0
EXEC sp_configure @configname = 'show advanced option', @configvalue = 0 ;

That turned out to be a pretty good strategy.

From 7:30 to 8:00 am the peak CMEMTHREAD wait starts per second dropped from almost 60,000 to under 160!  Even better, CPU utilization became "unstuck" from its previous plateau of about 75%... and the relationship between CPU utilized and database logical IO was restored.

Here's a before and after, which will make the drop in CPU and increase in logical IO rate quite evident.

Pretty good for not tuning a single query, eh? :)

Again - its not that I think query tuning is unimportant.  But first things first: make sure the system is delivering expected performance capacity... then tune queries to leverage the system performance capacity.

Gives an idea what I'm moving on to now :)  Its time to push this system a little bit harder, and then make sure that the logical work performed at the instruction of the queries and query plans is efficient - so that the system is working hard and smart.