Thursday, May 23, 2013

Crowdsource momemtum attempt: MSSQL stats update per-partition **Updated**

 I see a Connect item opened in June 2009 titled "Update Statistics at the partition level".

If I hadn't found this item, I would have written an item up myself.  Consider this Connect item, and upvote if it seems worthy to you.

As of SQL Server 2012, partitioned indexes can be rebuilt per partition.  That's good.

But, the statistics for a partitioned table can only be updated across all partitions.  That's bad.

The various posts in the Connect item make a good case for partition level stats updates.  Why update stats for data from 2001 that is partitioned on a time-based key, if the file the partition is in is read-only?  If I eliminate fragmentation, use a high fill-factor and page level compression in an old partition that is read only - the statistics in the partition will be stagnant.  Still important for the query optimizer... but the stats within the partition don't need to be updated.  If there's only 1 out of 20 partitions that is being updated, and the other 19 have updated stats and are read-only, updating stats will be much faster if only the active partition is updated.

There's more.  The read only partitions only have to be backed up once per quarter (or less).  The only reason to continue to back them up periodically is to make sure the backup is on good media.  That can take a whole lot of stress off of the backup infrastructure.

There's more.  The dbcc checkdb interval for data should be based on recovery tolerance, and recoverability.  Once a partition is read-only, the resources to perform the consistency check are a known quantity, and the time to recover from backup is a known quantity.  The dbcc checkdb interval can be far less than the dbcc checkdb interval for contents that are not read-only,

 So, statistics update can be made lighter and faster.  Backup can be made lighter and faster.  Consistency checks can be made lighter and faster.

At the current time, this means when I'm asked (I only get asked about tables of over 1 billion rows - and all of the systems I currently work with have similar workflows and workflow patterns), I will be recommending pseudo-partitioning for MSSQL, instead of partitioning.  We'll split tables into separate similar tables in separate files and filegroups, and use a view for the union.  That way, we can rebuild indexes per table/pseudo partition.  We can update statistics per table/pseudo partition.  We can mark old tables/partitions read only, and take load off of backup and consistency check procedures.

Hopefully, when statistics can be updated per-partition, I'll be able to use partition switching to seamlessly transition from a pseudo partitioning to a real partitioning strategy.  I'd rather not move hundreds of millions - or billions - of rows just to accommodate the metadata differences between pseudo partitioning and partitioning.

**Update by sql_sasquatch 29 August 2013**
SQL Server 2014 has additional features to allow statistics maintenance per partition.  Awesome!  Adam Machanic is concerned about how the optimizer will use the partitioned statistics (I am too)... guess we'll have to see about that.

But the cool thing is that connect items 328093 and 468517 are both listed as active again.

Yay crowdsourcing!

Monday, May 20, 2013

IBM Power AIX and HDS USP-V/HP XP 24000 - Which FEDs and ports for my Oracle hdisks?

The HDS USP-V is the same underlying storage hardware as the HP XP24000.  I've done quite a bit of work on systems using this storage.  I've never been the captain, always the boatswain. :)

A well-designed system using one of these arrays can provide a high level of predictable performance.  Front end director design makes these arrays more susceptible to front end congestion than many other types of storage, so high performance/high traffic databases require thoughtful planning of which LUNs will share front end directors and ports on the directors.  Its not just a matter of data throughput, IOPs count, or port queue depth.  As the front end director microprocessors become increasingly CPU busy, they introduce increasing latency and front end congestion.  This is often overlooked in performance investigations, in large part because its not easy to correlate microprocessor CPU busy to the added latency for any given workload.

If you are observing a system and see high read AND write latency from the server host, and the array is reporting a low level of write cache utilization and good response time from front end director through to disk media, bottleneck on the front end port queue waiting to get onto the microprocessor is a reasonable suspect.  (Its important to note that write cache utilization level is important to evaluate here - there are various cache thresholds in these arrays that govern behavior with respect to read and write priority as well as a switch that determines whether write cache utilization level has cache partition or global effect on the array.  But, in general, cache utilization performance effects are easier to evaluate on these arrays than the effects of front end director microprocessor CPU busy.)

So, when I evaluate an Oracle on AIX system with HDS USP-V/HP XP24000 storage that I am seeing for the very first time... if I see high latency in iostat results, and ESPECIALLY if I see QFULL conditions, I want to retrieve information as quickly as possible about which LUNs from this server are co-resident on FEDs, and on FED ports.

I ran across this over the weekend -HP documentation suggests that "lscfg -vl hdiskX" will indicate the port and controller for the hdisk device.

In fact, I think that "lscfg -l hdiskX" will indicate the port and controller. 
[not-yeti@sasquatch] $ lscfg -l hdisk33
  hdisk33           U756D.001.DQD56YK-P1-C2-T2-W50060E801530A650-L21000000000000  XP MPIO Disk XP24000 (Fibre)

I think this means that hdisk33 is connected to port1 of controller 2.  I assume each controller is an FED?

As I mentioned, I'm never the captain - always the boatswain.  I'll have to search for confirmation.

Something tells me that Mark Duszyk or a member of his crew will be able to confirm or deny.  I had thought I saw a blog post on this very topic at his excellent AIX and Linux blog  
But when I looked... I couldn't find it.

****Updated 5/20/2013 by sql-sasquatch****

The Ldev and port number are specified in Z1 of "lscfg -vl hdiskX".  Ldev 2006, port 6A in the example below.

[not-yeti@sasquatch] $ lscfg -vl hdisk33
  hdisk33          U756D.001.DQD56YK-P1-C2-T1-W50060E80
1530A650-L21000000000000  XP MPIO Disk XP24000 (Fibre)

        Machine Type and Model......OPEN-V
        Part Number.................
        ROS Level and ID............36303038
        Serial Number...............50 130A6
        EC Level....................
        FRU Number..................
        Device Specific.(Z0)........00000332CF000002
        Device Specific.(Z1)........2006 6A ....
        Device Specific.(Z2).........
        Device Specific.(Z3).........
        Device Specific.(Z4)..........A.
        Device Specific.(Z5)........
        Device Specific.(Z6)........

Also should be able to decode it from the WWN, but it seems like sometimes the WWN and Z1 fields don't quite line up.  Port CL6-A on the system example occurs with WWN W50060E801530A650 (I'd expect those final two characters based on the translation tables at the locations below).  CL6-A also shows up with WWN  W50060E801530A640 on the same system, which should decode to CL5-A.

The other odd thing is that I'd expect the 01530A6 to be 05130A6 to translate to 05:130A6, 05 for XP24000 and 130A6 as the serial number.


****End update****

Friday, May 17, 2013

Should cpu_scale_memp be increased for Power7+ and Power7?

**update from sql_sasquatch 20130906**
I've decided not to tinker with cpu_scale_memp and leave it at its default value of 8.  Yes, the cores are more capable, so I'd expect lrud rate of scan/steal to increase from the past.  But the systems I work with also have supercharged the amount of ram per LPAR in comparison to the past.  Doubling cpu_scale_memp from 8 to 16 as I was proposing cuts the number of memory pools (thus the number of lrud threads) in half.  That's fine - but on average it will result in twice as large of frame lists for those lrud threads to work.  I want to avoid that.  When lrud does have a lot of work to do, I want it working with shorter (hopefully) more efficient lists.
***end update from sql_sasquatch 20130906**

A little history before meandering and ending up at my plans to test Oracle on AIX Power 7  AIX 7.1 with cpu_scale_memp increased from 8 to 16.

Year  OS Level     Processor
2010  AIX V7.1     Power7
2007  AIX V6.1     Power6
2004  AIX 5L 5.3   Power5
2002  AIX 5L 5.2
2001  AIX 5L 5.1   Power4
1999  AIX 4.3.3
1998  AIX 4.3.2
1998  AIX 4.3.1
1997  AIX 4.3

Previous to 2001 AIX was developed and deployed on single core processors.  There was one logical and one physical CPU per socket on these systems.

AIX 5L 5.1 was released in 2001.  The Power4 processors introduced that year were the first dual core processors in the family.  These were the first systems in the IBM Power family with 2 logical and 2 physical CPUs per socket.

AIX 5L 5.3 was released in 2004.  The dual core Power5 processors introduced that year were the first in the family to support simultaneous multithreading (SMT).  Enabling this feature presented two logical CPUs from each core to the OS for scheduling.  These were the first Power systems with 4 logical CPUs and 2 physical cores per socket.

AIX 6.1 was released in 2007, as were the dual core Power6 processors.  With SMT enabled, these systems also presented 4 logical CPUs from the 2 physical cores per socket.

AIX 7.1 was introduced in 2010, as were Power7 processors.  Power7 processors were available with 4, 6, or 8 active cores per socket.  These processors supported SMT, and added the SMT4 feature to present 4 logical CPUs from each physical core.  So these systems could present 4, 6, 8, 12, 16, 24, or 32 logical CPUs per socket from the matrix of no-SMT, SMT, or SMT4 and 4, 6, or 8 active cores per socket..

With this much history, its easy for stuff to get lost.  Maybe cpu_scale_memp is an example.  Look around on the Intertubes - you won't see too much about this AIX kernel parameter.  I love reading Jacqui Lynch's stuff.  She is one of my "goto" sources for AIX performance tuning.  Interesting that one of the only tuning references for cpu_scale_memp is one of her presentations, where vmo tuning of the mempools parameter is discouraged in favor of tuning cpu_scale_memp (page 6 of the pdf below).  I don't think the mempools parameter is around anymore on Power7 AIX 7.1 systems.

The cpu_scale_memp parameter gives the maximum ratio of logical processors to memory pools in an LPAR.  The default value is 8.  So for each 8 logical processors, there will be at least one memory pool.  (The memory pools come from larger structures called vmpools or memory domains.  In turn, the memory pools are divided into framesets.)  Each memory pool has its own lrud (least recently used daemon, the page stealer).  Each lrud considers minfree and maxfree, the high and low water marks for free 4k memory frames, within its own memory pool.  Cross the low water mark, and the page stealer works until maxfree pages are free in the memory pool.

Nobody really talks about how or why to tune cpu_scale_memp.  But the introduction of the cpu_scale_memp parameter gives some insight.

AIX 5.2
Need strict_maxclient and cpu_scale_memp tuning parameters 
"Customer sees too much paging to paging space with maxclient set high and paging of client pages before the system runs out of free pages with maxclient set low."

However, without any additional information about the issue, or how strict_maxclient and/or cpu_scale_memp address the issue, there's not a lot to go on.

Memory and process(user thread) affinity both concern stickiness to the physical cores.  For memory affinity, try to get memory that is optimized to the current physical core.  For process affinity, try to make sure that a given user thread will be affinitized to the first core it executed on (or a specified set of cores with similar desired properties).

There are lots of knobs and switches to turn to govern memory/process affinity and the VMM in AIX, but cpu_scale_memp is one of the basic determinants.

Now that there are more threads per core, and more cores per socket - why not increase the ratio of logical processors to memory pools?  If the memory pool is larger, it may balance memory usage patterns more easily.  The larger pools may absorb temporary fluctuations more easily.  If memory_affinity is disabled, then the memory pools will be evenly sized regardless of the physical placement of cores and memory that make up the LPAR.  Midstream in AIX 5.3 the list-based LRU page_steal_algorithm was introduced, increasing the efficiency of lrud.  So why not allow the system fewer larger memory pools?

IBM has further enhanced performance with memory and scheduling affinity management by the Active System Optimizer (ASO) daemon (included with AIX), and the Dynamic System Optimizer (an additional paid license module) on top of that.

But it appears to me that returning to tune some of the fundamentals may lead to more predictable behavior and performance than simply adding more stuff on top.  Perhaps we'll see.  Hopefully in a round of IBM Oracle testing in the near future, I'll be able to circle back to this and include some comparative tests at cpu_scale_memp set to default and later to 16 on a Power7 or Power7+ system. In addition to monitoring memory with tons of vmstat, we'll use kdb to monitor memp and frs... perfpmr style.  :)

Thursday, May 16, 2013

Oracle on AIX Best Practices: AIXTHREAD_SCOPE=S

When researching best practices, a recommendation may turn up for which details are absent. There may be vague descriptions, but important details such as versions of introduction, compatibility, and specifics of behavior (in order to model interaction with other components) may be missing.

One such "historyless" best practice for Oracle on IBM Power AIX is setting environment variable AIXTHREAD_SCOPE=S in the Oracle user profile, or /etc/environment.

There are descriptions of the CPU scheduling benefits of the 1:1 kernel to user thread model, and very vague descriptions of a lower per-process memory footprint for Oracle with system contention scope instead of process contention scope. It was the mention of lower memory footprint that drew me to this item; I was observing an AIX 6.1 Oracle 11gR2 system that was paging even though vmstat reported free memory (I'm fairly certain that on this particular LPAR, memory_affinity must be disabled to eliminate paging, but that's a topic for another day). I looked at /etc/environment, and at the Oracle user profile. Environment variable AIXTHREAD_SCOPE wasn't set. Anything I could do to reduce the computational page footprint without shrinking the SGA (and thus the database cache) would help this system. Had I found something that would lower the computational footprint just enough to eliminate the page outs (which honestly shouldn't have been happening anyway)? In a word... no.

What's the deal? The setting is mentioned as recently as March 2013 in the "IBM POWER7 AIX and Oracle Database performance considerations" from the IBM Oracle International Competency Center (although without any reference to the scheduling benefit or the memory footprint benefit).

Regardless, I'm glad I didn't recommend incorporating this environment variable value on the system to change memory utilization. I don't think it would have. A thorough review of the documentation (which took hours between rounding up and reading), revealed the source and details below. This becomes an excellent example for the value of including version level interaction with a particular setting. This system was on AIX 6.1 - which already has a 1:1 thread model! The M:N thread model was introduced in AIX 4.3.1, and it seems the IBM Oracle recommendation AIXTHREAD_SCOPE=S likely originated from testing of the 1:1 model against M:N on AIX 4.3.2. It seems that AIX 5 versions and AIX 7.1 have M:N as default, while versions before AIX 5 and AIX 6.1 had 1:1 thread model as default. (In some cases, the M:N model may be favorable. At the bottom of this post is a documentation link for the threading model environment variables. AIXTHREAD_MNRATIO has a default value of 8:1)

So, if I had promised a change in memory utilization on the AIX 6.1 system in question... I would have been trying to sell them a benefit they had already realized through the default value. I can't afford to expend political capital lobbying for disruptive changes on production systems that have no net effect.

I will, however, ask them to incorporate setting this value in the Oracle user .profile at some convenient point in the near future. Once we've tuned the system well on AIX 6.1, I do not want system behavior to unexpectedly change with an upgrade to AIX 7.1, where the default thread model appears to again be M:N. Best practices should encourage predictable behavior, otherwise they are just practices :)


"The 1:1 model indicates that each user thread will have exactly one kernel thread mapped to it. This is the default model on AIX® 4.1, AIX 4.2, AIX 4.3, and AIX 6.1."

"The M:N model was implemented in AIX 4.3.1 and is also now the default model."

"Depending on the type of application, the administrator can choose to use a different thread model. Tests on AIX 4.3.2 have shown that certain applications can perform much better with the 1:1 model. The default thread model was changed back to 1:1 from M:N in AIX 6.1, but for AIX 4.3.1 to AIX 7.1, by simply setting the environment variable AIXTHREAD_SCOPE=S for that process, we can set the thread model to 1:1 and then compare the performance to its previous performance when the thread model was M:N."

Individual thread variables, including AIXTHREAD_SCOPE and AIXTHREAD_MNRATIO are documented in a very near location.

Thursday, May 2, 2013

O Registry, Registry! But wherefore did you so much tempt the heavens?

O Registry, Registry!  But wherefore did you so much tempt the heavens?
Whither WinsockListenBacklog?
Whither EnableDynamicBacklog?
Whither DynamicBacklogGrowthDelta, MaximumDynamicBacklog?
Whither MinimumDynamicBacklog away so fast?

Apologies from the Sasquatch to the Bard.  Its the coffee talkin'.

Under stress of concurrent TCP connection attempts, various errors can be reported by SQL Server as connections are refused.  Below is a good description of one of the connection failure modes.  Login authentication failures (regardless of credentials supplied) can also be reported under concurrent connection stress.
SQL Server logs 17832 with multiple TCP\IP connection requests

The following support document is a classic. Contains of the best descriptions of SQL Server Windows registry setting WinsockListenBacklog.

Description of TCP/IP settings that you may have to adjust when SQL Server connection pooling is disabled
KB328476 - Description of TCP/IP settings that you may have to adjust when SQL Server connection pooling is disabled ( 

Check the relevant products in the "applies to section".

SQL Server 2005 editions are listed.  Does that mean that registry setting WinsockListenBacklog should be created for SQL Server 2005 and later?  Could that resolve some of the TCP connection failures under stress for SQL Server 2008 or 2008 R2?


The support article "applies" to SQL Server 2005 because it provides the following information:
"Starting in SQL Server 2005, the network library passes a value of SOMAXCONN as the backlog setting to the listen API. SOMAXCONN allows the Winsock provider to set a maximum reasonable value for this setting. Therefore, the WinsockListenBacklog registry key is no longer used or needed in SQL Server 2005."

So, no more WinsockListenBacklog for SQL Server 2005 and beyond.

The clever reader will remember the dynamicbacklog family of registry values.  Full description in this support article:
Internet server unavailable because of malicious SYN attacks
[update 2020 December 21 - it's a dead link now and I don't have a replacement yet :-( ]

So... should those be set for SQL Server 2005 and beyond?  Not if the OS is Windows Server 2008 or beyond.  They won't have any effect.  This is explained here.
Where have those AFD driver related registry (DynamicBacklogGrowthDelta / EnableDynamicBacklog ...) keys gone?
[update 2020 December 21 - it's a dead link now and I don't have a replacement yet :-( ]

OK.  So... if there are connection failures under concurrent TCP connection stress with SQL Server 2008 or later, on Windows 2008 or later... whatfor to do?  Whither away the salve to soothe?

I'll have to pick that up later :)  As far as I can tell right now, the correctives on the most recent versions of SQL Server and Windows Server OS have to be about servicing the connection requests faster (clear the way for the nonpaged memory pool, make sure there are enough TCP connection memory blocks, take care of any other memory pressure conditions, consider binding the NIC to cores, utilize RSS scaling and potentially increase the number of cores servicing the NIC, etc).

More details as I learn, especially if I find specific correctives or diagnostics... wanted to get this out there so folks don't waste time with registry changes once valued but now merely "lodged with me useless".  My apologies to Milton.