Friday, August 30, 2013

Oracle RAC on IBMPower AIX 6.1/7.1? Look out for UDP bug!

This post, like many, might be longer than you want to read :)  If you don't want to read much, here's what you want to check for install on your IBM Power AIX servers running Oracle RAC.  If you are running Oracle RAC on a system exposed to the UDP defect, I recommend installing a fix when it fits into your maintenance schedule (as opposed to waiting to hit this kind of trouble).

To review docmentation for a specific APAR, append the APAR ID to this for the URL:
http://www-01.ibm.com/support/docview.wss?uid=isg1

IV33210 6.1
IV31961 6.1 TL8
IV31917 7.1 TL1?
IV31962 7.1 TL2

****

Now, to make a short story long...

Last night I did some late night reading of IBM AIX 6.1 and 7.1 APAR bugfixes.  I do that when I can't sleep.  In fact, sometimes its directly related to WHY I can't sleep.

So I wanted to put a warning out there for folks running Oracle RAC on IBM Power AIX.  IBM classifies fixes to high impact, pervasive defects as "HIPER".  Here's the description for a HIPER defect in UDP.  I can't think of anything that relies more significantly on UDP than Oracle RAC.  I often recommend to admins to evaluate their IBM Power systems for exposure to HIPER defects, and plan fix installs if exposed.  If you are running Oracle RAC on a system with the defect... Oracle RAC is exposed.

Defects in communication protocol, or in transmission (like a defect in packet checksum that causes erroneous packet rejection) can be maddening and sometimes take forever to resolve.  I was involved last year in an Oracle RAC problem that was escalated to the blade server vendor, Oracle database support, and finally to my group.  Another member of my team put in more hours than I did (he knows WAY more Oracle than I may ever know), but I put in at least 80 hours into diagnostic attempts.  The symptoms showed up unpredictably in ETL: many table loads would complete but one would languish.  Almost no thread or CPU utilization footprint.  Almost no disk IO footprint.  Almost no NOTHIN'!  After hours of hardly doing anything, it would finally "wake up" and finish its work. 

What was happening?  The query in question was waiting on inter-node RAC communication completion.  It was fetching a 32kb bock from another database node.  Eventually using wireshark it could be seen that the first 3 UDP packets were recieved, but the final packet was not.  So after timing out the thread would re-request the database block from the other node... and the same thing would happen.  (Please no jokes about UDP as unreliable protocol... unless you think its one I haven't thought of myself).  Why wasn't the final packet ever coming through?  Why did the query finally complete hours later - did the umpteen millionth request finally recieve all UDP packets for the 32k database block?  We never actually answered that question before fixing the problem - but I suspect that eventually the block was retired from the other nodes SGA database cache, and the query thread finally requested the database block from database storage rather than from the other nodes cache.  (Best explanation I can come up with for the eventual success, anyway.)

Early on I threw out the idea of checksum bugs: was it possible that something before wireshark was discarding a perfectly good UDP packet because the checksum it generated was different than the incoming packet checksum?  I'd never actually seen that happen with UDP, but I've seen similar problems where security hashes from Windows weren't matched by the hash generated from a SPARC SOlaris compiled executable database executable linking an OpenSSH package. (That also took forever to figure out by the way... eventually I was able to correct the issue by recompiling the executable with different compiler options.)  That experience, and knowing that there have been similar issues in some Linux x86 builds made me think it was a reasonable suspicion.

After many person-hours and several weeks of elapsed time, a Juniper switch firmware upgrade resolved the issue.  We had long post-mortems - what questions could we have asked earlier - what logs could we have reviewed - what monitoring could we have put in place to diagnose and correct the issue faster?

We didn't come up with any great answers other than remembering the experience and keeping that type of failure in mind for future RAC investigations.

And honestly, because I'm an outsider and only get to talk to the orgnaization staff I am introduced to or brought in with... that is about as far as I can get without having a complete topology of a given system, including all relevant components of the communication and storage network.  And I'm not really too much of a Linux guy... I don't read through Linux bug reports and fix docs like I do for AIX.  Not yet anyway :)

But if I'm looking at an IBMPower AIX system running RAC from now on... you'd better believe that I'll check for this fix :)  Not gonna wait for the problem to show up... it might take me too long to realize that the problem is corruption of TCP packets on the recieving end.    

So... please check your system, too.  Google searches for "Oracle RAC AIX APAR" and any of the APAR IDs below comes up empty (at least before my blog post they did :) ).  So maybe no-one has experienced this.  But again... I can't think of anything that uses UDP more critically than Oracle RAC.

I've got a question mark below for AIX 7.1 TL1 because APAR IV31917 cryptically says that the problem is not present in TL1, only TL2.  No idea what that REALLY means.  Maybe there is no TL1 fix, and APAR IV31917 is for AIX 7.1 TL0?  Maybe IV31917 is a 7.1 TL1 APAR, but it corrects code unused in TL1?  If you're running RAC on AIX 7.1 TL1, I'd ask IBM for clarification - no sense in changing maintenance plans to install a no-op fix :)

To review docmentation for a specific APAR, append the APAR ID to this for the URL:
http://www-01.ibm.com/support/docview.wss?uid=isg1

IV33210 6.1
IV31961 6.1 TL8
IV31917 7.1 TL1?
IV31962 7.1 TL2

Be well!

IBM fixes "minor Enhanced Affinity Bug"

While doing some late night reading, I came across the "minor Enhanced Affinity Bug" described below.  This defect is one among what I expect is many that result in paging space writes even when there seems to be sufficient free server memory to avoid paging space writes.

My recommendation to IBM: go ahead and expose the memory pool/frameset numbers in vmstat, or some other utility that does not require root privileges to return its information.  With a good tool and a clear explanation, there are lots and lots of sharp AIX administrators that could help correct these issues much earlier. 

One alternative is to fix these issues one at a time and slowly, in part due to the difficulty in obtaining the stats that would speed up full diagnosis.  Another alternative is to continue trying to address these issues with Active System Optimizer, Dynamic System Optimizer (the separately licensed extension module for ASO), Dynamic Platform Optimizer, and the many kernel parameters and environment variables which each slightly tune a specific facet of memory management. 

I can't recommend to critical systems that they enter a cycle of an unknown number of iterations of defect trigger, diagnosis, intervention, and evaluation.  Especially since the intervention may require installing a fix, or changing a kernel parameter requiring a bosboot and reboot to take effect.

I also can't recommend addressing an issue that arises from a fairly complex set of factors by adding something to the mix which is itself complex and fairly young. Consider that the 7 APARs listed at the end of this post are all contained in one service pack - AIX 6.1 TL8 SP2.  That's a lotta fixes for one daemon.

If I've got Big Blue's attention, I'll share a secret: Oracle's engineered systems will mop the floor in a comparison test with Oracle running on a Power system that is experiencing a lot of paging space traffic. 

Its not unusual for a recently configured system to choke severely under moderate use of paging space disk.  QFULLs may be seen on the paging space pvol(s) under moderate use.  Additional degradation can result from psbuf waits and even free frame waits.

I've got nothing against engineered systems in general or Oracle's engineered systems: for a given workload, let the best system strut its stuff.  But if the Power system is choking on paging space, it won't even be a fair fight.  Lets help folks make their IBM Power memory management predictable, and I bet real world performance will be quite grateful.


****

Append APAR ID to the following for documentation URL
http://www-01.ibm.com/support/docview.wss?uid=isg1

****

Fix Minor Enhanced Affinity Bug

sql_sasquatch defect description:
The ra_attach system call will, when able, allocate physical memory from domain local to the thread.  If necessary, memory from a "near" domain can be used for an ra_attach allocation.  However, due to a bug the first domain (0) is not considered "near" any other domain.  This may result in paging space writes when domains other than 0 are under pressure - even if domain 0 is near the stressed domain and has sufficient free memory for the allocation.

6100-06 IV28494
6100-07 IV28320
6100-08 IV27739
7100-00 IV28830
7100-01 IV29045
7100-02 IV27797

****

The following AIX 6100-08APARs address core dumps or other significant defects in the ASO daemon (Active System Optimizer).  Because the Dynamic System Optimizer module extends the ASO, I assume DSO is also compromised by these defects.

IV26296 Incorrect assert in SystemUtility_is_sid_in_pidspace
IV26301    ASO coredump at gmap_delete
IV26807    ASO core at AsyncWorker_add_work
IV26808    ASO core dump in StrategyBag_getLength
IV26810    ASO loops in large page job creation
IV27163    ASO core in large page job creation
IV35517    ASO core dump in pstatus_update_hotsegs when retrying MPSS op


*****
I'm gonna throw this one on at the end because otherwise I keep losing track of it.



IV10657: SRAD LOAD BALANCING ISSUES ON SHARED LPARS APPLIES TO AIX 6100-08

If tasks are distributed unevenly and its a problem... new memory allocations are probably also unevenly distributed, resulting in another problem :)
 

Thursday, August 29, 2013

Crowdsourcing: expose IBM AIX Memory Pool total and free frame count in vmstat

Why do page outs to disk paging space occur in AIX, even when vmstat and other tools indicate there is plenty of available server memory?

There are profiles, rules and conditions that determine which server resources end up in an LPAR.  If LPAR resources are dynamically provisioned rather than statically, then there there's the matter of which resources are added or removed over time.  There are lots and lots of rules, and plenty of kernel tunables and environment variables that influence AIX memory allocation and management from LPAR resources.  Then there's the Active System Optimizer daemon which might move stuff around within LPAR resources, the Dynamic System Optimizer (additional licensed module on top of ASO) which might move stuff around within LPAR resources, and the Dynamic Platform Optimizer which might move LPAR resources around within the server.  Put that all together with varying application activity and stress, and understanding the tide, ripple, and break of memory can be extremely complicated.

But paging out to paging space is actually remarkably simple... the magic numbers have just been HIDDEN from you.  And I am hoping that by crowdsourcing some support... grass roots... I can persuade Big Blue to just SHOW us the doggone numbers that make page out operations plain as day.

Here's the deal:  There is one lrud daemon thread for each LPAR memory pool.  The lrud daemon wakes up periodically, and if the number of free frames in the pool is below the low water threshold, page stealing starts.  By default lrud walks its list of memory frames.  Clean filesystem buffer pages can simply be expired and added to the free list.  Computational pages which were paged out once before... paged back in for a READ operation... and still have the same contents in paging space can simply be retired.  Computational pages that have never been paged out before have to be written to paging space.  Once the free frame list in the memory pool is above the high water mark, lrud work in that pool stops.  (***I'll come back later and fill in the names of relevant tunables.***)

That's remarkably simple.  The crazy thing is that there is NO easy way to see the size of the LPAR memory pools, or to see how many free frames there are in each memory pool.  And the only documented way is problematic for two reasons: it requires root privileges, and it uses the kernel debugger.

As someone that often evaluates performance and behavior on other administrators' systems, those are two heavy strikes. I don't want to inconvenience the admins by asking them to run monitors I cobble together... and I don't like asking them to run monitors as root.  And asking them to run a monitor which calls into the kernel debugger?  Doesn't that sound risky???!??

Okay... so maybe this sounds crazy to you.  After all... you got your vmstat... you got your svmon... you got your topas... you got your nmon.  Why is sql_sasquatch asking for something else?  Maybe he just doesn't UNDERSTAND how this really works.

Trust me... I've got a year's worth of scars from figuring this out.

Have you ever worked with IBM's perfpmr package?  When you open a performance PMR with IBM, often you'll be asked to download the script package and take some monitor samples.  I won't post IBM's code here.  But if you've downloaded it, poke around in the scripts.  In the memory monitors you'll see commands somewhat similar to these:

echo "memp *" | kdb
echo "frs *" | kdb

Those use aliases for the mempool and frameset commands to retrieve information including size and free frames.  I like the frameset command because it returns free memory in the frameset in human readable.  My hex ain't that good :)
The lru alias for lrustate command also comes in handy when evaluating page stealing behavior.

Lots more about kdb can be found within this 358 page pdf file.
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.kdb/doc/kdb/kdb_pdf.pdf

In the next week I'll post some numbers from one of my investigations that shows how kdb can illuminate paging space usage mysteries.

As much as Big Blue hesitates to agree, sometimes the best thing is to balance the memory pools in size and the number of affinitized logical processors.  That can be accomplished with provisioning server resources to the LPAR in a balanced manner... or by disabling memory_affinity (which has fallen out of favor recently but still may be the best option for large database server LPARs). 

To wrap it all up... these numbers... the size of your memory pools and the amount of free memory in the pools should be consulted when paging space is being utilized even when there seems to be sufficient server memory to avoid paging space writes. Sure, you can do it with root access and calling in to kdb.  But somehow lrud itself is getting the free frame count for the pools - I don't think its using kdb.  At least I hope its not :).  Regardless,  I really, really, REALLY want to see these numbers in a standard monitoring tool that can be used without root privileges.  The vmstat utility seems like the perfect place to put it.  Please?  Pretty please?  I'll be a good sasquatch...

:)

My blog doesn't get a lot of traffic - its not really meant to.  But if I can get some supportive comments from AIX administrators, developers, or other folks that might be persuasive... maybe we can all benefit.

Tuesday, August 27, 2013

How I Learned to Stop Worrying and Trust the Public Cloud

Some interesting and valuable insights... which I'll pick a fight with :)

http://davidchappellopinari.blogspot.com/2013/07/the-worlds-most-dangerous-technology.html
"The world's most dangerous technology is Windows Update."

"While many organizations test updates to make sure they don't break anything, how many check to make sure that an update doesn't steal sensitive data off your disk? Pretty much nobody."


"The level of trust that enterprises already have in Microsoft far exceeds what Windows Azure or another cloud platform asks of them."

I agree that there is a huge amount of trust in the Microsoft update practice of most organizations, and this yields a considerable vulnerability.

Here's my beef: trusting the secure/confidential status of data in the cloud is a very different matter from trusting the service availability yielded by the cloud.  David Chappell begins by acknowledging the testing (or at least pre-production deployment) by many organizations of updates.  He even states why this is done: "to make sure they don't break anything."  Then he moves to matters of data security/confidentiality: "how many make sure that an update doesn't steal sensitive data off your disk?"

With this line of argument, he concludes that the data security/confidentiality issues of the public cloud will eventually subside, and trust of the public cloud will result.

I won't argue that public cloud data security/confidentiality concerns will experience an eternal life; actually David has formulated a remarkably convincing brief argument that they will subside.

But service availability remains something that can be validated with Windows update; there is no solid similar method for an individual cloud subscriber to validate cloud code or hardware upgrades prior to deployment.  And while data security/confidentiality has not been breached in the public cloud by the public cloud (at least, such breaches are not publicly known), service availability should be a significant consideration before trust is granted to a public cloud platform.  Consider the few links below regarding Amazon EC2 service.

Finally, there are data concerns beyond security/confidentiality.  For medical and financial data, as well as data that outlines intellectual property such as design and code itself. there are huge concerns around data ownership: when the cloud service shuts down, what becomes of the data?

So, while I congratulate David Chappell for briefly articulating a compelling argument for the eventual irrelevance of data security/confidentiality concerns as a barrier to transition to public cloud platforms, the argument here is not sufficient for me to agree with the conclusion: "We're not there yet, but the day when trusting a public cloud platform is as unremarkable as trusting Windows Update is coming."

I don't think so:
1. the service availability vulnerability to individual organizations accompanying cloud code and hardware changes must be mitigated somehow.
2. data ownership issues other than security/confidentiality must be confidently resolved.


Already I wish I could communicate as briefly as Mr. Chappell does :)

At any rate, sql.sasquatch has not yet learned to stop worrying and trust the public cloud.    


*****  


22 October 2012
"The [Amazon EC2] outage was the fifth significant downtime in the last 18 months for the US-East-1 region, which is Amazon’s oldest availability zone. The US-East-1 region had outages in April 2011, March 2012, and June 15 and June 30 of 2012. Amazon’s U.S East region also was hit by a series of four outages in a single week in 2010."
Amazon Cloud Outage KOs Reddit, Foursquare & Others
http://www.datacenterknowledge.com/archives/2012/10/22/amazon-cloud-outage-affecting-many-sites/

30 June 2012
http://www.datacenterknowledge.com/archives/2012/07/03/multiple-generator-failures-caused-amazon-outage/

15 June 2012
http://www.datacenterknowledge.com/archives/2012/06/15/power-outage-affects-amazon-customers/
 
15 March 2012
http://www.datacenterknowledge.com/archives/2012/03/15/amazon-ec2-recovers-after-brief-downtime/

21 April 2011
http://www.datacenterknowledge.com/archives/2011/04/21/major-amazon-outage-ripples-across-web/

 May 2010
Amazon Addresses EC2 Power Outages
http://www.datacenterknowledge.com/archives/2010/05/10/amazon-addresses-ec2-power-outages/

Tuesday, August 13, 2013

40 concurrent SQL Server parallel queries + (2 sockets * 10 cores per socket) = spinlock convoy

Here's a disclaimer: trace flags aren't toys - only deploy them in production environments after exploring and testing their behavior in the context of your critical workloads (including end user activity, backups, integrity checks, index maintenance, stats maintenance).  Be especially cautious with startup trace flags since it'll require a SQL Server restart to back them out.

Trace flag 8048 - I don't know of any bad side effects... and I've been looking.  Not a reason for you not to test... but I'll update this post and add a new post if I ever learn of trouble from trace flag 8048.

Trace flag 8015 - combined with trace flag 8048, disabling NUMA has been good medicine for the kinda-DSS/kinda data warehouse type workflows on the systems I'm involved with.  Its lead to more even CPU utilization across cores, more stable/predictable memory utilization, lower overall physical reads, higher scalability, and shorter execution times for batch workloads.  Your mileage WILL vary :)  Potential bad side effects to consider: loss of one independent buffer pool on each NUMA node, loss of "affinitize different SQL Server connections to different physical NUMA nodes" strategy, single lazy writer instead of one lazy writer per NUMA node.

Ok... is that good enough for the lawyers?  :)



A colleague sent me the perfmon spreadsheet snippet above today.  This is from a SQL Server with 2 Intel sockets of ten cores per socket, with hyperthreading enabled.  I know its hard to read :)  Don't bother trying to read the numbers - read the reds and the greens.  And if its 3 digit red - the logical CPU is 100% busy.  Report application servers send up to 40 concurrent queries to this SQL Server system.

Its about as good of a picture as you'll ever see of what can go wrong in SQL Server by default on a NUMA server: parallel query worker threads are stacking up on a single NUMA node instead of spreading out, and the worker threads on the busy NUMA node are hitting MAJOR spinlock contention in memory allocation.  My colleague just grabbed the per-CPU %busy stats from perfmon, stuffed them into excel, turned on conditional formatting and voila!  Its clear to me how much of a benefit trace flags 8048 and 8015 will be to this system.  Is it clear to other folks yet?

See - with NUMA enabled by default, all worker threads for parallelized queries tend to stay on the NUMA node which received the connection.  Connections are distributed round-robin among the NUMA nodes... but what if by happenstance one node gets all the quick queries, and the other node gets all the long, parallelized queries?  Ummm... this happens.

You end up with lots and lots and lots of worker threads executing tasks for parallelized queries - all stacked on the same poor cores, and the other CPU cores just might be loafing.

The other behavior to keep in mind is that, by default, query memory allocation (and some other memory allocations) for all threads on a NUMA node are serialized through a single chokepoint (!!). That's as well-defined bottleneck as you'll find, just waiting for enough concurrent activity on the NUMA node to 'poke the bear'.  In fact, when things start to look like they do above, it might not just be CPU overload coupled with spinlock contention.  It could even be a 'spinlock convoy', where numerous cores (in this case all CPU cores on at least 1 NUMA node) spend more cycles spinning for the same resource than accomplishing logical work.

So here's the deal - if you see this pattern on your SQL Server... and its a dedicated server so nothing else is chewing up your compute power... Check the waits.  Check the spins.  Plot logical reads, or some other measure of database work, against CPU utilization.  Consider disabling NUMA with trace flag 8015, and don't forget to also apply trace flag 8048 in order to promote query memory allocation to per-core serialization, eliminating the NUMA node serialization bottleneck.  Even if disabling NUMA with trace flag 8015 is not a good idea for your workload (maybe you need more than one lazywriter or you are affinitizing connections to a NUMA node?), I haven't yet found a disadvantage to promoting memory serialization to per-core and removing the bottleneck with trace flag 8048.

Oh yeah, I know about SQL Server 2012 kb hotfixes 2819662 and 2845380.  Neither one nor the combination will remedy this issue.  Nor will they address the stuff I talk about in the following post:
http://sql-sasquatch.blogspot.com/2013/08/sql-server-numa-servers-cpu-hotspot.html

Here's a bibliography if you want a fuller understanding of task distribution with default SQL Server NUMA behavior, memory management with default SQL Server NUMA behavior, memory allocation with default SQL Server serialization, etc.  Or maybe you want to double check that I actually know what I'm talking about :) You could also follow the tags to my other posts on NUMA, TF8048, TF8015 :)


Selected SQL Server 2012 NUMA and Memory Management Hotfixes
kb2819662: SQL Server Performance Issues in NUMA Environments
kb2845380: You may experience performance issues in SQL Server 2012

Selected Trace Flag 8048 Sources 
CSS SQL Server Engineers - SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node May Need Trace Flag 8048
CSS SQL Server Engineers - CMemThread and Debugging Them


Selected SQL Server NUMA Sources
CSS SQL Server Engineers - SQL Server (NUMA Local, Foreign and Away Memory Blocks)
CSS SQL Server Engineers - SQL Server 2008 NUMA and Foreign Pages
CSS SQL Server Engineers - SQL Server 2005 Connection and Task Assignments 
CSS SQL Server Engineers - NUMA Connection Affinity and Parallel Queries
CSS SQL Server Engineers - Soft NUMA, I/O Completion Thread, Lazy Writer Workers and Memory Nodes
MSDN - Growing and Shrinking the Buffer Pool Under NUMA 
qdpma.com - NUMA Systems and SQL Server 

Friday, August 9, 2013

SQL Server: NUMA servers CPU hotspot liability by default

This post is long overdue.  I still don't like the graphs I have... although they kinda tell the story, they still don't quite capture the risk/reward in an intuitive way.  But I don't want to hold off anymore on getting something out in the wild.

I talk a lot about hotspots, and usually its about storage.  Balanced resource utilization typically provides the most predictable performance, and that's what I shoot for by default.

Its important to realize that hotspots are not only a storage resource concept - they can apply to CPU or other resources as well.  In fact, for SQL Server on NUMA servers, by default there is a CPU hotspot liability.

Imagine you have a server with 8 NUMA nodes, 6 cores on each NUMA node(think old school AMD, 12 cores per socket, 2 NUMA nodes per socket one logical CPU per physical core).  Parallel queries start stacking up their workers.  Once the 6 cores on NUMA node 0 have enough active worker threads threads to stay busy - it would seem like a good idea to start giving some threads to the other cores... right?  In fact - once those 6 cores are all at or near 100% CPU utilized - if any more active threads are added the available cycles on those CPUs are just going to be spread among a larger number of active workers, with each of them accomplishing less than previously.  (In fact, the growing number of context switches also represents a growing management cost, so less of the CPU total cycles will be available for dividing among the growing number of threads.)  And in fact, that's exactly what happens in SQL Server by default.  Hotspot jambalaya!

Here's the deal: client connections are round-robin distributed among NUMA nodes.  But when a connection requests a parallel query, there is a strong tendency for all of the query workers to remain on the NUMA node.  Hmmmm.... so on my 8 NUMA node server if a single query spins up a worker thread per CPU, each of the cores in the NUMA node would have 8 active worker threads.  While the other 40 cores get no work to do for the query.  Well.... shoot....

See - this is a huge part of the recommendation to set maxdop to the number of cores in the NUMA node... or half of the cores in the NUMA node... or...

But that's a losing game, let me tell you.  Because if a query starts with multiple data paths, you can easily end up right back where you started. If all queries were simple and had a single stream of processing, setting maxdop to the number of logical CPUs in a NUMA node might make sense.  Each parallel query would result in one thread per logical CPU.  But complex, multi-stream queries toss that idea out the window.  Use a stored procedure or complex query with 8 separate processing streams - you are right back where you started: 8 active worker threads per logical core on one NUMA node, and no work associated with that query for anyone else.

If that type of activity takes place on a DSS or DW system with a high concurrency of batched queries, query throughput can be horrible.  Check it out.  I'm sorry its ugly.  The numbers in the graph are from a 48 physical core system with 8 NUMA nodes, and MAXDOP set to 6.  I averaged the CPU busy along NUMA node boundaries, and those are the results below for Node0 to Node7.  While some NUMA nodes are maxed out, others are loafing.
  
This isn't a secret, although almost no-one ever talks about it.  There are details about task assignment at the following locations.

How It Works: SQL Server 2005 Connection and Task Assignments
http://blogs.msdn.com/b/psssql/archive/2008/02/12/how-it-works-sql-server-2005-connection-and-task-assignments.aspx

NUMA Connection Affinity and Parallel Queries
http://blogs.msdn.com/b/psssql/archive/2007/06/28/numa-connection-affinity-and-parallel-queries.aspx

So... what to do, what to do?

Well, SQLCAT pursued one avenue.  They wrapped queries in a "terminate - reconnect" strategy.

Resolving scheduler contention for concurrent BULK INSERT
http://sqlcat.com/sqlcat/b/technicalnotes/archive/2008/04/09/resolving-scheduler-contention-for-concurrent-bulk-insert.aspx

I'm not very fond of that solution, but it is an acknowledgement of the problem, if nothing else :)

Instead, for the DSS systems and workloads I work with, I recommend disabling NUMA support at the database level with startup trace flag 8015.   All schedulers are treated as a single pool, then - no preferential treatment for parallel query workers along NUMA node boundaries.  (All memory is treated as a single pool as well, which leads to a separate set of benefits for the workloads I deal with.  Guess I'll have to return to that another day.)

Its important to note that on these systems we already use trace flag 8048 to eliminate spinlock contention among schedulers in the same group during concurrent memory allocation.  Increase the number of schedulers in the group by disabling NUMA with TF8015, and if you haven't also put TF8048 in place you could invoke a spinlock convoy with enough concurrent activity.  That would NOT be cool.

Anyway... time for the big finish.  Here's another graph that's better than nothing, although I'm still not completely happy with it.  It takes forever to explain what is being measured... but at least it makes the difference very evident :)

So, the blue line is the control workflow - trace flag 8048 is in place to eliminate spinlock contention, but the database NUMA support is enabled.  The Y axis counts occurrences of qualifying 15 second interval perfmon samples.  The X axis is the difference between the CPU% for the busiest NUMA node and the average of the remaining 7 NUMA nodes.  The graph displays the level of balance in CPU utilization across NUMA nodes.  The more balanced the CPU use of a workload is, the more samples there will be in the lower end of the graph.  See -- the control workflow is not very balanced.  The most popular level of difference was 68% - so the busiest NUMA node was 68% busier than the average of the remaining 7 nodes well over 600 times (more than 2.5 hours).  

Contrast that with the same workload on the same system, with trace flag 8015 added to trace flag 8048, to disable database NUMA support.  Now the most popular position is about 4% difference, with over 2.5 hours spent at that level.





Thursday, August 8, 2013

AIX hotspot hunting and potential benefit of tier 0 cache: lvmstat overtime data

Recently I posted a method for tacking on a timestamp to lvmstat output with awk.  Maybe you think that's not a big deal.  In my world, it is :)

The graph above is a heat map of a logical volume with a JFS2 filesystem on top.  Its cio mounted, and has tons of database files in it.  With this particular database engine, all reads and writes to a cio filesystem are 8k (unlike Oracle and SQL Server which will coalesce reads and writes when possible).  Also, all database reads go into database cache, unlike Oracle which can elect to perform direct path reads into PGA rather than into the SGA database cache.  Those considerations contribute to the nice graph above.

The X axis is the logical partition number - the position from beginning to end in the logical volume.
The left primary axis represents the number of minutes in out of 360 elapsed minutes that the logical partition was ranked in the top 32 for the logical volume by iocnt.  (Easy to get with the -c parameter for the lvmstat command.)

So... what use is such a graph?  Together with data from lslv, you can map hot logical partitions back to their physical volumes (LUNs)... that is invaluable when trying to eliminate QFULL occurrences at the physical volume level.  For another thing, together with database cache analysis (insertion rates at all insertion points, expire rates, calculated cache hold time, etc) heat maps like these can help to estimate the value of added database cache.  They can also help to estimate the value of a tier 0 flash automated tiering cache - whether the flash is within the storage array, or in an onboard PCIe flash device.  I've worked a bit with EMC xtremsw for x86 Windows... looking forward to working with it for IBM Power and PCIe flash form factor.  Hoping to test with the QLogic Mt Rainier/FabricCache soon, too.  If the overall IO pattern looks like the graph above, as long as the flash capacity together with the caching algorithm results in good caching for the logical partitions on the upward swing of the hockey stick, you can expect good utilization of the tier 0 cache.  I personally prefer onboard cache, because I like as little traffic getting out of the server as possible.  In part for lower latency... in part to eliminate queuing concerns... but mostly to be a good neighbor to other tenants of shared storage.

So... in reality, the system I'm staring at today isn't as simple as a single logical volume heat map.  There are more than 6 logical volumes that contain persistent database files.  All of those logical volumes have hockey stick shaped graphs as above.  Its easy to count the number of upswing logical partitions across all of those LVs, and find out how much data I really want to see in the flash cache.  Now... if the flash cache transfer size is different than the logical partition size that should be considered.  If my logical partition size is 64mb, and the flash tiering always promotes contiguous 1 GB chunks from each LUN, that could lead to requiring a lot more flash capacity to contain all of my hot data.  On the other hand, if the transfer size into tier 0 cache is 1 mb, and the heat is very uneven among 1 mb chunks within each hot LP... the total flash cache size for huge benefit might be a lot smaller than the aggregate size of all hot LPs.  Something to think about.  But I can't give away all of my secrets. Keeping at least some secrets is key to sasquatch survival :)

Friday, August 2, 2013

∃ x ∈ HA: x ∈ DR

Its been a long time since my advanced calc class.  

∃ x ∈ HA: x ∈ DR
What on earth does that mean?

It means there exists an element of HA x such that x is an element of DR.  That's how I make sense of HA ≠ DR, which is oft-quoted and in many cases applied in ways that are simply not accurate. 

DR technology and HA technology are sets rather than mutually exclusive attributes of individual technology.  I'm aware of no definitions for DR and HA technology sets which preclude their intersection.  So can a particular technology be a member of both sets?  Yes.  In fact, there are many such examples.  

Name the database platform, and there is quite likely a continuous asynchronous replication method at the database level, based on send and apply of database logs to a secondary system.  Very likely also an asynchronous batched or delta set replication.  Maybe even a manner of synchronous transaction replication.  Oracle has Data Guard.  SQL Server has AlwaysOn and log shipping.  Intersystems Caché has shadowing and mirroring.

I like database replication for databases, for reasons I'll detail some other day.  But it bears mentioning that storage subsystem components have similar offerings.  EMC VMAX has synchronous and asynchronous data replication.  Hitachi has similar TrueCopy options.  IBM storage offers Global Mirror and Metro Mirror.  And the list goes on.

All of these technologies can be used to provide local data availability (and thus be part of a local high availability design, with or without automatic takeover).  These technologies can also provide replication to a secondary site.  With enough geographic separation to protect from natural disasters within design scope (eg hurricanes, tornadoes, earthquakes, flooding) any method of getting your data to the secondary site... even sneakernet of a tape backup - provide a measure of disaster recovery for the data.

Beyond that, its about thresholds, objectives, and the faults/disasters that are within scope.  Thresholds?  I think that more and more folks should be talking about MPOD - maximum period of disruption*.  Objectives: 1) RTO - recovery time objective (service time lost to fault/disaster goal) 2) RPO - recovery point objective (goal for how much data, measured in time leading up to disaster/failure, can be lost on recovery).  

Scope is very, very important to specify for DR and local recovery measure design: protecting from hurricane is very different from protecting from logical corruption. Protecting from logical corruption discovered immediately (or at least in the same day) is very different from protecting from corruption of data quality discovered a full year after introduction.  When a particular fault or disaster event falls outside of the scope of your DR or local availability design, it doesn't mean that you don't have high availability or don't have disaster recovery.  Its just a realization of the limited scope of the design.

In my opinion, designing for recovery from logical corruption or data quality problems is the most involved of recovery planning and design.  How much reachback time is enough?  Database server crashes usually don't go long before being discovered.  Data quality issues, even fairly pervasive ones, can be introduced months before they are discovered.

Consider this example: weekly full backups for 1 month, and all tx log backups for that month are retained locally.  Asynch database replication to secondary site.  1 monthly full backup and all tx log backups from that month retained at secondary site.  Compliance and litigation risk department requires that monthly full backups be retained at secondary site for each previous month of the year, and 1 yearly full backup for 7 years back (someone said its needed for SOX compliance :) ).

Hurricane or tornado at primary datacenter?  Thank goodness for database replication to secondary site, and disaster recovery!  Same if flooding, or if SAN level admin error formats all SAN disks.  (SAN replication would likely transmit the SAN admin error, but database replication in this case would not.)

Database admin error could take out primary system, and would likely be transmitted to secondary site if continuous asynch database replication.  So additional recovery would likely be needed.  If backups are available at primary and secondary site with all tx logs at both sites - recovery on primary is usually preferable.  Let's play with that scenario a little.  What if there are no local database backups and no secondary site backups, and a database admin error or errant table truncation takes place.   Neither primary nor secondary site could provide recovery of that table.

Does the existence or realization of such a failure mean that there is NO disaster recovery technology or measure in place?  No, not at all.  I mean... I'd never recommend the absence of database and log backups locally, and I always strongly recommend at least the capability at the secondary site.  But that strategy provides for disaster recovery from huge natural disasters, while providing no recoverability for logical corruption/admin error/data quality problems.

So... is this sasquatch preaching some strange new gospel?  I don't think so - I still believe HA≠ DR.  I also believe they can and do intersect.  I won't downplay the importance of backups - they are critical for survival.  But as business continuity and risk management in the database world matures... I want to make sure that at some level SQL Server folks, Oracle folks, and storage folks are all using at least SOME of the same important vocabulary in the same way.  Especially when it comes to critical service delivery characteristics and potentially big ticket capital and operational expenses.

Consider what EMC says about SRDF, Hitachi says about TrueCopy, IBM says about Metro/Global Mirror.

"Built for the industry-leading high-end VMAX hardware architecture, the SRDF family of solutions is trusted for disaster recovery and business continuity."

"Provides a continuous, nondisruptive, host-independent remote data replication solution for data protection, disaster recovery or data migration purpose."

"The Metro/Global Mirror function has a number of supported automated management offerings for disaster recovery solutions."

Consider what Oracle says about Data Guard.
"Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data."

Finally, consider the following command for SQL Server 2012 AlwaysOn:
ALTER DATABASE <> SET HADR 



--As an aside - although I personally believe recovery from logical corruption and admin errors can qualify as disaster recovery - many folks will say that local recovery from those events is not a disaster recovery if a secondary site is available and was not utilized in recovery.  Recoveries on primary/local system fall outside of many definitions of disaster recovery.
--When I control the vocabulary, I find its much more effective to talk about 'local availability', 'local recovery', and 'remote recovery' design sets.



*Pretty late update from sql.sasquatch 20131027.  My memory is not very reliable.  If you look around for MPOD related to business continuity or disaster recovery planning, you won't find much.  Look instead for MTPOD - maximum tolerable period of disruption, or one of these synonyms:
MAO - maximum allowable outage
MAO - maximum acceptable outage
MTD - maximum tolerable downtime