Friday, August 30, 2013

Oracle RAC on IBMPower AIX 6.1/7.1? Look out for UDP bug!

This post, like many, might be longer than you want to read :)  If you don't want to read much, here's what you want to check for install on your IBM Power AIX servers running Oracle RAC.  If you are running Oracle RAC on a system exposed to the UDP defect, I recommend installing a fix when it fits into your maintenance schedule (as opposed to waiting to hit this kind of trouble).

To review docmentation for a specific APAR, append the APAR ID to this for the URL:
http://www-01.ibm.com/support/docview.wss?uid=isg1

IV33210 6.1
IV31961 6.1 TL8
IV31917 7.1 TL1?
IV31962 7.1 TL2

****

Now, to make a short story long...

Last night I did some late night reading of IBM AIX 6.1 and 7.1 APAR bugfixes.  I do that when I can't sleep.  In fact, sometimes its directly related to WHY I can't sleep.

So I wanted to put a warning out there for folks running Oracle RAC on IBM Power AIX.  IBM classifies fixes to high impact, pervasive defects as "HIPER".  Here's the description for a HIPER defect in UDP.  I can't think of anything that relies more significantly on UDP than Oracle RAC.  I often recommend to admins to evaluate their IBM Power systems for exposure to HIPER defects, and plan fix installs if exposed.  If you are running Oracle RAC on a system with the defect... Oracle RAC is exposed.

Defects in communication protocol, or in transmission (like a defect in packet checksum that causes erroneous packet rejection) can be maddening and sometimes take forever to resolve.  I was involved last year in an Oracle RAC problem that was escalated to the blade server vendor, Oracle database support, and finally to my group.  Another member of my team put in more hours than I did (he knows WAY more Oracle than I may ever know), but I put in at least 80 hours into diagnostic attempts.  The symptoms showed up unpredictably in ETL: many table loads would complete but one would languish.  Almost no thread or CPU utilization footprint.  Almost no disk IO footprint.  Almost no NOTHIN'!  After hours of hardly doing anything, it would finally "wake up" and finish its work. 

What was happening?  The query in question was waiting on inter-node RAC communication completion.  It was fetching a 32kb bock from another database node.  Eventually using wireshark it could be seen that the first 3 UDP packets were recieved, but the final packet was not.  So after timing out the thread would re-request the database block from the other node... and the same thing would happen.  (Please no jokes about UDP as unreliable protocol... unless you think its one I haven't thought of myself).  Why wasn't the final packet ever coming through?  Why did the query finally complete hours later - did the umpteen millionth request finally recieve all UDP packets for the 32k database block?  We never actually answered that question before fixing the problem - but I suspect that eventually the block was retired from the other nodes SGA database cache, and the query thread finally requested the database block from database storage rather than from the other nodes cache.  (Best explanation I can come up with for the eventual success, anyway.)

Early on I threw out the idea of checksum bugs: was it possible that something before wireshark was discarding a perfectly good UDP packet because the checksum it generated was different than the incoming packet checksum?  I'd never actually seen that happen with UDP, but I've seen similar problems where security hashes from Windows weren't matched by the hash generated from a SPARC SOlaris compiled executable database executable linking an OpenSSH package. (That also took forever to figure out by the way... eventually I was able to correct the issue by recompiling the executable with different compiler options.)  That experience, and knowing that there have been similar issues in some Linux x86 builds made me think it was a reasonable suspicion.

After many person-hours and several weeks of elapsed time, a Juniper switch firmware upgrade resolved the issue.  We had long post-mortems - what questions could we have asked earlier - what logs could we have reviewed - what monitoring could we have put in place to diagnose and correct the issue faster?

We didn't come up with any great answers other than remembering the experience and keeping that type of failure in mind for future RAC investigations.

And honestly, because I'm an outsider and only get to talk to the orgnaization staff I am introduced to or brought in with... that is about as far as I can get without having a complete topology of a given system, including all relevant components of the communication and storage network.  And I'm not really too much of a Linux guy... I don't read through Linux bug reports and fix docs like I do for AIX.  Not yet anyway :)

But if I'm looking at an IBMPower AIX system running RAC from now on... you'd better believe that I'll check for this fix :)  Not gonna wait for the problem to show up... it might take me too long to realize that the problem is corruption of TCP packets on the recieving end.    

So... please check your system, too.  Google searches for "Oracle RAC AIX APAR" and any of the APAR IDs below comes up empty (at least before my blog post they did :) ).  So maybe no-one has experienced this.  But again... I can't think of anything that uses UDP more critically than Oracle RAC.

I've got a question mark below for AIX 7.1 TL1 because APAR IV31917 cryptically says that the problem is not present in TL1, only TL2.  No idea what that REALLY means.  Maybe there is no TL1 fix, and APAR IV31917 is for AIX 7.1 TL0?  Maybe IV31917 is a 7.1 TL1 APAR, but it corrects code unused in TL1?  If you're running RAC on AIX 7.1 TL1, I'd ask IBM for clarification - no sense in changing maintenance plans to install a no-op fix :)

To review docmentation for a specific APAR, append the APAR ID to this for the URL:
http://www-01.ibm.com/support/docview.wss?uid=isg1

IV33210 6.1
IV31961 6.1 TL8
IV31917 7.1 TL1?
IV31962 7.1 TL2

Be well!

No comments:

Post a Comment