Wednesday, November 20, 2013

Things Fall Apart: Again my Turn to Win Some or Learn Some (AIX)

Chinua Achebe and Jason Mraz.  I'm an eclectic sasquatch, if ever there was one. 

Things Fall Apart
 A recent performance intervention did NOT go as planned.

Recently I described a fairly common performance bottleneck scenario.  CPU utilization was low and workload was running long: low fibrechannel throughput on the IBM Power LPAR was strangling the CPUs that wanted to work harder.  Throughput was low due to lots of small reads and almost continual overruns of the LUN queue_depth, which was set to 16.  In addition to small average read size and low queue_depth there was a third component in the stew, as well: over time the logical volume "maximum" policy for inter-physical volume partition use became ineffective.  After the initial sufficient number of LUNs was nearly full, one or two LUNs was added to the VG to satisfy capacity needs, without concern for the physical partition spreading over multiple LUNs and disk service queues that previously took place.  The 16mb physical partition size which had served well over the old number of LUNs in rotation gave way to hundreds of GBs of the logical volumes made up of consecutive 16 mb physical partitions from the same LUN, using the same disk service queue.  Hotspot magnet.

So, the intervention plan was pretty simple:
1. Increase queue_depth from 16 to 32, which I knew the underlying SAN could sustain.
2. reorgvg to redistribute the physical partitions (and their future IO requests) more equitably among the LUNs.  reorgvg after increasing queue depth in order to take advantage of the higher queue depth.
3. further investigation to determine how to increase the Oracle read size - whether by addressing filesystem fragmentation, database object fragmentation, or some other factor (hdisk* and fcs* maximum transfer sizes were both already sufficient to allow the 1 mb product of 128 mbrc and 8k database block we want to see whenever possible).

Yeah, that was the plan.  You can look at the "before" picture of the system here:

So, what actually happened?  Well... 'I almost had a heart attack' is what happened.  A sasquatch under a defibrillator is not a pretty sight :)

The queue_depth was changed on all hdisks, from 16 to 32 as recommended.  The system was rebooted to make the change effective.  The smallest vg was selected for reorgvg first, and the reorgvg completed.  Then, a few days later - everything fell spectacularly apart.  How bad?  Think about average write service times displayed in iostat up to 40 seconds.  Not 40 milliseconds. 40 full seconds.

What happened?  Well, I still don't have a conclusive diagnosis. But when IBM was engaged via a PMR, they recommended resetting the fcs1 and fcs3 devices back to their default AIX 6.1 num_cmd_elems (200) and max_xfer_size (0x100000 or 1 mb).

Interesting.  The num_cmd_elems and max_xfer_size attribute hadn't been changed in a lllooonnnggg time.  The num_cmd_elems attribute was 2048 for both devices.  A lot higher than the default of 200... but I know that CLARiiON devices used to recommend this setting on IBM Power AIX hosts.  And its not unusual to see IBM storage accompanied by num_cmd_elems that high or even higher, since individual LUNs on many types of IBM storage can have a queue_depth of up to 256.

The max_xfer_size was a bit more unusual.  It was 0x1000000 - 16mb.  At first I wondered if that was intentional, because lately I've only seen the default 1 mb value or the recommended 2 mb value for those AIX versions where 2 mb results in the 128 mb DMA vs 16 mb with the default max_xfer_size.  I was told that the 16mb value had been recommended by the storage vendor.  And I don't doubt it - I found that recommendation here for the SVC (although this config didn't use an SVC):

Interesting - increase the maximum transfer size to 16mb, even if the hdisk maximum transfer size won't allow that large of a transfer.  And that recommendation is meant to improve performance.  I bet its the same reason that 2 mb later became a fairly common recommendation: increase the DMA allocation from 16mb default to 128mb.

Anyway - somehow the doubling of the queue_depth interacted with num_cmd_elems=2048 and max_xfer_size=0x1000000 in such a way that writes just plain couldn't get through in a reasonable time.  Each minute captured in 'iostat -DlRTV' showed average write latency upwards of a full second and as high as 40 seconds.  Read latency was sometimes averaging in the seconds, but not always.  It looked like the reads were only hurting if they were stacked behind a full or partially full service queue of writes.  Eventually I asked if anything was showing up in errpt.  Oh, yeah.  FCP_ERR4, so fast the error log was churning and could not longer tell when the errors had started.

Fortunately, an administrator had previously made changes to fibre channel devices one-at-a-time in order to keep the system running with the other fibre channel adapter - he did this again to correct this issue.  The fibre channel adapters were returned to defaults of 200 and 0x100000.  Performance improved to normal levels.  The fcstat command showed that num_cmd_elems was not sufficient to avoid queuing, but performance was roughly the same as it was before the change to the queue_depth.  Because I know that the AIX 7.1 default for fcs devices is num_cmd_elems=500, the attribute has just recently been increased, and we'll watch it real close.

Again my Turn to Win Some or Learn Some

As I said, no conclusive diagnosis from IBM yet.  But... you know, I've read this article so doggone many times... and I never scrutinized this portion before:


Note that changing max_xfer_size uses memory in the PCI Host Bridge chips attached to the PCI slots. The salesmanual, regarding the dual port 4 Gbps PCI-X FC adapter states that "If placed in a PCI-X slot rated as SDR compatible and/or has the slot speed of 133 MHz, the AIX value of the max_xfer_size must be kept at the default setting of 0x100000 (1 megabyte) when both ports are in use. The architecture of the DMA buffer for these slots does not accommodate larger max_xfer_size settings"

If there are too many FC adapters and too many LUNs attached to the adapter, this will lead to issues configuring the LUNs.

Hmmm... so now I'm thinking that the maximum transfer size on the fibre channel adapters may have been lying in wait for the fateful day when aggregate LUN queue depth was increased enough... and BLAM! Performance chaos.

If so, it'll be good to know for sure.  Especially if that means increasing num_cmd_elems to cover the aggregate queue_depth is safe, as long as we let the max_xfer_size at 1 mb (or maybe, hopefully, eke it up to 2 mb).

But the situation Dan Braden describes in the article doesn't quite fit what we saw: two fibre channel adapter ports were in use - but only one from each of two cards was in use on this LPAR.  And Dan mentions 'too many FC adapters' and 'too many LUNs' - not quite what happened here when the queue_depth attribute of existing LUNs was doubled to a completely reasonable (in my mind, anyway) value of 32.

We'll see.

No comments:

Post a Comment