Showing posts with label AIX. Show all posts
Showing posts with label AIX. Show all posts

Monday, December 30, 2013

How much IBM AIX JFS2 Filesystem RAM Overhead? j2_inodeCacheSize & j2_metadataCacheSize



The biggest takeaways from the stuff below:
AIX 6.1 defaults can result in ~14% of server memory eventually used for JFS2 inode and metadatacache.
AIX 7.1 defaults can result in ~7% of server memory eventually used for JFS2 inode and metadatacache.


*****
These details come from the location above.

j2_inodeCacheSize
range = 1:1000
previous default = 400 (~10% server memory)
AIX 7.1 default 200 (~5% server memory)
maximum value of 1000 = ~25% server memory

j2_metadataCacheSize
range = 1:1000
previous default = 400 (~4% server memory)
AIX 7.1 default = 200 (~2% server memory)
maximum value of 1000 = ~10% server memory

The metadata and inode cache never shrink.  Once they reach their configured size, contents begin to be retired so new contents can be cached.

If these cache portions are too large for a given server, the cache will continue to grow slowly until reaching maximum size.  The gradual increase in kernel memory used could put pressure on other memory uses if not accounted for.

On the other hand, if the inode or metadata cache is too small for the activity on a given server, the cache churn could result in a performance penalty. 


*****
Below is an excerpted question and answer from the URL immediately above, which seems to indicate that many folks have asked about the way that kernel memory on large memory servers grows over time.

Q5:What if the Inuse column in Used in AIX kernel & extension segments row grows over time?



A5:Growth of the Inuse column in Used in AIX kernel & extension segments row may seem like a memory leak.



Take, for example, the graph (to the right) of the Inuse column in Used in AIX kernel & extension segments row from the same AIX LPAR over a period of 26 days. AIX was rebooted on January 29, about four days before the first data point was collected. Memory used in AIX kernel & extension segments grew from 3386.81 MBs on Feb 3 to 4997.13 MBs on Mar 1. It does appear that there is a memory leak, but the following command (captured on March 1):



\# cat /proc/sys/fs/jfs2/memory_usage

metadata cache: 849981440

inode cache: 2129920000

total: 2979901440

\#



suggests that the growth in memory used by the AIX kernel & extensions can be addressed by tuning the AIX JFS2 i-node and metadata caches.



**Beware: The cat /proc/sys/fs/jfs2/memory_usage command above can crash AIX V6.1 TL0 and TL1 if the fix is not installed for APAR IZ06360: SYSTEM CRASH WHEN READING /PROC/SYS/FS/JFS2/MEMORY_USAGE (on TL0) or APAR IZ06954: SYSTEM CRASH WHEN READING /PROC/SYS/FS/JFS2/MEMORY_USAGE (on TL1).**

Wednesday, November 20, 2013

Things Fall Apart: Again my Turn to Win Some or Learn Some (AIX)

Chinua Achebe and Jason Mraz.  I'm an eclectic sasquatch, if ever there was one. 

Things Fall Apart
 A recent performance intervention did NOT go as planned.

Recently I described a fairly common performance bottleneck scenario.  CPU utilization was low and workload was running long: low fibrechannel throughput on the IBM Power LPAR was strangling the CPUs that wanted to work harder.  Throughput was low due to lots of small reads and almost continual overruns of the LUN queue_depth, which was set to 16.  In addition to small average read size and low queue_depth there was a third component in the stew, as well: over time the logical volume "maximum" policy for inter-physical volume partition use became ineffective.  After the initial sufficient number of LUNs was nearly full, one or two LUNs was added to the VG to satisfy capacity needs, without concern for the physical partition spreading over multiple LUNs and disk service queues that previously took place.  The 16mb physical partition size which had served well over the old number of LUNs in rotation gave way to hundreds of GBs of the logical volumes made up of consecutive 16 mb physical partitions from the same LUN, using the same disk service queue.  Hotspot magnet.

So, the intervention plan was pretty simple:
1. Increase queue_depth from 16 to 32, which I knew the underlying SAN could sustain.
2. reorgvg to redistribute the physical partitions (and their future IO requests) more equitably among the LUNs.  reorgvg after increasing queue depth in order to take advantage of the higher queue depth.
3. further investigation to determine how to increase the Oracle read size - whether by addressing filesystem fragmentation, database object fragmentation, or some other factor (hdisk* and fcs* maximum transfer sizes were both already sufficient to allow the 1 mb product of 128 mbrc and 8k database block we want to see whenever possible).

Yeah, that was the plan.  You can look at the "before" picture of the system here:
http://sql-sasquatch.blogspot.com/2013/11/low-cpu-busy-low-throughput-wasted.html

So, what actually happened?  Well... 'I almost had a heart attack' is what happened.  A sasquatch under a defibrillator is not a pretty sight :)

The queue_depth was changed on all hdisks, from 16 to 32 as recommended.  The system was rebooted to make the change effective.  The smallest vg was selected for reorgvg first, and the reorgvg completed.  Then, a few days later - everything fell spectacularly apart.  How bad?  Think about average write service times displayed in iostat up to 40 seconds.  Not 40 milliseconds. 40 full seconds.

What happened?  Well, I still don't have a conclusive diagnosis. But when IBM was engaged via a PMR, they recommended resetting the fcs1 and fcs3 devices back to their default AIX 6.1 num_cmd_elems (200) and max_xfer_size (0x100000 or 1 mb).

Interesting.  The num_cmd_elems and max_xfer_size attribute hadn't been changed in a lllooonnnggg time.  The num_cmd_elems attribute was 2048 for both devices.  A lot higher than the default of 200... but I know that CLARiiON devices used to recommend this setting on IBM Power AIX hosts.  And its not unusual to see IBM storage accompanied by num_cmd_elems that high or even higher, since individual LUNs on many types of IBM storage can have a queue_depth of up to 256.

The max_xfer_size was a bit more unusual.  It was 0x1000000 - 16mb.  At first I wondered if that was intentional, because lately I've only seen the default 1 mb value or the recommended 2 mb value for those AIX versions where 2 mb results in the 128 mb DMA vs 16 mb with the default max_xfer_size.  I was told that the 16mb value had been recommended by the storage vendor.  And I don't doubt it - I found that recommendation here for the SVC (although this config didn't use an SVC):
http://pic.dhe.ibm.com/infocenter/svc/ic/index.jsp?topic=%2Fcom.ibm.storage.svc.console.doc%2Fsvc_aixprob_1dcv18.html

Interesting - increase the maximum transfer size to 16mb, even if the hdisk maximum transfer size won't allow that large of a transfer.  And that recommendation is meant to improve performance.  I bet its the same reason that 2 mb later became a fairly common recommendation: increase the DMA allocation from 16mb default to 128mb.

Anyway - somehow the doubling of the queue_depth interacted with num_cmd_elems=2048 and max_xfer_size=0x1000000 in such a way that writes just plain couldn't get through in a reasonable time.  Each minute captured in 'iostat -DlRTV' showed average write latency upwards of a full second and as high as 40 seconds.  Read latency was sometimes averaging in the seconds, but not always.  It looked like the reads were only hurting if they were stacked behind a full or partially full service queue of writes.  Eventually I asked if anything was showing up in errpt.  Oh, yeah.  FCP_ERR4, so fast the error log was churning and could not longer tell when the errors had started.

Fortunately, an administrator had previously made changes to fibre channel devices one-at-a-time in order to keep the system running with the other fibre channel adapter - he did this again to correct this issue.  The fibre channel adapters were returned to defaults of 200 and 0x100000.  Performance improved to normal levels.  The fcstat command showed that num_cmd_elems was not sufficient to avoid queuing, but performance was roughly the same as it was before the change to the queue_depth.  Because I know that the AIX 7.1 default for fcs devices is num_cmd_elems=500, the attribute has just recently been increased, and we'll watch it real close.

Again my Turn to Win Some or Learn Some

As I said, no conclusive diagnosis from IBM yet.  But... you know, I've read this article so doggone many times... and I never scrutinized this portion before:



 ****

Note that changing max_xfer_size uses memory in the PCI Host Bridge chips attached to the PCI slots. The salesmanual, regarding the dual port 4 Gbps PCI-X FC adapter states that "If placed in a PCI-X slot rated as SDR compatible and/or has the slot speed of 133 MHz, the AIX value of the max_xfer_size must be kept at the default setting of 0x100000 (1 megabyte) when both ports are in use. The architecture of the DMA buffer for these slots does not accommodate larger max_xfer_size settings"

If there are too many FC adapters and too many LUNs attached to the adapter, this will lead to issues configuring the LUNs.
****

Hmmm... so now I'm thinking that the maximum transfer size on the fibre channel adapters may have been lying in wait for the fateful day when aggregate LUN queue depth was increased enough... and BLAM! Performance chaos.

If so, it'll be good to know for sure.  Especially if that means increasing num_cmd_elems to cover the aggregate queue_depth is safe, as long as we let the max_xfer_size at 1 mb (or maybe, hopefully, eke it up to 2 mb).

But the situation Dan Braden describes in the article doesn't quite fit what we saw: two fibre channel adapter ports were in use - but only one from each of two cards was in use on this LPAR.  And Dan mentions 'too many FC adapters' and 'too many LUNs' - not quite what happened here when the queue_depth attribute of existing LUNs was doubled to a completely reasonable (in my mind, anyway) value of 32.

We'll see.





Monday, November 18, 2013

Pssstt... Wanna See Something Ugly? (AIX)


Sometimes a really ugly-looking system is the best place to learn some lessons!  I often tell colleagues  its easier to recommend performance optimization changes, than it is to recommend the best order to implement the changes.

I cringe a little inside when I am introduced as a performance expert.  I don't consider myself an expert at much of anything.  But I learned something very valuable in Russia in 1993: when you see a long line - at least find out what its for if you aren't going to join!

Finding the long lines is a big part of troubleshooting performance and resource utilization issues.  Then, of course, you've got to decide what to do about the lines.  And - trickiest - what if there are several long lines for different resources at the same time?  What do you tackle first?

The system that generated the stats below is a survivor!  Every server has CPU, memory, and disk - but how many of them have long blocking queues for each of them, at the same time, and live to tell the tale?

Here's what I can tell you about the system: AIX 7.1 on a Power7 micropartitioned LPAR.  Entitlement 2.  vCPUs 6, with smt4 enabled (so 24 logical CPUs).  16 gb of RAM.  All LUNs presented from IBM VIOS through vscsi, with default queue_depth of 3.

The workload: basically, thousands of threads from an outside server will connect via TCP in order to send contents which will become flatfiles in filesystems on this server.  Hundreds of threads will start on this server to transfer completed flatfiles via TCP to a different external server.





The numbers for the graph come from vmstat and iostat.  The CPU runnable task queue comes from the 'r' column of 'vmstat -It' and is plotted against the left hand axis.  The sqfull number, a rate per second, is the aggregate of sqfulls across all LUNs on the system, which I pulled out of 'iostat -DlRTV' and is also plotted against the left hand axis.  The free frame waits are exposed by 'vmstat -vs' and plotted against the right hand axis.  The 'free frame wait' number reported is an event count; in the graph it is converted to a rate per second so that it is consistent with the sqfull reporting.  All of these were collected in 30 second intervals, in order to isolate activity of syncd when needed.

This server was seriously overworked.  I got called in to troubleshoot intermittent connectivity errors.  It took me a while to figure out what was happening :) .  Setting up logging of  'netstat -ano' showed me lots of good stuff.  The qlimit for each of the listener ports was 50 - the maximum TCP backlog for pending connections.  And sometimes qlen - the  backlog queue - overflowed qlimit.  Sometimes q0len was nonzero, too.  I learned about q0len and the limit of 1.5 times qlimit here.
https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014127310

Anyway... I could see that having a larger number passed into the port open command (maybe even SOMAXCONN) would allow increased tolerance - a longer TCP backlog queue before risk of rejecting an incoming connection.

But... I now agree with most voices out there: if the TCP backlog queue isn't long enough to prevent errors, find out if something is slowing down the rate of incoming connection request handling BEFORE just increasing the backlog queue depth.

That's how I got to the point of making the graph above.   Honestly, once I saw the sustained queuing for CPU, disk AND memory... I was surprised there weren't a lot MORE refused incoming connections.

What to do, what to do...

First, I noticed that the LUNs all had the default queue_depth of 3.  I did some checking, and found that increasing the vscsi device queue_depth to 32 would be AOK on this system.  (Recent VIOS allows setting vscsi queue depth up to 256, but don't set it higher than the underlying VIO server device queue depth.)  That's an easy change to recommend :)

What about the free frame waits?  Those had to be killing the server.  I had to fight the urge to just ask for more RAM.  Lets see - all of the filesystems were mounted rw.  OK... if they were instead mounted rbrw, filesystem readahead would still be there to prefetch reads of the flatfiles, and there would still be IO coalescing of writes while in filesystem cache.  That's also a quick change, so I recommended it.

Now... the run queue.  That was a really, really long run queue.  Peaking at almost 1200 - that's 50 runnable tasks per logical CPU.  And the CPUs only had an entitlement of 2!  In micropartitioned LPARs, the dispatch window is entitlement/vCPUs * 10 ms.  So on this LPAR each of the 6 vCPUs had 3.3 ms of dispatch time on a physical core per 10 ms window.  Once a vCPU came off of the core because its dispatch window was over, all of its threads came with it.  Those threads may not get back on a physical core for up to half of a second!

This one took some negotiation.  The LPAR was upgraded to an entitlement of 6 with 12 vCPUs. 

So there it was: increasing queue_depth from 3 to 32 to bring down the sqfulls, mounting filesystems rbrw to tame the free frame waits, and increasing entitlement and vcpus to bring down the runnable queue for each vcpu.

How'd it do?




 The rbrw mount option was the miracle among those changes.  It got rid of free frame waits completely!  Increasing the queue_depth did a great job of bringing the sqfull rate way, way down.  And the runnable queue isn't nearly as scary.  At this point, its no longer a systems issue - I've got ideas to improve performance at the application level - I'm pretty sure that I can help this workflow to achieve higher total throughput with a smaller number of threads with a little work.  And if we want to get rid of the rest of the sqfulls, it can be done by using a smaller physical partition size and more equitably distributing among the LUNs the partitions backing each logical volume.

I'm not sure if I'll be able to take another pass it this to further optimize the activity or not.  See - the reason I was called in initially - the intermittent dropped connections?  Well, the TCP backlog queue of 50 is sufficient now.  The q0len and qlen for the listeners is not a problem, because with lower levels of waiting for CPU, disk, and memory... the TCP connections are always being accepted before the backlog queue overflows.

Sure I'll have another learning opportunity soon :)