Monday, November 18, 2013

Pssstt... Wanna See Something Ugly? (AIX)


Sometimes a really ugly-looking system is the best place to learn some lessons!  I often tell colleagues  its easier to recommend performance optimization changes, than it is to recommend the best order to implement the changes.

I cringe a little inside when I am introduced as a performance expert.  I don't consider myself an expert at much of anything.  But I learned something very valuable in Russia in 1993: when you see a long line - at least find out what its for if you aren't going to join!

Finding the long lines is a big part of troubleshooting performance and resource utilization issues.  Then, of course, you've got to decide what to do about the lines.  And - trickiest - what if there are several long lines for different resources at the same time?  What do you tackle first?

The system that generated the stats below is a survivor!  Every server has CPU, memory, and disk - but how many of them have long blocking queues for each of them, at the same time, and live to tell the tale?

Here's what I can tell you about the system: AIX 7.1 on a Power7 micropartitioned LPAR.  Entitlement 2.  vCPUs 6, with smt4 enabled (so 24 logical CPUs).  16 gb of RAM.  All LUNs presented from IBM VIOS through vscsi, with default queue_depth of 3.

The workload: basically, thousands of threads from an outside server will connect via TCP in order to send contents which will become flatfiles in filesystems on this server.  Hundreds of threads will start on this server to transfer completed flatfiles via TCP to a different external server.





The numbers for the graph come from vmstat and iostat.  The CPU runnable task queue comes from the 'r' column of 'vmstat -It' and is plotted against the left hand axis.  The sqfull number, a rate per second, is the aggregate of sqfulls across all LUNs on the system, which I pulled out of 'iostat -DlRTV' and is also plotted against the left hand axis.  The free frame waits are exposed by 'vmstat -vs' and plotted against the right hand axis.  The 'free frame wait' number reported is an event count; in the graph it is converted to a rate per second so that it is consistent with the sqfull reporting.  All of these were collected in 30 second intervals, in order to isolate activity of syncd when needed.

This server was seriously overworked.  I got called in to troubleshoot intermittent connectivity errors.  It took me a while to figure out what was happening :) .  Setting up logging of  'netstat -ano' showed me lots of good stuff.  The qlimit for each of the listener ports was 50 - the maximum TCP backlog for pending connections.  And sometimes qlen - the  backlog queue - overflowed qlimit.  Sometimes q0len was nonzero, too.  I learned about q0len and the limit of 1.5 times qlimit here.
https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014127310

Anyway... I could see that having a larger number passed into the port open command (maybe even SOMAXCONN) would allow increased tolerance - a longer TCP backlog queue before risk of rejecting an incoming connection.

But... I now agree with most voices out there: if the TCP backlog queue isn't long enough to prevent errors, find out if something is slowing down the rate of incoming connection request handling BEFORE just increasing the backlog queue depth.

That's how I got to the point of making the graph above.   Honestly, once I saw the sustained queuing for CPU, disk AND memory... I was surprised there weren't a lot MORE refused incoming connections.

What to do, what to do...

First, I noticed that the LUNs all had the default queue_depth of 3.  I did some checking, and found that increasing the vscsi device queue_depth to 32 would be AOK on this system.  (Recent VIOS allows setting vscsi queue depth up to 256, but don't set it higher than the underlying VIO server device queue depth.)  That's an easy change to recommend :)

What about the free frame waits?  Those had to be killing the server.  I had to fight the urge to just ask for more RAM.  Lets see - all of the filesystems were mounted rw.  OK... if they were instead mounted rbrw, filesystem readahead would still be there to prefetch reads of the flatfiles, and there would still be IO coalescing of writes while in filesystem cache.  That's also a quick change, so I recommended it.

Now... the run queue.  That was a really, really long run queue.  Peaking at almost 1200 - that's 50 runnable tasks per logical CPU.  And the CPUs only had an entitlement of 2!  In micropartitioned LPARs, the dispatch window is entitlement/vCPUs * 10 ms.  So on this LPAR each of the 6 vCPUs had 3.3 ms of dispatch time on a physical core per 10 ms window.  Once a vCPU came off of the core because its dispatch window was over, all of its threads came with it.  Those threads may not get back on a physical core for up to half of a second!

This one took some negotiation.  The LPAR was upgraded to an entitlement of 6 with 12 vCPUs. 

So there it was: increasing queue_depth from 3 to 32 to bring down the sqfulls, mounting filesystems rbrw to tame the free frame waits, and increasing entitlement and vcpus to bring down the runnable queue for each vcpu.

How'd it do?




 The rbrw mount option was the miracle among those changes.  It got rid of free frame waits completely!  Increasing the queue_depth did a great job of bringing the sqfull rate way, way down.  And the runnable queue isn't nearly as scary.  At this point, its no longer a systems issue - I've got ideas to improve performance at the application level - I'm pretty sure that I can help this workflow to achieve higher total throughput with a smaller number of threads with a little work.  And if we want to get rid of the rest of the sqfulls, it can be done by using a smaller physical partition size and more equitably distributing among the LUNs the partitions backing each logical volume.

I'm not sure if I'll be able to take another pass it this to further optimize the activity or not.  See - the reason I was called in initially - the intermittent dropped connections?  Well, the TCP backlog queue of 50 is sufficient now.  The q0len and qlen for the listeners is not a problem, because with lower levels of waiting for CPU, disk, and memory... the TCP connections are always being accepted before the backlog queue overflows.

Sure I'll have another learning opportunity soon :) 

No comments:

Post a Comment