Friday, July 19, 2013

Emulex LUN/FibreChannel Adapter and LUN queue full events in Linux

IBM AIX makes it pretty easy to monitor fibre channel LUNs and fibre channel adapters for queue full conditions.  I used to think it would be just as easy to monitor on Windows and on Linux.  Sigh.

In AIX, I use "iostat -DlRTV" all the time, and my first stop is "serv qfull" (second to last, right next to the timestamp). If I see that number increasing during my time of interest, I have a pretty good idea of what to do.  (Ideally there will  be no LUN or adapter queue fulls at all on the system.)

I also use AIX fcstat all the time.  The output is in list format, and each adapter has to be specified in the command.  I posted an awk command I use to flatten the output into csv format when I run fcstat unattended.  But... fcstat makes it doggone easy to see when the adapter queue depth is overrun. Its right there in "No Adapter Elements Count". If I see that number increasing during my time of interest, I have a pretty good idea of what to do.

Then we get to Windows.  I don't even KNOW what I'll do there.  Criminey!  It took hours to get the right utilities to increase the max transfer size on Windows... I'm too scared to even THINK about monitoring for LUN or adapter queue fulls right now.

OK... now Linux.  Oracle is pushing Linux HARD, what with Exadata and their own Unbreakable/Oracle Enterprise Linux distro.  So I expected Linux to be pretty mature in the area of disk IO queuing transparency.  I'm disappointed.  Really disappointed.

The Linux implementation of iostat will give an average queue length per LUN.  That's not what I want. Full disk IO queues can hide real easily in average queue depth, even in 15 second intervals.

There isn't a standard replacement for fcstat in Linux.  Dang it!  Okay.  That leaves HBA level and multipather level.  But its the HBA that I'm really worried about throttling Linux disk IO.  So that's where I'll go.

Start with Emulex.  Move on to QLogic when I recover.

Emulex Driver for Linux
FC and FCoE version 8.2.0.63
NIC version 2.101.374.0
iSCSI version 2.101.374.0
User Manual
http://www-dl.emulex.com/support/linux/uldk/10014/linux.pdf

Here's the thing - even when you grab "statistics" from Emulex, it seems that queue full occurrences aren't deemed important enough to include in statistics.  Really?  OK.  On to error and even logging.

elx_mes0318: Failed to allocate IOTAG. last IOTAG is <last_allocated_iotag>
That's what I am talking about.  Running out of command tags for disk IO is important enough that the driver always logs it to /var/log/messages.

What about other queue full conditions?  Like these:
elx_mes0707: Driver's buffer pool is empty, IO busied
LOG_FCP verbose
elx_mes0708: Allocation request of <num_to_alloc> command buffers did not succeed. Allocated <num_allocated> buffers.
LOG_FCP verbose
elx_mes0711: Detected queue full - lun queue depth adjusted to %d
LOG_FCP verbose

The "LOG_FCP verbose" classification means that unless the hex bitmask value for lpfc_log_verbose includes the value for LOG_FCP, you won't see the messages.  Guess what?  The guide doesn't tell you what the LOG_FCP value is, so it can be included in lpfc_log_verbose.

Gotta look around.  Eventually in a couple of HP support documents I found what I was looking for.  So, when I do targeted performance investigation on Linux, I'll be including LOG_FCP = 0x40 in lpfc_log_verbose, and I'll take it back out when I'm done.  Because I think this stuff is important.  And I think there are probably lots of systems getting hit by queuing issues... and not necessarily aware of it.

#lpfc_log_verbose to capture Emulex queue full info per http://h10025.www1.hp.com/ewfrf/wc/document?cc=us&lc=en&dlc=en&docname=c02071793
#LOG_FCP=0x40 per
#http://h30507.www3.hp.com/t5/Technical-Support-Services-Blog/Enable-verbose-debugging-with-Emulex-and-Qlogic-on-Linux/ba-p/89957#.Uem9fY2W-So
echo 0x40 > /sys/class/scsi_host/host[X]/lpfc_log_verbose

That should allow me to look for the following messages in /var/log/messages:
elx_mes0318: Failed to allocate IOTAG. last IOTAG is <last_allocated_iotag>
Always
elx_mes0707: Driver's buffer pool is empty, IO busied
LOG_FCP verbose
elx_mes0708: Allocation request of <num_to_alloc> command buffers did not succeed. Allocated <num_allocated> buffers.
LOG_FCP verbose
elx_mes0711: Detected queue full - lun queue depth adjusted to %d
LOG_FCP verbose

We'll see how that goes.

*********

Ohhh...

http://h30507.www3.hp.com/t5/Technical-Support-Services-Blog/Enable-verbose-debugging-with-Emulex-and-Qlogic-on-Linux/ba-p/89957#.Uem9fY2W-So

WARNING:
Depending on the LPFC driver, the logging level definitions may differ... Always check ./lpfc_attr.c for latest table

No comments:

Post a Comment