sql.sasquatch: December 2013

Monday, December 30, 2013

How much IBM AIX JFS2 Filesystem RAM Overhead? j2_inodeCacheSize & j2_metadataCacheSize

The biggest takeaways from the stuff below:

AIX 6.1 defaults can result in ~14% of server memory eventually used for JFS2 inode and metadatacache.

AIX 7.1 defaults can result in ~7% of server memory eventually used for JFS2 inode and metadatacache.

*****

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20Systems/page/AIX%20i-node%20and%20metadata%20cache%20size

These details come from the location above.

j2_inodeCacheSize

range = 1:1000

previous default = 400 (~10% server memory)

AIX 7.1 default 200 (~5% server memory)

maximum value of 1000 = ~25% server memory

j2_metadataCacheSize

range = 1:1000

previous default = 400 (~4% server memory)

AIX 7.1 default = 200 (~2% server memory)

maximum value of 1000 = ~10% server memory

The metadata and inode cache never shrink. Once they reach their configured size, contents begin to be retired so new contents can be cached.

If these cache portions are too large for a given server, the cache will continue to grow slowly until reaching maximum size. The gradual increase in kernel memory used could put pressure on other memory uses if not accounted for.

On the other hand, if the inode or metadata cache is too small for the activity on a given server, the cache churn could result in a performance penalty.

*****

http://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Power%20Systems/page/How%20to%20summarize%20AIX%20memory%20use?section=AIXInuse

Below is an excerpted question and answer from the URL immediately above, which seems to indicate that many folks have asked about the way that kernel memory on large memory servers grows over time.

Q5:What if the Inuse column in Used in AIX kernel & extension segments row grows over time?

A5:Growth of the Inuse column in Used in AIX kernel & extension segments row may seem like a memory leak.

Take, for example, the graph (to the right) of the Inuse column in Used in AIX kernel & extension segments row from the same AIX LPAR over a period of 26 days. AIX was rebooted on January 29, about four days before the first data point was collected. Memory used in AIX kernel & extension segments grew from 3386.81 MBs on Feb 3 to 4997.13 MBs on Mar 1. It does appear that there is a memory leak, but the following command (captured on March 1):

\# cat /proc/sys/fs/jfs2/memory_usage

metadata cache: 849981440

inode cache: 2129920000

total: 2979901440

suggests that the growth in memory used by the AIX kernel & extensions can be addressed by tuning the AIX JFS2 i-node and metadata caches.

**Beware: The cat /proc/sys/fs/jfs2/memory_usage command above can crash AIX V6.1 TL0 and TL1 if the fix is not installed for APAR IZ06360: SYSTEM CRASH WHEN READING /PROC/SYS/FS/JFS2/MEMORY_USAGE (on TL0) or APAR IZ06954: SYSTEM CRASH WHEN READING /PROC/SYS/FS/JFS2/MEMORY_USAGE (on TL1).**

Thursday, December 26, 2013

Perfmon "Current Disk Queue Length" to LUN w/single Tx Log; writes bound by 32 per txlog limit

I was going to ask this question on #sqlhelp:Are #SQLServer outstanding txlog writes limited by number of log buffers/txlog? If so, is buffers/txlog fixed or configurable?

But by chance I found the answer:
Slide 6 of this deck from Microsoft's Ewan Fairweather addressed the question.
http://sqlbits.com/Downloads/246/Designing%20Highly%20Scalable%20OLTP%20Systems.pptx
Slide 6 says that the transaction log has 127 linked buffers and allows 32 outstanding IOs.
**Update - I've since learned that there are 128 txlog buffers per SQL Server txlog, with a maximum of 32 of them allowed for async inflight writes to the transaction log. End update**

Ewan's presentation seems to be focused on SQL Server 2008R2/Windows 2008 R2 - but that's fine because so am I at the moment :)

Paul Randal also mentions the 32 outstanding write IOs per log file in his post "Trimming More Transaction Log Fat".
http://www.sqlperformance.com/2013/01/io-subsystem/trimming-more-transaction-log-fat

In the graph below, I want to drive up CPU utilization. I've already made sure that the current CPU utilization is almost all SQL Server with very little other CPU consumed, and that there is a strong correlation between CPU utilized and logical reads on the system. So to drive up the rate of database work, I've got to drive up CPU utilization - don't want to waste SQL Server licenses :) .

There are two main databases on this SQL Server instance, and each has data files spread over multiple volumes/LUNs. Tempdb unfortunately has multiple datafiles (since the server has 48 logical CPUs) all residing on the same LUN. So tempdb suffers from occasionally stressing its IO queue, as is evident below. Windows will allow a total of 256 IO requests per LUN either in the HBA service queue for the LUN or in a Windows wait queue for the LUN. Tempdb gets close to that limit :) It actually used to hit that limit for a good chunk of time... so the HBA maximum transfer size was increased from the Windows default 512 kb to 2 mb. That eased some queuing pressure on tempdb and on other LUNs as well.

Current Disk Queue Length and %CPU Busy

The somewhat surprising element to me was that, with the exception of a single operation which occurs at 9:35 am ~~(I think this is a transaction log expansion though it could be the wrap-around)~~ activity to the LUNs here artfully named "Logs Vol1" and "Logs Vol2" has a maximum value of 32 for "Current Disk Queue Length". **update 12/30/2013** I previously thought that the peak in activity for one of the LUNs was due to transaction log activity like expansion/wraparound/shrink. Probably not in this case. Its more likely just that the LUN in question hosted more than just the single transaction log, and some of those outstanding IOs were for other LUN contents. **end update**

Current Disk Queue Length and %CPU Busy

That might be easier to see if I shrink the time period a bit and shrink the scale of the axis for "Current Disk Queue Length". Of course, my choice of colors might make it unreadable instead - I'm never sure.

In this case, that happens to be the service queue depth for these LUNs. (Its default service queue depth in most cases.) But I suspect that SQL Server is not basing its activity on the service queue depth for the LUNs - rather I think its something like the number of uncommitted transaction log buffers allowed that is actually the limit. Why wrestle to understand the difference? Well... if I restricted the service queue depth to 16 instead of 32, and SQL Server is limiting its behavior based on outstanding transaction log buffers, the "Current Disk Queue Length" would continue looking the same, even though write latency measured by perfmon and overall performance would likely degrade. On the other hand, if SQL Server really is taking the LUN service queue depth into account... if this was a SQL Server instance on top of IBM XIV storage, for example, I could probably crank the queue depth WAY up. The XIV allows lots of outstanding IOs per LUN... although a really long service queue to an XIV may increase average service time, in my case its all about throughput. So increasing throughput is acceptable even at the expense of average latency.

But, according to Ewan's presentation (and Paul Randal's post)... 32 outstanding writes per txlog is the limit, so I've got to find a way to bring down the write latency with that in mind.

Thursday, December 19, 2013

VMWare vCPU dispatch overload without high perfmon %Processor Time or ESX %RDY...

The system for the graph above is a 1 vCPU Windows Server 2008 R2 VMWare guest. %CPU busy and the processor run queue against the black vertical axis on the left, total thread count from perfmon against the blue axis on the right.

I was watching an ETL process... we'd already optimized the sending system. Times looked great... at first. Suddenly the times for the ETL stretched back and became even worse than before optimization! First thought was to look at the receiving database system. Nothing stood out, but there are always things to optimize... maybe reduce IO queuing by increasing max transfer size... I noticed that total throughput on the server seemed capped at 800 mb/second with queued IO and CPUs with plenty of idle time... but those factors were trues before.

Finally I started looking at the middleman - a small VM that grabs the extract files from the source system and manages their import into the target system. I'd actually forgotten it was a VM in this case... I'm a dinosaur so I always assume physical server until it becomes apparent its virtual.

I started staring at disk queuing first out of habit. Nothing out of the ordinary - in fact that was a little surprising. I had expected some disk queuing due to the number of sending threads which had increased after optimizing the sending system. And each sending thread should have been able to increase its send rate, as well.

But the "current disk queue length" perfmon counter was completely unremarkable... even though more than 40 writing threads coming from the source system should at times have overlapped with 30 or more reading threads sending to the target system.

But I guess the processor run queue, which reaches up to 102 threads for the lonely single vCPU would explain why the level of disk queuing was lower than I expected... and also potentially why performance was slower than before optimizing the sending system.

Not sure yet if a change was introduced to this VM (perhaps someone reduced vCPU based on prior utilization?), or maybe the optimization on the sending system caused a scheduling/interrupt handling overload for the hypervisor or guest.

Interestingly, the virtualization host would not necessarily show this as a problem. The vCPU might be getting all of its allotted time on a host server hardware thread. But with that many threads to switch on and off the vCPU, and trying to manage the interrupts from the network adapter that the ETL was keeping busy... there certainly was a problem. If only CPU busy in perfmon was monitored without also the processor run queue... the problem might not be apparent.

Good thing I'm always watching the queues :)

Hopefully cranking up to at least one more vCPU will lead to an improvement. I hate it when my optimizations end up in a net loss :)

Gotta give you a link, right? Here's a great one, on this topic which applies to Windows dispatch overload whether physical or virtual CPUs.

Measuring Processor Utilization and Queuing Delays in Windows applications

http://blogs.msdn.com/b/ddperf/archive/2010/04/04/measuring-processor-utilization-and-queuing-delays-in-windows-applications.aspx