Thursday, November 14, 2013

Low CPU Busy + Low Throughput + Wait queued IO = Wasted Oracle/SQL Server Licenses

Whether the system is running Oracle or SQL Server, chances are that database licensing costs are among the greatest system expenses, if not the greatest.  Oracle and SQL Server are both licensed per core now - so if the server is not making the CPUs work, licensing dollars might be going down the drain.

I work primarily with workloads that are very data-hungry - DSS and data warehouse style loads.

Let me share some graphs from an Oracle system on IBMPower AIX, in hopes of putting to bed this idea that read service time is the primary storage performance metric for all database systems whether OLTP or not.

The following graphs come from iostat (-DlRTV, my favorite) and vmstat (us+sy for CPUBusy).
OK... average read latency in ms and bytes read per second across all LUNs on the system.  This graph makes sense, right?  As more data is read, there is slightly increased read latency - because the storage subsystem is working harder.  That's normal.  And the system looks pretty doggone healthy - the read service times stay under 20 ms and are usually under 15 ms.  Great performance!

How about another graph that also tells us how good the system performance is?  Here we see read operations per second against average read service time in ms.

Sure.  Just like the other graph - making the system work harder by doing more read IOs, so its natural for the latency to rise.  But, at least the service time stays low.

Since licenses are expensive - how does CPU busy look on the system?

Huh. There are brief spikes about once an hour for AWR snapshot generation - those are easy to pick out later in the morning but a bit obscured before 3:30 am.  Take those out... and for a DSS system the Oracle licenses aren't really being given much of a workout.  What gives?  Maybe there isn't much work to do?  Or maybe the work is being throttled?

Nope.  On DSS or data warehouse systems, lets look at this assertion that "its all about service time."

What is the relationship between the service time and the CPU utilization?

That's interesting, I guess.  What do you think it means? Does it tell us what is necessary to push the CPUs harder and get more value out of the Oracle licenses?  Not really.  Not unless you think that somehow _increasing_ read service times is going to increase CPU utilization and increase the rate of database work. :)  I haven't yet seen anyone advance THAT argument.

We did, however, see a strong correlation between read service times and throughput, and that really seemed to make sense.  What about the relationship between throughput and CPU utilization?

Aha!  Ok, can definitely see that CPU busy and read bytes per second trend together.

So the co-trending of CPU and throughput is still evident in this graph.  The variable Oracle database read size makes the range of reads per second much tighter than the range of read bytes per second.  This happens in SQL Server as well.  I generally tune Oracle systems for a maximum multiblock read of 1 mb, and SQL Server systems for a readahead read of 2mb.  SQL Server is a little tricky in this regard, because far fewer Windows than UNIX administrators know about tuning the HBA maximum transfer size to allow large single read requests.

Anyway, now we are at a place that makes sense.  If more reads per second and/or more read bytes/second occur, there's a good chance that more CPU will be utilized and more value realized from the Oracle licenses.  But... maybe that's all the read requests that could be made given the work at the time?

That's where my good friend sqfull comes in.  Service queue full.  Each LUN has a queue_depth attribute, the maximum number of in-flight read and write requests for the LUN.  When the service queue is full, the next read or write request goes into the wait queue, and an sqfull event is counted.  The iostat utility displays an sqfull/second rate, in the same way that it displays reads/writes/transfers per second.  In the olden days the sqfull output was an event counter, it was changed for the sake of consistency.

OK... well are many IO requests hitting the LUNs when the service queue is full?

¡Ay, caramba! Nearly a 1 to 1 relationship between reads on the system and IO requests that hit full service queues.  That could be a significant throttle to throughput.

In fact... even though the service times looked good in the first few graphs... I guess it is kinda strange that the peak read throughput was 250 mb/second, and that the average was far below that.

Now, most references out there won't go much beyond advising to increase queue_depth in cases like this.  I never shy away from increasing queue_depth when appropriate... but I've got other tricks up my sleeve, too.  I'll have to share them another day.  If I've convinced you that some database systems have a LOT more to worry about than low read service times, maybe you'll come back :)

No comments:

Post a Comment