Wednesday, March 19, 2014

Here's something crazy: sight-unseen Oracle storage and parallelism recommendations ;-)

****A small update to clarify... although I do some work with SSD drives and flash both in storage arrays and onboard at the database server, the model I describe below assumes fibre channel SAN storage with an HDD config****

****Another update with a performance consideration.  Below I recommend spreading the data in chunks across the available LUNs.  With ASM, I recommend the 4 mb au size.  For other LVMs, use a ppsize/extent size NO LARGER than 16mb, to prevent queue depth overruns by many reads from parallel queries in a dense contiguous data area.  16mb in a logical volume manager rotating among the appropriate number of LUNs is workable.  Smaller is better... 8mb or 4mb.  I don't recommend going below 4mb.  Remember that the ppsize/extent size may have consequences for maximum logical volume size and/or maximum volume group size.****

I work with a crazy workload, so don't take recommendations from me without some significant testing.  The workload I tune for is 4 queries (at least) per x86 core (8 queries per Power core) with parallel_max_servers at 16 per x86 core (32 per Power core).  I also tune for SMP, not RAC.  But tuning for RAC this way can also be done...  assuming complete data level isolation among the nodes it would be the same.  RAC nodes fetching data from each other lowers the total read liability that I eventually mention below.  So my formula isn't completely linear if applied to RAC because hopefully no RAC workload has nodes that are completely data independent of each other :)

On Linux, I recommend ASM only, at least until the 3.13 kernel and the benefits of the multi-queue block layer.

****Updated to include AIX and Sparc Solaris details.  Silly of me to assume that quad core in the twitter thread meant x86 :-)****

On AIX I recommend ASM because its a portable knowledge base, and I also believe its a better bet for long-term performance capacity as databases age and experience transitions in administrators.  But JFS2 filesystems and filesystemio_options=SETALL can achieve similar level of performance.  Same LUN count as below in a volume group, use a physical partition size (ppsize) 16mb or less maximum "inter-physical volume allocation policy".

On Sparc Solaris I recommend ASM most of all.  Veritas lvm and fs can give great performance and I love the fact that its a portable skill set (HP-UX and AIX).  UFS is a solid, reliable filesystem that can perform very, very well but the number of folks that can tune UFS for maximum performance grows smaller every year.  I lover host-side ZFS for many purposes, including lightweight proof-of-concept databases, training databases, and dev/test databases outside of scalability/performance context.  If your workload is like mine, however, ZFS can only outperform UFS if you load up with so much host-side onboard flash that honestly I'd rather put to other use.  And it might not even outperform UFS then.  Whether Veritas or UFS, use wide spreading in the logical volumes across LUNs/physical volumes.  16 mb at most.  Smaller is better, down to 4 mb when possible.


For my workloads, where bandwidth and throughput are king, I scale the number of ASM disks for Oracle tablespaces based on the core count.  Eight ASM disks for tablespaces for x86 core count mod 12, minimum of 8 ASM disks for tablespaces (mod 6 for Power7 cores).  12 x86 cores? 8 ASM disks for tablespaces.  16 x86 cores? Also 8 disks.  8 x86 cores?  Also 8 disks.  60 cores (one of these new 4 socket, 15 cores per socket servers)=40 ASM disks for tablespaces.

Two ASM disks for redo logs.  Although my workload can do wonders at full fibre channel bandwidth and up to 100 ms read latency (maybe even a smidgeon more as long as bytes/sec throughput is high), I still need low write latency to the redo logs and fast log backups that don't interfere with the redo log write latency.

Now... LUN queue depth.  I assume a LUN queue depth of 32, because its the "minimum of the allowed maximums" for the various storage types I work with.  EMC VNX virtual provisioning gives maximum LUN queue depth of 32 and Hitachi USP-V/VSP, etc have maximum LUN queue depth of 32.  Some types of storage allow cranking that way up to 256!  That's great - for me that is value-add though, not a planning specification.  If I move a given database from one storage technology to another I don't want to re-work the data layout because I originally assumed and depended on a really high queue depth.

Multi-pathing?  Yes, please :)  I assume 2 paths, round robin.  Maybe even tweaking down the IO interval for switching between paths from its default.

Fibre channel adapter queue depth tends to be good on Linux - I want at least 1024 allowed outstanding IOs per HBA port.

Maximum transfer size of at least 1 mb at the LUN and adapter level.  For Oracle 11gR2 and beyond, no specified dbfmbrc so that it defaults to 128*8=1mb.

OK.  An ASM allocation unit (AU) of 4mb or smaller.  Oracle just recently began documenting the general recommendation of 4 mb au.  The idea behind it is with a 4mb ASM au, the chances for a 1mb or near 1 mb multiblock read are greater than if the au is 1mb.

OK.  So where does that leave us?  We've got a framework for host-side configuration that allows a pretty high level of throughput whether checking IOPs, bytes/second, or snapshot count of in-flight IOs.

The twitter thread that prompted this post is based on at least 80 EMC disks available to the host.  I'll assume they are 15k fibre channel disks.  For my workloads, I can get 2.4gigabytes/sec or more with about 24,000 IOPs sustained from 80 disks on a VNX using virtual provisioning and RAID5 4+1s.  I'd be tempted to push it even higher by using traditional LUNs... but the flexibility of virtual provisioning is REALLY handy.

What's gonna drive that throughput on the host side?  The number of concurrent serial queries and their in-flight IO plus the parallel queries and their in-flight IO.  The maximum in-flight read liability of parallel queries is where parallel_max_servers comes in.  I don't even remember exactly WHY but I use as assumption of 4 maximum in-flight reads per parallel server.  And I want to accommodate at least half the in-flight read liability of the parallel queries in the aggregate LUN queue depth.

So, if I've got 16 ASM LUNs for the tablespaces with 32 queue depth per LUN, aggregate queue depth is 512.  At 4 reads per parallel server, that's enough for 256 parallel servers (assuming accommodating half the liability in the aggregate queue depth).   

Wait... what?  Planning query parallelism based on the storage rather than CPU?  Yup.  But the storage configuration is informed by the CPU count.  And the CPU count was planned carefully because the CPU count determine license fees, and licenses are doggone expensive. :-)

Here's the way I think of IO bound workloads: unless the CPUs are fed fast enough by the storage, they won't be able to stretch their legs.  Unused CPU (below the 70%/80% threshold anyway) is money that's not working for me :)

Take it with a grain of salt and a lot of testing.  This type of storage and system planning works for my throughput/bandwidth sensitive workloads.  YMMV.  But if your workload is truly IO bound - rather than simply not yet tuned ;-) thinking about it this way may be valuable.

I try to model stuff this way because although tuning parallelism isn't easy, its much faster and typically lower risk to make parallelism changes (with before and after metrics captured to evaluate) than making storage changes like converting from an OS native logical volume manager to ASM or increasing the LUN count and rebalancing data.

OK... enough rambling for one morning.  Hope its helpful for someone :-)

No comments:

Post a Comment