Sunday, February 2, 2014

Asynchronous Replication does NOT mean decoupled from Primary System Perf/Scalability


 A growing number of the organizations that I work with are implementing robust disaster recovery plans for SQL Server and Oracle projects.  For that, I'm glad.  There is a misconception that I want to address: asynchronous replication is NOT without performance and scalability considerations for the primary system.  In the last few weeks, I've worked with database systems using enterprise storage arrays and asynchronous replication from two different vendors.  In both cases there was primary system impact (in one case performance degradation for the primary database host, in the other high system resource utilization degraded performance for many tenants of shared resources).  In both cases there was surprise that asynchronous replication could result in performance or scalability drag on the systems involved.

I'll refer to EMC SRDF/A documentation here... not because this problem is unique to SRDF/A or even more frequently encountered on SRDF/A than other array-level replication.  No - the reason I'll reference SRDF/A here is rather that at 412 pages the document I'll refer to is one of the most complete you'll find on asynchronous replication.  EMC takes SRDF/A and SRDF/S extremely seriously.  Even if you are implementing a different asynchronous replication mechanism, taking a look at this resource can help identify the various design and workload components that should be considered for asynchronous replication to be successful within project goals.   

EMC SRDF/A and SRDF/A Multi-Session Consistency on UNIX and Windows TechBook Version 1.7
Part Number H2554.7; 7 mb pdf (2010)

The Symmetrix Cache Management (Pages 254-256) section includes passages about both logical device and system-wide write pending limits.  This document was written with Enginuity 5671 in mind.  The limits themselves may change with different versions of Enginuity, and methods of cache and bandwidth resource management/partitioning may change over time as well.  The key is: since SRDF/A is cache based asynchronous replication, sustained high write pending on the source OR target system can result in throttled write activity for the primary system host.

The coupling of primary and secondary system performance and scalability during asynchronous replication is not unique to SRDF/A among array based mechanisms.  HUR replication (Hitachi Universal Replicator) asynchronous replication also has a maximum tolerance, as does asynchronous replication in SVC/V7000 configurations.  Database level asynchronous replication has similar limitations, although the system resource considerations are different than for array-based replication.

All asynchronous replication relies on buffering of primary system activity on the primary before transmission to the secondary system, and buffering on the secondary system before applying/destaging write contents.  In some SAN replication facilities such as SRDF/A, the primary system buffering before transmission and the secondary system buffering before application take place in cache.  Other replication facilities, such as SRDF/AR or database level asynchronous replication such as Oracle Data Guard Maximum Performance or SQL Server Always On Asynchronous Commit, use file system or disk based buffering on the primary and/or secondary system.

Whether activity is buffered in cache or on disk, the dilemma remains: once the buffer resource is fully consumed, should replication cease - or should the primary system write activity be throttled in order to allow the recovery mechanism to recover?

Some asynchronous replication mechanisms allow a choice of response to full buffer resource conditions.  Some don't.  

If you implement asynchronous replication, understand the system considerations for the buffer resources on primary and secondary, as well as the consequences for replication and throttling as buffer resources saturate.

No comments:

Post a Comment