Friday, August 2, 2013

∃ x ∈ HA: x ∈ DR

Its been a long time since my advanced calc class.  

∃ x ∈ HA: x ∈ DR
What on earth does that mean?

It means there exists an element of HA x such that x is an element of DR.  That's how I make sense of HA ≠ DR, which is oft-quoted and in many cases applied in ways that are simply not accurate. 

DR technology and HA technology are sets rather than mutually exclusive attributes of individual technology.  I'm aware of no definitions for DR and HA technology sets which preclude their intersection.  So can a particular technology be a member of both sets?  Yes.  In fact, there are many such examples.  

Name the database platform, and there is quite likely a continuous asynchronous replication method at the database level, based on send and apply of database logs to a secondary system.  Very likely also an asynchronous batched or delta set replication.  Maybe even a manner of synchronous transaction replication.  Oracle has Data Guard.  SQL Server has AlwaysOn and log shipping.  Intersystems Caché has shadowing and mirroring.

I like database replication for databases, for reasons I'll detail some other day.  But it bears mentioning that storage subsystem components have similar offerings.  EMC VMAX has synchronous and asynchronous data replication.  Hitachi has similar TrueCopy options.  IBM storage offers Global Mirror and Metro Mirror.  And the list goes on.

All of these technologies can be used to provide local data availability (and thus be part of a local high availability design, with or without automatic takeover).  These technologies can also provide replication to a secondary site.  With enough geographic separation to protect from natural disasters within design scope (eg hurricanes, tornadoes, earthquakes, flooding) any method of getting your data to the secondary site... even sneakernet of a tape backup - provide a measure of disaster recovery for the data.

Beyond that, its about thresholds, objectives, and the faults/disasters that are within scope.  Thresholds?  I think that more and more folks should be talking about MPOD - maximum period of disruption*.  Objectives: 1) RTO - recovery time objective (service time lost to fault/disaster goal) 2) RPO - recovery point objective (goal for how much data, measured in time leading up to disaster/failure, can be lost on recovery).  

Scope is very, very important to specify for DR and local recovery measure design: protecting from hurricane is very different from protecting from logical corruption. Protecting from logical corruption discovered immediately (or at least in the same day) is very different from protecting from corruption of data quality discovered a full year after introduction.  When a particular fault or disaster event falls outside of the scope of your DR or local availability design, it doesn't mean that you don't have high availability or don't have disaster recovery.  Its just a realization of the limited scope of the design.

In my opinion, designing for recovery from logical corruption or data quality problems is the most involved of recovery planning and design.  How much reachback time is enough?  Database server crashes usually don't go long before being discovered.  Data quality issues, even fairly pervasive ones, can be introduced months before they are discovered.

Consider this example: weekly full backups for 1 month, and all tx log backups for that month are retained locally.  Asynch database replication to secondary site.  1 monthly full backup and all tx log backups from that month retained at secondary site.  Compliance and litigation risk department requires that monthly full backups be retained at secondary site for each previous month of the year, and 1 yearly full backup for 7 years back (someone said its needed for SOX compliance :) ).

Hurricane or tornado at primary datacenter?  Thank goodness for database replication to secondary site, and disaster recovery!  Same if flooding, or if SAN level admin error formats all SAN disks.  (SAN replication would likely transmit the SAN admin error, but database replication in this case would not.)

Database admin error could take out primary system, and would likely be transmitted to secondary site if continuous asynch database replication.  So additional recovery would likely be needed.  If backups are available at primary and secondary site with all tx logs at both sites - recovery on primary is usually preferable.  Let's play with that scenario a little.  What if there are no local database backups and no secondary site backups, and a database admin error or errant table truncation takes place.   Neither primary nor secondary site could provide recovery of that table.

Does the existence or realization of such a failure mean that there is NO disaster recovery technology or measure in place?  No, not at all.  I mean... I'd never recommend the absence of database and log backups locally, and I always strongly recommend at least the capability at the secondary site.  But that strategy provides for disaster recovery from huge natural disasters, while providing no recoverability for logical corruption/admin error/data quality problems.

So... is this sasquatch preaching some strange new gospel?  I don't think so - I still believe HA≠ DR.  I also believe they can and do intersect.  I won't downplay the importance of backups - they are critical for survival.  But as business continuity and risk management in the database world matures... I want to make sure that at some level SQL Server folks, Oracle folks, and storage folks are all using at least SOME of the same important vocabulary in the same way.  Especially when it comes to critical service delivery characteristics and potentially big ticket capital and operational expenses.

Consider what EMC says about SRDF, Hitachi says about TrueCopy, IBM says about Metro/Global Mirror.

"Built for the industry-leading high-end VMAX hardware architecture, the SRDF family of solutions is trusted for disaster recovery and business continuity."

"Provides a continuous, nondisruptive, host-independent remote data replication solution for data protection, disaster recovery or data migration purpose."

"The Metro/Global Mirror function has a number of supported automated management offerings for disaster recovery solutions."

Consider what Oracle says about Data Guard.
"Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data."

Finally, consider the following command for SQL Server 2012 AlwaysOn:
ALTER DATABASE <> SET HADR 



--As an aside - although I personally believe recovery from logical corruption and admin errors can qualify as disaster recovery - many folks will say that local recovery from those events is not a disaster recovery if a secondary site is available and was not utilized in recovery.  Recoveries on primary/local system fall outside of many definitions of disaster recovery.
--When I control the vocabulary, I find its much more effective to talk about 'local availability', 'local recovery', and 'remote recovery' design sets.



*Pretty late update from sql.sasquatch 20131027.  My memory is not very reliable.  If you look around for MPOD related to business continuity or disaster recovery planning, you won't find much.  Look instead for MTPOD - maximum tolerable period of disruption, or one of these synonyms:
MAO - maximum allowable outage
MAO - maximum acceptable outage
MTD - maximum tolerable downtime

No comments:

Post a Comment