Thoughts on Disaster Recovery

I’m often amazed by the amount of synchronicity in the world.  Most recently I’ve been flooded from all sides with discussions about disaster recovery and of those, almost all wanted to achieve a zero recovery point without sacrificing performance.  Since this requirement violates the laws of physics I’ve had the opportunity to write up my response several times and refine it a bit.  I’m posting it here for posterity.

Let me pause here and explain recovery point objective.  Wikipedia says it is “deceptively difficult to explain” but I think the IBM iSeries Infocenter does a good job:

Recovery point objective (RPO) is the point in time relative to the failure to which you need preservation of data. Data changes preceding the failure or disaster by at least this time period are preserved by recovery processing. Zero is a valid value and is equivalent to a “zero data loss” requirement.

This post arose from discussions asking how to set up WMQ such that the DR site never loses a message or transaction.  Generally this is done with disk replication so that any updates to the QMgr files are synchronized to a duplicate filesystem at the remote site.  However, there is a latency between writing to the disk locally and that update being safely recorded at the remote site.  To obtain a zero recovery point, the filesystem must block on the write until the remote site acknowledges the update.  This is known as synchronous replication.  It works but, as you may imagine, it is not very performant.

The other mode of replication is asynchronous.  In this mode, the local filesystem returns control immediately after the write and transmits the updates to the remote site in the background.  Since the local writes can outpace the network, the system has the ability to queue the updates locally.  As long as the network bandwidth is greater than the average volume of updates, the system can absorb transient traffic spikes.  However, there is a period of time during which the local and remote site are not in synchronization.  If connectivity between the sites is lost while the two are out of sync, the delta between the two represents the recovery point.  Although it is typically very small, it is not zero.

Does this sound familiar?  It should.  Disk replication is essentially another form of asynchronous, store-and-forward communication.  This is much like WebSphere MQ except that MQ moves data around in atomic units called messages whereas disk replication moves them around in atomic units related to disk block writes.  Where MQ has units of work that can comprise multiple messages, replication has consistency groups that can comprise multiple disk updates.

That’s how they differ.  Where they are alike is that they both are asynchronous messaging.  In other words, the design being described is to use low level async messaging to provide recovery of a different async messaging layer a little higher up the stack.  This doesn’t eliminate the fundamental problem of asynchronicity we are trying to solve, but it can greatly reduce the window of opportunity for lost messages and relieve us of having to code replication logic at the messaging layer.  Both of these are Good Things but they don’t result in a zero recovery point.

Here’s the issue, described in a physics context.  Stripped to bare essentials, the data must physically exist somewhere.  To achieve a zero recovery point, the data must exist in at least two places and these places must be geographically distant for purposes of disaster recovery.  The absolute floor for latency between these locations is defined by the speed of light (186,000 miles per second is not just a good idea, it’s the law!) and goes up from there depending on network characteristics between the endpoints.  Any specific datum always originates in one physical location so the application must either block and wait for replication to complete or it must accept that the two sites will be at least slightly out of sync during the replication.  That’s just laws of physics.  You can’t break these laws, you can only break yourself against these laws.

Unfortunately, what I see most often are systems using async replication but treated as if the replication were synchronous.  In a controlled DR test, someone in an operations center walks the participants through a pre-planned procedure in which the Production network is disabled and failover proceeds in an orderly fashion.  This always results in a perfectly synched DR system due to the time it takes humans to coordinate and execute the failover.  Very few teams I’ve worked with perform the DR test by pulling network cable while the Production components are under heavy load, however that is what is required to get an idea of how close to a zero recovery point the system is capable of getting you.

Throw enough money at your network provider and you can get REALLY fast links between data centers but the cost for each incremental bandwidth improvement tends to curve sharply up at some point.  After that you still have a non-zero recovery point and diminishing returns for greater sums of cash.  It’s Zeno’s Dichotomy paradox applied to latency.  That’s why redesigning an existing application (even a big one) can be a LOT cheaper than really fast replication that doesn’t get you the zero recovery point you wanted.  You either pay through the nose for the bandwidth or pay in outage impact because a system assumed a zero recovery point that didn’t exist.  The best possible case is to eliminate the requirement for a zero recovery point.  At my former employer we redesigned existing apps to reconcile between themselves on reconnect.  It was a really big project, lots of people, close to a year.  Cost was recovered very quickly through reduced bandwidth and replication expense, and those savings continue to accrue year after year.

That’s not to say that disk replication isn’t useful.  It is extremely good for many use cases.  Particularly, those use cases which don’t expect it to overcome relativity.  Also, replication as a concept is very useful.  For example, both Infosphere Queue Replication (a.k.a. QRep, a Database replication solution) and file replication using FTE work extremely well because they are transactional.  The QRep replicates whole units of work or none at all.  FTE replicates whole files or none at all.  Both of these run over MQ and use non-persistent messages because they reconcile their state on reconnect.

And I don’t want to give the impressio that a zero recovery point isn’t practical.  Traditional hardware clusters, the Multi-Instance Queue Manager, z/OS Parallel Sysplex and other solutions do a great job of providing zero recovery points on locally shared storage.  Synchronous disk replication does a great job of meeting a zero recovery point objective if the application can tolerate the reduced throughput.

The problem, as I see it, isn’t one of replication or DR design.  The root cause here, in my humble opinion, is application designs which overlook non-functional requirements or act as if certain constraints – i.e. message affinity, ambiguous outcomes, relativity – do not exist.  There’s a common project management mindset that starts with a very simple single-instance prototype, iterates through cycles of improvement to get it to production readiness, then hands it off to the network and ops folks to “turn on security” and “implement DR”.  Vendors who tell their customers “look, this is harder than you think it is” don’t get rewarded in the market but ones who provide security and DR that you can bolt on after the fact are.  That creates a huge market differential between “sell you what you want” versus “sell you what you need” which in turn feeds a vicious cycle by encouraging rather than discouraging application architects to ignore these sets of requirements.

My advice? If you have a DR architecture based on a zero recovery point, make sure it will actually deliver that.  Next DR test, put it under heavy load and yank the WAN network cable out.  The DR architecture is supposed to handle a violent disconnection so test it that way, and do so multiple times.  But if the application is new, work out ahead of time what recovery point objective is going to be attainable at a reasonable cost.  You may need to factor in some sort of manual or automated reconciliation but the additional cost in application design and implementation will in the long run be a sound investment.

This entry was posted in General. Bookmark the permalink.

6 Responses to Thoughts on Disaster Recovery

  1. Maximilian Locher says:

    In your ‘Thoughts on Disaster Recovery’, you state, regarding Qrep and FTE that “Both of these run over MQ and use non-persistent messages because they reconcile their state on reconnect.”

    Serge Bourbonnais, in his presentation at StlDUGDB2 Technical Conference 2011 stated that staging is possible at the send queue if the network is down and that staging is possible at the recieve queue if DB2 or Q Apply is stopped.

    page 19 of https://www-950.ibm.com/events/wwe/grp/grp009.nsf/vLookupPDFs/3-3%20Bourbonnais-Active%20Active%20Q%20Rep/$file/3-3%20Bourbonnais-Active%20Active%20Q%20Rep.pdf

    Can you help me to reconcile these two statements?

    Thank you,

    Max

    • T.Rob says:

      I’m not sure that there is anything to reconcile here. The diagram in the presentation shows a control flow from the receiving side back to the sending side through the admin queue. The two sides perform error detection and correction over this control flow. The fact that messages can stack up in queues on the sending or receiving side doesn’t conflict with those messages being non-persistent or that QCapture and QApply have a control protocol. Help me understand where you see a disconnect because this all appears consistent to me at this point.

  2. Maximilian Locher says:

    T-Rob,

    Great Blog! Thanks for sharing your thoughts. In your ‘Thoughts on Disaster Recovery’, you refer to ‘ambiguous outcomes’. Can you provide a quick discussion and example?

    Thanks,
    Max

    • T.Rob says:

      Thanks for the comments, Max. I’ve written up issues with the ambiguous outcomes on Stack Overflow a few times. A good example is posted here. Let me know if this needs a separate blog post to clarify.

  3. Jon Levell says:

    Thanks for that posting (and the shorter version on the MQ listserv), it explains the trade-offs very well!

  4. Erik Zollinger says:

    This is an area of frustration for me. I’ve been working with MQ over the last 10 years now I’m working with InfoSphere Information Server – DataStage and have a new requirement of zero downtime….no maintenance window and provide a DR solution. In addition secure all the components. Thanks for all you do!

Leave a Reply to Jon Levell Cancel reply