In the last month alone, I’ve worked with two companies that had IT disruptions but didn’t use their IT disaster recovery (DR) plans because they weren’t sure if they could fail back home (aka return to normal). In both cases, these concerns were a surprise to the executive management team.
It’s a theme I’ve heard many times before – the IT disaster recovery solution was built without considering how the organization would return to the primary data center from the disaster recovery location. This perspective highlights some key issues to consider regarding the use of the IT disaster recovery strategy.
Isn’t This Just Another Failover?
It seems that IT managers often don’t consider the return home side of IT disaster recovery planning, or just assume it will work the same as the failover itself. It’s true that most DR solutions would return home the same way they failed over, but this approach has a few potential roadblocks to keep in mind:
Long Recovery Times Can Sink Fail-back
If your systems are designed for a recovery time of three days or more, in all likelihood, you’re using tape backups and restoring to shared systems at a recovery service provider or using a service provider’s infrastructure at your alternate location. In this situation, failing back requires stopping the recovered applications, taking a full backup while the systems are quiet, transporting that backup back to your production data center, and then performing the restore operation all over again. The result is several more days of downtime for your applications while you return to the production environment. Many organizations cannot afford to be down for 3+ days during a disaster followed by 3+ additional days of downtime to fail-back. This is a common reason why organizations often make the decision not to use their DR solutions – because of the time required to fail-back.
Compromising on Recovery Times Puts Pressure on Fail-back
Even applications with less than a three-day recovery time are susceptible to fail-back concerns – typically when the organization has compromised on recovery times to reduce the cost of DR. As an example, let’s say an organization has an online transaction processing system that is processing $2M in orders every day. It’s critical. But, because of its complexity and the volume of its data, the system has a 24 hour recovery time using test hardware at a second data center. If the system needed failed over, it would take about 18 hours to fail it over, which is bad, but then to fail-back would also take 24 hours, which doubles the cost of the downtime and provides a powerful disincentive to use the DR solution. If the application had a four-hour recovery time objective (which was originally proposed), the fail-back could have been a much simpler activity.
Lack of Resources for Fail-back
In addition to the timing considerations for failing back, there are often resources needed at the recovery location that may not have been considered to enable a return-to-normal. Three common missing resources are:
- Backup System: Once applications are running at your recovery site, is there a system in place to backup data in the DR environment to allow for the transition back?
- Disk: The organization has enough disk to restore the systems, but lacks adequate disk space to take snapshots and replicate those snapshots back to the original site for recovery.
- Network Bandwidth: Many organizations have just enough bandwidth to stream changes from production to the recovery site. When failing back, it is often necessary to ‘re-sync’ all the data (copy all the data from the recovery site back to the production site). This can often take days and significantly delay the fail-back process.
Lack of Communication Exacerbates the Problem
It is critical that senior executives understand what their DR solutions are capable of. I’ve seen too many executives surprised by the limitations of the systems they had chosen to fund (for example, performance limitations, a lack of interfaces and/or incomplete capability). While there is a clear responsibility on the part of executives to know what they buy, business continuity and disaster recovery professionals also have a responsibility to clearly and repeatedly communicate the status of the program – not just in terms of who has plans, but also the scenarios the organization can recover from and how the recovery process would actually work. Tabletop exercises are also a great way to raise awareness regarding potential limitations of the recovery solution.
Avoid the Trap
The design phase of a recovery strategy is the time to avoid these issues – executives should expect a clear articulation of how a proposed recovery strategy would be used both during fail-over and fail-back. It is the business continuity professional’s responsibility to build a capability that can fail-back, or be very clear about the solution’s limitations to senior executives.
Business continuity and IT disaster recovery planning is all that we do. If you’re looking for help with building or improving your business continuity program, we can help! Please contact us today to get started. We look forward to hearing from you!