Disaster Ahead I have been involved with assessing Business Continuity and Disaster Recovery (BCDR) plans and their development for over 25 years.  It always seems that DR planning is an afterthought and starts with system backups.   Typically, companies build out their IT infrastructure based on the business requirements. When it’s finished someone asks, “How do we recover this if something bad happens?”  That’s not 100% true, but most companies don’t really plan well for a major disaster.  The proliferation of networks helped compensate for this somewhat because  networks were not highly reliable, so they planned for one being down, and thus needed redundant networks.  Then they realized diverse networks would be wise!  As we move into the virtual world and cloud services it’s now possible to architect operational solutions with DR integrated into the design.  In fact I would venture to say the BCDR is the operations plan.  I’ll attempt to explain.

Even with the advancements of technology we often go into small, medium and even large companies and find they still approach DR with the same old “after the fact” concepts.  They buy new equipment without considering a better way to utilize it to keep the company running through a major disaster.  I’ve become a strong believer in taking the approach that if we’re going to do all the discovery work and build out a roadmap for disaster recovery we should integrate it into the long term IT architectural plan.  If you’re moving from P2V (physical to virtual) servers and you’re purchasing new or additional SAN, consider how to implement that new equipment into your current environment and better support BCDR as I have outlined below.

  1. Splitting production systems into two or more physical locations such that both cannot be affected by the same disaster.
    1. Put half of the production environment in Location A and half in Location B.
      1. If you have Application A and Application B and both have 2 servers then place Application A in Location A and place Application B in location B.
    2. If you have development and test systems put them in the non-production location, not with their production systems.
      1. Allowing them to be used for DR.
      2. If you do NOT have dev & test systems, you may need to have production servers fail over to other production servers in the non-production site.
  2. Splitting your SAN between two systems in two locations.
    1. Now you can replicate between the locations to have recover points that are shorter and easier to recover.
  3. Sub-netting your network so you can flip sub-nets between locations during an outage at the primary location.

These steps help you reduce the overall loss of systems and applications in case of a disaster and provide the following benefits.

  • Require less resource to recover only 1/2 of the production systems.
  • Only half of your production systems should ever be down at one time.
  • Reduce the cost of recovery from a major disaster by approximately half.
  • Reduce the RTOs and RPOs for all production systems.

Dollar rescue The other problem we see in the DR space is the DR plan.  It seems to grow into the monolithic 3 ring binder that no-one can find quickly and when they do it’s out of date.  The problem is the DR plan is a “living document” that must accurately reflect the current configuration of your IT infrastructure to be effective.  Yet most company IT systems are continually changing.  Thus maintaining the DR plan is a daunting and seemingly endless task.  I’ve come to the conclusion diagramming out each top application and all applications that touch that application has huge benefits.  I color code each server and application based on the RTO and RPO.  This provides the top IT person (CIO, CTO, VP IT, Dir IT) with a visual aid to discuss what is involved with IT DR to the other executives, including the Board of Directors.  It also provides the IT team with a visual to quickly understand what all has to be done to recover an application and which applications to focus on first.  This dramatically reduces the level of effort to maintain the DR plan, if incorporated into the plan correctly. The Disaster Recover Institute has it correct, we need to modify the team’s (company employees) thinking to incorporate BCDR into our daily work environment and operating procedures.  But, we need to do that before we start the BCDR process, at least for the people creating the BCDR plan. To make this truly work, the fail-over or recovery plan should actually be exercised quarterly and in production.  Thus, BCDR is the operational plan.  It’s actually how we run operations.  When we truly build it out correctly we should fail-over between sites and run production on the backup site. Another way to look at the BCDR planning cycle is that BCDR is the exit strategy.  You’re either going to plan at the beginning and make the entire environment support recovery and thus exit the disaster faster and with less pain and loss or you’re simply going to exit the game. And yes, doing this may not be feasible for every company and every application.  I realize that.  However, even small companies can have images made of virtual servers that can be spun up at another hosting location with some planning. If you’re an IT person in charge of BC or DR in your company connect with me and we can discuss it even more.  I’m always interested in learning what others have learned.