I spend alot of time with customers contemplating the ” good, the bad, and the ugly “. What I mean by that is that I help customers to understand what things can go right, and what things can go wrong, and in some cases very, very wrong, within their networks. The art and science of Business Continuity and Disaster Recovery Planning is something that can be as complicated or as simple as you allow it to be. We often times get stuck in the mind set of being “reactive” as an administrative methodology, and as a result, find ourselves chasing after events, allowing them to manage us. What I ask clients to consider with regards to BCP/DRP is what would have to change today, now, BEFORE the events happen that lead us to the need for these plans to become part of our reality in order for you to change that mindest from the “reactive” to the “proactive”, the mindset that allows us to anticipate activites and plan outcomes based on the use of three very important elements:
- Common Sense
- Risk Analysis and Business Impact Analysis
- P.I.E. = Probablity, Impact and Exposure
The transition to the use of these tools allows you to become an administrator that is able to interact with your network, as opposed to reacting to it.
Below are a set of statements, or “situations” that may or may not ring true in your world, for your DR plans today. The goal is to have you think through them and use them as a guide to engage in a little self examination with regards to the state of your DRP solution today. At the end of each section, I have placed a Question to ask yourself: statement that will summarize the key take away from the preceeeding discussion and give you an actionable take away that can be used to evaluate the current state of your DR plan.
Take some time and work through these discussions, and then go back and look “under the hood” at what you have going on. You may be surprised by what you find.
WE ARE NOT ABLE TO MEET OUR RTO/RPOS FOR OUR MISSION-CRITICAL APPLICATIONS.
Maybe you passed your last annual DR test, or maybe you did not. Even if you did pass, the test is only a predictor of whether or not you are actually able to meet the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) dictated by your business requirements. What many IT leaders do not consider is that DR tests are conducted under managed conditions, and can take months to plan. Most causes of outages (power failure, human error, hardware failure) do not give you notice. The single most important factor in determining whether or not your recovery management plan is successful is in your ability to reflect the day-to-day change management tasks so that they are perfectly in sync with your production environment. Today’s mission-critical applications have many dependencies that change frequently. Without ongoing tests, a recovery plan that worked before might now fail to restore availability to a vital business application.
Question to ask yourself: When was the last time we successfully tested all mission-critical applications against our RPO and RTO measurements?
OUR DR PLAN JUST SCRATCHES THE SURFACE.
You need to consider your recovery management capabilities in the context of the impact it has on staff and the long-term availability of your datacenter. Determining how long you can support an outage at your recovery center should impact your DR plan approach. It is also important to understand how the secondary site will be managed. Mostly likely, you will need to send staff to the secondary site to work on the recovery and maintain the temporary production environment, but this may not be easy in the event of a natural disaster. You cannot assume that the right people can get to the right places for each identified disaster. Provided you have the capabilities to recover, you must ensure that your organization is well informed with regards to procedures and chains of command.
Question to ask yourself: What would we do in a major disaster if we lost power for days or weeks, lost buildings, or lost communications links?
WE KNOW HOW TO FAILOVER TO A RECOVERY SITE, BUT WE LACK THE EXPERIENCE AND CAPABILITIES TO KNOW HOW TO FAILBACK.
Failover and failback are critical to executing a DR plan successfully. Failback can often be the most disruptive element to DR execution. With failback, most processes must be reversed. When a failover occurs, the secondary backup site must be a duplicate of your primary site. It must be able to support your production environment and offer the same protections needed to function as the primary site for a period of time. Failback means that your organization is looking to reinstate the production environment. Recovering back to the primary environment works the same way as a failover except in the opposite direction. Testing this scenario should also be performed, documented, and controlled. Not documenting and testing this component of your DR plan could force you to rely on your secondary site for extended periods of time, adding significant cost to the business.
Question to ask yourself: Do we test our capabilities to failback during our schedule recovery tests?
OUR RUNBOOKS ARE PROBABLY UNUSABLE.
Your runbooks should contain all of the information you and your staff need to perform day-to-day operations and to respond to emergency situations, including resource information about the primary datacenter and its hardware and software. Step-by-step recovery procedures for operational processes are also a critical component. If the procedures are not frequently updated, or not thoroughly vetted with key stakeholders, your recovery process will be significantly slowed, if not outright halted. And remember, the more time it takes to recover, the more expensive it gets. The Aberdeen Group estimated that downtime cost the average company $160,000 per hour in 2012. ( US dollars )
Question to ask yourself: How often do we evaluate and update our DR plan?
WE HAVE NOT CHANGED WHEN IT COMES TO CHANGE MANAGEMENT.
With today’s highly dynamic production environments, change is constant. Next generation datacenter technologies such as virtualization make it easier to create and deploy applications, allocate and provision storage, and set up new systems. However, the ease and frequency at which these changes occur can prevent your team from properly recording them at your recovery site. Without properly performing change management, secondary and backup environments can quickly get out of step with your production environment, causing recovery failures.
Question to ask yourself: What are we doing to ensure that our testing environment reflects our live production environment?
WE CAN PASS AN AUDIT, BUT THAT DOES NOT MEAN WE ARE RECOVERABLE
Passing an audit means you have a plan that meets a specific list of requirements. It does not mean that your plan will provide recoverability. Most auditors do not focus on the variables of your DR plan, and do not look at the effectiveness of the plan for each and every disaster scenario. They only ensure that you have met the static requirements established in the audit itself. The fact is, you can pass an audit, but still fail to recover from an actual event.
Question to ask yourself: When was the last time we tested the DR plans?
OUR IT ENVIRONMENT IS GETTING TOO COMPLEX.
Your business environment is becoming more dynamic, and becoming dependent upon an increasing number of applications. Restoration of full services will require recovery of all of these elements. As a result, you will need to tier your applications accordingly, which may require adjustments to your tiered environment to ensure you are addressing all interdependencies. A complex infrastructure will make the tiering — and therefore, the recovery — that much more difficult.
Question to ask yourself: How have we tiered our applications to aid in recovery?
BACKING UP DOES NOT MOVE US FORWARD.
Backing up is not, by itself, a DR solution, however it is a critical component to a successful recovery management plan. Whether you are replicating data to disk, tape, or a combination of both, moving data between storage mediums is slow. If it takes an unacceptable amount of time to move and restore data, then testing is probably out of the question. Time-to-restore concerns have also led companies to forego a regular test restoration process, which can lead to lost data.
Question to ask yourself: How have we integrated data management into our recovery management program and testing?
WE ARE NEITHER TESTING ENOUGH NOR DO WE HAVE THE TIME OR PEOPLE TO DO IT RIGHT.
Only 20-30 percent of BC/DR plans are tested, and many of those fail. While you have a DR plan and can arguably restore most of your mission-critical applications through testing, measuring the total restore against the frequency of testing and mapping out the resources needed to conduct and validate a test must also be performed. You might have the plan in place, but without the resources available or conducting the actual test, you cannot validate success. Testing recovery procedures of applications is a lot different than recreating a datacenter from scratch, and a 72-hour testing window is not adequate; it is just enough time to corral the right employees and beg them to participate in the test when it is not a part of their core function. Companies will often work with whatever resources they have.
Question to ask yourself: Do we have the bandwidth and expertise in-house to be fully recoverable?
WE ARE NOT COMFORTABLE WITH THE IDEA OF SOMEONE ELSE DOING OUR RECOVERY.
The benefits of partnering with a recovery service provider actually can complement the skill sets of the internal IT team by allowing them to focus on strategic projects rather than operational tasks, while improving the overall recoverability for the business.
Question to ask yourself: Do we have the recovery expertise in place to ensure that your role can be successful?
- Rockin’ the CASB – What you need to know about Cloud Access Security Brokers …
- Cloud Tweaks Blog … What Do You Know About Cloud Security?
- Security Awareness @ ISC2 Security Congress 2015
- Secure the Power of the Cloud … (and get certified while doing it)
- Announcing Exchange Server 2016 Preview!
- VMware Scripting Overview – A quick look under the hood
- Checklist: Use AD FS to implement and manage single sign-on with Server 2012/R2
- Checklist: Setting up a Federation Server (ADFS) for use with Office 365 on Windows Server 2008/R2
- The (ISC)² CISSP Domain Refresh … Are you prepared?
- vSphere 6.0 is on the way !!! …. Are you ready???