This is not an esoteric question. Market research about the disappointing state of backup and data protection shows that recovery failures occur all the time. Anecdotal evidence also points to recovery failures across enterprises of all sizes and industries. We hear from many prospects looking for a new backup and recovery solution that their impetus for replacing their existing tools was an unforeseen recovery failure.
A failed recovery is not a random event. In our 30 years of focusing on backup and cloud DR, we’ve learned to recognize the warning signs of known software conditions and configurations that, if not handled properly will make a recovery unlikely. Also, the lack of good process in a few key areas can increase the likelihood that a recovery will fail.
The good news is that if you know where and what to look for, you can potentially deal with these issues before they become a crisis. To help you assess your risk, Unitrends has created a tool that contains 10 questions to score your risk of experiencing a failed recovery.
Here are two examples of the conditions that can greatly increase your odds of a failed recovery:
Introducing VSS Errors
Volume Shadow Copy Service or VSS is a Microsoft technology that allows users to make manual or automatic snapshots of their data files. These snapshots are then copied to remote drives to be used as backups. Many Windows apps come with their own VSS app included, so many versions will exist on your server at the same time. If more than one VSS process runs in the same environment, there will be a conflict resulting in the failure of one or more VSS Writers. VSS issues can have multiple causes:
- Using multiple backup and recovery solutions – Perhaps you have inherited a legacy backup solution when you took the job, or maybe your company acquired another organization with a different backup technology. Ironically, using multiple backup solutions introduces risks to your ability to recover as each may want to use a different VSS driver.
- Gaps in coverage – with the interplay between applications and data, it is rare to have a clean differentiation between which backup solutions protect which assets. Most commonly, you have gaps in your coverage, or overlap or both. Each scenario poses risks to a successful recovery. Gaps mean that some data is not protected at all and overlapping backups may cause questions of which recovery tool to use.
- Differing backup schedules and processes – There are different methods of doing backups including full, incremental, snapshotting, synthetic backups, and others. Using multiple solutions, at a minimum means that you are now spending twice the amount of time, budget, and energy to manage your data protection and you may also be introducing software compatibility issues.
When and how do you test recovery
There is nothing you can do to better ensure successful recoveries than regular and automated testing. However, most organizations don’t test frequently or thoroughly enough. To lower the risk of a failed recovery, best-in-class organizations:
- Test after new servers are introduced or major changes are made – After making changes to the primary infrastructure there needs to be a process to ensure those changes are reflected in the recovery process. If new servers or storage arrays are not added to the restore process, and in the right order, they may not be fully functional after a disaster recovery.
- Test full application recovery – There are many ways to test application recovery. Some, such as database mounting, screenshot verification, and database verification will not assure that recovered applications have full functionality. To learn more about the different forms of testing see Disaster Recovery Testing and How to Win.
- Test at least monthly – Most organizations (55%) test only annually or not at all. A lot can change in that amount of time and the time to find your recoveries will not occur is not when the organization is down, and users are hounding you to ask when their applications will be restored. Frequent, application-level testing also ensures that in the event of a ransomware attack, you can simply restore with certainty from an uninfected backup.
These are just two of the categories of issues that can cause recovery failures. To help you assess your risk of experiencing a recovery failure please see our Recovery Assessment Tool and answer 10 questions. Your final score will be a good indicator of potential problems. Based on your risk level, we also suggest steps you can take to reduce the chances your next emergency recovery will fail.
Downtime is potentially a very stressful and chaotic time. This is not the time or climate you want to have to do forensic analysis on why your recovery has failed. Recovery is an example of the idiom “an ounce of prevention is better than a pound of cure.”