Saturday, August 28, 2010

Don’t Simply Believe, Test


When it comes to your environment, you can't really use Alice's disaster recovery plan from Dilbert or just sit around singing "I Believe We Can" from Phineas and Ferb. You actually need to test your plan regularly to make sure that you are truly covered. Now this doesn't need to be a weekly or monthly burden on you or your staff, but there is a list of things that you should be doing at least one of each quarter on a scheduled basis.

Audit Backup Coverage

Take a look at the production systems in your environment and make sure that the file folders, databases, etc. are actually making it to offsite media at their prescribed intervals. Systems change over time and it is very easy to complete an upgrade or other systems change and forget to update the backup job to cover any additional folders or servers. In larger environments, you may need to involve the application owners to validate what is needed for restoring their applications.

Test Restore of the Backup System Itself

In many environments, it is not that important to test the restore process of individual files, folders, or databases as the user community's requests typically take care of that task for you. What many organizations fail to do though is test restoring the backup system itself. This is often a nontrivial task as these systems often have a "catalog" of backups that needs to be restored first, before any production data can be restored. Make sure that your DR kit includes these instructions, as well as a plan for how to address replacing the primary media device used by your backups.

Test Fail-Over Systems

Many organizations, even in the Small and Medium Business space, are now using redundancy technologies such as clustering, load-balancing, secondary sites, etc. The viability of these secondary systems should not simply be assumed; these systems also need to be tested. The techniques involved do not need to be complicated or even executed during primary business hours. It can be as simple as rolling your cluster to its secondary node, shutting down one of the servers in your NLB group, or simply temporarily changing firewall rules to prevent access to a primary system. When doing these tests, don't forget to test the physical infrastructure pieces as well, such as backup generators, backup cooling plans, physical space access, etc.

Test Communications and Staffing Plans

In the "lessons learned" stage of nearly every disaster recovery event I have ever been a part of, whether it was a test or an actual emergency; this is the one portion of the plan that usually got the most scrutiny. First responders were often confused on who needed to be contacted, when, how operations updates should be provided, and who secondary contacts were for the various areas. This confusion was often the single biggest factor contributing to slower recovery times and, as a result, lost productivity for the organization as a whole. Contact sheet updates are often overlooked as staffing changes occur. Additionally, many organization's plans for updating the staff at large relies heavily on internal electronic systems that may or may not be available in times of emergency.

Test DR Kit access

With many organizations relying on media couriers such as Iron Mountain for the offsite storage of critical DR materials, such as their DR kits and backup media; it is prudent to test these vendors ability to bring you the needed materials in the timeframe that they have promised.
For those organizations not using a courier service, but rather relying on either bank safety deposit boxes or employees taking materials home; it is prudent to test and confirm that the "first responder" staff knows who to contact and how to gain access to these critical DR materials.

Summary

While none of what I have posted is particularly revolutionary, these are critical items that are often overlooked by many organizations in the hustle and bustle of their daily operations. Working with your DR plan is NEVER exciting; it is stressful. In fact, it is like your first aid kit: Something you know you need to have and hope to never actually use.

1 comment:

  1. While the comic referrences can be a bit obscure (they depend on the reader being "in the know" or at least be aware of the relatively popular comic strips - Tthe actual content of your article is extremely well targeted. I would suggest that you consider replacing "DR" with "Disaster Recovery" for added impact although the DR stands out more (double CAPITALS)... very good article, on target to the decision maker/responsible party of a Small to Medium Business Information Technology installation and a few of us techs that need reminding of the basics from time to time to stay the course of our support efforts.

    ReplyDelete