The Hazy Shade of Backups. Boot Camp: Week 3. Mapping Backups to Key Metrics: Recovery Point Objective (RPO)
Welcome back to Boot Camp!
This week, we get more technical and goal oriented. Two key performance metrics, Recovery Point Objective (RPO) and Recovery Time Objective (RTO), are the two main technical metrics that people use to indicate acceptable business-related performance targets for backup and recovery. In this post, we tackle in detail the first metric, RPO.
RPO, measured as time, corresponds to the maximum potential data loss associated with an outage, corruption, or other issue that requires a recovery effort. There are many possible ways to assess potential data loss, but the most simple and straightforward to communicate within an organization is based on backup frequency. The time between successful backups is your window for potential data loss. All content and metadata changes and additions during that window between backups represent potential loss. For example, if you take a backup every 24 hours, your best attainable RPO is 24 hours. That is overly simplistic, but it gives us a starting point for discussion.
As we factor in more real-world backup scenarios involving full, incremental, differential and log file backups, the simple RPO discussion gets far more complex. The worst case potential data loss should still drive RPO selection, as RPO is the agreed upon maximum acceptable risk. This usually occurs around full backups windows. If a full backup takes 12 hours to run, 12 hours is the best attainable RPO, even if you take fast incremental backups in between full backups.
Fortunately, your risk profile changes. You reduce the risk profile for your organization significantly by implementing more sophisticated backup schemes. If an event were to occur between full backup windows, the data risk at that moment in time may be far smaller, as the time between incremental or log backups may be minutes or hours. Events often happen during less than worst case timing.
However, further complicating the picture are unsuccessful backups. Backups fail for many reasons – processes terminating prematurely, backups that roll into the next window without completing, tape and hard drives running out of space, and hardware failure. Research statistics are all over the map:
- According to Microsoft, 42% of attempted recoveries from tape backups failed. Ben Matheson, group product manager for Microsoft Data Protection Manager, was quoted saying more than 50% of customers surveyed said their current backup solutions do not fill their needs.
- Storage Magazine reported, “Over 34% of companies do not test their backups and of those that tested, 77% found their tape backups failed to restore.” (storagemagazine.techtarget.com)
- Research I conducted found more than 1 in 2 companies reported periodic backup failures and related recovery failures spanning all backup technologies.
One thing is clear; backups fail, failure is frequently undetected, and downstream recovery efforts suffer, as recovery often requires back tracking to find the last good backup. The impact to RPO is that actual data loss can easily exceed expected levels. Backup testing, a future Boot Camp topic, mitigates some of the risk by spotting trouble before a recovery effort is needed and help improve processes and backups in general.
The complexity of backup strategies and options, combined with untested and/or failed backups results in a very hazy view of true risk. What’s the best way to create a reasonable RPO? Deliver service that can realistically achieve that RPO, and minimize risk with a series of steps.
Identify Your RPO
- Work with business counterparts to define what level of loss under different situations, measured in time, is acceptable. This is your target business RPO.
- Assess the backup window required to perform a full backup.
- Determine if there are backup gaps between any type of backup (log, incremental, etc …) that exceed the duration of your full backup
- Select the longer of your full backup window or largest backup gap, which equals your technical RPO capacity
- Compare the business RPO with the technical RPO. If the business RPO is longer or equal to the technical RPO, the technical RPO becomes your stated RPO. If not, then you will need to negotiate with the business to reset RPO expectations and secure business buy in, or negotiate for budget to implement additional backup capacity allowing you to reduce the technical RPO to the maximum business RPO. Sometimes faster servers will suffice; other times may require a different approach, such as externalizing your BLOB content to shrink the content to be backed up.
Minimize Overall Risk
- Implement a strategy that leverages multiple types of backups, not just full backups.
- Test your backups to ensure they are good.
- Monitor changes to your content and impact to backup times. As content increases, so does the time to back up. This can push you out of compliance with your RPO directly, or it can cause processes and hardware to break that previously worked flawlessly.
- Factor backup impact assessments to any infrastructure change, no matter how small, to determine and identify unintended system stress.
RPOs and backups can make your head swim as you dig into the complexities and interactions, but these are important factors in your administrative tool kit. No one should graduate Boot Camp without basic training in RPO and risk management.
Download SharePoint Backup to help you fulfill your Recovery Point Objectives.