Does an application have to meet its established recovery time objective when returning to the primary data center?
- One opinion is the application should have to meet its Recovery Time Objective in case there is a problem in the alternate data center and the application has to return quickly.
- The other opinion is the application should not have to meet their Recovery Time Objective when returning to the primary data center because we are recovering from a disaster and want to restore services to the primary data center in a slow methodical manner.
Feedback from BCP Builder Community on LinkedIn:
Is migration necessary?
- If we assume the application and data moved to the alternative site successfully, then the first Recovery Time Objective is probably met because there was no down time. The question of whether to actually come back to the primary center needs to be considered. If there is no service degradation, a migration back may be unwarranted. The third option is when the “fail back” occurs, does the successful restart and re-establishment of links have to meet the recovery time objective? If this involves breaking the links to the business or the data set then this certainly has to meet recovery time objective or you have downtime. In the current world where actually finding the physical location of the application server and the data set is getting harder, fail back is becoming a more complex question.
- Returning to the primary data center can be a “planned” event (i.e. it doesn’t have to be a reactive response to the critical incident impacting production services to users). The potential disruption to ‘return to normal’ can occur at an agreed time frame that minimizes the user impact.
Recovery Times never breached
- If the fail-over to the alternate data center was within the ‘agreed’ recovery time objective then this time was never breached. IT should choose the next best window to fail back to the primary data center with minimal or no user impact. This should only happen when an appropriate fix is in place.
- The recovery time objective is the time requirement to get an application available again, regardless of whether it is running out of the primary or other data center. The recovery and restoration of the primary data center can occur in parallel, to an appropriate time frame, as the processes requiring the specified application can still operate.
- Fail-back only occurs when the production environment is stable, and is not subjected to established recovery time objectives. Consideration on when to fail-back is always top of mind so production impacts are minimized.
- Recovery time objectives may be exceeded to a reasonable extent, based on a management decision.
Testing
- Backup and recovery plans need to be tested to ensure fail-overs can be executed as quickly as possible, especially for mission-critical systems. If a critical system (e.g., email, payroll, HR) is remotely hosted, you should be able to achieve minimal to no impact on recovery time objectives. Locally hosted systems are at greater risk of a slow return to normal operation (e.g., availability of staff, network performance issues). This is a good reason to periodically test the recover-ability of such systems. A phased recovery to achieve the desired recovery time objective makes sense, e.g., initial fail-over/recovery at 80% of normal, 90% in an hour, and 100% in two hours.
- Testing helps you to confirm inter-dependencies between all components.
Consequences
- Contractual, legal or regulatory requirements also have a say in this. They will influence the disaster recovery solutions you’ll have to put in place in order to meet the recovery time objectives, regardless in which data center the application is operational. In some cases the alternate data center will become the primary data center and same rules will apply there.
- If you are running an IT Department that is funded by your customer base (internal or external) there are consequences in not meeting the required time frames. An internal providers job depends on meeting the agreement (even if the restoration issues are outside their department). Externally there are performance items in the contract, as well as escape of non-renewal clauses. These are the consequences of not meeting a recovery time objective.
Investment
- If someone is unhappy with their application recovery time objective, and wants a shorter time frame, they will need to fund the improvement.
- There should be an agreement on the recovery time objective. Even if both parties are not happy with the recovery time objective because the customer wants xx minutes and infrastructure can only provide, with current configurations, yy hours or days. You need to work toward full agreement on recovery time objective. If a customer needs a set time frame, they need to fund improvements to the environment – hardware, locations, software, etc.
Example
- There is a primary location for the application and a secondary (alternate) location. Data center 1 has Application A, C and E as primary and B, D F as secondary. Data center 2 has Application B, D, F as primary and A, C and E as secondary. A loss of a single data center means half the applications are still running and only the other half need to be restored. It doesn’t make any difference which data center they’re running from.
If you want to increase your Organizational Resilience, start with preparing a Business Continuity Plan and check out BCP Builder’s Business Continuity Planning Templates.