Disaster recovery can mean different things to different businesses. For example one business might prioritize returning to normal operations more quickly over any potential data loss. Conversely some businesses may prioritize data loss over the time it takes to restore normal operations of your systems.
Obviously most businesses are going to hope recovery from an service interrupting event requires as little downtime as possible, potentially measurable in single digit minutes, together with no data loss up to the second of the failure. The two metrics often used to define this for a workload or organization are recovery point objective (RPO) and recovery time objective (RTO). Please see this blog post for information on RPO and RTO. In summary:
- Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service. This objective determines what is considered an acceptable time window when service is unavailable and is defined by the organization. How long can I be down?
- Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point. This objective determines what is considered an acceptable loss of data between the last recovery point and the interruption of service and is defined by the organization. How much data can be lost because the newest available recovery point may have been taken hours before the event?
If you are operating on-premises and the RTO for a workload or business is to get your system back up ASAP but it is ok if you are down a day or two, then you might be able to get away with purchasing replacement hardware overnight or increasingly less likely, locally. If your RTO is measured in hours managing disaster recovery on-premises often requires the purchase of spare hardware. When you operate in the cloud you can easily provision resources when necessary or even have an active-active environment in another geographical region to which you can automatically route traffic in the event of a failure. Operating in the cloud provides a business the ability to provision new hardware and solutions easily and far less expensively.
Regarding RPO if 24-48 hours of data loss is acceptable then backup to removal media such as tape might be just fine (provided you have enough time to restore your system in your RTO). You can use tape or snapshots to archive data on-premises more frequently and restore those backups if you have a failure but you need to provision adequate storage and importantly enough time and system availability to perform these more frequent backups. In the cloud you pay for the storage you use and there are often options available to implement realtime data replication that would be prohibitively expensive for a single organization. Many of the resources and options for recovery can only be available to you in the cloud because they can take advantage of economies of scale. Reliable network equipment is expensive.
The shorter the time that is required for the RTO and the smaller the acceptable data loss is for the RPO the higher the cost and complexity of your solution will be.
From a DR perspective when you are implementing effective data backup solutions from on-premises to the cloud your options are largely determined by the particular RTO and RPO for the environment you are designing. How are you going to get your data in/out of the cloud for instance? How quickly do you need to get your data back? The RTO and how quickly you can restore/retrieve data is important. You need to understand how much data you need to import and export and your bandwidth obviously plays a role.
You will want to consider the durability and availability of your backups as well and most importantly you need to consider security and how you may be required to abide by compliance and governance controls. By default you should assume the need to encrypt data in transit as well as at rest.
You always define your RPO and RTO based on organization needs and budget. But if you don’t test your recovery you have no idea if you can meet those requirements. Operating in the cloud provides you the means by which to test and document your recovery procedures. You know what to look for when responding to an event if you practice, otherwise when the incident occurs you find yourself in trouble. Operating in the cloud makes this testing possible because you can provision the resources you need and take them down when you are done without the need to be disruptive to those accessing your application or workload.
Obviously this is a huge conversation. Contact us today to discuss your situation, your requirements and solutions we believe may work for you.