There is an increasing reliance on using high availability to protect Managed File Transfer (MFT) systems, and indeed most MFT vendors provide a robust solution, often offering both Active-Active or Active-Passive configurations. There are however many circumstances where using high availability is simply not an option, whether due to cost, infrastructure or some other reason. In this case it is necessary to revert to the ‘old school’ way of doing things using some form of backup-restore mechanism to offer disaster recovery in managed file transfer.
Just to be clear to those who have a different understanding to me regarding the difference between high availability and disaster recovery, this article is based upon the following understanding:
In high availability there is little or no disruption of service; after part of the infrastructure fails or is removed, the rest of the infrastructure continues as before.
In disaster recovery, the service is recovered to a new, cold or standby instance, either by automatic or manual actions. This includes VM snapshots and restoring to a non-production environment.
It goes without saying that disaster recovery in Managed File Transfer isn’t something that you can easily achieve on the fly; it’s important to have detailed plans and practice them regularly until recovery practices always complete flawlessly. When you start planning for disaster recovery the very first question should be “What should my recovery environment look like?”
This might sound like a strange way to start, but just take a moment to consider why you have a Managed File Transfer system in the first place. Do you need the data stored in it to be available following the recovery or just the folder structure? It’s a best practice policy not to leave data in the MFT system for long periods of time – it should contain transient data, with an authoritative source secured elsewhere. If you continue with this train of thought, think about how valid the content of any backup would be if (for example) it is only taken once per day. Potentially that could mean 23 hours and 59 minutes since the previous backup; a lot can change in that time.
Similarly, consider that you may have another system sending data into MFT on a frequent basis; if that system needs to be recovered (due perhaps to a site outage), then you will need to find a common point in time that you are able to recover to, or risk duplicate files being sent following recovery activities (see RPO below)
Should your recovery environment be similarly sized to the production environment? Ideally, the answer is always going to be yes, but what if your production system is under used, or sized to take into account periodic activities? In that case, a smaller less powerful environment may be used.
RTO and RPO
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are the two most critical points of any disaster recovery plan. RTO is the length of time that it will take to recover your MFT system; RPO is the point in time that you will set your MFT system back to – frequently this is the last successful backup. As already mentioned, you may need to synchronise the RPO with other independent systems. Once you have decided upon the RPO, you need to plan how you will handle transfers which may have occurred since that time; will they be resent, or do you need to determine which files must not be resent? Will you need to request inbound files to be sent again?
You can decide on RTO time only by executing a recovery test. This will enable you to accurately gauge the amount of time the restore process takes; remember that some activities may be executed in parallel, assuming available resources.
Hosting of MFT systems on Virtual Machine (VM) farms has changed the way that we consider RPO and RTO somewhat. In general, VMware allows us several possibilities for recovery, including:
- A system snapshot taken periodically and shipped to the recovery site
- Replication of the volume containing the VM (and generally several other VMs)
- Hyper-V shared cluster environment
- Of these, Hyper-V probably comes closest to being a high availability alternative, however it should be remember that it uses asynchronous replication of the VM; this means that there is a loss of data or transaction, albeit a small one.