Disaster Recovery in Managed File Transfer
There is an increasing reliance on using high availability to protect Managed File Transfer (MFT) systems, and indeed most MFT vendors provide a robust solution, often offering both Active-Active or Active-Passive configurations. There are however many circumstances where using high availability is simply not an option, whether due to cost, infrastructure or some other reason. In this case it is necessary to revert to the ‘old school’ way of doing things using some form of backup-restore mechanism to offer disaster recovery in managed file transfer.
Just to be clear to those who have a different understanding to me regarding the difference between high availability and disaster recovery, this article is based upon the following understanding:
In high availability there is little or no disruption of service; after part of the infrastructure fails or is removed, the rest of the infrastructure continues as before.
In disaster recovery, the service is recovered to a new, cold or standby instance, either by automatic or manual actions. This includes VM snapshots and restoring to a non-production environment.
It goes without saying that disaster recovery in Managed File Transfer isn’t something that you can easily achieve on the fly; it’s important to have detailed plans and practice them regularly until recovery practices always complete flawlessly. When you start planning for disaster recovery the very first question should be “What should my recovery environment look like?”
This might sound like a strange way to start, but just take a moment to consider why you have a Managed File Transfer system in the first place. Do you need the data stored in it to be available following the recovery or just the folder structure? It’s a best practice policy not to leave data in the MFT system for long periods of time – it should contain transient data, with an authoritative source secured elsewhere. If you continue with this train of thought, think about how valid the content of any backup would be if (for example) it is only taken once per day. Potentially that could mean 23 hours and 59 minutes since the previous backup; a lot can change in that time.
Similarly, consider that you may have another system sending data into MFT on a frequent basis; if that system needs to be recovered (due perhaps to a site outage), then you will need to find a common point in time that you are able to recover to, or risk duplicate files being sent following recovery activities (see RPO below)
Should your recovery environment be similarly sized to the production environment? Ideally, the answer is always going to be yes, but what if your production system is under used, or sized to take into account periodic activities? In that case, a smaller less powerful environment may be used.
RTO and RPO
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are the two most critical points of any disaster recovery plan. RTO is the length of time that it will take to recover your MFT system; RPO is the point in time that you will set your MFT system back to – frequently this is the last successful backup. As already mentioned, you may need to synchronise the RPO with other independent systems. Once you have decided upon the RPO, you need to plan how you will handle transfers which may have occurred since that time; will they be resent, or do you need to determine which files must not be resent? Will you need to request inbound files to be sent again?
You can decide on RTO time only by executing a recovery test. This will enable you to accurately gauge the amount of time the restore process takes; remember that some activities may be executed in parallel, assuming available resources.
Hosting of MFT systems on Virtual Machine (VM) farms has changed the way that we consider RPO and RTO somewhat. In general, VMware allows us several possibilities for recovery, including:
- A system snapshot taken periodically and shipped to the recovery site
- Replication of the volume containing the VM (and generally several other VMs)
- Hyper-V shared cluster environment
Of these, Hyper-V probably comes closest to being a high availability alternative, however it should be remember that it uses asynchronous replication of the VM; this means that there is a loss of data or transaction, albeit a small one.
Let’s assume that you’ve decided to avoid the RPO question by presenting an ‘empty’ system in your recovery site. This means that you will need to periodically export your production configuration and ship it to the recovery site. Ideally, you would want to do this at least daily, but possibly more frequently if you have a lot of changes. Some MFT systems allow you to export and ship the configuration using the MFT product itself – this is a neat, self-contained method that should be used if it’s available. In this way you are more likely to be sure to have the very latest copy of the configuration prior to the server becoming unavailable.
The actual MFT software may or may not be installed in advance, depending upon your licence agreement (some vendors permit this, others not – be sure to check as part of your planning). In any event, it is best to keep the product installation executable(s) on the server in case they are required.
So next on the list of things to think about is; what else do you need to complete the recovery? Unfortunately, the answer can be quite long:
Can the new server have the same IP address as the old? Do you need to add it? The new server may well be on a completely different subnet.
If you are using DNS C records to reach the server, where are they updated?
Is there a load balancer to be updated?
Does the recovery server have the same firewall rules as the production server?
Are you using a forward proxy to send traffic out of the network, and if so will it present the same source IP address?
If you have multiple sites defined in your MFT system, does each have a unique IP address?
- Keys and Certificates
Are these included as part of the system configuration or do they have to be handled separately?
Are PGP key-rings held in the home directory of the account that the MFT systems runs under?
- User Accounts
Does your configuration export include locally defined users? Do you make use of local groups on the server which may not be present on the recovery server?
Will LDAP/LDAPS queries work equally well from this machine?
Returning to Normal Operations
Sooner or later you will need to switch back operations to the normal production environment. Unfortunately, this isn’t always as straightforward as you could wish for.
When disaster struck and you initiated your disaster recovery plan, you were forced into it by circumstances. It was safe to assume that data was lost and the important thing was to get the system back again. Now however your recovery environment may have been running for days and it will probably have seen a number of transfers. At this point you need to ‘drain’ your system of active transfers and potentially identify files which have been uploaded into your MFT but have not yet been downloaded.
Some MFT systems keep track of which files have been transferred (to avoid double sending); if your MFT system is one of these, then you will need to ensure that the production system knows which files the recovery system has already handled. Regardless of this, you will need to ship your configuration back to the production server in order to accommodate any changes that have occurred – for example, users being created or deleted, or even simply changing their password.
Synchronise the return with other applications that send data through the MFT system in order to avoid data bottlenecks from forming during the move; remember that any DNS changes you need to make at this time may take some time to be replicated through the network.
Keeping the Recovery System Up To Date
Of course, any time you make an update to your production environment, it could easily invalidate your recovery environment. An example might be something as simple as resizing a disk, or adding a new IP address – both of these activities should hopefully be covered by change management practices, but of course we all know that replication of changes into the recovery environment doesn’t always happen. This is why it’s so important to perform regular disaster recovery exercises every six months or so, so that you can identify and resolve these discrepancies before a disaster occurs. When considering what changes need to be replicated, look again at the areas you need to consider when first setting up the recovery environment.
Ramifications of Not Being Disaster Ready
For many organisations, their MFT system is far more important than people realise. An unplanned outage will prevent goods from being shipped, orders placed and payments being sent, which obviously has a negative impact at not only a financial level, but also in other intangible levels that aren’t so easily quantifiable. How likely are customers to use your services in the future if they can’t rely on their availability now?
There’s also the certification aspect to consider. If you are considering ISO 27001 certification, you need to have a realistic plan in place, test and maintain it – neglecting this will result in an audit failure and potential loss of certification if it has already been delivered.
Finally, the most important thing to do is document EVERYTHING. Every step should be able to be followed by someone without specific knowledge of the system. Every change should be recorded, every test detailed, regardless of success or failure.