Which resiliency option is best for your Managed File Transfer solution?
As a Managed File Transfer (MFT) solution becomes more business critical, an organisation needs to consider how to make their system more resilient to failure.
There are different ways of achieving this, depending on your chosen solution but before you think about adding resiliency there are a few things you will need to consider.
High Availability Vs Disaster Recovery
Highly Available (HA) Managed File Transfer systems typically rely on two or more nodes, which can each handle requests at the same time. When a single node fails, the other nodes carry on and can pick up the extra load from the failed node. In many cases MFT systems tend to use either a native clustering technology such as Windows Server Clustering or will have proprietary heartbeats which keep nodes aware of each other. When one node fails, only connections and transfers that are actively passing though that node are lost, while other transfers passing through other nodes are unaffected and keep being processed. The goal for a HA system is nearly zero downtime in the event of a failure.
Disaster Recovery (DR) systems are designed to provide resilience in the event of a more significant failure of the Managed File Transfer system. DR systems are typically based in a different location to the standard system and only accept connections and transfers when the main system is unavailable. Network routing, storage and database replication and other infrastructure changes may be needed to be completed before the DR system can be activated and this can lead to a period of service downtime while this takes place.
Active:Active or Active:Passive
When it comes to designing a HA system, most MFT solutions offer either an Active:Active or an Active:Passive configuration.
In an Active:Active configuration, two or more nodes are running and sharing resources such as storage and database links.
With Active:Passive HA configurations, all the load is passed through a single node and, when that fails, another node in the system will detect it and start all the services required to run the system. This will mean that all connections are lost in the event of failure but typically, service is resumed very quickly in the event of failure.
Some MFT solutions are now able to offer a hybrid HA/DR type configuration, where nodes can be sited in different data centres and able to accept connections independently of each other. They will typically write to storage which is either replicated in near real time or will deal with synchronising data between nodes independently of any infrastructure. This is becoming more popular as it reduces the footprint of the overall MFT solution while providing the benefits of HA and DR architectures. The downside to this is that not every MFT solution can support this architecture and that the network infrastructure required is more complicated.
Before deciding on a resilient architecture you need to consider some or all of the following factors.
HA and DR solutions need more architecture and are significantly more complicated to configure and manage. Additional servers, management interfaces and network resources all need to be factored in and available.
Global and Local load balancers are needed to route the traffic to the correct nodes. These are usually configured in either a “round robin” type of configuration where connections are passed to each node in turn, or for some more advanced load balancers connections are tracked and passed to the least busy node. In addition, networking routes and firewalls need to be configured for all nodes including passive or DR nodes. Specifically for DR, trading partners and end users may need to be able to send to alternative nodes.
3. Storage Replication
Typically HA solutions will share storage but for DR, solutions may need storage replication between data centres. If this is only happening on a schedule, then data may not be available when the DR system comes up, or in some cases, the system may not be able to come up until the data is replicated.
4. Database Clustering
In a similar way to data replication, databases for HA systems are typically run on replicated technology. If the same database is to be used between the data centres, synchronous database clustering would need to be used (rather than asynchronous approaches like mirroring or log shipping). This should be near instantaneous to avoid similar issues to Storage replication.
Forward and reverse proxies can be very helpful to shield any node failures from incoming connections, as they typically “front” the systems and should remain unaffected by any node failure. This can be a big advantage for using these gateways but care needs to be taken if in a DR configuration that these must be failed over too in the event of a server failure.
Monitoring tools should be in place checking a variety of factors such as services running, and node statuses. If these detect a failure, alerts should trigger and if possible automated failover scripts should be executed to bring up any additional nodes or resources.
Using virtualisation tool to achieve DR
In order to simplify the whole HA/DR architecture, many organisations now leverage virtualisation tools such as VMware Site Recovery Manager (SRM) or vSphere vMotion. These can handle the moving of the virtual server instance to another virtual environment with all associated resources.
If servers are physical, some solutions offer DR tools that can automatically trigger remote instances of the services using simple monitoring.
In mission critical B2B workflows achieving maximum uptime is of paramount importance. The type of resiliency your business requires may even determine your choice of solution, however this may also be driven by infrastructure limitations or simply the way in which the vendor charges for the additional nodes required to deliver.