FAULT-TOLERANT SYSTEMS

What is fault-tolerance?

Fault-tolerance describes a superior level of availability characterized by five-nines uptime (99.999%) or better. Fault-tolerant systems can deliver these levels of availability because they can “tolerate” or withstand both hardware and software “faults” or failures.  They typically do this by proactively monitoring and preventing critical systems from failing in the first place or by completely mitigating the risk of a catastrophic component or system failure.

Software-based vs. hardware-based fault-tolerance

Fault-tolerance can be achieved using both software-based and hardware-based approaches.

In a software-based approach, all data committed to disk is mirrored across redundant systems. More sophisticated software-based methods also replicate uncommitted data, or data in memory, to a redundant system. In the event of a primary system failure, a secondary backup system resumes operation, taking over from the moment the primary system fails so that no transactions or data are duplicated or lost.

In a hardware-based approach, redundant systems run simultaneously. Parallel servers perform identical tasks, so if one server fails, the other continues to process transactions or deliver services. This approach relies on the statistical probability of both systems simultaneously failing and being extremely low. Only one server is needed to deliver applications, but having two servers helps ensure that at least one will always be running.

 

How everRun® Enterprise and ztC™ Edge deliver fault-tolerant workloads

Stratus everRun Enterprise software and Stratus ztC Edge computing platforms use software-based approaches to deliver fault-tolerant applications and protect data.

The main challenge with software-based approaches is efficiently replicating data while minimizing system overhead. Don’t replicate enough, and your recovery times increase. Replicate too often, and you use too much of your system resources to ensure availability.

everRun Enterprise and Stratus Redundant Linux, the operating platform that powers Stratus’ ztC Edge solution, replicate all data written to disk (for highly available workloads) and use a unique checkpointing engine to continuously copy data in memory and CPU states (for fault-tolerant workloads). All I/O operations are queued until checkpoints are completed and verified. Proprietary algorithms dynamically adjust checkpointing frequency based on the type and amount of data changes and I/O throughput. If/when one node fails, a two-second pause prevents split-brain scenarios, resulting in a sub-five-second recovery time – below the TCP/IP threshold for queueing and resubmitting requests.

In addition to its unique, highly efficient checkpointing engine, Stratus solutions are differentiated by their operational simplicity. No application or guest operating system modifications are required to make them cluster-aware. No additional failover scripts are needed to ensure application availability and data integrity. All that’s required is for the applications to be installed on a virtual machine and launched to make them fault tolerant.

 

How ftServer® delivers fault-tolerant workloads

Stratus ftServer uses a hardware-based approach to deliver fault-tolerant applications and data.

The main challenge with hardware-based approaches is ensuring the precise synchronization of processes and threads – ensuring that the same things are happening simultaneously on both nodes of a redundant system.

Stratus ftServer uses proprietary field programmable gate arrays (FPGA) to ensure lock-step processing across two identical halves of a ftServer system. The two-equivalent customer-replaceable units (CRU) run in parallel. Each act as the primary or secondary server as needed. Each executes the same process at the same time. With ftServer, there is no recovery time when there’s a failure in a single component or CRU. The available CRU takes over as the primary server until the unavailable CRU is replaced. For organizations that cannot tolerate even a second of unplanned downtime, Stratus ftServer is a viable option.

In addition to using FPGAs and a lock-step approach, Stratus ftServer is differentiated by its operational simplicity. Applications, virtualization platforms, or guest operating systems installed in ftServer do not require special modification or configuration to make them fault tolerant.