Fault Tolerance

Fault Tolerance is the ability of a system to operate correctly in the presence of faults. People have become dependant enough on some systems that they are described as life-critical. This means someone's life is in danger of death if the system fails. A failure is when a system ceases providing one or more of its primary services. A primary service for a spaceship might be something like "Provide certain volume of breathable air under all conditions." It is plain to see that if this objective is not taken seriously then someone is going to die.


To build a life-critical system it is important to understand the conditions under which it must operate.

Valid State

When a system is operating normally it is said to be in a valid state. A system may have many valid states and normally traverses from one to another without problems. the system must be designed to tolerate its components and the environment when perturbed.

Faults

When the system or one of its components cannot tolerate a condition then it leaves its valid state and transitions to a fault state. faults are bad but they are not failures. The job of a fault tolerant system is to correct faults before the system fails.

Fault Detection

Fault detection occurs when a system or one of its components senses that something is not valid. Much of system design is dedicated not to how the system operates under normal conditions but how it responds to possible but unlikely conditions. Part of detecting faults is logging them so that they can be inspected after the fact. Flight data recorders are just such a device for logging faults.

Damage Containment

Once a fault is detected, the system must assess the extent of the damage and contain the fault to the smallest possible safe scope. A space station should be set up such that a breach in any part of its hull should cause air locks to seal off the faulty chamber automatically or at least alert crew to where the fault is and recommend which locks should be secured.

Restoral

Once the damage has been confined then the system may be able to find an alternative means for providing the service such as a backup subsystem or a way to continue to operate in an attenuated or degraded fashion. Restoral places the system back into a valid state. The restoral phase of fault tolerance is what drives engineers to include backup capabilities in their systems. The clock starts when a fault is detected and runs till restoral is complete. If restoration is not quick enough then the system may fail even though it is in the process of recovering. The restored system can continue to operate until such time as it can be repaired.

Repair

Usually manual intervention is required to repair or replace failed components. That’s why they have space docks in the future.

Fault Tolerance Observed

Here are some things I have learned about fault tolerance. Back in the early 80s I wrote an embedded application for a microcontroller installed in Telco central offices. The box had 23 printed circuit boards consisting of memory, CPU, serial ports and modems in a wire wrapped backplane. Main memory and the CPU were on different boards. The embedded software was about 18,000 lines of assembly code. One of the functions of the application was to collect plant messages from voice switches of any type. It would scan the messages for alarms and upon detecting a major alarm, would dial out and forward all received messages to a Network Operations Center (NOC). It was important that this application be fault tolerant.

In the lab we could not make it break. But after installing this box in several offices for one independent phone company, the quality analyst informed me that the box was resetting every 3 days. This would cause it to loose all its collected data and cold start. The offices were unmanned and the fact that the box reset was good (as apposed to rolling over on its back) but it might loose an hour's performance data and an indeterminate amount of plant data. There seemed to be some electrical condition that occurred every couple of days that we could not identify.

I was asked to make the box "Fault Tolerant". After some study, I made a few changes to the task monitor. One change was to detect stack overflows of individual tasks. In fault tolerance jargon this is called fault detection and damage confinement. Having detected the overflow we soft reset them individually without destroying any unnecessary data. This is called restoral. There was no possibility of repair since there was no permanent damage. Another was to checksum the configuration parameters and only reset them if their checksums were wrong. This was another damage confinement measure. One of the configurable parameters was the phone number of the NOC. Previously on a hard reset the phone number would revert to our product support line. The technical assistance center would have to call the box and reset the phone numbers and other parameters.

Two things happened when these changes were tested. I have witnesses who can verify these statements. One day while we were performing a load test, I pulled the CPU card out. Everything stopped. I waited a few moments then pushed the CPU card back in. The machine did not crash, it did not reset, it kept going from where it left off. My lab tech and I could not believe it. I have never seen a machine with one CPU do this before or since that project. It had been dumping reports on multiple ports and resumed where it left off and no data was lost. Zero, zip, nada.

I thought about it for a long time to figure out how it could possibly do this. The CPU registers decay in a few milliseconds after removal of power. No batteries on this CPU. I suppose the power interrupt caused the registers to be pushed onto the stack on loss of power and subsequently restored when power was returned. The statistics piled up. This little box went from a MTBF of 3 days to 3.5 years in all its hundreds of installations. We made the box tolerate just a few abnormal conditions and its stability improved 400 fold.

Back | Next



Copyright Spidel Tech Solutions, Inc. 2004 All Rights Reserved.  Updated: 9/27/2008 2:50:51 AM Idx: 904 Site Design STS

This site is the home of Spidel School of Design
Please visit the Spidel Tech Blog.