Fault tolerance is the ability of a system to continue to carry out their intended function despite faults. In a broad sense, fault tolerance is associated with reliability, with successful operation, and with the absence of breakdowns The ultimate goal of fault tolerance is to develop a reliable system. When the society becomes increasingly dependant on computer systems, the reliability of these systems becomes a critical question. In airplanes, chemical plants, heart pace-makers or other safety-critical applications, a system failure can cost people's lives or environmental disaster. There are different approaches to achieve fault tolerance. . Common to all these approaches is a certain amount of redundancy. This can be a replicated hardware component, an additional check bit attached to a string of digital data, or a few lines of program code verifying the correctness of the program's results. In this course, we will study fault tolerance in both hardware and software. The rapid development of real-time computing applications that started around the mid-1990s, especially the demand for software-embedded intelligent devices, made software fault tolerance a pressing issue The following is a tentative list of topics to be covered:
- Introduction
- Definition of fault tolerance
- Redundancy
- Applications of fault tolerance
- Basic principles of reliability
- Attribute: reliability, accessibility, safety
- Deficiencies: errors, mistakes and failures
- Means: prevention, removal and forecasting
- Reliability evaluation
- Common measures: failure rate, mean time to failure, mean time to repair, etc.
- Reliability block diagram
- Markov processes
- Hardware redundancy
- Redundancy timetables
- Evaluation and comparison and applications
- Information redundancy
- Codes: linear, Hamming, cyclic, unordered, arithmetic, etc.
- Encoding and decoding technicians and applications
- Time redundancy
- Software fault tolerance
- Specific features
- Software fault tolerance techniques: N-version programming, recovery block, self-monitoring software, etc.