Software Test and An... Complete Curriculum Software Reliability

Fault-Tolerant Software Architectures


"Doppelt genäht hält besser!"

The lectures present systematic techniques to derive both deterministic statements about the tolerability of predefined fault classes by redundant strategies and probabilistic measures of dependability attributes (reliability, availability) for predefined architectures / programs. The theoretical illustration of the topics above is completed by reports on the real-world industrial development and licensing practice.

1. Introduction
2. Classical Reliability Theory
3. System Structures
4. Common Failure of Diverse Software
5. Back-to-Back-Testing of Diverse Software
6. Forced Diversity
7. Checkpointing
8. Guidelines for the development of fault-tolerant software

“If I had two necks ...“
Christina of Denmark turning down a proposal by Henry VIII

Extended Contents

1. Introduction
Overview, Terminology: mistake, fault, error, failure
dependability: reliability, availability, safety, security
persistence: permanent and temporary faults, permanent and intermittent failures
failing behaviour: omission failures, crash failures, fail-silent behaviour
failure consistency: interactive consistency (Byzantine generals")
fault handling measures: avoidance, detection, diagnosis, removal, prediction
fault tolerance measures: exclusion, masking, recovery
homogeneous and diverse redundancy
redundancy types: structural, functional, information and time redundancy
activation of redundancy: static and dynamic
cold and hot reserve, hybrid redundancy
N-Version Programming, Recovery Block Programming
Triple Modular Redundancy (TMR)
relative test und absolute test
voting strategies: matching comparison, median, average
voting granularity, level of detail of the acceptance test

2. Classical Reliability Theory
terminology, failure rate, constant and non-constant failure rate
dependability measures for systems without and with repair
lifetime, reliability, availability, mean time to failure (MTTF),
mean time to repair (MTTR), mean time between failures (MTBF)

3. System Structures
definition, examples, parts count method
decomposition lemma, multi-linear form, monotonic structure functions, properties
reliability graph, examples, comparison with physical connection diagram
redundancy at system level and at component level, combinatorial models
examples: error-correcting codes, aircraft control
comparison TMR vs. single system
duplex structure, "dual-dual", "pair & spare"
voter reliability and fault coverage
Markov processes: homogeneity, stationarity
examples: maintenance teams, Markov model for Triple Modular Redundancy

4. Common Failure of Diverse Software
theoretical approach: model by Eckhardt & Lee
experimental approach by Knight & Leveson: Launch Interceptor Problem
test of failure independence hypothesis
test of fault independence hypothesis (Brilliant, Knight, Leveson)
further example: Package Shipment System
estimation of maximal inaccuracy due to independence assumption

5. Back-to-Back-Testing of Diverse Software
failure sets of two-fold diverse Software
pro and cons of back-to-back-testing
statistical sampling theory
theoretical analysis of back-to-back-testing
experimental analysis of back-to-back-testing (Knight / Leveson), overlap ratio
examples: Automatic Landing Problem, Project on Diverse Software
combat Simulation Problem, Communication Protocol

6. Forced Diversity
fault causes, concept of forced diversity
graphical representation of diversity aspects
fault classification by origin
functional diversity, data diversity and time diversity, diversity of software environment
quantitative influence of forced diversity
extension of Eckhardt & Lee's theory of by Littlewood & Miller
example: Airport Scheduler
forced diversity and voter granularity
example: Project on Diverse Software (PODS)

7. Checkpointing
location of checkpoints, information reduction, failure masking
information reduction and failure dependence
example: binary disjunction
dynamic checkpointing

8. Guidelines for the development of fault-tolerant software
overview: procedure
Phase I: analysis of inherent application properties
Phase II: identification of goals to be achieved
Phase III: determination of degrees of freedom
Phase IV: decision on the application of diversity
Phase V: diverse development
overview: phases and decision steps