Example: RAID disk configurations - Real-time and Safety-critical Embedded Systems

(Redundant Array of Independent Disks)

8.33. 2. Fault tolerance for transient hardware faults

• Basic approach: Software supported fault tolerance

• Repeated execution will avoid transient faults

• The handling of fault effects is important

• Transient faults are handled by setting a fault-free state and continuing the execution from that state (potentially with repeated execution)

• Four phases of operation:

• E.g., detecting illegal instructions at CPU level

• E.g., detecting violation of memory access restrictions

• Application dependent techniques:

• Acceptance checking

• Timing related checking

• Cross-checking

• Structure checking

• Diagnostic checking

• ...

8.35. The four phases of operation 2/4

1. Damage assessment:

• Motivation: Errors can propagate among the components between the occurrence and detection of errors

• Limiting error propagation: Checking interactions

• Input acceptance checking (to detect external errors)

• Output credibility checking (to provide "fail-silent" operation)

• Checking and logging resource accesses and communication

• Estimation of components affected by a detected error

• Analysis of interactions (during the latency of error detection)

8.36. The four phases of operation 3/4

1. Recovery from an erroneous state

• Backward recovery:

• Restoring a prior error-free state (saved earlier)

• Independent of the detected error and estimated damage

• State shall be saved and restored for each component

• Compensation:

• The error can be handled by using redundant information

8.37. Types of recovery

• State space of the system (example): Error detection

8.38. Types of recovery

• State space of the system: Forward recovery

8.39. Types of recovery

• State space of the system: Backward recovery

8.40. Types of recovery

• State space of the system: Compensation

8.42. Backward recovery

• Based on saved state

• Checkpoint: The saved state

• Checkpoint operations:

• Saving the state: periodically, after messages; into stable storage

• Recovery: restoring the state from the stable storage to memory

• Discarding: after having more recent saved state(s)

• Analogy: "autosave"

• Based on operation logs

• Error to be handled: unintended operation

• Recovery is performed by the withdrawal of operations

• Analogy: "undo"

• It is possible to combine the two mechanisms

8.43. Scenarios of backward recovery

8.44. Checkpoint intervals

Aspects of optimizing checkpoint intervals:

• Stable storage is slow ( overhead) and has limited capacity

• Computation is lost after the last checkpoint

• Long error detection latency increases the chance of damaged checkpoints

8.45. Example: User configured checkpointing

Challenges

• Parameters: Length of checkpoint intervals, number of checkpoints, reliability of the checkpoint storage

8.46. Example: User configured checkpointing

• Declaration of a checkpoint vector:

• Declaration of a checkpoint file:

8.47. Example: User configured checkpointing

• Program data to be saved (example):

• Initialization of the checkpoint vector:

• Restoring the heckpoint:

• Similarly, using the fread() system call [fragile]

8.49. Example: Saving the state of the CPU

• The setjmp() and longjmp() system calls can be used

• Generic "go to" by saving and restoring CPU state

• After longjmp() the execution is continued in the same way as if after the return from the last corresponding setjmp() system call

• The return value is the second parameter of the longjmp() system call

• Saving and restoring checkpoints can be distinguished:

• Successful saving of checkpoint: 0 return value

• Restoring checkpoint: the second parameter as the return value

• Similar system calls with saving the signal mask:

• sigsetjmp(), siglongjmp()

8.50. Example: Saving the state of the CPU

• Recording the CPU state:

• Saving the recorded CPU state:

• Reading the saved CPU state:

• Restoring the saved state:

8.51. Rollback recovery in distributed systems

Messages influence the consistency of the system-level state formed by the local checkpoints:

In-transit message:

In the saved state the message

• is already sent, and

• is not processed yet The message shall be saved!

Inconsistent message:

In the saved state the message

• is not sent yet, but

• is already processed

This situation shall be avoided!

8.52. Coordinated checkpointing in distributed systems

8.53. Coordinated checkpointing in distributed systems

8.54. Coordinated checkpointing in distributed systems

8.55. The four phases of operation 4/4

1. Fault treatment and continuing service

• Transient faults:

• Handled by the forward or backward recovery

• Permanent faults:Recovery becomes unsuccessful (the error is detected again)The faulty component shall be localized and handled:

• Diagnostic checks to localize the fault

• Reconfiguration- Fault tolerance: Replacing the faulty component using redundancy- Degraded operation: Continuing only the safety related services

• Repair and substitution

8.56. 4. Fault tolerance for software faults

• Repeated execution is not effective for design faults

8.57. N-version programming

• Active redundancy:Each variant is executed (in parallel)

• The same inputs are used

• Majority voting is performed on the output

• Acceptable range of difference shall be specified

• The voter is a single point of failure

8.58. Recovery blocks

• Passive redundancy: Activation only in case of faults

• The primary variant is executed first

• Acceptance checking performed on the output of the variants

• In case of a detected error another variant is executed

8.59. Recovery blocks

• Passive redundancy: Activation only in case of faults

• The primary variant is executed first

• Acceptance checking performed on the output of the variants

• In case of a detected error another variant is executed

8.60. Recovery blocks

• Passive redundancy: Activation only in case of faults

• The primary variant is executed first

• Acceptance checking performed on the output of the variants

• In case of a detected error another variant is executed

8.61. Comparison of the techniques

8.62. Example: Airbus A-320, self-checking blocks

• Pair-wise self-checking execution

• Primary pair is active, switch-over in case of a fault

• Permanent hardware fault:The pair with repeatedly detected fault will switch off

8.63. Summary

8.64. Software architecture design in standards

• IEC 61508:Functional safety in electrical / electronic / programmable electronic safety-related systems

• Software architecture design

8.65. Redundancy in space (resources) and time

8.66. Costs of redundancy and operation (faults)

8.67. Summary: Types of redundancy

1. Hardware redundancy

• Replicas are used to tolerate permanent faults 2. Software redundancy

• Variants (NVP, RB) are used to tolerate design faults

• Software is used to tolerate transient hardware faults:

• Forward recovery

• Backward recovery 3. Information redundancy

8.68. Summary: Techniques of fault tolerance

1. Hardware design faults

• Diverse redundant components are used 2. Hardware permanent operational faults

• Replicated components are used 3. Hardware transient operational faults

• Software techniques for fault tolerance1. Error detection2. Damage assessment3. Forward or backward recovery (or compensation)4. Fault treatment

• Information redundancy: Error correcting codes

• Time redundancy: Repeated execution 4. Software design faults

• Variants as diverse redundant components (NVP, RB)

9. 9 Testing: Test design and testing process

9.1. Testing: Test design and testing process

9.2. Overview

• Testing basics

• Goals and definitions

• Test design

• Specification based (functional, black-box) testing

• Structure based (white-box) testing

• Testing process

• Module testing

• Integration testing

• System testing

• Validation testing

9.3. Testing and test design in the V-model

9.4. Goals of testing

Testing:

• Running the program in order to detect faults Exhaustive testing:

• Running the program in all possible ways (inputs)

• Hard to implement in practice Observations:

• Dijkstra: Testing is able to show the presence of faults, but not able to show the absence of faults.

• Hoare: Testing can be considered as part of an inductive proof: If the program runs correctly for a given input then it will run similarly correctly in case of similar inputs.

9.5. Test environment: System testing

9.6. Test environment: Module testing

9.7. Tests and faults

• Typical software faults

• Algorithm faults (formal) verification

• Coding faults automated code generationinitializing, boundaries, control flow, ...

• Performance / timing faults

• Deficiencies of the user interface, ...

• Basic concepts

• Fault model: Set of faults to be detected by the test suite

• Test for a fault: The output differs from the expected one

• Detectable fault: There is at least one test for it

• Fault coverage: The ratio of the number of faults detected by a given test suite and the number of all faults

9.8. Practical aspects of testing

• Testing costs may reach 50% of the development costs!

• Testing embedded systems:

• Cross-development (different platforms)

• Platform related faults shall be considered (integration)

• Performance and timing related testing are relevant

• Testing safety-critical systems:

• Functional/black box testing (D3):

9.10. Testing in the standards (here: EN 50128)

• Performance testing (D6):

Test design

How test data can be selected?

9.11. Test approaches

1. Specification based (functional) testing

• The system is considered as a "black box"

• Only the external behaviour (functionality) is known (the internal behaviour is not)

• Test goals: checking the existence of the specified functions and absence of extra functions

• The system is considered as a white box

• The internal structure (source) is known

• Test goals: coverage of the internal behaviour (e.g., program graph)

9.12. I. Specification based (functional) testing

Goals:

• Based on the functional specification,

• find representative inputs (test data)for testing the functionality.

Overview of techniques:

1. Equivalence partitioning 2. Boundary value analysis 3. Cause-effect analysis 4. Combinatorial techniques

9.13. 1. Equivalence partitioning

Input and output equivalence classes:

remaining inputs follows from the principle of induction

Test data selection is a heuristic procedure:

• Input data triggering the same service

• Valid and invalid input data valid and invalid equivalence classes

• Invalid data: Robustness testing

9.14. Equivalence classes (partitions)

• Classic example: Triangle characterization program

• Inputs: Lengths of the sides (here 3 integers)

• Outputs: Equilateral, isosceles, scalene

• Test data for equivalence classes

• Equilateral: 3,3,3

• Isosceles: 5,5,2

• Similarly for the other sides

• Scalene: 5,6,7

• Not a triangle: 1,2,5

• Similarly for the other sides

• Just not a triangle: 1,2,3

• Invalid inputs

• Zero value: 0,1,1

• Negative value: -3,-5,-3

• Not an integer: 2,2, 'a'

them systematically

• Weak and strong equivalence classes:

9.16. 2. Boundary value analysis

• Examining the boundaries of data partitions

• Focusing on the boundaries of equivalence classes

• Input and output partitions are also examined

• Typical faults to be detected: Faulty relational operators, conditions in cycles, size of data structures, ...

• Typical test data:

• A boundary requires 3 tests:

• A partition requires 5-7 tests:

• Effects: output equivalence classes

• Boole-graph: relations of causes and effects

• AND, OR relations

• Invalid combinations

• Decision table: Covering the Boole-graph

• Truth table based representation

• Columns represent test data

9.18. Cause-effects analysis

9.19. 4. Combinatorial techniques

• Several input parameters

• Failures are caused by (specific) combinations

• Exhaustive testing all combinations requires too much test cases

• Rare combinations may also cause failures

• Basic idea: N-wise testing

• For each n parameters, testing all possible combinations of their potential values

• Special case (n = 2): pairwise testing

9.20. Example: pair-wise testing

9.21. Additional techniques

• Finite automaton based testing

• The specification is given as a finite automaton

• Typical test goals: to cover each state, each transition, invalid transitions, ...

• Use case based testing

• The specification is given as a set of use cases

• Each use case shall be covered by the test suite

• Random testing

• Easy to generate (but evaluation may be more difficult)

• Low efficiency

9.22. II. Structure based testing

• Internal structure is known:

• It has to be covered by the test suite

• Goals:There shall not remain such

• statement,

• decision,

• execution path

• in the program, which was not executed during testing

9.24. The internal structure

• Well-specified representation:

• Model-based: state machine, activity diagram

• Source code based: control flow graph (program graph)

9.25. Test coverage metrics

Characterizing the quality of the test suite:

Which part of the testable elements were tested

1. Statements Statement coverage

Definition:

Does not take into account branches without statements

Statement coverage: 80%

Statement coverage: 100%

9.27. 2. Decision coverage

Definition:

Does not take into account all combinations of conditions!

Decision coverage: 50%

Decision coverage: 100%

9.28. 3. Multiple condition coverage

Definition:

Strong, but complex:

For conditions test cases may be necessary!

9.29. Other coverage criteria

MC/DC: Modified Condition/Decision Coverage

• It is used in the standard DO-178B to ensure that Level A (Catastrophic) software is tested adequately

• It is a form of exhaustive testing, in that during testing all of the below must be true at least once:

• Each decision tries every possible outcome

• Each condition in a decision takes on every possible outcome

100% path coverage implies:

• 100% statement coverage, 100% decision coverage

• 100% multiple condition coverage is not implied

Path coverage: 80%

Statement coverage: 100%

9.31. A structure based testing technique

• Goal: Covering independent paths

• Independent paths from the point of view of testing:There is a statement or decision in the path, that is not included in the other path

• The maximal number of independent paths:

• CK, cyclomatic complexity

9.33. Generating structure based test sequences

• Algorithm:

• Selecting max. CK independent paths

• Generating inputs to traverse the paths, each after the other

• Problems:

• Not all paths can be traversed (see conditions)

• Is it possible to generate a proper input sequence?

• It is possible to set the internal variables in a proper way to traverse the selected path?

• Cycles: Traversal shall be limited (minimized)

• There are no fully automated tools to generate test sequences for path coverage

9.34. Data flow based test criteria

• Basic idea: To check the

• definition (value assignment) and

• use of the variables

• Labeling the program graph:

• def(v): definition of variable v (by assigning a value)

• use(v): using variable v

• p-use(v): using v in a predicate (for a decision)

9.36. Execution of test cases

Execution order of the test cases:

First the more efficient tests (i.e., that offer higher fault

coverage than the others)

• Covering longer paths,

• Covering more difficult decisions

Testing process

What are the typical phases of testing?

How to test complex systems?

9.37. Relation to the development process

1. Module testing

4. Validation testing

• Testing user requirements

• Environment simulation

9.38. 1. Module testing

• Modules:

• Logically separated units

• Well-defined interfaces

• OO paradigm: Classes (packages, components)

• Module call hierarchy (in ideal case):

9.39. Module testing

• Lowest level testing

• Integration phase is more efficient if the modules are already tested

• Modules can be tested separately

• Handling complexity

• Debugging is easier

• Testing can be parallel for the modules

• Complementary techniques

• Test executor and test stubs are required

• Integration is not supported

9.41. Regression testing

Repeated execution of test cases:

• In case when the module is changed

• Iterative software development,

• Modified specification,

• Corrections, ...

• In case when the environment changes

• Changing of the caller/called modules,

• Changing of platform services, ...

• Goals:

• Repeatable, automated test execution

• Identification of functions to be re-tested

9.42. 2. Integration testing

Testing the interactions of modules

• Incremental testing: stepwise integration of modules

9.43. "Big bang" testing

• Integration of all modules and testing using the external interfaces of the integrated system

• External test executor

• Based of the functional specification of the system

• To be applied only in case of small systems

9.44. Top-down integration testing

• Modules are tested from the caller modules

• Stubs replace the lower-level modules that are called

• Requirement-oriented testing

• Module modification: modifies the testing of lower levels

9.45. Bottom-up integration testing

• Modules use already tested modules

• Test executor is needed

• Testing is performed in parallel with integration

• Module modification: modifies the testing of upper levels

9.46. Integration with the runtime environment

• Motivation: It is hard to construct stubs for the runtime environment

• Platform services, RT-OS, task scheduler, ...

• Strategy:

1. Top-down integration of the application modules to the level of the runtime environment 2. Bottom-up testing of the runtime environment

• Isolation testing of functions (if necessary)

• Testing aspects:

• Data integrity

• User profile (workload)

• Checking application conditions of the system(resource usage, saturation)

• Testing fault handling

9.48. Types of system tests

9.49. 4. Validation testing

• Goal: Testing in real environment

• User requirements are taken into account

• Non-specified expectations come to light

• Reaction to unexpected inputs/conditions is checked

• Events of low probability may appear

• Timing aspects

9.50. Summary

• Testing techniques

• Specification based (functional, black-box) testing

• Equivalence partitioning

• Boundary value analysis

• Cause-effect analysis

• Structure based (white-box) testing

• Coverage metrics and criteria

• Testing process

• Module testing

• Integration testing

• Top-down integration testing

• Bottom-up integration testing

• System testing

• Validation testing

10. 10 Hazard analysis

10.1. Hazard analysis

• Frequency of occurrence

• Severity of consequences

• Hazard catalogue

• Risk matrix

• These results form the basis for risk reduction

10.3. Categorization of the techniques

• On the basis of the development phase (tasks):

• Design phase: Identification and analysis of hazards

• Delivery phase: Demonstration of safety

• Operation phase: Checking the modifications

• On the basis of the analysis approach:

• Cause-consequence view:

• Forward (inductive): Analysis of the effects of fault/events

• Backward (deductive): Analysis of the causes of hazards

• System hierarchy view:

• Bottom-up: From the components (subsystems) to system level

• Top-down: From the system level towards the components

3. Event Tree

4. Cause-Consequence Analysis

5. Failure Modes and Effects Analysis (FMEA)

10.5. 1. Checklists

• Basic approach

• Collection of experiences about typical faults and hazards

• Used as guidelines and as "rule of thumb"

• Advantages

• Known sources of hazards are included

• Well-proven ideas and solutions can be applied

• Disadvantages

• Completeness is hard to achieve (checklist is incomplete)

• False confidence about safety

• Applicability in different domains than the original domain of the checklist is questionable

10.6. Example: Checklist to examine a specification

• Completeness

• Complete list of functions, references, tools

• Consistency

• Internal and external consistency

• Traceability of requirements

• Realizability

• Resources are available

• Usability is considered

• Maintainability is considered

• Risks: cost, technical, environmental

10.7. Motivations to check the specification

• Experience: Hazards are often caused byincomplete or inconsistent specification

• Example: Statistics of failures detected during the software testing of the Voyager and Galileo spacecraft 78% (149/192) specification related failures, from which

• 23% stuck in dangerous state (without exit)

• 16% lack of timing constraints

• 12% lack of reaction to input event

• 10% lack of checking input values

• Potential solutions to avoid problems

• Using a strong specification language

• Applying correct design patterns

• Checking the specification

10.8. Example: Checklist for state machine specifications

Completeness and consistency:

• State definition

• Inputs (trigger events)

10.9. Example: Checklist for state machine specifications

10.10. Example: Checklist for state machine specifications

10.11. Example: Checklist for state machine specifications

10.12. Example: Checklist for state machine specifications

10.13. Example: Checklist for state machine specifications

10.14. Example: Checklist for state machine specifications

10.15. Example: Static checking of the source code

• Goal: Finding dangerous constructs

• Basis: Language subset (allowed constructs)

• Tool support

• Finding typical faults (e.g., Lint for C)

• Data related faults: Lack of initialization, ...

• Control related faults: Unreachable statements, ...

• Interface related faults: Improper type, lack of return value, ...

• Memory related faults: Lack of releasing unused memory, ...

• Semantic analysis (e.g., PolySpace tool)

• Analysis of the function call hierarchy

• Checking data flow (relations among variables)

• Checking the ranges of variables

• Checking coding rules (e.g., code complexity metrics)

10.16. Example: Output of the analysis in PolySpace

• Static analysis and code colouring: Identification of dangerous constructs

10.17. 2. Fault tree analysis

Analysis of the causes of system level hazards

• Top-down analysis

• Identifying the component level combinations of faults/events that may lead to hazard Construction of the fault tree

1. Identification of the foreseen system level hazard:on the basis of environment risks, standards etc.

2. Identification of intermediate events (pseudo-events):Boolean (AND, OR) combinations of lower level events that may cause upper level events

3. Identification of primary (basic) events:no further refinement is needed/possible

10.18. Set of elements in a fault tree

10.19. Fault tree example: Elevator

10.20. Fault tree example: Elevator

10.21. Fault tree example: Elevator

10.22. Fault tree example: Software analysis

• Outputs of the analysis of the reduced fault tree:

• Single point of failure (SPOF)

• Critical events that appear in several cuts

10.24. Original fault tree of the elevator example

10.25. Reduced fault tree of the elevator example

10.26. Quantitative analysis of the fault tree

combinations

• AND gate: product (if the events are independent)

• Exact calculation:

• OR gate: sum (worst case estimation)

• Exactly:

• Typical problems:

• Correlated faults (not independent)

• Handling of fault sequences

10.27. Fault tree of the elevator with probabilities

10.28. 3. Event tree analysis

• Forward (inductive) analysis:Investigates the effects of an initial event

• Initial event: component level fault/event

• Related events: faults/events of other components

• Ordering: causality, timing

• Branches: depend on the occurrence of events

• Investigation of hazard occurrence "scenarios"

10.30. Event tree example: Reactor cooling

10.31. Event tree example: Reactor cooling

10.32. Event tree example: Reactor cooling

10.33. Event tree example: Recovery blocks (RB)

• Complexity

10.35. Cause-consequence analysis example

10.36. Cause-consequence analysis example

10.37. Cause-consequence analysis example

10.38. 5. Failure modes and effects analysis (FMEA)

• Systematic investigation of component failure modes and their effects

• Advantages:

• Known faults of components are included

• Criticalities of effects can also be estimated (FMECA)

10.39. Example: Analysis of a computer system

10.40. Analysis of operator faults

• Qualitative techniques:

• Operation - hazards - effects - causes - mitigations

• Analysis of physical and mental demands

• Fault causes human-machine interface problems

• Frequency of occurrence of hazards:Frequent, probable, occasional, remote, improbable, incredible

• Identification of risks

• Output of the severity/frequency analysis:

• Risk matrix

• Protection level: Identifies the risks to be handled

10.42. Example: Risk matrix (railway control systems)

10.43. Examples of risk reduction requirements

• In case of catastrophic consequence:

In document Real-time and Safety-critical Embedded Systems (Pldal 146-0)