Dependability metrics: Mean values - Real-time and Safety-critical Embedded Systems

• Partitioning the state of the system: s(t)

• Correct (D, down) and incorrect (U, up) state partitions

• Mean values:

• Mean Time to First Failure: MTFF = E{u1}

• Mean Up Time: MUT = MTTF = E{ui} (Mean Time To Failure)

• Mean Down Time: MDT = MTTF = E{di} (Mean Time To Repair)

• Reliability: (no failure until t)

• Asymptotic availability: (regular repairs) In other way: K = A = MTTF /(MTTF+MTTR)

7.37. Availability related requirements

Availability of a system built up from components,

where the availability of a component is 95%:

• Availability of a system built from 2 components:

• Availability of a system built from 5 components:

• Availability of a system built from 10 components:

7.38. Attributes of components

• Fault rate:

• Probability that the component will fail at time point t given that it has been correct until t

• In other way (on the basis of the definition of reliability):

7.39. Case study: development of a DMI

7.40. Case study: DMI requirements

• Safety:

• Safety Integrity Level: SIL 2

• Tolerable Hazard Rate: THR hazardous failures per hours

• CENELEC standards: EN 50129 and EN 50128

• Reliability:

• Mean Time To Failure: MTTF 5000 hours (5000 hours: 7 months)

• Availability:

• A = MTTF / (MTTF+MTTR), A Faulty state: shall be less than 42 hours per yearMTTR 24 hours if MTTF=5000 hours

7.41. Threats to dependability

7.42. The characteristics of faults

Software fault:

• Permanent design fault (systematic)

• Activation of the fault depends on the operational profile (inputs)

7.43. Means to improve dependability

• Fault prevention:

• Physical faults: Good components, shielding, ...

• Design faults: Good design methodology

• Fault removal:

• Design phase: Verification and corrections

• Prototype phase: Testing, diagnostics, repair

•

7.45. Overall safety lifecycle model: Goals

• Technical framework for the activities necessary for ensuring functional safety

• Covers all lifecycle activities

• Initial concept

• Hazard analysis and risk assessment

• Specification, design, implementation

• Operation and maintenance

• Modification

• Final decommissioning and/or disposal

7.46. Hardware and software development

• PE system architecture (partitioning of functions) determines software requirements

7.47. Software safety lifecycle

• Safety req. spec. has two parts:

• Software safety functions

• Software safety integrity levels

• Validation planning is required

• Integration with PE hardware is required

• Final step: Software safety validation

7.48. Example software lifecycle (V-model)

7.49. Maintenance activities

7.50. Techniques and measures: Basic approach

• Goal: Preventing the introduction of systematic faults and controlling the residual faults

• SIL determines the set of techniques to be applied as

• M: Mandatory

• HR: Highly recommended (rationale behind not using it should be detailed and agreed with the assessor)

• R: Recommended

• -: No recommendation for or against being used

• NR: Not recommended

• Combinations of techniques is allowed

• E.g., alternate or equivalent techniques are marked

• Hierarchy of methods is formed (references to tables)

7.51. Example: Guide to selection of techniques

• Software safety requirements specification:

• Techniques 2a and 2b are alternatives

• Referred table: Semi-formal methods (B.7)

• A.1: Software safety requirements specification

• A.2: Software architecture design

• A.3: Support tools and programming languages

• A.4: Detailed design

• A.5: Module testing and integration

• A.6: PE integration

• A.7: Software safety validation

• A.8: Modification

• A.9: Software verification

• A.10: Functional safety assessment

7.53. Hierarchy of design methods

7.54. Hierarchy of design methods

7.55. Hierarchy of design methods

7.57. Hierarchy of V and V methods

7.58. Hierarchy of V and V methods

7.59. Hierarchy of V and V methods

7.60. Application of tools in the lifecycle

• Fault prevention:

• Program translation from high-level programming languages

• MBD, CASE tools: High level modeling and code/configuration generators

• Fault removal:

• Analysis, testing and diagnosis

• Correction (code modification)

7.61. Safety concerns of tools

• Types of tools

• Tools potentially introducing faults

• Modeling and programming tools

• Program translation tools

• Tools potentially failing to detect faults

• Analysis and testing tools

• Project management tools

• Requirements

• Use certified or widely adopted tools

• "Increased confidence from use" (no evidence of improper results yet)

• Use the well-tested parts without altering the usage

• Check the output of tools (analysis/diversity)

• Control access and versions

7.62. Safety of programming languages

• Expertise available in the design team

• Coding standards (subsets of languages) are defined

• "Dangerous" constructs are excluded (e.g., function pointers)

• Static checking can be used to verify the subset

• Specific (certified) compilers are available

• Compiler verification kit for third-party compilers

7.63. Safety of programming languages

7.64. Language comparison

Wild jumps: Jump to arbitrary address in memory

Overwrites: Overwriting arbitrary address in memory

Model of math: Well-defined data types

Separate compilation: Type checking across modules

7.65. Coding standards for C and C++

• MISRA C (Motor Industry Software Reliability Association)

• Safe subset of C (2004): 141 rules (121 required, 20 advisory)

• Examples:

• Rule 33 (Required): The right hand side of a " " or " " operator shall not contain side effects.

• "Joint Strike Fighter Air Vehicle C++ Coding Standard"

7.66. Safety-critical OS: Required properties

• Partitioning in space

• Memory protection

• Guaranteed resource availability

• Partitioning in time

• Deterministic scheduling

• Guaranteed resource availability in time

• Mandatory access control for critical objects

• Not (only) discretionary

• Bounded execution time

• Also for system functions

• Support for fault tolerance and high availability

• Fault detection and recovery / failover

• Redundancy control

7.67. Example: Safety and RTOS

• Less scheduling risks

• High maintenance risks

• Example: Tornado for Safety Critical Systems

• Integrated software solution uses Wind River'ssecurely partitioned VxWorks AE653 RTOS

• ARINC 653: Time and space partitioning(guaranteed isolation)

• RTCA/DO-178B: Level A certification

• POSIX, Ada, C support

7.68. Principles for documentation

• Type of documentation

• Comprehensive (overall lifecycle)

• E.g., Software Verification Plan

• Specific (for a given lifecycle phase)

• E.g., Software Source Code Verification Report

• Document Cross Reference Table

• Determines documentation for a lifecycle phase

• Determines relations among documents

• Traceability of documents is required

• Relationships between documents are specified (input, output)

• Terminology, references, abbreviations are consistent

• Merging documents is allowed

• If responsible persons (authors) shall not be independent

7.69. Document cross reference table (EN50128)

creation of a document

used document in a given phase (read vertically)

7.70. Human factors

• In contrast to computers

• Humans often fail in:

• reacting in time

• following a predefined set of instructions

• Humans are good in:

• handling unanticipated problems

• Human errors

• Not all kind of human errors are equally likely

• Hazard analysis (FMECA) is possible in a given context

• Results shall be integrated into system safety analysis

• Reducing the errors of developers

• Safe languages, tools, environments

• Training, experience and redundancy (independence)

• Reducing operator errors:

• Designing ergonomic HMI (patterns are available)

• Designing to aid the operator rather than take over

7.71. Organization

• Safety management

• Quality assurance

• Safety Organization

• Competence shall be demonstrated

• Training, experience and qualifications

• Independence of roles:

• DES: Designer(analyst, architect, coder, unit tester)

• VER: Verifier

• VAL: Validator

• ASS: Assessor

• MAN: Project manager

• QUA: Quality assurance personnel

7.72. Independence of personnel

7.73. Overall safety lifecycle (overview)

7.74. Specific activities (overview)

7.75. Specific activities (overview)

7.76. Specific activities (overview)

7.77. Specific activities (overview)

7.78. Specific activities (overview)

7.80. Specific activities (overview)

7.81. Summary

• Safety-critical systems

• Hazard, risk

• THR and Safety Integrity Level

• Dependability

• Attributes of dependability

• Fault Error Failure chain

• Means to improve dependability

• Development process

• Lifecycle activities

systems

8.1. Design of the architecture of safety-critical systems

8.2. Objectives

• Stopping (switch-off) is a safe state

• In case of a detected error the system has to be stopped

• Detecting errors is a critical task

• Stopping (switch-off) is not a safe state

• Service is needed even in case of a detected error

• full service

• degraded (but safe) service

• Fault tolerance is required

8.3. Architectural solutions (overview)

• The effects of detected errors can be handled (compensated)

• All failure modes are safe

• "Inherent safe" system

8.4. Typical architectures for fail-stop operation

8.5. 1. Single channel architecture with built-in self-test

• Single processing flow

• Scheduled hardware self-tests

• After switch-on: Detailed self-test to detect permanent faults

• In run-time: On-line tests to detect latent permanent faults

• Scheduled software self-tests

• Typically application dependent techniques

• Checking the control flow, data acceptance rules, timeliness properties

• Disadvantages:

•

8.6. Implementation of on-line error detection

• Application dependent (ad-hoc) techniques

• Acceptance checking (e.g., for ranges of values)

• Timing related checking (e.g., too early, too late)

• Cross-checking (e.g., using inverse function)

• Structure checking (e.g., in linked list structure)

• Application independent mechanisms

• Hardware supported on-line checking

• CPU level: Invalid instruction, user/supervisor modes etc.

• MMU level: Protection of memory ranges

• Generic architectural solutions

Stuck-at 0/1 faults:

Transition fault:

State transitions to check stuck faults:

"March" algorithms:

8.8. Example: Software self-test

• Checking the correctness of execution paths

• On the basis of the program control flow graph

8.9. Example: Software self-test

• Checking the correctness of execution paths

• On the basis of the program control flow graph

• Actual run: Checked on the basis of assigned signatures

8.11. Example: The SAFEDMI hardware architecture

• Single electronic structure based on reactive fail-safety

• Generic (off-the-shelf) hardware components are used

• Most of the safety mechanisms are based on software(error detection and error handling)

8.12. Example: The SAFEDMI hardware architecture

Hardware components:

8.13. Example: The SAFEDMI fault handling

• Operational modes:

• Startup, Normal, Configuration and Safe modes

• Suspect state to implement controlledrestart/stop after an error

8.14. Example: The SAFEDMI error detection techniques

• Startup: Detection of permanent hardware faults

• CPU testing with external watchdog circuit

• Memory testing with marching algorithms

• EPROM integrity checking with error detection codes

• Shared input

• Comparison of outputs

• Stopping in case of deviation

• High error detection coverage

• The comparator is a critical component (but simple)

• Special way of comparison:

• Performed by the operator

• Disadvantages:

• Common mode faults

• Long detection latency

8.16. Example: TI Hercules Safety Microcontrollers

8.18. Example: SCADA system architecture

• Two channels

• Alternating bitmap visualization from the two channels:Comparison by the operator

• Synchronization:Detection of internal errors before the effects reach the outputs

8.19. Example: SCADA deployment options

• Two channels on the same server

• Statically linked software modules

• Independent execution in memory, disk and time

• Diverse data representation

• Binary data (signals): Inverse representation (original/negated)

• Diverse indexing in the technology database

• Two channels on two servers

• Synchronization on dedicated network

8.20. Example: SCADA error detection techniques

For random hardware faults during operation:

• Comparison of channels: Operator and I/O circuits

• Heartbeat: Blinking RGB-BGR symbols indicate the regular update of the bitmap on the screen

• Watchdog process

• Checking the availability of the other processes

• Regular comparison of the content of the technology database

• Detecting latent errors

For unintended control by the operator:

• Three-phased control of outputs:

• Preparation (but locking the outputs using diverse software modules)

• Read-back using independent software modules

• Acknowledgement by the operator (diverse GUI operations)

8.21. Example: SCADA three phases of control

8.22. 3. Two-channels architecture with safety bag

• Independent second channel

• "Safety bag": only safety checking

• Diverse implementation

• Checking the output of the primary channel

• Example:

• Elektra railway interlocking system

• Rules are implemented to check the primary channel

8.23. Example: Alcatel (Thales) Elektra

Two channels:

• Logic channel: CHILL (CCITT High Level Language) procedure-oriented programming language

• Safety channel: PAMELA (Pattern Matching Expert System Language) rule-based language

8.24. Typical architectures for fault-tolerant systems

8.26. Fault tolerant systems

• Fault tolerance: Providing (safe) service in case of faults

• Autonomous error handling during operation (instead of stopping)

• Intervening into the fault failure chain

• Basic condition: RedundancyExtra resources to replace (the service of) faulty components

• Types of redundancy

• Cold: The redundant component is inactive in fault-free case

• Warm: The redundant component has reduced load in fault-free case

• Hot: The redundant component is active in fault-free case

8.27. Forms of redundancy

1. Hardware redundancy

• Extra hardware components

• Inherent in distributed systems

• Planned for fault tolerance 2. Software redundancy

• Extra software modules 3. Information redundancy

8.28. Example: Error detecting and correcting codes

• Limited error correction capability

• Information storage: In long time, more errors can occur than the number of errors that can be corrected by the applied codes

• Basic idea: Periodic reading, correcting and writing back the information

8.29. How to use the redundancy?

• Hardware design faults: ( )

• Hardware redundancy, with design diversity

• Often are neglected (wide-spread components are used)

• Hardware permanent operational faults: ( )

• Hardware redundancy (e.g., redundant processor)

• Hardware transient operational faults: ( )

• Time redundancy (e.g., instruction retry)

• Information redundancy (e.g., error correcting codes)

• Software redundancy (e.g., checkpointing and recovery)

• Software design faults: ( )

• Software redundancy, with design diversity

8.30. 1. Fault tolerance for hardware permanent faults

• TMR: Triple Modular Redundancy

• Masking the failure by majority voting

• Voter is a critical component (but simple)

• NMR: N-modular redundancy

• Masking the failure by majority voting

• Goal: Surviving a mission time with high probability

• Airborne and space systems: 4MR, 5MR

8.31. Implementation of the replication

• Equipment/server level:

• Servers: High availability server clusters

• E.g., Linux HA Clustering, Windows Server Failover Clustering

• Software support: Failover and failback

• Component level:

• Replication of components: TMR

• Self-checking circuits (processing encoded information)

8.32. Example: RAID disk configurations

(Redundant Array of Independent Disks)

8.33. 2. Fault tolerance for transient hardware faults

• Basic approach: Software supported fault tolerance

• Repeated execution will avoid transient faults

• The handling of fault effects is important

• Transient faults are handled by setting a fault-free state and continuing the execution from that state (potentially with repeated execution)

• Four phases of operation:

• E.g., detecting illegal instructions at CPU level

• E.g., detecting violation of memory access restrictions

• Application dependent techniques:

• Acceptance checking

• Timing related checking

• Cross-checking

• Structure checking

• Diagnostic checking

• ...

8.35. The four phases of operation 2/4

1. Damage assessment:

• Motivation: Errors can propagate among the components between the occurrence and detection of errors

• Limiting error propagation: Checking interactions

• Input acceptance checking (to detect external errors)

• Output credibility checking (to provide "fail-silent" operation)

• Checking and logging resource accesses and communication

• Estimation of components affected by a detected error

• Analysis of interactions (during the latency of error detection)

8.36. The four phases of operation 3/4

1. Recovery from an erroneous state

• Backward recovery:

• Restoring a prior error-free state (saved earlier)

• Independent of the detected error and estimated damage

• State shall be saved and restored for each component

• Compensation:

• The error can be handled by using redundant information

8.37. Types of recovery

• State space of the system (example): Error detection

8.38. Types of recovery

• State space of the system: Forward recovery

8.39. Types of recovery

• State space of the system: Backward recovery

8.40. Types of recovery

• State space of the system: Compensation

8.42. Backward recovery

• Based on saved state

• Checkpoint: The saved state

• Checkpoint operations:

• Saving the state: periodically, after messages; into stable storage

• Recovery: restoring the state from the stable storage to memory

• Discarding: after having more recent saved state(s)

• Analogy: "autosave"

• Based on operation logs

• Error to be handled: unintended operation

• Recovery is performed by the withdrawal of operations

• Analogy: "undo"

• It is possible to combine the two mechanisms

8.43. Scenarios of backward recovery

8.44. Checkpoint intervals

Aspects of optimizing checkpoint intervals:

• Stable storage is slow ( overhead) and has limited capacity

• Computation is lost after the last checkpoint

• Long error detection latency increases the chance of damaged checkpoints

8.45. Example: User configured checkpointing

Challenges

• Parameters: Length of checkpoint intervals, number of checkpoints, reliability of the checkpoint storage

8.46. Example: User configured checkpointing

• Declaration of a checkpoint vector:

• Declaration of a checkpoint file:

8.47. Example: User configured checkpointing

• Program data to be saved (example):

• Initialization of the checkpoint vector:

• Restoring the heckpoint:

• Similarly, using the fread() system call [fragile]

8.49. Example: Saving the state of the CPU

• The setjmp() and longjmp() system calls can be used

• Generic "go to" by saving and restoring CPU state

• After longjmp() the execution is continued in the same way as if after the return from the last corresponding setjmp() system call

• The return value is the second parameter of the longjmp() system call

• Saving and restoring checkpoints can be distinguished:

• Successful saving of checkpoint: 0 return value

• Restoring checkpoint: the second parameter as the return value

• Similar system calls with saving the signal mask:

• sigsetjmp(), siglongjmp()

8.50. Example: Saving the state of the CPU

• Recording the CPU state:

• Saving the recorded CPU state:

• Reading the saved CPU state:

• Restoring the saved state:

8.51. Rollback recovery in distributed systems

Messages influence the consistency of the system-level state formed by the local checkpoints:

In-transit message:

In the saved state the message

• is already sent, and

• is not processed yet The message shall be saved!

Inconsistent message:

In the saved state the message

• is not sent yet, but

• is already processed

This situation shall be avoided!

8.52. Coordinated checkpointing in distributed systems

8.53. Coordinated checkpointing in distributed systems

8.54. Coordinated checkpointing in distributed systems

8.55. The four phases of operation 4/4

1. Fault treatment and continuing service

• Transient faults:

• Handled by the forward or backward recovery

• Permanent faults:Recovery becomes unsuccessful (the error is detected again)The faulty component shall be localized and handled:

• Diagnostic checks to localize the fault

• Reconfiguration- Fault tolerance: Replacing the faulty component using redundancy- Degraded operation: Continuing only the safety related services

• Repair and substitution

8.56. 4. Fault tolerance for software faults

• Repeated execution is not effective for design faults

8.57. N-version programming

• Active redundancy:Each variant is executed (in parallel)

• The same inputs are used

• Majority voting is performed on the output

• Acceptable range of difference shall be specified

• The voter is a single point of failure

8.58. Recovery blocks

• Passive redundancy: Activation only in case of faults

• The primary variant is executed first

• Acceptance checking performed on the output of the variants

• In case of a detected error another variant is executed

8.59. Recovery blocks

• Passive redundancy: Activation only in case of faults

• The primary variant is executed first

• Acceptance checking performed on the output of the variants

• In case of a detected error another variant is executed

8.60. Recovery blocks

• Passive redundancy: Activation only in case of faults

• The primary variant is executed first

• Acceptance checking performed on the output of the variants

• In case of a detected error another variant is executed

8.61. Comparison of the techniques

8.62. Example: Airbus A-320, self-checking blocks

• Pair-wise self-checking execution

• Primary pair is active, switch-over in case of a fault

• Permanent hardware fault:The pair with repeatedly detected fault will switch off

8.63. Summary

8.64. Software architecture design in standards

• IEC 61508:Functional safety in electrical / electronic / programmable electronic safety-related systems

• Software architecture design

8.65. Redundancy in space (resources) and time

8.66. Costs of redundancy and operation (faults)

8.67. Summary: Types of redundancy

1. Hardware redundancy

• Replicas are used to tolerate permanent faults 2. Software redundancy

• Variants (NVP, RB) are used to tolerate design faults

• Software is used to tolerate transient hardware faults:

• Forward recovery

• Backward recovery 3. Information redundancy

8.68. Summary: Techniques of fault tolerance

1. Hardware design faults

• Diverse redundant components are used 2. Hardware permanent operational faults

• Replicated components are used 3. Hardware transient operational faults

• Software techniques for fault tolerance1. Error detection2. Damage assessment3. Forward or backward recovery (or compensation)4. Fault treatment

• Information redundancy: Error correcting codes

• Time redundancy: Repeated execution 4. Software design faults

• Variants as diverse redundant components (NVP, RB)

9. 9 Testing: Test design and testing process

9.1. Testing: Test design and testing process

9.2. Overview

• Testing basics

• Goals and definitions

• Test design

• Specification based (functional, black-box) testing

• Structure based (white-box) testing

• Testing process

• Module testing

• Integration testing

• System testing

• Validation testing

9.3. Testing and test design in the V-model

9.4. Goals of testing

Testing:

• Running the program in order to detect faults Exhaustive testing:

• Running the program in all possible ways (inputs)

• Hard to implement in practice Observations:

• Dijkstra: Testing is able to show the presence of faults, but not able to show the absence of faults.

• Hoare: Testing can be considered as part of an inductive proof: If the program runs correctly for a given input then it will run similarly correctly in case of similar inputs.

9.5. Test environment: System testing

9.6. Test environment: Module testing

9.7. Tests and faults

• Typical software faults

• Algorithm faults (formal) verification

• Coding faults automated code generationinitializing, boundaries, control flow, ...

• Performance / timing faults

• Deficiencies of the user interface, ...

• Basic concepts

• Fault model: Set of faults to be detected by the test suite

• Test for a fault: The output differs from the expected one

• Detectable fault: There is at least one test for it

• Fault coverage: The ratio of the number of faults detected by a given test suite and the number of all faults

9.8. Practical aspects of testing

• Testing costs may reach 50% of the development costs!

• Testing embedded systems:

• Cross-development (different platforms)

• Platform related faults shall be considered (integration)

• Performance and timing related testing are relevant

• Testing safety-critical systems:

• Functional/black box testing (D3):

9.10. Testing in the standards (here: EN 50128)

• Performance testing (D6):

Test design

How test data can be selected?

9.11. Test approaches

1. Specification based (functional) testing

• The system is considered as a "black box"

• Only the external behaviour (functionality) is known (the internal behaviour is not)

• Test goals: checking the existence of the specified functions and absence of extra functions

• The system is considered as a white box

• The internal structure (source) is known

• Test goals: coverage of the internal behaviour (e.g., program graph)

9.12. I. Specification based (functional) testing

Goals:

• Based on the functional specification,

• find representative inputs (test data)for testing the functionality.

Overview of techniques:

1. Equivalence partitioning 2. Boundary value analysis 3. Cause-effect analysis 4. Combinatorial techniques

9.13. 1. Equivalence partitioning

Input and output equivalence classes:

remaining inputs follows from the principle of induction

Test data selection is a heuristic procedure:

• Input data triggering the same service

In document Real-time and Safety-critical Embedded Systems (Pldal 103-0)