• Nem Talált Eredményt

Frame coding in dynamic segment

DTS: Dynamic Trailing Sequence

To indicate the exact point in time of

transmitter's minislot action point

Prevents premature Channel idle detection

DTS is sent after FES (Frame End Sequence).

• extended byte from CRC, appending to the bit stream

• FES to the end

• DTS to the end, if dynamic segment

6.22. Symbol coding

Coding of two symbols are the same:

CAS (Collision Avoidance Symbol) and

MTS (Media Access Test Symbol)

Wakeup Symbol (WUS) This is not High, it is Idle!

6.23. Sampling and majority voting

• every node samples RxD many times within a bit duration

• majority voting within a sliding window (5 width voting window)

• suppresses glitches (spikes) coming from transient or disturbance(Assuming that the duration is short compared to bit duration)

• this is the timing of start of bit, not the clock for oversampling RxD!

• Receiver counts the received samples (oversampled) within bit duration

• If VotedVal High ƒLow, sample counter is set to 2 at the next sample

• synchronization is accomplished only if enabled

6.25. Clock synchronization

Every node has a local clock

Assumptions:

• every node within the distributed system has approximately the same view of time

• difference between any two nodes cannot exceed a tolerance Distributed clock synchronization mechanism

6.26. Timing hierarchy

• microtick: derived from CC's internal timer (given prescaler, compare-match etc.) implementation might vary, controller specific

• macrotick: synchronized on a cluster-wide basis

Error model:

• offset (phase)

• rate (frequency) Both are corrected

Measurement: every node measures the time difference between the true and expected sync edge in microticks.

Last N is collected in a table.

6.28. Offset correction

• Measurement

• Selection of stored deviation values(only current communication cycle is used)

• I no sync frame error

• Calculation of fault-tolerant midpoint

• Within limit? If not crop

• An optional external correction (supplied by host)

6.29. Fault-tolerant midpoint algorithm

• determination of

• omitting the smallest and largest differences

• from the remaining: arithmetic mean of the smallest and larges, truncated towards zero (even if it is negative)

6.30. Rate correction

Principle:

• Measurement: time difference between two succeeding cycles are measured

• Last N time period measurements in a table

• If no sync frame error

• Calculation of fault-tolerant midpoint

• Damping value is applied

• smaller then a damping limit? no correction

• larger then a damping limit? correction = correction-limit

• Within upper/lower limits? If not crop

• An optional external correction (supplied by host)

6.31. External clock synchronization

If every node is in sync. within the cluster, different clusters can drift apart from each other

(clock sync. assures only consistency within the cluster)

host-determined external offset and rate correction for both clusters at the same time

Clock correction: adjustment of microticks within a macrotick

6.32. Wakeup

• Every bus driver needs to have stable power supply

Whole cluster needs to be awaken (wakeup)

• startup on both channels (synchronously)

• startup initiation: coldstart can be initiated only by coldstart node Coldstart:

• Collision Avoidance Symbol (CAS)

• in the first 4 cycle only this node can send messages

• first the other coldstart nodes integrate

• later the other (non coldstart) nodes integrate

6.34. Startup cont.

No. of coldstart nodes is

3, if No. of nodes 3

2, if No. of nodes 3

Startup frame is also a sync frame ďƒ coldstart node is also a sync node At least two fault-free coldstart nodes are necessary for cluster startup

leading coldstart node: who actively starts the coldstart proc.

following coldstart node: coldstart node who integrates

Startup performed by coldstart nodes:

• coldstart startup initiation

• ends if there is stable communication between coldstart nodes Integration:

• at least 2 startup frame from different nodes

• integration can start during coldstart node startup, but might not finish until at least 2 coldstart nodes startup finishes

6.35. Node Local Bus Guardian

if both CC and BG enables it

If failure: fail-silent operation

The failure is not propagated

(sender is unmounted from bus)

BG: doesn't send,

only supervises timing

6.36. Central Bus Guardian

Verified failures:

• short-circuit and other bus errors

• sending at inappropriate time

CBG: needs to store static timing information

6.37. Central Bus Guardian cont.

• capable of decoding frames

• NOT capable of coding frames

• has own clock synchronization mechanism (same as in CC)

• distinguishes between various frame types(startup, sync etc.)

• can be cascaded

7. 7 Safety-critical systems

7.1. Safety-critical systems: Basic definitions

7.2. Introduction

• Safety-critical systems

• Informal definition: Malfunction may cause injury of people

• Safety-critical computer-based systems

• E/E/PE: Electrical, electronic, programmable electronic systems

• Control, protection, or monitoring

• EUC: Equipment under control

Railway signaling, x-by-wire, interlocking, emergency stopping, engine control, ...

7.3. Specialities of safety critical systems

• Special solutions to achieve safe operation

• Design: Requirements, architecture, tools, ...

• Verification, validation, and independent assessment

• Certification (by safety authorities)

• Basis of certification: Standards

• IEC 61508: Generic standard (for electrical, electronic or programmable electronic systems)

• DO178B: Software in airborne systems and equipment

• EN50129: Railway (control systems)

• EN50128: Railway (software)

• ISO26262: Automotive

• Other sector-specific standards: Medical, process control, etc.

7.4. Definition of safety

• Central concepts: Hazard, risk and safety

7.5. Definition of safety

• Central concepts: Hazard, risk and safety

7.6. Definition of safety

• Central concepts: Hazard, risk and safety

7.7. Definition of safety

7.8. Definition of safety

• Central concepts: Hazard, risk and safety

7.9. Definition of safety

• Central concepts: Hazard, risk and safety

7.10. Accident examples

• Intelligent braking is controlled by shock absorber + wheel rotation delayed braking hitting the embankment

• Is the control system "too intelligent"?

• Correct functioning but not safe behaviour!

7.11. Accident examples

• Toyota car accident in San Diego, August 2009

• Hazard: Stuck accelerator (full power)

• Floor mat problem

• Hazard control: What about ...

• Braking?

• Shutting off the engine?

• Putting the vehicle into neutral? (gearbox: D, P, N)

7.12. Experiences

• Harm is typically a result of a complex scenario

• (Temporal) combination of failure(s) and/or normal event(s)

• Hazards may not result in accidents

• Hazard failure

• Undetected (and unhandled) error is a typical cause of hazards

• Hazard may also be caused by (unexpected) combination of normal events

• Central problems in safety-critical systems:

• Analysis of hazards

• Assignment of functions to avoid hazards accidents harms

7.13. Hazard control

• Containment: Reduce the consequence of a hazard

7.14. Safety-related system

• Safety function:

• Function which is intended to achieve or maintain a safe state for the EUC

• Safety-related system:

• Implements the required safety functions necessary to achieve or maintain a safe state for the EUC,

• and is intended to achieve the necessary safety integrity for the required safety functions

• Requirements for a safety-related system:

• What is the safety function: Safety function requirements

• What is the likelihood of the correct operation of the safety function: Safety integrity requirements

7.15. Safety integrity

• Safety integrity:

• Probability of a safety-related system satisfactorily performing the required safety functions (i.e., without failure)

• under all stated conditions

• within a stated period of time

• Types of safety integrity:

• Random (hardware): Related to random hardware failures

• Occur at a random time due to degradation mechanisms

• Systematic: Related to systematic failures

• Failures related in a deterministic way to faults that can only beeliminated by modification of the design / manufacturing process / operation procedure / documentation / other relevant factors

• Safety integrity level (SIL):

• Discrete level for specifying safety integrity requirements of the safety functions (i.e., probabilities of failures)

7.16. Example: Safety function

• Hazard analysis: Avoiding injury of the operator when cleaning the blade

• If the cover is lifted more than 5 mm then the motor should be stopped

• The motor should be stopped in less than 1 sec

• Safety function: Interlocking

• When the cover is lifted to 4 mm, the motor is stopped and braked in 0,8 s

• Safety integrity:

• The probability of failure of the interlocking (safety function) shall be less than (one failure in 10.000 operation)

• Failure of interlocking is not necessarily result in an injury sincethe operator may be careful

7.17. Safety and dependability

• Safety vs. reliability:

• Fail-safe state: safe, but 0 reliability

• Railway signaling, red state: Safety reliability

• Airplane control: Safety = reliability

• Safety vs. availability:

• Fail-stop state: safe, but 0 availability (and reliability)

• High availability may result in (short) unsafe states

• Safety vs. dependability:

• Safety function requirements:

• Derived from hazard identification

• Safety integrity requirements:

• Related to target failure measure of the safety function

• Derived from risk estimation: Acceptable risk

• Safety standards: Risk based approach for determining target failure measure

• Tolerable risk: Risk which is accepted in a given context based on the current values of society

• It is the result of risk analysis

• Performed typically by the customer

• Considering the environment, scenarios, mode of operation, ...

7.19. Risk based approach

• EN50129:Railway applications

• THR:Tolerable hazard rate(continuous operation)

7.20. Risk analysis

• EN50129 (railway applications)

7.21. Mode of operation

• Way in which a safety-related system is to be used:

• Low demand mode: Frequency of demands for operation is

• no greater than one per year and

• no greater than twice the proof-test frequency

• High demand (or continuous) mode: Frequency of demands for operation is

• greater than one per year or

• greater than twice the proof-test frequency

• Target failure measure:

• Low demand mode: Average probability of failure to perform the desired function on demand

• High demand mode: Probability of a dangerous failure per hour

• Acceptable risk Tolerable hazard rate (THR)

7.22. Safety integrity requirements

7.23. Determining SIL: Overview

• Hazard identification and risk analysis Target failure measure

7.24. Structure of requirements

7.25. Challenges in achieving functional safety

• Preventing/controlling dangerous failures resulting from

• Incorrect specification (system, HW, SW)

• Omissions in safety requirement specification

• Hardware failure mechanisms: Random or systematic

• Software failure mechanisms: Systematic

• Common cause failures

• Human (operator) errors

• Environmental influences (e.g., temperature, EM, mechanical)

• Supply system disturbances (e.g., power supply)

• ...

7.26. Demonstrating SIL requirements

• Approaches:

• Random failure integrity:

• Quantitative approach: Based on statistics, experiments

• Systematic failure integrity:

• Qualitative approach: Rigor in the engineering- Development life cycle- Techniques and measures- Documentation- Independence of persons

• Safety case:

• Documented demonstration that the product complies with the specified safety requirements

• Systematic demonstration

7.27. Summary of the basic concepts

System safety

• emphasizes building in safety, not adding it to a completed design

• deals with systems as a whole rather than with subsystems or components

• takes a larger view of hazards than just failures

• emphasizes analysis rather than past experience and standards

7.29. Characterizing the system services

• Typical characteristics of services:

• Reliability, availability, integrity, ...

• These depend on the failures during the use of the services(the good quality of the production process is not enough)

• Composite characteristic: Dependability

• Definition: Ability to provide service in which reliance can justifiably be placed

• Justifiably: based on analysis, evaluation, measurements

• Reliance: the service satisfies the needs

• Basic question: How to avoid or handle the faults affecting the services?

7.31. Fault effects

7.32. Fault effects

7.33. Fault effects

7.34. Dependability and security

• Basic attributes of dependability:

• Availability: Probability of correct service (considering repairs and maintenance)

• Reliability: Probability of continuous correct service (until the first failure)

• Safety: Freedom from unacceptable risk of harm

• Integrity: Avoidance of erroneous changes or alterations

• Maintainability: Possibility of repairs and improvements

• Attributes of security:

• Availability

• Integrity

• Confidentiality: absence of unauthorized disclosure of information

7.35. Dependability metrics: Mean values

• Partitioning the state of the system: s(t)

• Correct (D, down) and incorrect (U, up) state partitions

• Mean values:

• Mean Time to First Failure: MTFF = E{u1}

• Mean Up Time: MUT = MTTF = E{ui} (Mean Time To Failure)

• Mean Down Time: MDT = MTTF = E{di} (Mean Time To Repair)

• Reliability: (no failure until t)

• Asymptotic availability: (regular repairs) In other way: K = A = MTTF /(MTTF+MTTR)

7.37. Availability related requirements

Availability of a system built up from components,

where the availability of a component is 95%:

• Availability of a system built from 2 components:

• Availability of a system built from 5 components:

• Availability of a system built from 10 components:

7.38. Attributes of components

• Fault rate:

• Probability that the component will fail at time point t given that it has been correct until t

• In other way (on the basis of the definition of reliability):

7.39. Case study: development of a DMI

7.40. Case study: DMI requirements

• Safety:

• Safety Integrity Level: SIL 2

• Tolerable Hazard Rate: THR hazardous failures per hours

• CENELEC standards: EN 50129 and EN 50128

• Reliability:

• Mean Time To Failure: MTTF 5000 hours (5000 hours: 7 months)

• Availability:

• A = MTTF / (MTTF+MTTR), A Faulty state: shall be less than 42 hours per yearMTTR 24 hours if MTTF=5000 hours

7.41. Threats to dependability

7.42. The characteristics of faults

Software fault:

• Permanent design fault (systematic)

• Activation of the fault depends on the operational profile (inputs)

7.43. Means to improve dependability

• Fault prevention:

• Physical faults: Good components, shielding, ...

• Design faults: Good design methodology

• Fault removal:

• Design phase: Verification and corrections

• Prototype phase: Testing, diagnostics, repair

7.45. Overall safety lifecycle model: Goals

• Technical framework for the activities necessary for ensuring functional safety

• Covers all lifecycle activities

• Initial concept

• Hazard analysis and risk assessment

• Specification, design, implementation

• Operation and maintenance

• Modification

• Final decommissioning and/or disposal

7.46. Hardware and software development

• PE system architecture (partitioning of functions) determines software requirements

7.47. Software safety lifecycle

• Safety req. spec. has two parts:

• Software safety functions

• Software safety integrity levels

• Validation planning is required

• Integration with PE hardware is required

• Final step: Software safety validation

7.48. Example software lifecycle (V-model)

7.49. Maintenance activities

7.50. Techniques and measures: Basic approach

• Goal: Preventing the introduction of systematic faults and controlling the residual faults

• SIL determines the set of techniques to be applied as

• M: Mandatory

• HR: Highly recommended (rationale behind not using it should be detailed and agreed with the assessor)

• R: Recommended

• -: No recommendation for or against being used

• NR: Not recommended

• Combinations of techniques is allowed

• E.g., alternate or equivalent techniques are marked

• Hierarchy of methods is formed (references to tables)

7.51. Example: Guide to selection of techniques

• Software safety requirements specification:

• Techniques 2a and 2b are alternatives

• Referred table: Semi-formal methods (B.7)

• A.1: Software safety requirements specification

• A.2: Software architecture design

• A.3: Support tools and programming languages

• A.4: Detailed design

• A.5: Module testing and integration

• A.6: PE integration

• A.7: Software safety validation

• A.8: Modification

• A.9: Software verification

• A.10: Functional safety assessment

7.53. Hierarchy of design methods

7.54. Hierarchy of design methods

7.55. Hierarchy of design methods

7.57. Hierarchy of V and V methods

7.58. Hierarchy of V and V methods

7.59. Hierarchy of V and V methods

7.60. Application of tools in the lifecycle

• Fault prevention:

• Program translation from high-level programming languages

• MBD, CASE tools: High level modeling and code/configuration generators

• Fault removal:

• Analysis, testing and diagnosis

• Correction (code modification)

7.61. Safety concerns of tools

• Types of tools

• Tools potentially introducing faults

• Modeling and programming tools

• Program translation tools

• Tools potentially failing to detect faults

• Analysis and testing tools

• Project management tools

• Requirements

• Use certified or widely adopted tools

• "Increased confidence from use" (no evidence of improper results yet)

• Use the well-tested parts without altering the usage

• Check the output of tools (analysis/diversity)

• Control access and versions

7.62. Safety of programming languages

• Expertise available in the design team

• Coding standards (subsets of languages) are defined

• "Dangerous" constructs are excluded (e.g., function pointers)

• Static checking can be used to verify the subset

• Specific (certified) compilers are available

• Compiler verification kit for third-party compilers

7.63. Safety of programming languages

7.64. Language comparison

Wild jumps: Jump to arbitrary address in memory

Overwrites: Overwriting arbitrary address in memory

Model of math: Well-defined data types

Separate compilation: Type checking across modules

7.65. Coding standards for C and C++

• MISRA C (Motor Industry Software Reliability Association)

• Safe subset of C (2004): 141 rules (121 required, 20 advisory)

• Examples:

• Rule 33 (Required): The right hand side of a " " or " " operator shall not contain side effects.

• "Joint Strike Fighter Air Vehicle C++ Coding Standard"

7.66. Safety-critical OS: Required properties

• Partitioning in space

• Memory protection

• Guaranteed resource availability

• Partitioning in time

• Deterministic scheduling

• Guaranteed resource availability in time

• Mandatory access control for critical objects

• Not (only) discretionary

• Bounded execution time

• Also for system functions

• Support for fault tolerance and high availability

• Fault detection and recovery / failover

• Redundancy control

7.67. Example: Safety and RTOS

• Less scheduling risks

• High maintenance risks

• Example: Tornado for Safety Critical Systems

• Integrated software solution uses Wind River'ssecurely partitioned VxWorks AE653 RTOS

• ARINC 653: Time and space partitioning(guaranteed isolation)

• RTCA/DO-178B: Level A certification

• POSIX, Ada, C support

7.68. Principles for documentation

• Type of documentation

• Comprehensive (overall lifecycle)

• E.g., Software Verification Plan

• Specific (for a given lifecycle phase)

• E.g., Software Source Code Verification Report

• Document Cross Reference Table

• Determines documentation for a lifecycle phase

• Determines relations among documents

• Traceability of documents is required

• Relationships between documents are specified (input, output)

• Terminology, references, abbreviations are consistent

• Merging documents is allowed

• If responsible persons (authors) shall not be independent

7.69. Document cross reference table (EN50128)

creation of a document

used document in a given phase (read vertically)

7.70. Human factors

• In contrast to computers

• Humans often fail in:

• reacting in time

• following a predefined set of instructions

• Humans are good in:

• handling unanticipated problems

• Human errors

• Not all kind of human errors are equally likely

• Hazard analysis (FMECA) is possible in a given context

• Results shall be integrated into system safety analysis

• Reducing the errors of developers

• Safe languages, tools, environments

• Training, experience and redundancy (independence)

• Reducing operator errors:

• Designing ergonomic HMI (patterns are available)

• Designing to aid the operator rather than take over

7.71. Organization

• Safety management

• Quality assurance

• Safety Organization

• Competence shall be demonstrated

• Training, experience and qualifications

• Independence of roles:

• DES: Designer(analyst, architect, coder, unit tester)

• VER: Verifier

• VAL: Validator

• ASS: Assessor

• MAN: Project manager

• QUA: Quality assurance personnel

7.72. Independence of personnel

7.73. Overall safety lifecycle (overview)

7.74. Specific activities (overview)

7.75. Specific activities (overview)

7.76. Specific activities (overview)

7.77. Specific activities (overview)

7.78. Specific activities (overview)

7.80. Specific activities (overview)

7.81. Summary

• Safety-critical systems

• Hazard, risk

• THR and Safety Integrity Level

• Dependability

• Attributes of dependability

• Fault Error Failure chain

• Means to improve dependability

• Development process

• Lifecycle activities

systems

8.1. Design of the architecture of safety-critical systems

8.2. Objectives

• Stopping (switch-off) is a safe state

• In case of a detected error the system has to be stopped

• Detecting errors is a critical task

• Stopping (switch-off) is not a safe state

• Service is needed even in case of a detected error

• full service

• degraded (but safe) service

• Fault tolerance is required

8.3. Architectural solutions (overview)

• The effects of detected errors can be handled (compensated)

• All failure modes are safe

• "Inherent safe" system

8.4. Typical architectures for fail-stop operation

8.5. 1. Single channel architecture with built-in self-test

• Single processing flow

• Scheduled hardware self-tests

• After switch-on: Detailed self-test to detect permanent faults

• In run-time: On-line tests to detect latent permanent faults

• Scheduled software self-tests

• Typically application dependent techniques

• Checking the control flow, data acceptance rules, timeliness properties

• Disadvantages:

8.6. Implementation of on-line error detection

• Application dependent (ad-hoc) techniques

• Acceptance checking (e.g., for ranges of values)

• Timing related checking (e.g., too early, too late)

• Cross-checking (e.g., using inverse function)

• Structure checking (e.g., in linked list structure)

• Application independent mechanisms

• Hardware supported on-line checking

• CPU level: Invalid instruction, user/supervisor modes etc.

• MMU level: Protection of memory ranges

• Generic architectural solutions

Stuck-at 0/1 faults:

Transition fault:

State transitions to check stuck faults:

"March" algorithms:

8.8. Example: Software self-test

• Checking the correctness of execution paths

• On the basis of the program control flow graph

8.9. Example: Software self-test

• Checking the correctness of execution paths

• On the basis of the program control flow graph

• Actual run: Checked on the basis of assigned signatures

8.11. Example: The SAFEDMI hardware architecture

• Single electronic structure based on reactive fail-safety

• Generic (off-the-shelf) hardware components are used

• Most of the safety mechanisms are based on software(error detection and error handling)

8.12. Example: The SAFEDMI hardware architecture

Hardware components:

8.13. Example: The SAFEDMI fault handling

• Operational modes:

• Startup, Normal, Configuration and Safe modes

• Suspect state to implement controlledrestart/stop after an error

8.14. Example: The SAFEDMI error detection techniques

• Startup: Detection of permanent hardware faults

• CPU testing with external watchdog circuit

• Memory testing with marching algorithms

• EPROM integrity checking with error detection codes

• Shared input

• Comparison of outputs

• Stopping in case of deviation

• High error detection coverage

• The comparator is a critical component (but simple)

• Special way of comparison:

• Performed by the operator

• Disadvantages:

• Common mode faults

• Long detection latency

8.16. Example: TI Hercules Safety Microcontrollers

8.18. Example: SCADA system architecture

• Two channels

• Alternating bitmap visualization from the two channels:Comparison by the operator

• Synchronization:Detection of internal errors before the effects reach the outputs

8.19. Example: SCADA deployment options

• Two channels on the same server

• Statically linked software modules

• Independent execution in memory, disk and time

• Diverse data representation

• Binary data (signals): Inverse representation (original/negated)

• Diverse indexing in the technology database

• Two channels on two servers

• Synchronization on dedicated network

8.20. Example: SCADA error detection techniques

For random hardware faults during operation:

• Comparison of channels: Operator and I/O circuits

• Heartbeat: Blinking RGB-BGR symbols indicate the regular update of the bitmap on the screen

• Watchdog process

• Checking the availability of the other processes

• Regular comparison of the content of the technology database

• Detecting latent errors

For unintended control by the operator:

• Three-phased control of outputs:

• Three-phased control of outputs: