Byzantine type error: a node tells different time for node and node for the same query
Necessary condition to be able to synchronize:
clocks in the system
where is No. of nodes with Byzantine type errors
Fault tolerant average:
The message arrives with varying time because of variance in time of media access, timing of SW etc.
• if synchronized from application SW:
• if synchronized from OS kernel:
• if synchronized from HW of comm. controller:
Can be proved: if in a set of N nodes there is a latency jitter of in the communication time, the consistency of the clock system cannot be better then even if all clock are perfect.
5.13. Time standards
In distributed, real-time systems:
• Temps Atomique Internationale, TAI)Based on an atomic clock (derived from the transmitted freq. of
Cesium-133 atom: 1 sec = )
• Universal Time Coordinated, UTC
• Derived from the movement of Earth and Sun (astronomy)
• Substitutes GMT since 1972, with 1 sec derived from TAI
• The movement of Earth is irregular sometimes extra leap secs inserted.
Time format: Network Time Protocol (NTP) is the most widespread
• 8 byte, from which 4 UTC sec, 4 fraction of sec.(fraction: 232 psec resolution)
• time since January 1., 1900, 00:00:00
• the format is good until 2036 (136 years is the turnover cycle)A new "Year 2000" problem!
5.14. Duration, precedence
Assume that the clocks are internally synchronized
global time: weakened version of reference time
(more coarse resolution, macro tick)
Example 3 clocks: global clock
Why is it worth to have a coarse grid?
We can assign the same times for an event, if sync. time is smaller then resolution:
Measurement of time difference:
If time difference is measured based on two different clocks:
• macro ticks has to be well specified
• clock inaccuracy needs to be taken into account!
Concept: message only in certain zones
6. 6 Real-time systems 2.
6.1. Real-time systems
• Characteristics of real-time systems
• SW issues
• Real-time operating systems
• Scheduling, real-time performance analysis
• Memory management
• Clock synchronization
• Communication in real-time systems
6.2. Real-time communication
• requirements,
• synchronization,
• flow control,
• media access protocols
6.3. Requirements in real-time systems
Delay/jitter caused by the protocol
• should be small between the communication network interfaces (CNI) of sender and receiver
• jitter should be small and predictable,
• in distributed applications the message should appear in every nodes CNI within a short and known time.
Compensability
• separation of host and communication network interface,
A real-time communication system has to support changes in configuration.
E.g. communication system of a car should not be influenced by the existence or lack of an extra (optional) feature
Error detection
• communication errors: communication systems needs to be predictable and fault tolerant. The errors need to be detected and corrected.
• complete acknowledgement: end-to-end protocol Never rely on transducers, always check their status!
Famous example: Three Mile Island, nuclear catastrophe, 1972
One of the valves failed to close, but the monitoring system shoed is closed, since the message of "close command" arrived correctly.
Physical structure
Multicast: bus or ring topology.
If the fault tolerance is accomplished through active redundancy:
devices need to be physically separated
E.g. steer-by-wire:
different parts are mounted in different places of the car
Synchronization of communication
• handshaking (data valid, data accepted), slowest receiver determines the speed of communication
6.4. Flow control
Goal: receiver has to keep up with the sender
Explicit flow control (event triggered): receiver acknowledges the sender on an explicit way that
• the message arrived correctly
• ready to receive a new message.
Example: Positive Acknowledgement or Retransmission Protocol (PAR)
ET protocol, given a sender and a receiver, communication media, time-out and a retransmission counter
Receiver: has the same message already arrived?
yes: send acknowledgement
no: send acknowledgement, notify client
Properties:
• communication is initiated by the sender,
• receiver is allowed to delay the sender (through a duplex communication channel),
• communication error is detected by the sender, the receiver does not receive information about the error,
• error correction through time redundancy,
• congestion: throughput decreases (nonlinearly) with the increase of comm. load Example: bus, no global time, communication is token controlled
token round trip time: 10 msec
time of message sending: 1 msec
PAR protocol, No. of retransmission: 2
What time-out is req.? 22 msec (2 x message + 2 x token)
1 msec
2 x unsuccessful + 1 x successful = 55 msec
2 x 22 msec + 11 msec
error detection time: 3 x unsuccessful = 66 msec
jitter: = 55 msec - 1 msec = 54 msec
6.6. Implicit flow control (time triggered):
• global time base required,
• no addressing is required, while everyone knows from a table the source and destination of the message,
• receiver cannot influence the speed.
Media access control communication protocols
Properties of the channel:
bandwidth: 10 kbit/sec..1 Mbit/sec (wired)
1 Gbit/sec (optical)
propagation speed/delay: 300 000 km/sec
in wire 2/3 of that: e.g. 5 sec/km
channel bit length: No. of bits that can traverse the channel
within one propagation delay
E.g. 100 Mbit/sec bandwidth, 200 m long cable,
propagation delay is 1 sec channel bit length is 100 bits.
Message should be hold at the channel as long, as it
propagates and arrives at every node.
Protocol efficiency:
message length/(message length + channel bit length)
E.g. in previous example with 40 bits long message,
protocol efficiency=40/(40+100)=4/14=29%
6.7. Physical layer
asynchronous: synchronization only at the beginning of message,
pour quality clock at the receiver (pl. sec/sec
drift),
typically short frames because of clock drift.
synchronous: synchronization also on the fly,
(e.g. there are several edges/state changes, not necessarily separate clock)
longer messages can be transmitted.
Return-to-zero (RZ) coding
1: positive pulse, 0: negative pulse
• intermediate signal level required
• it has DC component
• synchronizing code, but bit cell req. larger bandwidth required
Manchester coding (Ethernet, RFID etc.)
1: rising edge at half clock cycle,
0: falling edge at half clock cycle
At clock position: decision is made
based on next bit, if there is a
state change
• synchronizing code, but bit cell req. larger bandwidth required
Frequency modulation coding C: clock position, D: data position
1: signal change in C and D,
0 (after 1): no signal change
0 (after 0): signal change in C
• synchronizing code, but bit cell required double bandwidth required
Modified Frequency Modulation Coding (MFM)
(floppy disks)
1: signal change only in D,
0 (after 0: signal change in C
0 (after 1): no signal change
• synchronizing code, single bandwidth enough
6.8. Real-time systems
• Characteristics of real-time systems
• SW issues
• Real-time operating systems
• Scheduling, real-time performance analysis
• Memory management
• Clock synchronization
• Communication in real-time systems
• FlexRay
Pictures from the standard [1],[2]
[1] FlexRay Communications System, Protocol Specification, Version 2.1, Revision A
[2] FlexRay Communications System, Electrical Physical Layer Specification, Version 2.1, Revision B
Major properties:
• large communication speed (10 Mbps)
• time-triggered
• redundant, fault tolerant, safety critical First car in mass production:
BMW X5 (2007)
• controls stabilizers and dampers (Adaptive Drive) BMW X6 (2008) - full utilization of FlexRay
• transmission
• variable steering transmission ration (active steering)
•
• HRT
6.9. Bus topology
Passive bus (passive star can be considered a bus)
Active star
Single channel cascaded star
Hybrid topology
6.10. Node architecture
6.11. Communication controller - bus driver interface
TxEN: Transmit Enable Not
Can send data to the bus only under certain conditions
(Time Triggered)
6.12. Physical layer specification
Link: wire - UTP or STP (unshielded/shielded twisted pair)
optical
If wired:
= 1.8 .. 3.2 V
= 0.6 .. 2.0 V
6.13. Media Access Control
Periodic communication cycles
• static time division protocol (Time Division Multiple Access)
• dynamic "mini-slotting"
action point: sending a message can start only at given macro tick
arbitration: sender can start the message only at predefined time
Static segment and idle always,dynamic segment and symbol window optional
6.14. Static segment
• every slot has the same length within this segment (configurable)
• every frame has the same length within this segment (configurable)
6.15. Timing of static slot
6.16. Dynamic segment
• arbitration is based on minislots
The No. of minislots within a cluster is fixed
Action point: sending of message starts here
within a minislot
6.17. Symbol window
One symbol can be sent within a symbol window
The protocol doesn't contain arbitration
If arb. required to be handled by higher-level protocol
Symbols:
• Collision Avoidance Symbol (CAS)
• Media Access Test Symbol (MTS)
• Wakeup Symbol (WUS)
6.18. Frame format
• Payload preamble indicator: is an optional vector contained in payload?
• network management vector (static segment),
• message ID (dynamic segment), CAN-like content filtering
• Payload length: size of the payload segment ( byte)
• Cycle count: transmitting node's view of the value of the cycle counter
6.19. Frame coding
Between Communication Controller and Bus Drive
TxD, RxD, TxEN: binary code, NRZ coding
TSS: Transmission Start Sequence: Low for a given time
(configurable)
FSS: Frame Start Sequence: for compensation of possible
quantization errors at BSS
BSS: Byte Start Sequence: synchronization edge for the receiver
(every data is sent as extended byte)
FES: Frame End Sequence: end of the last byte
6.20. Frame coding in dynamic segment
DTS: Dynamic Trailing Sequence
To indicate the exact point in time of
transmitter's minislot action point
Prevents premature Channel idle detection
DTS is sent after FES (Frame End Sequence).
• extended byte from CRC, appending to the bit stream
• FES to the end
• DTS to the end, if dynamic segment
6.22. Symbol coding
Coding of two symbols are the same:
CAS (Collision Avoidance Symbol) and
MTS (Media Access Test Symbol)
Wakeup Symbol (WUS) This is not High, it is Idle!
6.23. Sampling and majority voting
• every node samples RxD many times within a bit duration
• majority voting within a sliding window (5 width voting window)
• suppresses glitches (spikes) coming from transient or disturbance(Assuming that the duration is short compared to bit duration)
• this is the timing of start of bit, not the clock for oversampling RxD!
• Receiver counts the received samples (oversampled) within bit duration
• If VotedVal High Low, sample counter is set to 2 at the next sample
• synchronization is accomplished only if enabled
6.25. Clock synchronization
Every node has a local clock
Assumptions:
• every node within the distributed system has approximately the same view of time
• difference between any two nodes cannot exceed a tolerance Distributed clock synchronization mechanism
6.26. Timing hierarchy
• microtick: derived from CC's internal timer (given prescaler, compare-match etc.) implementation might vary, controller specific
• macrotick: synchronized on a cluster-wide basis
Error model:
• offset (phase)
• rate (frequency) Both are corrected
Measurement: every node measures the time difference between the true and expected sync edge in microticks.
Last N is collected in a table.
6.28. Offset correction
• Measurement
• Selection of stored deviation values(only current communication cycle is used)
• I no sync frame error
• Calculation of fault-tolerant midpoint
• Within limit? If not crop
• An optional external correction (supplied by host)
6.29. Fault-tolerant midpoint algorithm
• determination of
• omitting the smallest and largest differences
• from the remaining: arithmetic mean of the smallest and larges, truncated towards zero (even if it is negative)
6.30. Rate correction
Principle:
• Measurement: time difference between two succeeding cycles are measured
• Last N time period measurements in a table
• If no sync frame error
• Calculation of fault-tolerant midpoint
• Damping value is applied
• smaller then a damping limit? no correction
• larger then a damping limit? correction = correction-limit
• Within upper/lower limits? If not crop
• An optional external correction (supplied by host)
6.31. External clock synchronization
If every node is in sync. within the cluster, different clusters can drift apart from each other
(clock sync. assures only consistency within the cluster)
host-determined external offset and rate correction for both clusters at the same time
Clock correction: adjustment of microticks within a macrotick
6.32. Wakeup
• Every bus driver needs to have stable power supply
Whole cluster needs to be awaken (wakeup)
• startup on both channels (synchronously)
• startup initiation: coldstart can be initiated only by coldstart node Coldstart:
• Collision Avoidance Symbol (CAS)
• in the first 4 cycle only this node can send messages
• first the other coldstart nodes integrate
• later the other (non coldstart) nodes integrate
6.34. Startup cont.
No. of coldstart nodes is
3, if No. of nodes 3
2, if No. of nodes 3
Startup frame is also a sync frame ď coldstart node is also a sync node At least two fault-free coldstart nodes are necessary for cluster startup
leading coldstart node: who actively starts the coldstart proc.
following coldstart node: coldstart node who integrates
Startup performed by coldstart nodes:
• coldstart startup initiation
• ends if there is stable communication between coldstart nodes Integration:
• at least 2 startup frame from different nodes
• integration can start during coldstart node startup, but might not finish until at least 2 coldstart nodes startup finishes
6.35. Node Local Bus Guardian
if both CC and BG enables it
If failure: fail-silent operation
The failure is not propagated
(sender is unmounted from bus)
BG: doesn't send,
only supervises timing
6.36. Central Bus Guardian
Verified failures:
• short-circuit and other bus errors
• sending at inappropriate time
CBG: needs to store static timing information
6.37. Central Bus Guardian cont.
• capable of decoding frames
• NOT capable of coding frames
• has own clock synchronization mechanism (same as in CC)
• distinguishes between various frame types(startup, sync etc.)
• can be cascaded
7. 7 Safety-critical systems
7.1. Safety-critical systems: Basic definitions
7.2. Introduction
• Safety-critical systems
• Informal definition: Malfunction may cause injury of people
• Safety-critical computer-based systems
• E/E/PE: Electrical, electronic, programmable electronic systems
• Control, protection, or monitoring
• EUC: Equipment under control
Railway signaling, x-by-wire, interlocking, emergency stopping, engine control, ...
7.3. Specialities of safety critical systems
• Special solutions to achieve safe operation
• Design: Requirements, architecture, tools, ...
• Verification, validation, and independent assessment
• Certification (by safety authorities)
• Basis of certification: Standards
• IEC 61508: Generic standard (for electrical, electronic or programmable electronic systems)
• DO178B: Software in airborne systems and equipment
• EN50129: Railway (control systems)
• EN50128: Railway (software)
• ISO26262: Automotive
• Other sector-specific standards: Medical, process control, etc.
7.4. Definition of safety
• Central concepts: Hazard, risk and safety
7.5. Definition of safety
• Central concepts: Hazard, risk and safety
7.6. Definition of safety
• Central concepts: Hazard, risk and safety
7.7. Definition of safety
7.8. Definition of safety
• Central concepts: Hazard, risk and safety
7.9. Definition of safety
• Central concepts: Hazard, risk and safety
7.10. Accident examples
• Intelligent braking is controlled by shock absorber + wheel rotation delayed braking hitting the embankment
• Is the control system "too intelligent"?
• Correct functioning but not safe behaviour!
7.11. Accident examples
• Toyota car accident in San Diego, August 2009
• Hazard: Stuck accelerator (full power)
• Floor mat problem
• Hazard control: What about ...
• Braking?
• Shutting off the engine?
• Putting the vehicle into neutral? (gearbox: D, P, N)
7.12. Experiences
• Harm is typically a result of a complex scenario
• (Temporal) combination of failure(s) and/or normal event(s)
• Hazards may not result in accidents
• Hazard failure
• Undetected (and unhandled) error is a typical cause of hazards
• Hazard may also be caused by (unexpected) combination of normal events
• Central problems in safety-critical systems:
• Analysis of hazards
• Assignment of functions to avoid hazards accidents harms
7.13. Hazard control
• Containment: Reduce the consequence of a hazard
7.14. Safety-related system
• Safety function:
• Function which is intended to achieve or maintain a safe state for the EUC
• Safety-related system:
• Implements the required safety functions necessary to achieve or maintain a safe state for the EUC,
• and is intended to achieve the necessary safety integrity for the required safety functions
• Requirements for a safety-related system:
• What is the safety function: Safety function requirements
• What is the likelihood of the correct operation of the safety function: Safety integrity requirements
7.15. Safety integrity
• Safety integrity:
• Probability of a safety-related system satisfactorily performing the required safety functions (i.e., without failure)
• under all stated conditions
• within a stated period of time
• Types of safety integrity:
• Random (hardware): Related to random hardware failures
• Occur at a random time due to degradation mechanisms
• Systematic: Related to systematic failures
• Failures related in a deterministic way to faults that can only beeliminated by modification of the design / manufacturing process / operation procedure / documentation / other relevant factors
• Safety integrity level (SIL):
• Discrete level for specifying safety integrity requirements of the safety functions (i.e., probabilities of failures)
7.16. Example: Safety function
• Hazard analysis: Avoiding injury of the operator when cleaning the blade
• If the cover is lifted more than 5 mm then the motor should be stopped
• The motor should be stopped in less than 1 sec
• Safety function: Interlocking
• When the cover is lifted to 4 mm, the motor is stopped and braked in 0,8 s
• Safety integrity:
• The probability of failure of the interlocking (safety function) shall be less than (one failure in 10.000 operation)
• Failure of interlocking is not necessarily result in an injury sincethe operator may be careful
7.17. Safety and dependability
• Safety vs. reliability:
• Fail-safe state: safe, but 0 reliability
• Railway signaling, red state: Safety reliability
• Airplane control: Safety = reliability
• Safety vs. availability:
• Fail-stop state: safe, but 0 availability (and reliability)
• High availability may result in (short) unsafe states
• Safety vs. dependability:
• Safety function requirements:
• Derived from hazard identification
• Safety integrity requirements:
• Related to target failure measure of the safety function
• Derived from risk estimation: Acceptable risk
• Safety standards: Risk based approach for determining target failure measure
• Tolerable risk: Risk which is accepted in a given context based on the current values of society
• It is the result of risk analysis
• Performed typically by the customer
• Considering the environment, scenarios, mode of operation, ...
7.19. Risk based approach
• EN50129:Railway applications
• THR:Tolerable hazard rate(continuous operation)
7.20. Risk analysis
• EN50129 (railway applications)
7.21. Mode of operation
• Way in which a safety-related system is to be used:
• Low demand mode: Frequency of demands for operation is
• no greater than one per year and
• no greater than twice the proof-test frequency
• High demand (or continuous) mode: Frequency of demands for operation is
• greater than one per year or
• greater than twice the proof-test frequency
• Target failure measure:
• Low demand mode: Average probability of failure to perform the desired function on demand
• High demand mode: Probability of a dangerous failure per hour
• Acceptable risk Tolerable hazard rate (THR)
7.22. Safety integrity requirements
7.23. Determining SIL: Overview
• Hazard identification and risk analysis Target failure measure
7.24. Structure of requirements
7.25. Challenges in achieving functional safety
• Preventing/controlling dangerous failures resulting from
• Incorrect specification (system, HW, SW)
• Omissions in safety requirement specification
• Hardware failure mechanisms: Random or systematic
• Software failure mechanisms: Systematic
• Common cause failures
• Human (operator) errors
• Environmental influences (e.g., temperature, EM, mechanical)
• Supply system disturbances (e.g., power supply)
• ...
7.26. Demonstrating SIL requirements
• Approaches:
• Random failure integrity:
• Quantitative approach: Based on statistics, experiments
• Systematic failure integrity:
• Qualitative approach: Rigor in the engineering- Development life cycle- Techniques and measures- Documentation- Independence of persons
• Safety case:
• Documented demonstration that the product complies with the specified safety requirements
• Systematic demonstration
7.27. Summary of the basic concepts
System safety
• emphasizes building in safety, not adding it to a completed design
• deals with systems as a whole rather than with subsystems or components
• takes a larger view of hazards than just failures
• emphasizes analysis rather than past experience and standards
7.29. Characterizing the system services
• Typical characteristics of services:
• Reliability, availability, integrity, ...
• These depend on the failures during the use of the services(the good quality of the production process is not enough)
• Composite characteristic: Dependability
• Definition: Ability to provide service in which reliance can justifiably be placed
• Justifiably: based on analysis, evaluation, measurements
• Reliance: the service satisfies the needs
• Basic question: How to avoid or handle the faults affecting the services?
7.31. Fault effects
7.32. Fault effects
7.33. Fault effects
7.34. Dependability and security
• Basic attributes of dependability:
• Availability: Probability of correct service (considering repairs and maintenance)
• Reliability: Probability of continuous correct service (until the first failure)
• Safety: Freedom from unacceptable risk of harm
• Integrity: Avoidance of erroneous changes or alterations
• Maintainability: Possibility of repairs and improvements
• Attributes of security:
• Availability
• Integrity
• Confidentiality: absence of unauthorized disclosure of information
7.35. Dependability metrics: Mean values
• Partitioning the state of the system: s(t)
• Correct (D, down) and incorrect (U, up) state partitions
• Mean values:
• Mean Time to First Failure: MTFF = E{u1}
• Mean Up Time: MUT = MTTF = E{ui} (Mean Time To Failure)
• Mean Down Time: MDT = MTTF = E{di} (Mean Time To Repair)
• Reliability: (no failure until t)
• Asymptotic availability: (regular repairs) In other way: K = A = MTTF /(MTTF+MTTR)
7.37. Availability related requirements
Availability of a system built up from components,
where the availability of a component is 95%:
• Availability of a system built from 2 components:
• Availability of a system built from 5 components:
• Availability of a system built from 10 components:
7.38. Attributes of components
• Fault rate:
• Probability that the component will fail at time point t given that it has been correct until t
• In other way (on the basis of the definition of reliability):
7.39. Case study: development of a DMI
7.40. Case study: DMI requirements
• Safety:
• Safety Integrity Level: SIL 2
• Tolerable Hazard Rate: THR hazardous failures per hours
• CENELEC standards: EN 50129 and EN 50128
• Reliability:
• Mean Time To Failure: MTTF 5000 hours (5000 hours: 7 months)
• Availability:
• A = MTTF / (MTTF+MTTR), A Faulty state: shall be less than 42 hours per yearMTTR 24
• A = MTTF / (MTTF+MTTR), A Faulty state: shall be less than 42 hours per yearMTTR 24