Communication models - Real-time and Safety-critical Embedded Systems

• unicast or multicast

• unicast: point-point connection

• multicast: from one node to any No. of nodes

• broadcast: from one node to every node

• symmetric or asymmetric link

• symmetric: "A" can send message to "B" if and only if "B" can send message to "A"

• asymmetric: link might be unidirectional (e.g. different transmission powers)

• implicit or explicit synchronization

• implicit: together with other communication (piggy-back)

• explicit: message only for sync. purposes

• internal or external synchronization

• internal: reference is within the network

• external: reference is out of the network (e.g. GPS)

• continuous or on-demand

• on-demand: pre- or post-facto

• every node or just a subset

• end: message is sent(needs to wait until the channel is free)

• time of signal propagation

• reception time: from arrival of signal until its reception by the application Types of clock systems

(a) Central clock system

• Accurate and precise clock provides the time for whole system,

• standby redundancy for fault tolerance,

• high cost (accurate clock is expensive),

• communication demand is low (one message per sync.)Example: GPS, satellites transmitting time signals, can be used for sync. with ns accuracy

(b) Centrally controlled clock systems

• one master clock (assumed to be accurate) polls the time of slaves

• measures the difference and specifies a correction for slaves,

• if master gets out of order, one of the slaves takes over the duty,

• transmission time and delay needs to be estimated,

• communication demand is higher then at central clock sys.

Master-slave algorithm in more detail

Assumptions:

error of master clock is smaller then that of slave ( small)

goal is consistency within the network, not with global time

Master-slave communication and calculations summarized:

1. Master sends its own time to SlaveSend

2. Slave calculates the time difference at the moment of reception 3. Slave sends its own time and the prev. difference to Master 4. Master calculates the time difference ate the moment of reception

5. Master calculates the required correction

6. Master sends the required correction to SlaveSend( ) 7. Slave corrects its clock with

8. Repeat 1-7 for every Slave The correction term:

: average error

: difference of quantization error of clocks

: difference in duration of communications

Last two terms are random can be decreased by averaging.

After the synchronization there is still remaining error

It increases with the time because of clock drift:

After time the max. deviation from the master:

Max. difference of clocks of two slaves after time:

Example: in the case of mitmĂłt hw which error can be neglected?

Every clock is aware of its own error (accuracy interval):

Terms of :

base error: remaining error at reset time (sync.)

delay between reading the clock and updating clock

drift causes error because of delay

Communication time + error caused by the drift is taken into account by incrementing :

5.8. Minimization of maximal error

If request is received from :

, Send to

At least once every time:

5.9. Intersection interval algorithm

If request is received from :

5.10. Visualization of Intersection interval algorithm

If the intervals do not intersect each other no sync. is possible

5.11. Clock sync. in the case of Byzantine type errors

Byzantine type error: a node tells different time for node and node for the same query

Necessary condition to be able to synchronize:

clocks in the system

where is No. of nodes with Byzantine type errors

Fault tolerant average:

The message arrives with varying time because of variance in time of media access, timing of SW etc.

• if synchronized from application SW:

• if synchronized from OS kernel:

• if synchronized from HW of comm. controller:

Can be proved: if in a set of N nodes there is a latency jitter of in the communication time, the consistency of the clock system cannot be better then even if all clock are perfect.

5.13. Time standards

In distributed, real-time systems:

• Temps Atomique Internationale, TAI)Based on an atomic clock (derived from the transmitted freq. of

Cesium-133 atom: 1 sec = )

• Universal Time Coordinated, UTC

• Derived from the movement of Earth and Sun (astronomy)

• Substitutes GMT since 1972, with 1 sec derived from TAI

• The movement of Earth is irregular sometimes extra leap secs inserted.

Time format: Network Time Protocol (NTP) is the most widespread

• 8 byte, from which 4 UTC sec, 4 fraction of sec.(fraction: 232 psec resolution)

• time since January 1., 1900, 00:00:00

• the format is good until 2036 (136 years is the turnover cycle)A new "Year 2000" problem!

5.14. Duration, precedence

Assume that the clocks are internally synchronized

global time: weakened version of reference time

(more coarse resolution, macro tick)

Example 3 clocks: global clock

Why is it worth to have a coarse grid?

We can assign the same times for an event, if sync. time is smaller then resolution:

Measurement of time difference:

If time difference is measured based on two different clocks:

• macro ticks has to be well specified

• clock inaccuracy needs to be taken into account!

Concept: message only in certain zones

6. 6 Real-time systems 2.

6.1. Real-time systems

• Characteristics of real-time systems

• SW issues

• Real-time operating systems

• Scheduling, real-time performance analysis

• Memory management

• Clock synchronization

• Communication in real-time systems

6.2. Real-time communication

• requirements,

• synchronization,

• flow control,

• media access protocols

6.3. Requirements in real-time systems

Delay/jitter caused by the protocol

• should be small between the communication network interfaces (CNI) of sender and receiver

• jitter should be small and predictable,

• in distributed applications the message should appear in every nodes CNI within a short and known time.

Compensability

• separation of host and communication network interface,

A real-time communication system has to support changes in configuration.

E.g. communication system of a car should not be influenced by the existence or lack of an extra (optional) feature

Error detection

• communication errors: communication systems needs to be predictable and fault tolerant. The errors need to be detected and corrected.

• complete acknowledgement: end-to-end protocol Never rely on transducers, always check their status!

Famous example: Three Mile Island, nuclear catastrophe, 1972

One of the valves failed to close, but the monitoring system shoed is closed, since the message of "close command" arrived correctly.

Physical structure

Multicast: bus or ring topology.

If the fault tolerance is accomplished through active redundancy:

devices need to be physically separated

E.g. steer-by-wire:

different parts are mounted in different places of the car

Synchronization of communication

• handshaking (data valid, data accepted), slowest receiver determines the speed of communication

6.4. Flow control

Goal: receiver has to keep up with the sender

Explicit flow control (event triggered): receiver acknowledges the sender on an explicit way that

• the message arrived correctly

• ready to receive a new message.

Example: Positive Acknowledgement or Retransmission Protocol (PAR)

ET protocol, given a sender and a receiver, communication media, time-out and a retransmission counter

Receiver: has the same message already arrived?

yes: send acknowledgement

no: send acknowledgement, notify client

Properties:

• communication is initiated by the sender,

• receiver is allowed to delay the sender (through a duplex communication channel),

• communication error is detected by the sender, the receiver does not receive information about the error,

• error correction through time redundancy,

• congestion: throughput decreases (nonlinearly) with the increase of comm. load Example: bus, no global time, communication is token controlled

token round trip time: 10 msec

time of message sending: 1 msec

PAR protocol, No. of retransmission: 2

What time-out is req.? 22 msec (2 x message + 2 x token)

1 msec

2 x unsuccessful + 1 x successful = 55 msec

2 x 22 msec + 11 msec

error detection time: 3 x unsuccessful = 66 msec

jitter: = 55 msec - 1 msec = 54 msec

6.6. Implicit flow control (time triggered):

• global time base required,

• no addressing is required, while everyone knows from a table the source and destination of the message,

• receiver cannot influence the speed.

Media access control communication protocols

Properties of the channel:

bandwidth: 10 kbit/sec..1 Mbit/sec (wired)

1 Gbit/sec (optical)

propagation speed/delay: 300 000 km/sec

in wire 2/3 of that: e.g. 5 sec/km

channel bit length: No. of bits that can traverse the channel

within one propagation delay

E.g. 100 Mbit/sec bandwidth, 200 m long cable,

propagation delay is 1 sec channel bit length is 100 bits.

Message should be hold at the channel as long, as it

propagates and arrives at every node.

Protocol efficiency:

message length/(message length + channel bit length)

E.g. in previous example with 40 bits long message,

protocol efficiency=40/(40+100)=4/14=29%

6.7. Physical layer

asynchronous: synchronization only at the beginning of message,

pour quality clock at the receiver (pl. sec/sec

drift),

typically short frames because of clock drift.

synchronous: synchronization also on the fly,

(e.g. there are several edges/state changes, not necessarily separate clock)

longer messages can be transmitted.

Return-to-zero (RZ) coding

1: positive pulse, 0: negative pulse

• intermediate signal level required

• it has DC component

• synchronizing code, but bit cell req. larger bandwidth required

Manchester coding (Ethernet, RFID etc.)

1: rising edge at half clock cycle,

0: falling edge at half clock cycle

At clock position: decision is made

based on next bit, if there is a

state change

• synchronizing code, but bit cell req. larger bandwidth required

Frequency modulation coding C: clock position, D: data position

1: signal change in C and D,

0 (after 1): no signal change

0 (after 0): signal change in C

• synchronizing code, but bit cell required double bandwidth required

Modified Frequency Modulation Coding (MFM)

(floppy disks)

1: signal change only in D,

0 (after 0: signal change in C

0 (after 1): no signal change

• synchronizing code, single bandwidth enough

6.8. Real-time systems

• Characteristics of real-time systems

• SW issues

• Real-time operating systems

• Scheduling, real-time performance analysis

• Memory management

• Clock synchronization

• Communication in real-time systems

• FlexRay

Pictures from the standard [1],[2]

[1] FlexRay Communications System, Protocol Specification, Version 2.1, Revision A

[2] FlexRay Communications System, Electrical Physical Layer Specification, Version 2.1, Revision B

Major properties:

• large communication speed (10 Mbps)

• time-triggered

• redundant, fault tolerant, safety critical First car in mass production:

BMW X5 (2007)

• controls stabilizers and dampers (Adaptive Drive) BMW X6 (2008) - full utilization of FlexRay

• transmission

• variable steering transmission ration (active steering)

•

• HRT

6.9. Bus topology

Passive bus (passive star can be considered a bus)

Active star

Single channel cascaded star

Hybrid topology

6.10. Node architecture

6.11. Communication controller - bus driver interface

TxEN: Transmit Enable Not

Can send data to the bus only under certain conditions

(Time Triggered)

6.12. Physical layer specification

Link: wire - UTP or STP (unshielded/shielded twisted pair)

optical

If wired:

= 1.8 .. 3.2 V

= 0.6 .. 2.0 V

6.13. Media Access Control

Periodic communication cycles

• static time division protocol (Time Division Multiple Access)

• dynamic "mini-slotting"

action point: sending a message can start only at given macro tick

arbitration: sender can start the message only at predefined time

Static segment and idle always,dynamic segment and symbol window optional

6.14. Static segment

• every slot has the same length within this segment (configurable)

• every frame has the same length within this segment (configurable)

6.15. Timing of static slot

6.16. Dynamic segment

• arbitration is based on minislots

The No. of minislots within a cluster is fixed

Action point: sending of message starts here

within a minislot

6.17. Symbol window

One symbol can be sent within a symbol window

The protocol doesn't contain arbitration

If arb. required to be handled by higher-level protocol

Symbols:

• Collision Avoidance Symbol (CAS)

• Media Access Test Symbol (MTS)

• Wakeup Symbol (WUS)

6.18. Frame format

• Payload preamble indicator: is an optional vector contained in payload?

• network management vector (static segment),

• message ID (dynamic segment), CAN-like content filtering

• Payload length: size of the payload segment ( byte)

• Cycle count: transmitting node's view of the value of the cycle counter

6.19. Frame coding

Between Communication Controller and Bus Drive

TxD, RxD, TxEN: binary code, NRZ coding

TSS: Transmission Start Sequence: Low for a given time

(configurable)

FSS: Frame Start Sequence: for compensation of possible

quantization errors at BSS

BSS: Byte Start Sequence: synchronization edge for the receiver

(every data is sent as extended byte)

FES: Frame End Sequence: end of the last byte

6.20. Frame coding in dynamic segment

DTS: Dynamic Trailing Sequence

To indicate the exact point in time of

transmitter's minislot action point

Prevents premature Channel idle detection

DTS is sent after FES (Frame End Sequence).

• extended byte from CRC, appending to the bit stream

• FES to the end

• DTS to the end, if dynamic segment

6.22. Symbol coding

Coding of two symbols are the same:

CAS (Collision Avoidance Symbol) and

MTS (Media Access Test Symbol)

Wakeup Symbol (WUS) This is not High, it is Idle!

6.23. Sampling and majority voting

• every node samples RxD many times within a bit duration

• majority voting within a sliding window (5 width voting window)

• suppresses glitches (spikes) coming from transient or disturbance(Assuming that the duration is short compared to bit duration)

• this is the timing of start of bit, not the clock for oversampling RxD!

• Receiver counts the received samples (oversampled) within bit duration

• If VotedVal High Low, sample counter is set to 2 at the next sample

• synchronization is accomplished only if enabled

6.25. Clock synchronization

Every node has a local clock

Assumptions:

• every node within the distributed system has approximately the same view of time

• difference between any two nodes cannot exceed a tolerance Distributed clock synchronization mechanism

6.26. Timing hierarchy

• microtick: derived from CC's internal timer (given prescaler, compare-match etc.) implementation might vary, controller specific

• macrotick: synchronized on a cluster-wide basis

Error model:

• offset (phase)

• rate (frequency) Both are corrected

Measurement: every node measures the time difference between the true and expected sync edge in microticks.

Last N is collected in a table.

6.28. Offset correction

• Measurement

• Selection of stored deviation values(only current communication cycle is used)

• I no sync frame error

• Calculation of fault-tolerant midpoint

• Within limit? If not crop

• An optional external correction (supplied by host)

6.29. Fault-tolerant midpoint algorithm

• determination of

• omitting the smallest and largest differences

• from the remaining: arithmetic mean of the smallest and larges, truncated towards zero (even if it is negative)

6.30. Rate correction

Principle:

• Measurement: time difference between two succeeding cycles are measured

• Last N time period measurements in a table

• If no sync frame error

• Calculation of fault-tolerant midpoint

• Damping value is applied

• smaller then a damping limit? no correction

• larger then a damping limit? correction = correction-limit

• Within upper/lower limits? If not crop

• An optional external correction (supplied by host)

6.31. External clock synchronization

If every node is in sync. within the cluster, different clusters can drift apart from each other

(clock sync. assures only consistency within the cluster)

host-determined external offset and rate correction for both clusters at the same time

Clock correction: adjustment of microticks within a macrotick

6.32. Wakeup

• Every bus driver needs to have stable power supply

Whole cluster needs to be awaken (wakeup)

• startup on both channels (synchronously)

• startup initiation: coldstart can be initiated only by coldstart node Coldstart:

• Collision Avoidance Symbol (CAS)

• in the first 4 cycle only this node can send messages

• first the other coldstart nodes integrate

• later the other (non coldstart) nodes integrate

6.34. Startup cont.

No. of coldstart nodes is

3, if No. of nodes 3

2, if No. of nodes 3

Startup frame is also a sync frame ď coldstart node is also a sync node At least two fault-free coldstart nodes are necessary for cluster startup

leading coldstart node: who actively starts the coldstart proc.

following coldstart node: coldstart node who integrates

Startup performed by coldstart nodes:

• coldstart startup initiation

• ends if there is stable communication between coldstart nodes Integration:

• at least 2 startup frame from different nodes

• integration can start during coldstart node startup, but might not finish until at least 2 coldstart nodes startup finishes

6.35. Node Local Bus Guardian

if both CC and BG enables it

If failure: fail-silent operation

The failure is not propagated

(sender is unmounted from bus)

BG: doesn't send,

only supervises timing

6.36. Central Bus Guardian

Verified failures:

• short-circuit and other bus errors

• sending at inappropriate time

CBG: needs to store static timing information

6.37. Central Bus Guardian cont.

• capable of decoding frames

• NOT capable of coding frames

• has own clock synchronization mechanism (same as in CC)

• distinguishes between various frame types(startup, sync etc.)

• can be cascaded

7. 7 Safety-critical systems

7.1. Safety-critical systems: Basic definitions

7.2. Introduction

• Safety-critical systems

• Informal definition: Malfunction may cause injury of people

• Safety-critical computer-based systems

• E/E/PE: Electrical, electronic, programmable electronic systems

• Control, protection, or monitoring

• EUC: Equipment under control

Railway signaling, x-by-wire, interlocking, emergency stopping, engine control, ...

7.3. Specialities of safety critical systems

• Special solutions to achieve safe operation

• Design: Requirements, architecture, tools, ...

• Verification, validation, and independent assessment

• Certification (by safety authorities)

• Basis of certification: Standards

• IEC 61508: Generic standard (for electrical, electronic or programmable electronic systems)

• DO178B: Software in airborne systems and equipment

• EN50129: Railway (control systems)

• EN50128: Railway (software)

• ISO26262: Automotive

• Other sector-specific standards: Medical, process control, etc.