Real-time and Safety-critical Embedded Systems

(1)

Dabóczi, Tamás

Majzik, István

(2)

(3)

1.8. Fault tolerant behavior ... 3

1.9. Byzantine failures ... 4

1.10. Scheduling ... 4

2. 2 Exercises using mitmót rapid prototyping system ... 5

2.1. Modifying the wireless comm. API Packet handling with interrupt ... 5

2.2. Modifying the wireless comm. API Packet handling with interrupt ... 5

2.3. Modifying the wireless comm. API Dependable message transmission ... 5

2.4. Modifying the wireless comm. API Dependable message transmission ... 6

2.5. Modifying the wireless comm. API Sending with CSMA/CA ... 6

2.6. Modifying the wireless comm. API Sending with CSMA/CA ... 6

2.7. Distributed clock synchronization Fault-tolerant sync. ... 7

2.10. Distributed clock synchronization Interval-intersection based sync. ... 8

2.12. Visualization of interval-intersection ... 8

2.14. Distributed clock synchronization Minimization of max. error ... 9

2.17. Knight Rider on sensor network ... 10

2.18. Knight Rider on sensor network ... 10

2.19. Visualized clock synchronization master-slave approach ... 11

2.20. Visualized clock synchronization master-slave approach ... 11

2.21. Visualized clock synchronization distributed approach ... 11

2.22. Visualized clock synchronization distributed approach ... 12

2.23. Visualized clock synchronization distributed approach 2 ... 12

2.24. Real-time systems ... 12

2.25. SW issues of embedded systems ... 13

2.26. Properties for qualification ... 13

2.27. Classification of SW architectures ... 13

2.28. Periodic SW architecture ... 13

2.29. Simple periodic software architecture ... 14

2.30. Properties: ... 14

2.31. Weighted periodic software architecture ... 14

2.33. Cyclic executive software architecture ... 15

2.34. Properties ... 15

2.35. Time triggered software architecture ... 15

2.37. Periodic software arch. extended with IT ... 16

2.39. Function queue ... 17

2.42. SW architecture with real-time operating system (RTOS) ... 18

(4)

2.50. Mailbox ... 23

2.51. Queue ... 23

3. 3 Real-Time Operating Systems ... 24

3.2. The story of (Micro-Controller Operating System) ... 24

3.3. Summary (properties) ... 25

3.4. Scaling File : OS_CFG.H ... 25

3.5. Operation system calls ... 27

3.6. Synchronization mechanism ... 28

3.7. Communication between tasks ... 29

3.8. Semaphores ... 30

3.9. Example for usage of Semaphore Writing to and reading from buffer on a way that the data is not lost ... 30

3.10. Useage of Mailbox ... 31

3.11. Example for usage of Mailbox ... 31

3.12. Example for usage of Queue ... 32

3.13. uC-OS Event Control Block (ECB) semaphore, mailbox, queue ... 35

4. 4 Exercises ... 36

4.1. Exercises ... 36

4.2. Deadline Monotonic Algorithm ... 36

4.3. Solution: ... 36

4.4. Solution: ... 37

4.5. Solution: ... 37

4.6. Solution: ... 38

4.8. Solution: ... 39

4.9. Solution: ... 40

4.10. Solution: ... 41

4.11. Solution: ... 41

4.13. Solution: ... 42

4.14. Solution: ... 43

4.15. Solution: ... 44

4.16. Solution: ... 44

4.17. Scheduling with Earliest Deadline First ... 45

5. 5 Real-time systems ... 45

5.2. Memory management ... 46

5.4. Handling time, Clock systems ... 47

5.5. Time measurement, time providers ... 47

5.6. Communication models ... 48

5.7. Uncertainty of delay ... 49

5.8. Minimization of maximal error ... 51

5.9. Intersection interval algorithm ... 51

5.10. Visualization of Intersection interval algorithm ... 52

5.11. Clock sync. in the case of Byzantine type errors ... 52

5.12. Jitter of synchronization message ... 53

5.13. Time standards ... 53

5.14. Duration, precedence ... 53

6. 6 Real-time systems 2. ... 55

(5)

6.16. Dynamic segment ... 66

6.17. Symbol window ... 67

6.18. Frame format ... 67

6.19. Frame coding ... 68

6.20. Frame coding in dynamic segment ... 68

6.21. Construction of Frame bit stream ... 69

6.22. Symbol coding ... 69

6.23. Sampling and majority voting ... 69

6.24. Bit clock alignment ... 70

6.25. Clock synchronization ... 70

6.26. Timing hierarchy ... 70

6.27. Clock synchronization ... 71

6.28. Offset correction ... 71

6.29. Fault-tolerant midpoint algorithm ... 71

6.30. Rate correction ... 72

6.31. External clock synchronization ... 72

6.32. Wakeup ... 72

6.33. Startup ... 73

6.34. Startup cont. ... 73

6.35. Node Local Bus Guardian ... 73

6.36. Central Bus Guardian ... 74

6.37. Central Bus Guardian cont. ... 75

7. 7 Safety-critical systems ... 75

7.1. Safety-critical systems: Basic definitions ... 75

7.2. Introduction ... 76

7.3. Specialities of safety critical systems ... 77

7.4. Definition of safety ... 77

7.10. Accident examples ... 79

7.11. Accident examples ... 80

7.12. Experiences ... 81

7.13. Hazard control ... 81

7.14. Safety-related system ... 82

7.15. Safety integrity ... 82

7.16. Example: Safety function ... 82

7.17. Safety and dependability ... 83

7.18. Safety requirements ... 84

7.19. Risk based approach ... 84

7.20. Risk analysis ... 85

7.21. Mode of operation ... 86

7.22. Safety integrity requirements ... 86

7.23. Determining SIL: Overview ... 87

7.24. Structure of requirements ... 87

(6)

7.32. Fault effects ... 90

7.33. Fault effects ... 90

7.34. Dependability and security ... 91

7.35. Dependability metrics: Mean values ... 91

7.36. Dependability metrics: Probability functions ... 92

7.37. Availability related requirements ... 92

7.38. Attributes of components ... 92

7.39. Case study: development of a DMI ... 93

7.40. Case study: DMI requirements ... 93

7.41. Threats to dependability ... 93

7.42. The characteristics of faults ... 94

7.43. Means to improve dependability ... 94

7.44. Overview of the development of safety-critical systems ... 95

7.45. Overall safety lifecycle model: Goals ... 95

7.46. Hardware and software development ... 95

7.47. Software safety lifecycle ... 96

7.48. Example software lifecycle (V-model) ... 97

7.49. Maintenance activities ... 97

7.50. Techniques and measures: Basic approach ... 98

7.51. Example: Guide to selection of techniques ... 98

7.52. Structure of the tables (IEC 61508) ... 99

7.53. Hierarchy of design methods ... 99

7.56. Hierarchy of V and V methods ... 100

7.60. Application of tools in the lifecycle ... 103

7.61. Safety concerns of tools ... 104

7.62. Safety of programming languages ... 104

7.63. Safety of programming languages ... 105

7.64. Language comparison ... 106

7.65. Coding standards for C and C++ ... 106

7.66. Safety-critical OS: Required properties ... 107

7.67. Example: Safety and RTOS ... 107

7.68. Principles for documentation ... 108

7.69. Document cross reference table (EN50128) ... 109

7.70. Human factors ... 110

7.71. Organization ... 111

7.72. Independence of personnel ... 112

7.73. Overall safety lifecycle (overview) ... 112

7.74. Specific activities (overview) ... 113

7.81. Summary ... 115

8. 8 Design of the architecture of safety-critical systems ... 116

(7)

8.16. Example: TI Hercules Safety Microcontrollers ... 124

8.17. Example: SCADA system ... 125

8.18. Example: SCADA system architecture ... 125

8.19. Example: SCADA deployment options ... 126

8.20. Example: SCADA error detection techniques ... 127

8.21. Example: SCADA three phases of control ... 127

8.22. 3. Two-channels architecture with safety bag ... 128

8.23. Example: Alcatel (Thales) Elektra ... 129

8.24. Typical architectures for fault-tolerant systems ... 130

8.25. Objectives for fault tolerant behaviour ... 131

8.26. Fault tolerant systems ... 131

8.27. Forms of redundancy ... 131

8.28. Example: Error detecting and correcting codes ... 132

8.29. How to use the redundancy? ... 132

8.30. 1. Fault tolerance for hardware permanent faults ... 132

8.31. Implementation of the replication ... 133

8.32. Example: RAID disk configurations ... 134

8.33. 2. Fault tolerance for transient hardware faults ... 134

8.34. The four phases of operation 1/4 ... 135

8.37. Types of recovery ... 136

8.42. Backward recovery ... 138

8.43. Scenarios of backward recovery ... 138

8.44. Checkpoint intervals ... 139

8.45. Example: User configured checkpointing ... 139

8.49. Example: Saving the state of the CPU ... 141

8.50. Example: Saving the state of the CPU ... 141

8.51. Rollback recovery in distributed systems ... 142

8.52. Coordinated checkpointing in distributed systems ... 143

8.56. 4. Fault tolerance for software faults ... 144

8.57. N-version programming ... 145

8.58. Recovery blocks ... 145

8.61. Comparison of the techniques ... 147

8.62. Example: Airbus A-320, self-checking blocks ... 147

(8)

9.1. Testing: Test design and testing process ... 150

9.2. Overview ... 151

9.3. Testing and test design in the V-model ... 151

9.4. Goals of testing ... 152

9.5. Test environment: System testing ... 152

9.6. Test environment: Module testing ... 152

9.7. Tests and faults ... 153

9.8. Practical aspects of testing ... 153

9.9. Testing in the standards (here: EN 50128) ... 154

9.10. Testing in the standards (here: EN 50128) ... 154

9.11. Test approaches ... 154

9.12. I. Specification based (functional) testing ... 155

9.13. 1. Equivalence partitioning ... 155

9.14. Equivalence classes (partitions) ... 156

9.15. Valid/invalid equivalence classes ... 157

9.16. 2. Boundary value analysis ... 157

9.17. 3. Cause-effect analysis ... 158

9.18. Cause-effects analysis ... 158

9.19. 4. Combinatorial techniques ... 158

9.20. Example: pair-wise testing ... 158

9.21. Additional techniques ... 159

9.22. II. Structure based testing ... 159

9.23. The internal structure ... 160

9.24. The internal structure ... 160

9.25. Test coverage metrics ... 160

9.26. 1. Statement coverage ... 161

9.27. 2. Decision coverage ... 162

9.28. 3. Multiple condition coverage ... 163

9.29. Other coverage criteria ... 163

9.30. 4. Path coverage ... 164

9.31. A structure based testing technique ... 164

9.32. A structure based testing technique ... 165

9.33. Generating structure based test sequences ... 165

9.34. Data flow based test criteria ... 165

9.35. Data flow based test criteria ... 166

9.36. Execution of test cases ... 166

9.37. Relation to the development process ... 166

9.38. 1. Module testing ... 167

9.39. Module testing ... 167

9.40. Isolated testing of modules ... 168

9.41. Regression testing ... 168

9.42. 2. Integration testing ... 168

9.43. "Big bang" testing ... 169

9.44. Top-down integration testing ... 169

9.45. Bottom-up integration testing ... 170

9.46. Integration with the runtime environment ... 170

9.47. 3. System testing ... 171

9.48. Types of system tests ... 171

9.49. 4. Validation testing ... 171

9.50. Summary ... 172

10. 10 Hazard analysis ... 172

(9)

10.16. Example: Output of the analysis in PolySpace ... 178

10.17. 2. Fault tree analysis ... 179

10.18. Set of elements in a fault tree ... 179

10.19. Fault tree example: Elevator ... 179

10.22. Fault tree example: Software analysis ... 180

10.23. Qualitative analysis of the fault tree ... 181

10.24. Original fault tree of the elevator example ... 181

10.25. Reduced fault tree of the elevator example ... 181

10.26. Quantitative analysis of the fault tree ... 181

10.27. Fault tree of the elevator with probabilities ... 182

10.28. 3. Event tree analysis ... 182

10.29. Event tree example: Reactor cooling ... 183

10.33. Event tree example: Recovery blocks (RB) ... 184

10.34. 4. Cause-consequence analysis ... 184

10.35. Cause-consequence analysis example ... 185

10.38. 5. Failure modes and effects analysis (FMEA) ... 186

10.39. Example: Analysis of a computer system ... 187

10.40. Analysis of operator faults ... 187

10.41. Catalogue of hazards ... 188

10.42. Example: Risk matrix (railway control systems) ... 188

10.43. Examples of risk reduction requirements ... 188

10.44. Risk reduction techniques ... 188

10.45. Basic idea for risk reduction ... 189

10.46. Risk reduction principles (overview) ... 189

10.47. 1. Hazard elimination ... 190

10.51. 2. Hazard reduction ... 192

10.54. 3. Hazard control ... 193

10.55. 4. Damage minimization ... 193

10.56. Summary ... 193

11. 11 Safety cases ... 194

11.1. Safety cases ... 194

11.2. The safety case ... 194

11.3. Standard structure of a safety case ... 194

11.4. Quality related parts of the safety case ... 195

11.5. Technical parts of the safety case ... 196

(10)

11.13. Safety case patterns ... 201

11.14. Example of a GSN pattern ... 202

11.15. The Fault Tree pattern ... 202

11.16. The ALARP pattern ... 202

11.17. Modular safety cases ... 203

11.18. Advantages and disadvantages of GSN ... 204

11.19. Examples: Supporting tools ... 204

11.20. Summary ... 205

12. 12 Dependability and safety analysis ... 205

12.1. Dependability and safety analysis ... 205

12.2. Overview ... 206

12.3. Dependability metrics (see Basic Definitions) ... 206

12.4. Dependability metrics (see Basic Definitions) ... 207

12.5. Attributes of components (see Basic Definitions) ... 207

12.6. Goals of the analysis ... 207

12.7. Boole models for calculating dependability ... 208

12.8. Reliability block diagram ... 208

12.9. Reliability block diagram examples ... 209

12.10. Typical system configurations (overview) ... 209

12.11. Serial system ... 209

12.12. Serial system ... 210

12.13. Parallel system ... 210

12.14. Parallel system ... 211

12.15. Complex canonical system ... 211

12.16. N out-of M faulty components ... 212

12.17. N out-of M faulty components: TMR (NMR) ... 212

12.18. TMR/simplex system ... 213

12.19. Cold redundant system ... 213

12.20. Summary ... 214

12.21. The problem ... 214

12.22. Architecture of the system ... 215

12.23. Board level: List of components ... 215

12.24. Reliability data of electronic components ... 216

12.25. Example tool: MTBF Calculator ... 216

12.26. Life expectancy ... 217

12.27. Board level computations: Serial system ... 217

12.28. Supervisory subsystem: RBD model ... 218

12.29. Example tools ... 218

13. 13 Safety and Dependability Analysis: A Case Study ... 219

13.1. Safety and Dependability Analysis: A Case Study ... 219

13.2. The SAFEDMI Project ... 219

13.3. System Overview ... 220

13.4. Requirements ... 220

13.5. Overview of the Evaluation Tasks ... 221

13.6. Evaluation Techniques ... 221

13.7. Evaluation Techniques ... 221

13.8. Interplay of Evaluation Techniques ... 221

13.9. Evaluation of the DMI Architecture ... 222

13.10. Evaluation of the DMI Architecture ... 223

13.11. Evaluation of Wireless Communication Protocols ... 223

13.12. Evaluation of Detection Codes and Residual Errors ... 224

(11)

13.28. Periodic tests ... 230

13.29. Availability of the DMI with varying MTTR ... 231

14. 14 Formal modelling and verification ... 231

14.1. Formal modelling and verification ... 231

14.2. Example software lifecycle (V-model) ... 231

14.3. Techniques and measures in standards ... 232

14.4. Goals of formal modeling and verification ... 232

14.5. Modeling with timed automata ... 233

14.7. Automata and variables ... 233

14.8. Extensions using clock variables ... 234

14.9. Timed automata (in the UPPAAL tool) ... 234

14.10. Role of state invariants and guards ... 235

14.11. Extensions for modeling distributed systems ... 235

14.12. Example: Using clock variables and synchronization ... 236

14.13. Further extensions (rarely used) ... 236

14.14. The UPPAAL tool set ... 237

14.15. Automaton model ... 238

14.16. Simulator ... 238

14.17. Formalizing requirements with temporal logics ... 238

14.19. What are the formalized properties? ... 239

14.20. State based properties ... 240

14.21. Safety properties ... 240

14.22. Liveness properties ... 240

14.23. Language to formalize reachability properties ... 241

14.24. Temporal logics ... 241

14.25. The computational tree ... 241

14.26. Quantifying paths and characterizing states ... 242

14.27. The Computational Tree Logic (CTL) ... 243

14.28. Summary of temporal operators in UPPAAL ... 243

14.29. Composite operators for all paths ... 243

14.30. Composite operators for an existing path ... 244

14.31. Conditional reachability ... 244

14.32. Examples: formalizing properties using temporal logic ... 245

14.33. Model checking ... 245

14.34. The UPPAAL model checker ... 245

14.35. The UPPAAL model checker ... 246

14.36. Counter-example in the simulator ... 246

14.37. A case study ... 246

14.38. A solution for the mutual exclusion problem ... 246

14.39. Properties to be verified ... 247

14.40. How can these properties be verified? ... 247

14.41. The model in UPPAAL ... 248

14.42. Formalizing properties in UPPAAL ... 248

14.43. Verifying the properties in UPPAAL ... 248

14.44. Correction of the algorithm ... 249

(12)

14.52. Automated application code synthesis ... 252

14.53. Mapping the model semantics to source code ... 253

14.54. Model representation ... 253

14.55. Implementation of the code synthesis ... 254

14.56. Source code generation in the Eclipse environment ... 254

14.57. Run-time monitoring and verification ... 255

14.58. Control flow checking ... 255

14.59. Instrumentation for control flow monitoring ... 255

14.60. Hierarchical monitoring of temporal properties ... 256

14.61. Time overhead of monitoring ... 256

14.62. Code size overhead of monitoring ... 257

14.63. Summary of model based design and verification ... 257

(13)

• Characteristics of real-time systems

• SW issues

• Real-time operating systems

• Scheduling, real-time performance analysis

• Memory management

• Clock synchronization

• Communication in real-time systems

1.2. Real-time systems

• SW issues

1.3. Real-time systems

From the point of view of reaction to external event

• Hard real-time (HRT) system: missing the deadline has critical causes

(14)

HRT does not necessarily mean fast, only guarantied (e.g. year 2000 problem)

• Soft real-time (SRT) system: missing the deadline is not critical,only service level decreasesTypically transaction like operations

(e.g. ATM machine has a larger response time, thus the customer needs to be more patient)

1.4. Examples for Real-time systems

HRT:

• Romeo and Juliet: Friar John is sent to deliver Friar Laurence's letter to Romeo,

• balancing robot falls if the command has too large delay,

• anti blocking system, ESP, airbag in a car.

SRT:

• ATM machine, money withdrawal

1.5. Romeo and Juliet

Story (in 14 seconds):

R and J two young star-crossed lovers (from House of Capulet and of Montague, who are enemies)

(15)

• safety-critical behavior

1.6. Mistakes of Friar Laurence

• did not realize that communication is a real-time task (messenger to Romeo)

• hasn't assured the arrival of message within limited time (in HRT manner!)

• hasn't used safety-critical communication (supervising that the message arrived)

• hasn't supervised the actuator after sending the command (hasn't supervised Romeo's reaction to the message)

1.7. Properties of HRT vs. SRT

1.8. Fault tolerant behavior

Acknowledgment of messages

Historical example: Andrew (A) and Bill (B) can win in the battle against Eric (E) between them only if they join their troops.

Negotiation: they send a messenger but through channel with possible failures

A: "Let's attack at 4!"

(16)

• Has the message reached the destination? acknowledgement of message

• The other doesn't know if the acknowledgment reached the source acknowledgement of acknowledgment It cannot be assured that the failure of message be detected within limited time

1.9. Byzantine failures

Historical example: Byzantine Generals' Problem (traitors among them)

On of the generals tells "A" to one messenger, and "B" to another.

In practice: if a node provides different information from the same event at different time instances or if asked several times

(see later clock synchronization).

More general:

if a component's failure is not stopping or crashing,

but processes requests incorrectly,

corrupts its state,

produces incorrect output.

1.10. Scheduling

Safety critical systems: critical area should be 0 (deadline is always kept)

Typical problem: in the case of risk of catastrophe too many trigger events are generated - processing is overloaded, reaction is delayed

(17)

Short description:

The current Application programming Interface (API) of ISM band wireless communication module handles packet receiving with polling.

By calling the receiving_packet function the program waits in never-ending cycle, until the packet is received.

Task: modify the API to handle the receiving with IT

2.2. Modifying the wireless comm. API Packet handling with interrupt

Detailed description:

• study the current communication API, with special emphasis on receiving packets

• study the operation of the communication chip based on the datasheet

• develop a method to receive packets with interrupt(HW modification is also required ask supervisor)

• develop new receiving_packet_with_IT function in eCos OS

• messages needs to be stored automatically

• order of reading the messages is based on the order of reception

• reading empty message queue needs to block the process

• document the program

• develop an environment to demonstrate the usage

2.3. Modifying the wireless comm. API Dependable message transmission

Short description:

The current Application programming Interface (API) of ISM band wireless communication module sends the message immediately after calling send_packet function.

There is no information about the successful reception, or lack of success because of weak reception or collision.

Task: modify the API to handle acknowledgment based on Positive Acknowledgement or Retransmission (PAR) protocol

(18)

• develop a demo to demonstrate collisions

• develop new receiving_packet_PAR function with the following parameters:

• max. No. of retransmissions

• min. delay between two transmissions

• message ID: parameter of the sending function.ID generation is the duty of a higher level SW

• provide broadcast possibility, in this case there is no ACK

• sending is blocked until message is successfully sent or No. of retransmission is exceeded

• In case of unsuccessful sending return error code to calling SW

2.5. Modifying the wireless comm. API Sending with CSMA/CA

Short description:

The current Application programming Interface (API) of ISM band wireless communication module sends the message immediately after calling send_packet function.

There is no information about traffic on the communication channel, thus there is a large probability of collision.

Task: modify the API with trying avoid collision

2.6. Modifying the wireless comm. API Sending with CSMA/CA

• study the operation of the communication chip based on the datasheet, whith emphasis on Data Quality Detect (DQD) status information

• develop a demo to demonstrate collisions

• develop new receiving_packet_CA function:

• post operation: application indicates intention of sending message to MAC layer

• after post MAC layer waits for a random time ( )

(19)

of distributed clock synchronization algorithms is that every node runs the same algorithm. If one of the nodes fails on a way that random incorrect information is provided, the faulty node can significantly distort the synchronization.

Task: develop and implement fault-tolerant clock synchronization algorithm

2.8. Distributed clock synchronization Fault-tolerant sync.

Assumption: there are at most Byzantine faulty nodes

Basic idea:

• Every node asks the local time of all other nodes

• calculates the differences with his own local time

• orders the differences in ascending order

• discards the smallest and largest

• computes the simple mean of the remaining new local time

2.9. Distributed clock synchronization Fault-tolerant sync.

• Study the current communication API, with special emphasis on sending and receiving packets

• Implement the clock sync. algorithm

• Unique ID of nodes should be configured with DIP switches

• Be prepared for lost frames

• Max. No. of nodes is a prior known constant

• measure the accuracy of synchronization

• modify the API to reduce delay and jitter (insert time into frame on the lowest possible level)

• measure the accuracy of sync. with the modified API

(20)

Distributed systems cooperate on a common task, thus their local clocks need to be in synchrony. Characteristics of distributed clock synchronization algorithms is that every node runs the same algorithm.

Interval-intersection based clock sync. tries to adjust the common time to the middle of an uncertainty interval, which interval is a subset of all uncertainty intervals.

Task: develop and implement interval-based clock synchronization algorithm

2.11. Distributed clock synchronization Interval-intersection based sync.

Assumption: every node keeps track of local time and its accuracy . The accuracy increases with time (drift).

Basic idea:

• Every node asks the local time and accuracy of all other nodes

• consistency check

• update the uncertainties with travel time and drift

• calculate the interval intersection

• local time will be the middle of common interval

• uncert. will be the width of common interval

2.12. Visualization of interval-intersection

(21)

2.13. Distributed clock synchronization Interval-intersection based sync.

2.14. Distributed clock synchronization Minimization of max.

error

Short description:

Distributed systems cooperate on a common task, thus their local clocks need to be in synchrony. Characteristics of distributed clock synchronization algorithms is that every node runs the same algorithm.

(22)

Assumption: every node keeps track of local time and its accuracy . The accuracy increases with time (drift).

Basic idea:

• Every node asks the local time and accuracy of all other nodes

• consistency check

• update the uncertainties with travel time and drift

• if uncertainty (increased with drift during message travel) is smaller then local one replace the local time with that of the accurate node adjust the uncert. to the accurate node (take drift into account!)

2.16. Distributed clock synchronization Minimization of max.

error

2.17. Knight Rider on sensor network

Short description:

Every mitmót has several LEDs and wireless communication possibility.

It allows us to design visual effects with the LEDs by synchronized switching through wireless network.

Task: develop and implement a Knight Rider demo

(23)

• Modify the previous task by adding the following extra function:

• connect the root node (1st node) to the PC with Serial link

• the order of flashing should be configured from a terminal program on a PC at the beginning

2.19. Visualized clock synchronization master-slave approach

Short description:

It allows us to visualize the clock synchronization procedure by flashing the LEDs according to the status of local time.

Task: develop and implement master controlled clock synchronization and visualize it

2.20. Visualized clock synchronization master-slave approach

• At start every node needs to flash his LEDs with a random frequency between 0.5 Hz and 2 Hz.

• Nodes should adjust their clocks to the master (rate and offset corrections):

• Every node runs the local time with the internal timer in Output Compare mode, and switches the LED status at specific values

• The master node sends every 3 seconds a synchronization message through wireless network

• The time instance of sync. message corresponds to LED ON of master node (specific phase of internal timer, which is known be all nodes)

• Slave nodes latch their local timer at receiving sync. message

• Compare register of Timer is adjusted based on the difference (either with fix step every sync message, or proportional with the diff.) the clock locks to the master timer like a PLL (Phased Locked Loop) both offset and rate correction is accomplished at the same time

2.21. Visualized clock synchronization distributed approach

Short description:

(24)

• Nodes should adjust their clocks to each other (only offset corrections):

• Every node runs the same code, there is no special node

• Every node asks all other nodes timer through wireless network in a regular interval

• The node simple averages the timer states new local time

• be prepared for lost messages (collision)

2.23. Visualized clock synchronization distributed approach 2

Same as before, but both offset and rate correction

• Nodes should adjust their clocks to each other (only offset corrections):

• Every node runs the same code, there is no special node

• Every node asks all other nodes timer and period length (Output Compare reg.) through wireless network in a regular interval

• The node simple averages the timer states new local time

• The node simple averages the time periods new period length

2.24. Real-time systems

• SW issues

(25)

• reaction time for external asyncronous event,

• protection (memory),

• support for recursive functions, reentrant functions,

• processor utilization.

2.26. Properties for qualification

• maximal response time,

• handling hardware,

• communication between tasks,

• ease of development,

• area of usage

2.27. Classification of SW architectures

Practical implementation:

• Periodic

• Periodic extended with interrupt

• Function queue

• Real-time operating system

2.28. Periodic SW architecture

• simple periodic

• weighted periodic

• cyclic executive

(26)

while (TRUE){

if (DeviceA_Needs_Service()) {Service_A};

if (DeviceB_Needs_Service()) {Service_B};

if (DeviceC_Needs_Service()) {Service_C};

...

}

2.30. Properties:

• max. response time:

• handling HW: polling

• communication between tasks: through shared variables (not preemptive, thus no problem!)

• maintainability: bad

• HRT behavior: slow (e.g. printing task) (it can be HRT even if slow)

• proc. usage: 100% (NOT good!)

• area of usage: where time constant of system greater than cycle time (rapid and rare events)

2.31. Weighted periodic software architecture

void main() {

while (TRUE){

if (DeviceB_Needs_Service()) {Service_B};

(27)

2.32. Properties:

• max. response time: smaller for frequent tasks

• handling hardware: polling

• proc. usage: still 100%

• other properties: priority like behavior, but NOT preemptive

2.33. Cyclic executive software architecture

• Cycle boundaries are scheduled by a timer(only the boundaries!)

• The cycle is executed 1x or several times every timer IT

• Within a cycle it might be weighted cyclic

2.34. Properties

• max. response time: time period of cycle

• proc. usage: standby

2.35. Time triggered software architecture

(28)

• micro kernel supervises the time, and starts the tasks

2.36. Properties:

• max. response time: scheduled prevalence of given task

• communication between tasks: through shared variables

• proc. usage: standby

• HRT behavior: OK

typical in safety critical systems

2.37. Periodic software arch. extended with IT

(29)

• maintainability: good from point of view of IT, but relations change with a new task

• area of usage: execution time of tasks are appx. same Most widespread

2.39. Function queue

2.40. Properties:

• max. response time: exec. time of longest task

• handling hardware: with interrupts

• communication between tasks: through shared variables

• maintainability: good

• picking up from queue:

• FIFO

(30)

• SW issues

• Communication in real-time systems Tamás Dabóczi

Budapest University of Technology and Economics

Dept. of Measurement and Information Systems

2.42. SW architecture with real-time operating system (RTOS)

Control in more detail:

(31)

2.43. Properties:

• max. response time: OS specific data ( usec) for lower priority tasks:: all execution times of tasks with larger priority

• handling hardware: with interrupts

• communication between tasks: through RTOS communication functions. This serves also as synchronization

• maintainability: very good

• HRT behavior: good

• proc. usage: (When? Only if put into sleep during idle!)

• area of usage: universal

• drawback: OS extra time and code Nomenclature

embedded OS: requires limited resources (runs even on uC)

real-time OS: limited and deterministic response time

for external event

(E.g. Unix is already embedded, but not RT. RTlinux is also RT. We deal only with RT and embedded OS.)

task: chain of activities wich are logically related to each other

job: subactivity of a task

process: schedule entity with own memory (implementation of tasks)

thread: schedule entity but without own memory

kernel: core of OS

scalability: services of OS can be switched on/off in compile time

availability as source code

(32)

• handling interrupts,

• timing,

• handling memory With scalability as extra:

• handling peripheries, system programs (API)

• handling communication channels

• management of virtual memory, file systems etc.

2.44. States of tasks

2.45. Handling Tasks

Task Control Block (TCB)

(33)

2.46. Priorities

• reset

• power supply

• timer IT

• highest priority HW IT

• ...

• lowest priority HW IT

• scheduler

• highest priority task

• ...

• lowest priority task

(34)

2.47. Comparison of general purpose- and RTOS

2.48. Synchronization mechanisms between tasks

•

(35)

• signaling (binary) events

2.49. Semaphore

Operations:Create, Pend, Post, (Accept)

Pend: if , ; keeps on running

if , blocked, scheduling

Post: ; scheduling

administration:

2.50. Mailbox

Arbitrary data structure can be handed over.

Overwritten if the message from mailbox is not taken.

Administration:

2.51. Queue

Can be thought as linked list of mailboxes.

(36)

3. 3 Real-Time Operating Systems

3.1. Real-time systems

• SW issues

Operating System

(Micro-Controller Operating System)

3.2. The story of (Micro-Controller Operating System)

• Developer: Jean J. Labrosse

• motivation: he needed a RT kernel for an application

• "A" kernel reliable, but too expensive

• "B" kernel cheaper company bought this oneThey spent 2 months to get simple tasks to runIt turned out that the kernel was not well tested,they were one of the first customers

After accumulating large delay with the product release, they bought kernel "A" (expensive but reliable).

After 3 months they caught a bug, which the vendor couldn't fix for 6 months!

The product was on the market with large delay.

(37)

• ROMable,

• scalable,

• preemptive,

• multi-tasking,

• deterministic run-time of OS,

• allows different stack size for every task,

• services: mailbox, queue, semaphore, fixed-sized memory partitions, time related functions, etc.

• interrupt management (IT can be nested up to 255 levels deep),

• robust and reliable.

3.4. Scaling File : OS_CFG.H

(38)

The source code consist of conditional compilations based on these directives:

(39)

• Structure of a task:

• infinite loop,

• Needs to contain a blocking command.

• Priorities:

• Every task needs to have a unique priority,

• Priority is not modified by the OS, but the user may change,

• Scheduling: always the highest priority task runs, the priority is static.

• Stack:

• Every task has own stack,

• Size of stacks of different tasks may be different.

• Identification of tasks (suspend, resume, del, changeprio):

• With priorities, since this is unique.

(40)

Remark:

After an IT not the interrupted task receives the right to run, if the scheduler is preemptive, but the highest priority in the ready to run list (IT can change the set of "ready to run").

3.6. Synchronization mechanism

• Semaphores

• The counter of the semaphore needs to be initialized at the creation,

• Pend can wait for ever, or for a limited time (timeout in no. of tics).

Mutual Exclusion Semaphores

Event Flags

(41)

Message Queues

• Message can be appended also to the beginning of the Queue, thus it can become also LIFO Memory Management

• fixed-sized memory blocks can be allocated within a mem. partition

• after successful usage mem. block needs to be freed through the OS by OSMemPut.

3.7. Communication between tasks

• ECB: event control block (semaphore, mailbox, queue etc.)

• IT routine is not allowed to block!

•

(42)

3.8. Semaphores

Notation: key, if it is used for protecting shared resource,

flag, if it is used for signaling event

N: how many resources needs to be protected (counting semaphore)

N=1: binary semaphore

3.9. Example for usage of Semaphore Writing to and reading

from buffer on a way that the data is not lost

(43)

3.10. Useage of Mailbox

Message can be any pointer type data.

3.11. Example for usage of Mailbox

(44)

Usage of Queue

• Message can be arbitrary pointer type data.

• Queue is the circular buffer of pointers.

3.12. Example for usage of Queue

(45)

(46)

uC-OS Task control block

Scheduler of uC-OS

(47)

3.13. uC-OS Event Control Block (ECB) semaphore, mailbox,

queue

(48)

• Deadline Monotonic Algorithm (DMA)

• Earlier Deadline First

4.2. Deadline Monotonic Algorithm

Exercise 1: Calculate the worst case response time of Task 1..4 ( ) with iterative procedure (DMA)!

• The priorities of tasks decrease with the numbers (IT is highest, then ).

• The computation time of OS will be neglected.

• Tasks do not block each other (simple tasks).

• Provide the expression of DMA, and name the variables!

• Calculate the response time (even if the deadline cannot be met)!

• Can the deadline be met?

4.3. Solution:

(49)

4.4. Solution:

4.5. Solution:

(50)

4.6. Solution:

(51)

4.7. Deadline Monotonic Algorithm

Exercise 2: Calculate the worst case response time of Task 1..4 ( ) with iterative procedure (DMA)!

4.8. Solution:

(52)

4.9. Solution:

(53)

4.10. Solution:

4.11. Solution:

(54)

4.12. Deadline Monotonic Algorithm

• Exercise 3: Calculate the worst case response time of Task 1..4 ( ) with iterative procedure (DMA)!

4.13. Solution:

(55)

4.14. Solution:

(56)

4.15. Solution:

4.16. Solution:

(57)

4.17. Scheduling with Earliest Deadline First

Exercise:

• Schedule the following tasks with Earlier Deadline First algorithm!

• Every task is ready to run at

• the scheduler runs every 1 ms

• computation time of the scheduler can be neglected

• What is the shortest time interval for which the scheduling needs to be calculated in order to be able to decide whether all task are schedulable?

• Calculate the utilization index!

Schedule the tasks!

• How long does scheduling need to be calculated?

• Utilization index?

5. 5 Real-time systems

(58)

5.2. Memory management

Static:

• very safe in RT systems

• but does not allow reentrancy of functions.

Stack:

• supports reentrancy,

• What stack size is required? Only an estimate can be given.

Watermark: stack is filled with a fixed pattern.

Investigating the memory at any time: can be seen the last undisturbed pattern.

Dynamic mem. allocation:

• flexible,

• but mem. can be fragmented, chance for bubles,

• time of mem. allocation cannot be estimated,

• time of garbage collection cannot be estimated, it is nondeterministic,

• estimation of WCET is unrealistic,

• careless programming can cause pointers to loose, no "free" after "malloc"

Dynamic allocation of mem. is forbidden in RT systems.

(59)

5.4. Handling time, Clock systems

• representation of time,

• clock synchronization

5.5. Time measurement, time providers

Clocks as the sources of true time with given accuracy

clock shows function of the true time

Reference clock: the absolute accurate clock,

Correct clock: correct at , if

Accurate clock: accurate at , if at

(only offset error)

physical clock: oscillator + counter

resolution: No. of microticks per sec (g)

rate of clock:

where n is the No. of tick of reference clock,

is the time period of reference clock

clock drift: sec/sec

(60)

Precision:

consider clocks, the precision at the i. moment:

This increases with the drift sync. needed

internal synchronization: improves the consistency of clocks

accuracy: offset relative to the reference clock

external synchronization: adjusting to the reference clock

I all clocks in a set are externally synchronized with "A" accuracy, the clock set is also internally synchronized with "2A" accuracy.

If the clocks are internally synchronized, we cannot make any statement about the external synchronization!

5.6. Communication models

• unicast or multicast

• unicast: point-point connection

• multicast: from one node to any No. of nodes

• broadcast: from one node to every node

• symmetric or asymmetric link

• symmetric: "A" can send message to "B" if and only if "B" can send message to "A"

• asymmetric: link might be unidirectional (e.g. different transmission powers)

• implicit or explicit synchronization

• implicit: together with other communication (piggy-back)

• explicit: message only for sync. purposes

• internal or external synchronization

• internal: reference is within the network

• external: reference is out of the network (e.g. GPS)

• continuous or on-demand

• on-demand: pre- or post-facto

• every node or just a subset

(61)

• end: message is sent(needs to wait until the channel is free)

• time of signal propagation

• reception time: from arrival of signal until its reception by the application Types of clock systems

(a) Central clock system

• Accurate and precise clock provides the time for whole system,

• standby redundancy for fault tolerance,

• high cost (accurate clock is expensive),

• communication demand is low (one message per sync.)Example: GPS, satellites transmitting time signals, can be used for sync. with ns accuracy

(b) Centrally controlled clock systems

• one master clock (assumed to be accurate) polls the time of slaves

• measures the difference and specifies a correction for slaves,

• if master gets out of order, one of the slaves takes over the duty,

• transmission time and delay needs to be estimated,

• communication demand is higher then at central clock sys.

Master-slave algorithm in more detail

Assumptions:

error of master clock is smaller then that of slave ( small)

goal is consistency within the network, not with global time

(62)

Master-slave communication and calculations summarized:

1. Master sends its own time to SlaveSend

2. Slave calculates the time difference at the moment of reception 3. Slave sends its own time and the prev. difference to Master 4. Master calculates the time difference ate the moment of reception

5. Master calculates the required correction

6. Master sends the required correction to SlaveSend( ) 7. Slave corrects its clock with

8. Repeat 1-7 for every Slave The correction term:

: average error

: difference of quantization error of clocks

: difference in duration of communications

Last two terms are random can be decreased by averaging.

After the synchronization there is still remaining error

It increases with the time because of clock drift:

After time the max. deviation from the master:

Max. difference of clocks of two slaves after time:

Example: in the case of mitmĂłt hw which error can be neglected?

(63)

Every clock is aware of its own error (accuracy interval):

Terms of :

base error: remaining error at reset time (sync.)

delay between reading the clock and updating clock

drift causes error because of delay

Communication time + error caused by the drift is taken into account by incrementing :

5.8. Minimization of maximal error

If request is received from :

, Send to

At least once every time:

5.9. Intersection interval algorithm

If request is received from :

(64)

5.10. Visualization of Intersection interval algorithm

If the intervals do not intersect each other no sync. is possible

5.11. Clock sync. in the case of Byzantine type errors

Byzantine type error: a node tells different time for node and node for the same query

Necessary condition to be able to synchronize:

clocks in the system

where is No. of nodes with Byzantine type errors

Fault tolerant average:

(65)

The message arrives with varying time because of variance in time of media access, timing of SW etc.

• if synchronized from application SW:

• if synchronized from OS kernel:

• if synchronized from HW of comm. controller:

Can be proved: if in a set of N nodes there is a latency jitter of in the communication time, the consistency of the clock system cannot be better then even if all clock are perfect.

5.13. Time standards

In distributed, real-time systems:

• Temps Atomique Internationale, TAI)Based on an atomic clock (derived from the transmitted freq. of

Cesium-133 atom: 1 sec = )

• Universal Time Coordinated, UTC

• Derived from the movement of Earth and Sun (astronomy)

• Substitutes GMT since 1972, with 1 sec derived from TAI

• The movement of Earth is irregular sometimes extra leap secs inserted.

Time format: Network Time Protocol (NTP) is the most widespread

• 8 byte, from which 4 UTC sec, 4 fraction of sec.(fraction: 232 psec resolution)

• time since January 1., 1900, 00:00:00

• the format is good until 2036 (136 years is the turnover cycle)A new "Year 2000" problem!

5.14. Duration, precedence

Assume that the clocks are internally synchronized

global time: weakened version of reference time

(more coarse resolution, macro tick)

Example 3 clocks: global clock

(66)

Why is it worth to have a coarse grid?

We can assign the same times for an event, if sync. time is smaller then resolution:

Measurement of time difference:

If time difference is measured based on two different clocks:

• macro ticks has to be well specified

• clock inaccuracy needs to be taken into account!

Concept: message only in certain zones

(67)

6. 6 Real-time systems 2.

6.1. Real-time systems

• SW issues

6.2. Real-time communication

• requirements,

• synchronization,

• flow control,

• media access protocols

6.3. Requirements in real-time systems

Delay/jitter caused by the protocol

• should be small between the communication network interfaces (CNI) of sender and receiver

• jitter should be small and predictable,

• in distributed applications the message should appear in every nodes CNI within a short and known time.

Compensability

• separation of host and communication network interface,

(68)

A real-time communication system has to support changes in configuration.

E.g. communication system of a car should not be influenced by the existence or lack of an extra (optional) feature

Error detection

• communication errors: communication systems needs to be predictable and fault tolerant. The errors need to be detected and corrected.

• complete acknowledgement: end-to-end protocol Never rely on transducers, always check their status!

Famous example: Three Mile Island, nuclear catastrophe, 1972

One of the valves failed to close, but the monitoring system shoed is closed, since the message of "close command" arrived correctly.

Physical structure

Multicast: bus or ring topology.

If the fault tolerance is accomplished through active redundancy:

devices need to be physically separated

E.g. steer-by-wire:

different parts are mounted in different places of the car

Synchronization of communication

• handshaking (data valid, data accepted), slowest receiver determines the speed of communication

6.4. Flow control

Goal: receiver has to keep up with the sender

Explicit flow control (event triggered): receiver acknowledges the sender on an explicit way that

• the message arrived correctly

• ready to receive a new message.

Example: Positive Acknowledgement or Retransmission Protocol (PAR)

ET protocol, given a sender and a receiver, communication media, time-out and a retransmission counter

(69)

Receiver: has the same message already arrived?

yes: send acknowledgement

no: send acknowledgement, notify client

Properties:

• communication is initiated by the sender,

• receiver is allowed to delay the sender (through a duplex communication channel),

• communication error is detected by the sender, the receiver does not receive information about the error,

• error correction through time redundancy,

• congestion: throughput decreases (nonlinearly) with the increase of comm. load Example: bus, no global time, communication is token controlled

token round trip time: 10 msec

time of message sending: 1 msec

PAR protocol, No. of retransmission: 2

What time-out is req.? 22 msec (2 x message + 2 x token)

1 msec

2 x unsuccessful + 1 x successful = 55 msec

2 x 22 msec + 11 msec

error detection time: 3 x unsuccessful = 66 msec

jitter: = 55 msec - 1 msec = 54 msec

6.6. Implicit flow control (time triggered):

• global time base required,

(70)

• no addressing is required, while everyone knows from a table the source and destination of the message,

• receiver cannot influence the speed.

Media access control communication protocols

Properties of the channel:

bandwidth: 10 kbit/sec..1 Mbit/sec (wired)

1 Gbit/sec (optical)

propagation speed/delay: 300 000 km/sec

in wire 2/3 of that: e.g. 5 sec/km

channel bit length: No. of bits that can traverse the channel

within one propagation delay

E.g. 100 Mbit/sec bandwidth, 200 m long cable,

propagation delay is 1 sec channel bit length is 100 bits.

Message should be hold at the channel as long, as it

propagates and arrives at every node.

Protocol efficiency:

message length/(message length + channel bit length)

E.g. in previous example with 40 bits long message,

protocol efficiency=40/(40+100)=4/14=29%

6.7. Physical layer

asynchronous: synchronization only at the beginning of message,

pour quality clock at the receiver (pl. sec/sec

drift),

typically short frames because of clock drift.

synchronous: synchronization also on the fly,

(e.g. there are several edges/state changes, not necessarily separate clock)

longer messages can be transmitted.

(71)

Return-to-zero (RZ) coding

1: positive pulse, 0: negative pulse

• intermediate signal level required

• it has DC component

• synchronizing code, but bit cell req. larger bandwidth required

Manchester coding (Ethernet, RFID etc.)

1: rising edge at half clock cycle,

0: falling edge at half clock cycle

At clock position: decision is made

based on next bit, if there is a

state change

• synchronizing code, but bit cell req. larger bandwidth required

(72)

Frequency modulation coding C: clock position, D: data position

1: signal change in C and D,

0 (after 1): no signal change

0 (after 0): signal change in C

• synchronizing code, but bit cell required double bandwidth required

Modified Frequency Modulation Coding (MFM)

(floppy disks)

1: signal change only in D,

0 (after 0: signal change in C

0 (after 1): no signal change

• synchronizing code, single bandwidth enough

(73)

6.8. Real-time systems

• SW issues

• FlexRay

Pictures from the standard [1],[2]

[1] FlexRay Communications System, Protocol Specification, Version 2.1, Revision A

[2] FlexRay Communications System, Electrical Physical Layer Specification, Version 2.1, Revision B

Major properties:

• large communication speed (10 Mbps)

• time-triggered

• redundant, fault tolerant, safety critical First car in mass production:

BMW X5 (2007)

• controls stabilizers and dampers (Adaptive Drive) BMW X6 (2008) - full utilization of FlexRay

• transmission

• variable steering transmission ration (active steering)

•

(74)

• HRT

6.9. Bus topology

Passive bus (passive star can be considered a bus)

Active star

Single channel cascaded star

(75)

Hybrid topology

(76)

6.10. Node architecture

6.11. Communication controller - bus driver interface

TxEN: Transmit Enable Not

Can send data to the bus only under certain conditions

(Time Triggered)

(77)

6.12. Physical layer specification

Link: wire - UTP or STP (unshielded/shielded twisted pair)

optical

If wired:

= 1.8 .. 3.2 V

= 0.6 .. 2.0 V

6.13. Media Access Control

Periodic communication cycles

• static time division protocol (Time Division Multiple Access)

• dynamic "mini-slotting"

(78)

action point: sending a message can start only at given macro tick

arbitration: sender can start the message only at predefined time

Static segment and idle always,dynamic segment and symbol window optional

6.14. Static segment

• every slot has the same length within this segment (configurable)

• every frame has the same length within this segment (configurable)