• Nem Talált Eredményt

2.2 Framework of Reinforcement Learning

2.2.4 Deep Q-learning

Deep Learning (DL) is a form of machine learning that transforms inputs into outputs based on Artificial Neural Networks (ANNs). It can be commonly applied in the field, such as image processing, speech recognition, natural language processing, etc. DL can be classified into supervised, semi-supervised and unsupervised learning (LeCun et al., 2015). The architectures, including deep neural networks, deep reinforcement learning, convolutional neural networks, etc., have rapidly developed. An application of smart mask filter (He et al.,2021a) based on DL can be referred to AppendixB.

Deep Q-network (DQN) method is developed based on Q-learning and DL, which uses multi-layer neural networks to approximate the Q-function instead of using a tabu- lar Q-function in Q-learning, shown in Fig.2.5(b). Generally, the method of Q-learning can solve the problem where the state and action spaces are quite small. Otherwise, a

"curse of dimensionality" problem shows as the space of states and actions expands. It is impossible to create such a large Q-value table for storing and searching the corre- sponding data. Thus, DQN can ease this problem when the states and actions space is too large. Specifically, the action-value function Q(s,a)is parameterized as Q(s,at) with using the deep convolutional neural network, whereθt is the weight of the neural network at iteration t. DQN uses two separate neural networks, the target and the on- line neural network, for the Q-function updating, instead of using a single Q-function in (2.13). These two neural networks can calculate the target and estimation value, respectively. Then the loss function is updated as follows:

Ltt) =Est,at,rt[ytQ(st,att)]2

=Est,at,rt,st+1[rtmax

at+1

Q(sˆ t+1,at+1; ˆθt)−Q(st,att)]2 (2.15)

where yt =rtmax

at+1

Q(sˆ t+1,at+1; ˆθt) is the target value depending on the weights of the neural network,γ is discount factor which is mentioned before, ˆQ(st+1,at+1; ˆθt)is the target Q-network, and ˆθt are the parameters used to compute the target Q-network at iteration t. The target network parameters ˆθt are only updated with the Q-network parametersθt everyXstep and are fixed between individual updates. The loss function is minimized to update the Q-network by using the stochastic gradient descent method.

Moreover, a replay buffer that stores the experiences is used to update the online network in DQN. Specifically, mini-batches (e.g., samples) are selected randomly and uniformly from the buffer of stored experiences to update the Q-network with the loss function in (2.15). That is different from the Q function updating process in Q-learning, that only uses the last obtained experience. The target network is updated periodically instead of every single step to stabilize and speed up the training process. It can alleviate the high correlation between consecutive experiences and overcome neural networks’

"forgetfulness" problem during the learning process. Besides, the experiences stored in the replay buffer can be relearned multiple times, making data utilization more effi- cient.

Chapter 3

Traffic Light Control with Game Theory-based Strategies

As mentioned in Section 1.2.1, integrating artificial intelligence like game theory into traffic light control provides an intelligent and efficient way to regulate traffic flow. In a game theoretical framework, incoming links are regarded as players, and the status of the signal light (green or red) can be considered the players’ decisions. The traf- fic light control system aims to find an optimal solution to reduce traffic congestion and maximize the number of vehicles that travel through the intersection in the urban transportation system.

This chapter develops a novel game theoretical traffic light control system with decision-making combinations. Unlike the popular methods of adjusting the time in- tervals of the green light, this model focuses on the optimal vehicle route plan. Four control methods are implemented and compared in the traffic light control model: (a) Constant control strategy. The time intervals of traffic lights are fixed and periodi- cal. (b) global optimal strategy. Centralized control can be considered a cooperative game, where incoming links are regarded as players communicating with each other to generate the best strategy for the whole game with a global cost function. (c) Nash equilibrium strategy. In a decentralized control with a non-cooperative game structure, each player will strive for an outcome that provides him with the lowest possible cost.

(d) Stackelberg equilibrium strategy. Similar to (c), the concepts of leader and follower are introduced. The leader has a higher priority to enforce his strategy on the followers, and the followers will respond to the decisions made by the leader.

3.1 Model Formulation

Fig.3.1shows a general structure of a cross intersection with four incoming links (i.e., w1...w4) where the vehicles come from outside the queues and four outgoing links (i.e., z1...z4) through which the vehicles pass the intersection. Thus, the notion(wi,zj)can be defined as the path, which indicates the vehicles are moving to zj from wi. In this case, only a traffic light’s red and green states are considered to control for a simpler evaluation of the algorithm’s effect. We formulate these states in (3.1):

g=

(0, red (3.1a)

1, green (3.1b)

We assume that there is a variable called te that is a time interval during which all the waiting vehicles on the path and incoming vehicles outside the path will be cleared.

The cleared vehicles are the vehicles passing through the intersection. Based on this, the key equation of this model can be created as:

Lwizj(k) +IwiRwizjte(k) =Owizjte(k) (3.2) where i,i=1, ...,4 is the index of the incoming link and j,j=1, ...,4 is the index of the outgoing link. Variable k is the index of the time slice, and Lwizj is regarded as the queue length, governing the number of waiting vehicles leaving to zj from wi. The incoming traffic flow is given by Iwi (vehicles/second), which indicates the number of vehicles entering into the incoming link wifor each second. In order to make sure all the incoming vehicles from wican be distributed on a different path to the outgoing link zj, a parameter Rwizj is introduced, which assigns the number of vehicles to different path (wi,zj)with the different ratio of incoming traffic flow Iwi. The left part of the equation is the sum of waiting vehicles and incoming vehicles corresponding to the incoming link wi, and the right part means the outgoing vehicles with respect to the outgoing traffic flow Owizj (vehicles/second).

Solving tefrom (3.2), we can obtain:

te(k) = Lwizj(k) OwizjIwiRwizj

(3.3)

Our goal here is to update the new queues based on the old queues and the situation of te, which can be discussed in three different cases as follows:

Fig. 3.1: General structure of the intersection.

Case 1: 0<te(k)≤Ts, the updated queue length Lwizj(k+1)is:

Lwizj(k+1) =Lwizj(k)−gwizj(k)Owizjte(k)

gwizj(k)IwiRwizj(Tste(k)) +IwiRwizjTs

(3.4)

where Ts is the length of the time slice and k is the order of the time slice aforemen- tioned. The status of the light signals is defined as gwizj(k) reflected in (3.1), which determines whether the vehicles can leave the incoming link wi to the outgoing link zj or not (i.e. 0-red-stop, 1-green-pass). When te(k) is more than zero and less than Ts under the condition of gwizj(k) =1, it means at first all the vehicles including the waiting vehicles at the traffic light and the incoming vehicles outside the queues will be cleared during te(k). Then, after clearing any waiting vehicles, but incoming vehi- cles still keep entering in Tste(k), they will be removed as well because the green

light is active. Finally, an additional part that is all incoming vehicles during the pe- riod Ts no matter whether the traffic light is red or green should be added. Otherwise when the traffic light is red during Ts, the two minus parts −gwizj(k)Owizjte(k) and

gwizj(k)IwiRwizj(Tste(k)) are zero because gwizj(k) =0, thus, only the old queue part Lwizj(k)and plus part IwiRwizjTsare remained.

Case 2: te(k)<0 or te(k)>Ts, the updated queue length Lwizj(k+1)is:

Lwizj(k+1) =Lwizj(k)−gwizj(k)OwizjTs+IwiRwizjTs (3.5) Similarly, when te(k) is more than Ts or less than zero under the condition of gwizj(k) =1, the vehicles can not be cleared completely in a time slice Ts. That means the system cannot remove all the waiting and incoming vehicles in Ts with the lim- ited speed of outgoing traffic flow. That directly causes the situation of accumulating queues, which is the main difference between case 1 and case 2. In (3.5), the minus part−gwizj(k)OwizjTs is following the speed of outgoing traffic flow even though there are still some vehicles are not removed on the path. The plus part IwiRwizjTs and the condition of the red light are the same as the previous situation.

Case 3: te(k) =0, the updated queue length Lwizj(k+1)is:

Lwizj(k+1) =

(−gwizj(k)IwiRwizjTs+IwiRwizjTs (3.6a)

gwizj(k)OwizjTs+IwiRwizjTs (3.6b) Initial queue length Lwizj(k) =0 results in te(k) =0 in (3.3). The difference be- tween the speed of outgoing traffic flow Owizj and the speed of incoming traffic flow IwiRwizj is defined as DIOwizj. If DIOwizi >0 in (3.6a), then the queue length will al- ways keep empty with the speed IwiRwizj. Otherwise, in (3.6b), the queue length will keep increasing with a small part of leaving vehicle stream with speed Owizj.

Generally, the collision must occur among the vehicle stream if more than one vehi- cle stream crosses the intersection simultaneously. Thus, the crossing rate Y[(w1i,z1j),(w2i,z2j)]

is defined to determine how many levels of the speed of the traffic flow on the first bent track(w1i,z1j) will be affected by the traffic flow on the other bent track(w2i,z2j). The variable Ow1

iz1j,0is the original speed of the vehicle stream on the first bent track(w1i,z1j) without any interference from other bent tracks. So we have:

Ow1

iz1j,w2iz2j =Ow1

iz1j,0 Π

(w2i,z2j)

Y[(w1i,z1j),(w2i,z2j)] (3.7)