Deep Q-learning - Framework of Reinforcement Learning

2.2 Framework of Reinforcement Learning

2.2.4 Deep Q-learning

Deep Learning (DL) is a form of machine learning that transforms inputs into outputs based on Artificial Neural Networks (ANNs). It can be commonly applied in the field, such as image processing, speech recognition, natural language processing, etc. DL can be classified into supervised, semi-supervised and unsupervised learning (LeCun et al., 2015). The architectures, including deep neural networks, deep reinforcement learning, convolutional neural networks, etc., have rapidly developed. An application of smart mask filter (He et al.,2021a) based on DL can be referred to AppendixB.

Deep Q-network (DQN) method is developed based on Q-learning and DL, which uses multi-layer neural networks to approximate the Q-function instead of using a tabu- lar Q-function in Q-learning, shown in Fig.2.5(b). Generally, the method of Q-learning can solve the problem where the state and action spaces are quite small. Otherwise, a

"curse of dimensionality" problem shows as the space of states and actions expands. It is impossible to create such a large Q-value table for storing and searching the corre- sponding data. Thus, DQN can ease this problem when the states and actions space is too large. Specifically, the action-value function Q(s,a)is parameterized as Q(s,a;θt) with using the deep convolutional neural network, whereθt is the weight of the neural network at iteration t. DQN uses two separate neural networks, the target and the on- line neural network, for the Q-function updating, instead of using a single Q-function in (2.13). These two neural networks can calculate the target and estimation value, respectively. Then the loss function is updated as follows:

L_t(θt) =E_s_t_,a_t_,r_t[yt−Q(st,a_t;θt)]²

=E_s_t_,a_t_,r_t_,s_t+1[rt+γ^max

a_t+1

Q(sˆ t+1,a_t+1; ˆθt)−Q(st,a_t;θt)]² (2.15)

where yt =rt+γ^max

at+1

Q(sˆ _t+1,a_t+1; ˆθt) is the target value depending on the weights of the neural network,γ is discount factor which is mentioned before, ˆQ(s_t+1,a_t+1; ˆθt)is the target Q-network, and ˆθt are the parameters used to compute the target Q-network at iteration t. The target network parameters ˆθt are only updated with the Q-network parametersθt everyXstep and are fixed between individual updates. The loss function is minimized to update the Q-network by using the stochastic gradient descent method.

Moreover, a replay buffer that stores the experiences is used to update the online network in DQN. Specifically, mini-batches (e.g., samples) are selected randomly and uniformly from the buffer of stored experiences to update the Q-network with the loss function in (2.15). That is different from the Q function updating process in Q-learning, that only uses the last obtained experience. The target network is updated periodically instead of every single step to stabilize and speed up the training process. It can alleviate the high correlation between consecutive experiences and overcome neural networks’

"forgetfulness" problem during the learning process. Besides, the experiences stored in the replay buffer can be relearned multiple times, making data utilization more effi- cient.

Chapter 3 Traffic Light Control with Game Theory-based Strategies

As mentioned in Section 1.2.1, integrating artificial intelligence like game theory into traffic light control provides an intelligent and efficient way to regulate traffic flow. In a game theoretical framework, incoming links are regarded as players, and the status of the signal light (green or red) can be considered the players’ decisions. The traf- fic light control system aims to find an optimal solution to reduce traffic congestion and maximize the number of vehicles that travel through the intersection in the urban transportation system.

This chapter develops a novel game theoretical traffic light control system with decision-making combinations. Unlike the popular methods of adjusting the time in- tervals of the green light, this model focuses on the optimal vehicle route plan. Four control methods are implemented and compared in the traffic light control model: (a) Constant control strategy. The time intervals of traffic lights are fixed and periodi- cal. (b) global optimal strategy. Centralized control can be considered a cooperative game, where incoming links are regarded as players communicating with each other to generate the best strategy for the whole game with a global cost function. (c) Nash equilibrium strategy. In a decentralized control with a non-cooperative game structure, each player will strive for an outcome that provides him with the lowest possible cost.

(d) Stackelberg equilibrium strategy. Similar to (c), the concepts of leader and follower are introduced. The leader has a higher priority to enforce his strategy on the followers, and the followers will respond to the decisions made by the leader.

3.1 Model Formulation

Fig.3.1shows a general structure of a cross intersection with four incoming links (i.e., w₁...w4) where the vehicles come from outside the queues and four outgoing links (i.e., z₁...z4) through which the vehicles pass the intersection. Thus, the notion(wi,z_j)can be defined as the path, which indicates the vehicles are moving to z_j from w_i. In this case, only a traffic light’s red and green states are considered to control for a simpler evaluation of the algorithm’s effect. We formulate these states in (3.1):

(0, red (3.1a)

1, green (3.1b)

We assume that there is a variable called t_e that is a time interval during which all the waiting vehicles on the path and incoming vehicles outside the path will be cleared.

The cleared vehicles are the vehicles passing through the intersection. Based on this, the key equation of this model can be created as:

L_w_i_z_j(k) +I_w_iR_w_i_z_jt_e(k) =O_w_i_z_jt_e(k) (3.2) where i,i=1, ...,4 is the index of the incoming link and j,j=1, ...,4 is the index of the outgoing link. Variable k is the index of the time slice, and L_w_i_z_j is regarded as the queue length, governing the number of waiting vehicles leaving to zj from wi. The incoming traffic flow is given by Iw_i (vehicles/second), which indicates the number of vehicles entering into the incoming link w_ifor each second. In order to make sure all the incoming vehicles from w_ican be distributed on a different path to the outgoing link z_j, a parameter R_w_i_z_j is introduced, which assigns the number of vehicles to different path (wi,z_j)with the different ratio of incoming traffic flow I_w_i. The left part of the equation is the sum of waiting vehicles and incoming vehicles corresponding to the incoming link w_i, and the right part means the outgoing vehicles with respect to the outgoing traffic flow O_w_i_z_j (vehicles/second).

Solving tefrom (3.2), we can obtain:

t_e(k) = L_w_i_z_j(k) Ow_iz_j−Iw_iRw_iz_j

(3.3)

Our goal here is to update the new queues based on the old queues and the situation of t_e, which can be discussed in three different cases as follows:

Fig. 3.1: General structure of the intersection.

Case 1: 0<te(k)≤Ts, the updated queue length Lw_iz_j(k+1)is:

L_w_i_z_j(k+1) =L_w_i_z_j(k)−g_w_i_z_j(k)Owizjt_e(k)

−gwizj(k)IwiR_w_i_z_j(Ts−t_e(k)) +I_w_iR_w_i_z_jT_s

(3.4)

where T_s is the length of the time slice and k is the order of the time slice aforemen- tioned. The status of the light signals is defined as g_w_i_z_j(k) reflected in (3.1), which determines whether the vehicles can leave the incoming link w_i to the outgoing link z_j or not (i.e. 0-red-stop, 1-green-pass). When t_e(k) is more than zero and less than T_s under the condition of g_w_i_z_j(k) =1, it means at first all the vehicles including the waiting vehicles at the traffic light and the incoming vehicles outside the queues will be cleared during t_e(k). Then, after clearing any waiting vehicles, but incoming vehi- cles still keep entering in T_s−t_e(k), they will be removed as well because the green

light is active. Finally, an additional part that is all incoming vehicles during the pe- riod T_s no matter whether the traffic light is red or green should be added. Otherwise when the traffic light is red during Ts, the two minus parts −gw_iz_j(k)Ow_iz_jte(k) and

−gw_iz_j(k)Iw_iRw_iz_j(Ts−te(k)) are zero because gw_iz_j(k) =0, thus, only the old queue part L_w_i_z_j(k)and plus part I_w_iR_w_i_z_jT_sare remained.

Case 2: te(k)<0 or te(k)>Ts, the updated queue length Lw_iz_j(k+1)is:

L_w_i_z_j(k+1) =L_w_i_z_j(k)−g_w_i_z_j(k)OwizjT_s+I_w_iR_w_i_z_jT_s (3.5) Similarly, when t_e(k) is more than T_s or less than zero under the condition of g_w_i_z_j(k) =1, the vehicles can not be cleared completely in a time slice T_s. That means the system cannot remove all the waiting and incoming vehicles in Ts with the lim- ited speed of outgoing traffic flow. That directly causes the situation of accumulating queues, which is the main difference between case 1 and case 2. In (3.5), the minus part−gwizj(k)OwizjT_s is following the speed of outgoing traffic flow even though there are still some vehicles are not removed on the path. The plus part I_w_iR_w_i_z_jT_s and the condition of the red light are the same as the previous situation.

Case 3: t_e(k) =0, the updated queue length L_w_i_z_j(k+1)is:

Lw_iz_j(k+1) =

(−gwizj(k)IwiR_w_i_z_jT_s+I_w_iR_w_i_z_jT_s (3.6a)

−gw_iz_j(k)Ow_iz_jTs+Iw_iRw_iz_jTs (3.6b) Initial queue length L_w_i_z_j(k) =0 results in t_e(k) =0 in (3.3). The difference be- tween the speed of outgoing traffic flow O_w_i_z_j and the speed of incoming traffic flow I_w_iR_w_i_z_j is defined as DIO_w_i_z_j. If DIO_w_i_z_i >0 in (3.6a), then the queue length will al- ways keep empty with the speed I_w_iR_w_i_z_j. Otherwise, in (3.6b), the queue length will keep increasing with a small part of leaving vehicle stream with speed O_w_i_z_j.

Generally, the collision must occur among the vehicle stream if more than one vehi- cle stream crosses the intersection simultaneously. Thus, the crossing rate Y[(w¹_i,z¹_j),(w²_i,z²_j)]

is defined to determine how many levels of the speed of the traffic flow on the first bent track(w¹_i,z¹_j) will be affected by the traffic flow on the other bent track(w²_i,z²_j). The variable O_w1

iz¹_j,0is the original speed of the vehicle stream on the first bent track(w¹_i,z¹_j) without any interference from other bent tracks. So we have:

O_w1

iz¹_j,w²_iz²_j =O_w1

iz¹_j,0 Π

(w²_i,z²_j)

Y[(w¹_i,z¹_j),(w²_i,z²_j)] (3.7)

In document Multi-agent traffic control using game theory and reinforcement learning (Pldal 42-48)