Proposed Method Framework - Multi-agent traffic control using game theory and reinforcement lea

Algorithm 4.1 Q learning with a single agent

Initialization: Initialize Q-function Q₀, state s₀, number of time stepsTand attenuation rateβ^;

while Time step t<Tdo

ε=e⁻^β^t Choose a random numberµ ∈[0,1];

ifµ ∈[ε,1]then a_t =arg max

a Q(st,a);

else

at =RandomAction;

end

Observe reward r_t and the next state s_t+1; Update Qⁱ_t+1(st,at)in (2.13);

t=t+1;

end

r_w_i_z_j_,t=





 C_w_i_z_j

Lw_iz_j,t, g_w_i_z_j =1 (4.4a)

0, g_w_i_z_j =0 (4.4b)

If it is a centralized control with a single agent, then the cumulative reward func- tion sums up all the reward values in (4.4) at time step t is calculated as the feedback according to the state s_t and the corresponding action a_t:

r_t(st,a_t) =

∑

4 i=1

∑

4 j=1

r_wz,t (4.5)

The algorithm of RL is shown as in Algorithm4.1. The simulation result of Sin- gle-agent Reinfoecement Learning (SARL) with a centralized agent can also be con- ducted in Section4.3.

Algorithm 4.2 Q-learning with multiple agents

Initialization: Initialize arbitrary Q-function Qⁱ₀(S,a₁, ...,a_n) for all the agents, state S₀, number of time stepsTand attenuation rateβ^;

while Time step t<Tdo ε=e⁻^β^t;

Choose a random numberµ∈[0,1];

ifµ ∈[ε,1]then

Choose joint action a_i,t,i= 1,2, ...,4 by using Nash/Stackelberg equilibrium with a cooperative behaviour in (4.7);

else

Choose random joint action a_i,t,i=1,2, ...,4;

end

Observe r_1,t,r_2,t, ...,r_4,t and the next state S_t+1; for all agent i, i= 1,2,...,4 do

Update Qⁱ_t+1(St,a_1,t, ...,a_n,t)in (4.6);

end t=t+1;

end

Q_i,t+1(St,a_1,t, ...,a_n,t) = (1−α)Qi,t(St,a_1,t, ...,a_n,t) +α[ri,t+γNash/Stack Qi,t(St+1,a₁, ...,a_n)]

(4.6)

where Q_i,t(St,a_1,t, ...,a_n,t)is the Q-values of the i^th agent when all the agents take joint actions a_1,t, ...,a_n,t in a global state S_t = (s1,t,s_2,t, ...,s_n,t), and similarly, the unique Nash/Stackelberg solution for the Q-values of the i^th,i=1, . . .,4 agent in state S_t+1 is represented by Nash/StackQi,t(St+1,a₁, ...,a_n) . Beside, r_i,t is the received reward of the i^th agent in state S_t. The multi-agent Q-function differs from the single agent Q-function in (2.13), in which the agent needs to observe not only its reward but also the reward of other agents to get the feedback of the equilibrium solutions. It does not update all the entries in the Q-function and only the entry corresponding to the current state and the actions chosen by the agents, which is called asynchronous updating (Hu and Wellman,2003).

The algorithm SNASBQ is shown as in Algorithm4.2.

Nash equilibrium selection

In a game theoretical framework, traffic incoming links are commonly treated as players who make decisions according to the distribution of the jets at the crossroads. The status of the signal light (green or red) can be considered as the decisions made by

these players, which constructs the game theoretical model framework from the traffic intersection. The goal of these players is to move toward the balanced point where each player can get the maximum interests (lowest cost value in this case). The game theoretical framework can be referred to in Section2.1and3.2.

The situation when more than one Nash equilibriums exist should be considered. In this case, the non-cooperative game is informationally non-unique, making the decision selection process harder. Therefore, after selecting the combination of Nash equilib- rium solutions, the players will strive for a common goal to select the unique Nash equilibrium solution with cooperative behaviour in (4.7):

k^∗=arg min

k (ω1

∑

n i=1

|J_w^∗,k_i −J¯^k|+ω2J¯^k) (4.7)

where k is the index of Nash equilibrium solutions, and k^∗ is the index of the unique optimal Nash equilibrium solution, n is the number of the players and J_w^∗,k_i is the cost of the player i in the k^thNash equilibrium. Also, ¯J^k= ¹_n∑ⁿ_i=1Jw^∗,k_i is the average cost of the players in the k^th Nash equilibrium,ω1andω2are the weight factors. The unique Nash equilibrium is a minimum of the sum of the absolute difference between the average cost of the players and the average cost of the players with weight factors in the k^th Nash equilibrium combination. At last, we can know the unique Nash equilibrium of the combination of decisions is(dw^∗,k1^∗,d_w^∗,k₂^∗,d_w^∗,k₃^∗,d_w^∗,k₄^∗)for the players respectively.

Nash Q-Values

NashQ_i,t(St+1,a₁, ...,a_n)can be conducted from Nash equilibrium equation (A.2). How- ever, it is for finding the maximum Q-values of each agent instead of finding the min- imum cost of each agent. The maximum Q-values of the agents will be selected by Nash equilibrium for updating in S_t+1, which is the combination of Nash equilibrium solutions corresponding to the optimal joint actions of all the agents:











Q_1,t(S_t+1,a^∗₁,a^∗₂,a^∗₃,a^∗₄)≥Q_1,t(S_t+1,a₁,a^∗₂,a^∗₃,a^∗₄) (4.8a) Q_2,t(St+1,a^∗₁,a^∗₂,a^∗₃,a^∗₄)≥Q_2,t(St+1,a^∗₁,a₂,a^∗₃,a^∗₄) (4.8b) Q_3,t(St+1,a^∗₁,a^∗₂,a^∗₃,a^∗₄)≥Q_3,t(St+1,a^∗₁,a^∗₂,a₃,a^∗₄) (4.8c) Q_4,t(St+1,a^∗₁,a^∗₂,a^∗₃,a^∗₄)≥Q_4,t(St+1,a^∗₁,a^∗₂,a^∗₃,a₄) (4.8d) where(a^∗₁,a^∗₂,a^∗₃,a^∗₄)is the optimal joint action of Nash equilibrium solutions. On the contrary, the sign ”≤” is replaced by ”≥” since it is a maximum operation.

In this case, the situation of more than one Nash equilibrium always happens in such a large size of Q-values. The cooperative operation among the agents should also be implemented as well, which is from (4.7). Thus, the index of the unique optimal joint actions can be gained:

k^∗=arg min

k (ω1

∑

n i=1

|Q^∗,k_i,t −Q¯^k|+ω2Q¯^k) (4.9)

where k, k^∗, n,ω1andω2are defined same as (4.7), and Q^∗,k_i,t is the Q-value of the agent i in the k^th Nash equilibrium, ¯Q^k= ¹_n∑ⁿ_i=1Q^∗,k_i,t is the average Q-values of the agents in the k^th Nash equilibrium at the state t. Thus, the unique Nash equilibrium of Q-values corresponding to the optimal joint actions(a^∗,k₁ ^∗,a^∗,k₂ ^∗,a^∗,k₃ ^∗,a^∗,k₄ ^∗)is Q^∗,k_i,t^∗, which can be selected as NashQ_i,t(St+1,a₁,a₂,a₃,a₄),i=1, . . .,4 for updating Q-function, it can be expressed as follows:

NashQ_i,t(St+1,a₁,a₂,a₃,a₄) =Q^∗,k_i,t^∗(St+1,a^∗,k₁ ^∗,a^∗,k₂ ^∗,a^∗,k₃ ^∗,a^∗,k₄ ^∗) (4.10)

Stackelberg Q-Value

Unlike updating Nash Q-values, Stackelberg Q-values are updated with multiple hier- archical levels. Assuming that agent 1 is the leader and the others are followers. Firstly, SSBQ uses a similar idea as SNAQ to select the suboptimal actions of followers because the followers are on the same level. Therefore, the expected joint actions of followers (a^∗₂,a^∗₃,a^∗₄)can be chosen by Nash equilibrium as in (4.8). Similarly, the unique Nash equilibrium solution(a^∗,k₂ ^∗,a^∗,k₃ ^∗,a^∗,k₄ ^∗)can be got from (4.9), which can be represented by R^∗(a1)(i.e, responding to the leader-agent 1).

Secondly, the optimal action of the leader can be chosen with the maximum operator from the Q-value of the leader:

a^∗₁=arg max

Q_1,t(St+1,(a1),R^∗(a1)) (4.11) Finally, the optimal joint actions of followers corresponding to the optimal action of the leader are selected from (a^∗,k₂ ^∗,a^∗,k₃ ^∗,a^∗,k₄ ^∗), i.e., R^∗(a^∗₁). Thus, StackQ_i,t can be updated by the selected joint actions of followers and the leader together:

StackQ_i,t(St+1,a₁,a₂,a₃,a₄) =Q_i,t(St+1,a^∗₁,R^∗(a^∗₁)) (4.12)

There are some implementation distinctions in some applications between SNAQ and SSBQ due to the difference in updating Q-values. Instead of calculating the Nash equilibrium solutions to find the optimal joint actions from four agents’ action space in (4.8), the size of calculating equilibrium solutions for SSBQ is reduced significantly with only three agents’ action space. Thus, SSBQ can be utilized in some systems with limited hardware resources. Another point is that if the agents have an extremely imbal- anced condition, e.g., one incoming link has much more vehicles than other incoming links in the traffic intersection, then SSBQ should perform better with the multiple hi- erarchical levels.

Complexity Analysis

The multi-agent Q-functions maintain n Q-functions depending on the number of agents n. Each agent maintains one Q-function (Q1, ...Qn) in which agents can observe the actions and rewards. Each Q-function Q_i,i=1, . . .,4 contains of Q_i(s1, ...,s_n,a₁, ...,a_n) in all state space and action space. Thus, the number of the states for agent i is defined as|si|, so the size of state space can also be defined as|S|=|s1||s2|...|sn|. The number of the actions of agent a_i is defined as |ai|, so the size of joint actions space is |A|=

|a1||a2|...|an|. There are n Q-functions since there are n agents in this system, so the space complexity of this systemO(n)is linear to the number of the agents, space and polynomial of actions space:

O(n) =n|S||A|

=n|s1||s2|...|sn||a1||a2|...|an|

(4.13)

where n =4 since there are four agents in this system. The state number of each agent |si| which can be referred to Table4.1 is same, i.e., |s1|=|s2|=...=|sn|, and

|S|=16ⁿ=65536. The original action space of agents is in (4.3), and the actions for each agent is same, i.e.,|a1|=|a2|=...=|an|. However, to compress the action space’s size, only the permissible actions are selected as the real action space based on Table 3.2as mentioned before. The computational time complexity is mainly determined by calculating and updating the Q-functions, which is hard to calculate since the computa- tional time of finding an equilibrium solution is unknown.

Convergence

After the learning process, each agent i should converge to an equilibrium Q^∗_i, and the joint strategies of all agents determine the Q-values. The convergence theorem is defined as follows:

Theorem 4.1 [(Hu and Wellman,2003), Nash Q convergence theorem]. Let(Q₁,Q₂, ...Qn)donate the Q-functions to all the agents(1,2, ...,n), and the Q-functions(Q1,Q2, ...Qn) should converge to(Q^∗₁,Q^∗₂, ...Q^∗_n)for any agent i∈ΩN={1,2, ...,n}, iff the following Assumption4.1,4.2and Corollary4.1are satisfied:

Assumption 4.1. Every state s∈S and action a∈A of each agent are visited infinitely often.

Assumption 4.2. The learning rate α should satisfy the following conditions for all s₁...sn,a₁...an.

1. 0≤α ^<^1, ∑^∞

t=0αt =∞, ∑^∞

t=0αt²<∞.

2. αt(s1...sn,a₁...an) =0,i f(s1...sn,a₁...an)6= (s1,t...sn,t,a_1,t...an,t).

Corollary 4.1 [(Szepesvári and Littman,1999), Corollary 5]. Let Pt:Q−→Qbe the pseudo-contraction operator, and there exist a number 0<γ ^<1 and a sequenceλt≥0 converging to 0 with probability 1. Then the following Q-function converges to Q^∗with probability 1:

Q_t+1= (1−αt)Qt+αt[PtQ_t] (4.14) iffαt satisfies Assumption4.2and it satisfies following condition:

kP_tQ−P_tQ^∗k≤γ kQ−Q^∗k+λt,∀Q∈Qand Q^∗=E[PtQ^∗] (4.15) P_t is the pseudo-contraction operator which maps two points closer to each other in the space, i.e.,kP_tQ−P_tQˆ k≤γ kQ−Qˆk,∀Q,Qˆ∈Q. The proof of the convergence of a general Nash Q-learning process is described in (Hu and Wellman,2003).

In document Multi-agent traffic control using game theory and reinforcement learning (Pldal 61-67)