• Nem Talált Eredményt

Algorithm 4.1 Q learning with a single agent

Initialization: Initialize Q-function Q0, state s0, number of time stepsTand attenuation rateβ;

while Time step t<Tdo

ε=eβt Choose a random numberµ ∈[0,1];

ifµ ∈[ε,1]then at =arg max

a Q(st,a);

else

at =RandomAction;

end

Observe reward rt and the next state st+1; Update Qit+1(st,at)in (2.13);

t=t+1;

end

rwizj,t=

Cwizj

Lwizj,t, gwizj =1 (4.4a)

0, gwizj =0 (4.4b)

If it is a centralized control with a single agent, then the cumulative reward func- tion sums up all the reward values in (4.4) at time step t is calculated as the feedback according to the state st and the corresponding action at:

rt(st,at) =

4 i=1

4 j=1

rwz,t (4.5)

The algorithm of RL is shown as in Algorithm4.1. The simulation result of Sin- gle-agent Reinfoecement Learning (SARL) with a centralized agent can also be con- ducted in Section4.3.

Algorithm 4.2 Q-learning with multiple agents

Initialization: Initialize arbitrary Q-function Qi0(S,a1, ...,an) for all the agents, state S0, number of time stepsTand attenuation rateβ;

while Time step t<Tdo ε=eβt;

Choose a random numberµ∈[0,1];

ifµ ∈[ε,1]then

Choose joint action ai,t,i= 1,2, ...,4 by using Nash/Stackelberg equilibrium with a cooperative behaviour in (4.7);

else

Choose random joint action ai,t,i=1,2, ...,4;

end

Observe r1,t,r2,t, ...,r4,t and the next state St+1; for all agent i, i= 1,2,...,4 do

Update Qit+1(St,a1,t, ...,an,t)in (4.6);

end t=t+1;

end

Qi,t+1(St,a1,t, ...,an,t) = (1−α)Qi,t(St,a1,t, ...,an,t) +α[ri,tNash/Stack Qi,t(St+1,a1, ...,an)]

(4.6)

where Qi,t(St,a1,t, ...,an,t)is the Q-values of the ith agent when all the agents take joint actions a1,t, ...,an,t in a global state St = (s1,t,s2,t, ...,sn,t), and similarly, the unique Nash/Stackelberg solution for the Q-values of the ith,i=1, . . .,4 agent in state St+1 is represented by Nash/StackQi,t(St+1,a1, ...,an) . Beside, ri,t is the received reward of the ith agent in state St. The multi-agent Q-function differs from the single agent Q-function in (2.13), in which the agent needs to observe not only its reward but also the reward of other agents to get the feedback of the equilibrium solutions. It does not update all the entries in the Q-function and only the entry corresponding to the current state and the actions chosen by the agents, which is called asynchronous updating (Hu and Wellman,2003).

The algorithm SNASBQ is shown as in Algorithm4.2.

Nash equilibrium selection

In a game theoretical framework, traffic incoming links are commonly treated as players who make decisions according to the distribution of the jets at the crossroads. The status of the signal light (green or red) can be considered as the decisions made by

these players, which constructs the game theoretical model framework from the traffic intersection. The goal of these players is to move toward the balanced point where each player can get the maximum interests (lowest cost value in this case). The game theoretical framework can be referred to in Section2.1and3.2.

The situation when more than one Nash equilibriums exist should be considered. In this case, the non-cooperative game is informationally non-unique, making the decision selection process harder. Therefore, after selecting the combination of Nash equilib- rium solutions, the players will strive for a common goal to select the unique Nash equilibrium solution with cooperative behaviour in (4.7):

k=arg min

k1

n i=1

|Jw∗,kiJ¯k|+ω2J¯k) (4.7)

where k is the index of Nash equilibrium solutions, and k is the index of the unique optimal Nash equilibrium solution, n is the number of the players and Jw∗,ki is the cost of the player i in the kthNash equilibrium. Also, ¯Jk= 1nni=1Jw∗,ki is the average cost of the players in the kth Nash equilibrium,ω1andω2are the weight factors. The unique Nash equilibrium is a minimum of the sum of the absolute difference between the average cost of the players and the average cost of the players with weight factors in the kth Nash equilibrium combination. At last, we can know the unique Nash equilibrium of the combination of decisions is(dw∗,k1,dw∗,k2,dw∗,k3,dw∗,k4)for the players respectively.

Nash Q-Values

NashQi,t(St+1,a1, ...,an)can be conducted from Nash equilibrium equation (A.2). How- ever, it is for finding the maximum Q-values of each agent instead of finding the min- imum cost of each agent. The maximum Q-values of the agents will be selected by Nash equilibrium for updating in St+1, which is the combination of Nash equilibrium solutions corresponding to the optimal joint actions of all the agents:













Q1,t(St+1,a1,a2,a3,a4)≥Q1,t(St+1,a1,a2,a3,a4) (4.8a) Q2,t(St+1,a1,a2,a3,a4)≥Q2,t(St+1,a1,a2,a3,a4) (4.8b) Q3,t(St+1,a1,a2,a3,a4)≥Q3,t(St+1,a1,a2,a3,a4) (4.8c) Q4,t(St+1,a1,a2,a3,a4)≥Q4,t(St+1,a1,a2,a3,a4) (4.8d) where(a1,a2,a3,a4)is the optimal joint action of Nash equilibrium solutions. On the contrary, the sign ”≤” is replaced by ”≥” since it is a maximum operation.

In this case, the situation of more than one Nash equilibrium always happens in such a large size of Q-values. The cooperative operation among the agents should also be implemented as well, which is from (4.7). Thus, the index of the unique optimal joint actions can be gained:

k=arg min

k1

n i=1

|Q∗,ki,tQ¯k|+ω2Q¯k) (4.9)

where k, k, n1andω2are defined same as (4.7), and Q∗,ki,t is the Q-value of the agent i in the kth Nash equilibrium, ¯Qk= 1nni=1Q∗,ki,t is the average Q-values of the agents in the kth Nash equilibrium at the state t. Thus, the unique Nash equilibrium of Q-values corresponding to the optimal joint actions(a∗,k1 ,a∗,k2 ,a∗,k3 ,a∗,k4 )is Q∗,ki,t, which can be selected as NashQi,t(St+1,a1,a2,a3,a4),i=1, . . .,4 for updating Q-function, it can be expressed as follows:

NashQi,t(St+1,a1,a2,a3,a4) =Q∗,ki,t(St+1,a∗,k1 ,a∗,k2 ,a∗,k3 ,a∗,k4 ) (4.10)

Stackelberg Q-Value

Unlike updating Nash Q-values, Stackelberg Q-values are updated with multiple hier- archical levels. Assuming that agent 1 is the leader and the others are followers. Firstly, SSBQ uses a similar idea as SNAQ to select the suboptimal actions of followers because the followers are on the same level. Therefore, the expected joint actions of followers (a2,a3,a4)can be chosen by Nash equilibrium as in (4.8). Similarly, the unique Nash equilibrium solution(a∗,k2 ,a∗,k3 ,a∗,k4 )can be got from (4.9), which can be represented by R(a1)(i.e, responding to the leader-agent 1).

Secondly, the optimal action of the leader can be chosen with the maximum operator from the Q-value of the leader:

a1=arg max

a1

Q1,t(St+1,(a1),R(a1)) (4.11) Finally, the optimal joint actions of followers corresponding to the optimal action of the leader are selected from (a∗,k2 ,a∗,k3 ,a∗,k4 ), i.e., R(a1). Thus, StackQi,t can be updated by the selected joint actions of followers and the leader together:

StackQi,t(St+1,a1,a2,a3,a4) =Qi,t(St+1,a1,R(a1)) (4.12)

There are some implementation distinctions in some applications between SNAQ and SSBQ due to the difference in updating Q-values. Instead of calculating the Nash equilibrium solutions to find the optimal joint actions from four agents’ action space in (4.8), the size of calculating equilibrium solutions for SSBQ is reduced significantly with only three agents’ action space. Thus, SSBQ can be utilized in some systems with limited hardware resources. Another point is that if the agents have an extremely imbal- anced condition, e.g., one incoming link has much more vehicles than other incoming links in the traffic intersection, then SSBQ should perform better with the multiple hi- erarchical levels.

Complexity Analysis

The multi-agent Q-functions maintain n Q-functions depending on the number of agents n. Each agent maintains one Q-function (Q1, ...Qn) in which agents can observe the actions and rewards. Each Q-function Qi,i=1, . . .,4 contains of Qi(s1, ...,sn,a1, ...,an) in all state space and action space. Thus, the number of the states for agent i is defined as|si|, so the size of state space can also be defined as|S|=|s1||s2|...|sn|. The number of the actions of agent ai is defined as |ai|, so the size of joint actions space is |A|=

|a1||a2|...|an|. There are n Q-functions since there are n agents in this system, so the space complexity of this systemO(n)is linear to the number of the agents, space and polynomial of actions space:

O(n) =n|S||A|

=n|s1||s2|...|sn||a1||a2|...|an|

(4.13)

where n =4 since there are four agents in this system. The state number of each agent |si| which can be referred to Table4.1 is same, i.e., |s1|=|s2|=...=|sn|, and

|S|=16n=65536. The original action space of agents is in (4.3), and the actions for each agent is same, i.e.,|a1|=|a2|=...=|an|. However, to compress the action space’s size, only the permissible actions are selected as the real action space based on Table 3.2as mentioned before. The computational time complexity is mainly determined by calculating and updating the Q-functions, which is hard to calculate since the computa- tional time of finding an equilibrium solution is unknown.

Convergence

After the learning process, each agent i should converge to an equilibrium Qi, and the joint strategies of all agents determine the Q-values. The convergence theorem is defined as follows:

Theorem 4.1 [(Hu and Wellman,2003), Nash Q convergence theorem]. Let(Q1,Q2, ...Qn)donate the Q-functions to all the agents(1,2, ...,n), and the Q-functions(Q1,Q2, ...Qn) should converge to(Q1,Q2, ...Qn)for any agent i∈ΩN={1,2, ...,n}, iff the following Assumption4.1,4.2and Corollary4.1are satisfied:

Assumption 4.1. Every state sS and action aA of each agent are visited infinitely often.

Assumption 4.2. The learning rate α should satisfy the following conditions for all s1...sn,a1...an.

1. 0≤α <1,

t=0αt =∞, ∑

t=0αt2<∞.

2. αt(s1...sn,a1...an) =0,i f(s1...sn,a1...an)6= (s1,t...sn,t,a1,t...an,t).

Corollary 4.1 [(Szepesvári and Littman,1999), Corollary 5]. Let Pt:Q−→Qbe the pseudo-contraction operator, and there exist a number 0<γ <1 and a sequenceλt≥0 converging to 0 with probability 1. Then the following Q-function converges to Qwith probability 1:

Qt+1= (1−αt)Qtt[PtQt] (4.14) iffαt satisfies Assumption4.2and it satisfies following condition:

kPtQPtQk≤γ kQQk+λt,∀Q∈Qand Q=E[PtQ] (4.15) Pt is the pseudo-contraction operator which maps two points closer to each other in the space, i.e.,kPtQPtQˆ k≤γ kQQˆk,∀Q,Qˆ∈Q. The proof of the convergence of a general Nash Q-learning process is described in (Hu and Wellman,2003).