• Nem Talált Eredményt

Distributed Clustering of Linear Bandits in Peer to Peer Networks

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Distributed Clustering of Linear Bandits in Peer to Peer Networks"

Copied!
9
0
0

Teljes szövegt

(1)

Distributed Clustering of Linear Bandits in Peer to Peer Networks

Nathan Korda NATHAN@ROBOTS.OX.AC.UK

MLRG, University of Oxford

Bal´azs Sz¨or´enyi SZORENYI.BALAZS@GMAIL.COM

EE, Technion & MTA-SZTE Research Group on Artificial Intelligence

Shuai Li SHUAILI.SLI@GMAIL.COM

DiSTA, University of Insubria

Abstract

We provide two distributed confidence ball algo- rithms for solving linear bandit problems in peer to peer networks with limited communication ca- pabilities. For the first, we assume that all the peers are solving the same linear bandit problem, and prove that our algorithm achieves the opti- mal asymptotic regret rate of any centralised al- gorithm that can instantly communicate informa- tion between the peers. For the second, we as- sume that there are clusters of peers solving the same bandit problem within each cluster, and we prove that our algorithm discovers these clusters, while achieving the optimal asymptotic regret rate within each one. Through experiments on several real-world datasets, we demonstrate the performance of proposed algorithms compared to the state-of-the-art.

1. Introduction

Bandits are a class of classic optimisation problems that are fundamental to several important application areas. The most prominent of these is recommendation systems, and they can also arise more generally in networks (see, e.g., (Li et al., 2013; Hao et al., 2015)).

We consider settings where a network of agents are try- ing to solve collaborative linear bandit problems. Sharing experience can improve the performance of both the whole network and each agent simultaneously, while also increas- ing robustness. However, we want to avoid putting too much strain on communication channels. Communicating every piece of information would just overload these chan- Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

nels. The solution we propose is a gossip-based informa- tion sharing protocol which allows information to diffuse across the network at a small cost, while also providing ro- bustness.

Such a set-up would benefit, for example, a small start-up that provides some recommendation system service but has limited resources. Using an architecture that enables the agents (the client’s devices) to exchange data between each other directly and to do all the corresponding computations themselves could significantly decrease the infrastructural costs for the company. At the same time, without a central server, communicating all information instantly between agents would demand a lot of bandwidth.

Multi-Agent Linear Bandits In the simplest setting we consider, all the agents are trying to solve the same un- derlying linear bandit problem. In particular, we have a set of nodesV, indexed byi, and representing a finite set of agents. At each time,t:

• a set ofactions(equivalently, thecontexts) arrives for each agenti,Dit ⇢ Dand we assume the setDis a subset of the unit ball inRd;

• each agent, i, chooses an action (context)xit 2 Dit, and receives areward

rti= (xit)T✓+⇠it,

where✓is some unknowncoefficient vector, and⇠itis some zero mean,R-subGaussian noise;

• last, the agents can share information according to some protocol across a communication channel.

We define theinstantaneous regretat each nodei, and, re- spectively, thecumulative regretover the whole network to be:

it:=⇣ xi,⇤tT

✓ Erti, and Rt:=

Xt k=1

|V|

X

i=1

it,

(2)

wherexi,⇤t := arg maxx2Di

txT✓. The aim of the agents is to minimise the rate of increase of cumulative regret. We also wish them to use a sharing protocol that does not im- pose much strain on the information-sharing communica- tion channel.

Gossip protocol In a gossip protocol (see, e.g., (Kempe et al., 2003; Xiao et al., 2007; Jelasity et al., 2005; 2007)), in each round, an overlay protocol assigns to every agent another agent, with which it can share information. After sharing, the agents aggregate the information and, based on that, they make their corresponding decisions in the next round. In many areas of distributed learning and compu- tation gossip protocols have offered a good compromise between low-communication costs and algorithm perfor- mance. Using such a protocol in the multi-agent bandit setting, one faces two major challenges.

First, information sharing is not perfect, since each agent acquires information from only one other (randomly cho- sen) agent per round. This introduces a bias through the unavoidable doubling of data points. The solution is to mit- igate this by using a delay (typically ofO(logt)) on the time at which information gathered is used. After this de- lay, the information is sufficiently mixed among the agents, and the bias vanishes.

Second, in order to realize this delay, it is necessary to store information in a buffer and only use it to make decisions after the delay has been passed. In (Sz¨or´enyi et al., 2013) this was achieved by introducing an epoch structure into their algorithm, and emptying the buffers at the end of each epoch.

The Distributed Confidence Ball Algorithm (DCB)We use a gossip-based information sharing protocol to produce a distributed variant of the generic Confidence Ball (CB) algorithm, (Abbasi-Yadkori et al., 2011; Dani et al., 2008;

Li et al., 2010). Our approach is similar to (Sz¨or´enyi et al., 2013) where the authors produced a distributed✏-greedy al- gorithm for the simpler multi-armed bandit problem. How- ever their results do not generalise easily, and thus signifi- cant new analysis is needed. One reason is that the linear setting introduces serious complications in the analysis of the delay effect mentioned in the previous paragraphs. Ad- ditionally, their algorithm is epoch-based, whereas we are using a more natural and simpler algorithmic structure. The downside is that the size of the buffers of our algorithm grow with time. However, our analyses easily transfer to the epoch approach too. As the rate of growth is logarith- mic, our algorithm is still efficient over a very long time- scale.

The simplifying assumption so far is that all agents are solving the same underlying bandit problem, i.e. finding the same unknown✓-vector. This, however, is often unre-

alistic, and so we relax it in our next setup. While it may have uses in special cases, DCB and its analysis can be con- sidered as a base for providing an algorithm in this more realistic setup, where some variation in✓is allowed across the network.

Clustered Linear Bandits Proposed in (Gentile et al., 2014; Li et al., 2016a;b), this has recently proved to be a very successful model for recommendation problems with massive numbers of users. It comprises a multi-agent linear bandit model agents’✓-vectors are allowed to vary across a clustering. This clustering presents an additional challenge to find the groups of agents sharing the same underlying bandit problem before information sharing can accelerate the learning process. Formally, let{Uk}k=1,...,Mbe a clus- tering ofV, assume some coefficient vector✓kfor eachk, and let for agenti 2Uk the reward of actionxitbe given by

rit= (xit)Tk+⇠ti.

Both clusters and coefficient vectors are assumed to be ini- tially unknown, and so need to be learnt on the fly.

The Distributed Clustering Confidence Ball Algorithm (DCCB)The paper (Gentile et al., 2014) proposes the ini- tial centralised approach to the problem of clustering linear bandits. Their approach is to begin with a single cluster, and then incrementally prune edges when the available in- formation suggests that two agents belong to different clus- ters. We show how to use a gossip-based protocol to give a distributed variant of this algorithm, which we call DCCB.

Our main contributions In Theorems 1 and 6 we show our algorithms DCB and DCCB achieve, in the multi-agent and clustered setting, respectively, near-optimal improve- ments in the regret rates. In particular, they are of order almost p

|V| better than applying CB without informa- tion sharing, while still keeping communication cost low.

And our findings are demonstrated by experiments on real- world benchmark data.

2. Linear Bandits and the DCB Algorithm

The generic Confidence Ball (CB) algorithmis designed for a single agent linear bandit problem (i.e. |V| = 1).

The algorithm maintains a confidence ballCt⇢Rdwithin which it believes the true parameter✓lies with high prob- ability. This confidence ball is computed from the obser- vation pairs,(xk, rk)k=1,...,t(for the sake of simplicity, we dropped the agent index,i). Typically, the covariance ma- trix At = Pt

k=1xkxTk andb-vector, bt = Pt

k=1rkxk, are sufficient statistics to characterise this confidence ball.

Then, given its current action set,Dt, the agent selects the optimistic action, assuming that the true parameter sits in Ct, i.e. (xt,⇠) = arg max(x,✓0)2DtCt{xT0}. Pseudo- code for CB is given in the Appendix A.1.

(3)

Gossip Sharing Protocol for DCB We assume that the agents are sharing across a peer to peer network, i.e. every agent can share information with every other agent, but that every agent can communicate with only one other agent per round. In our algorithms, each agent,i, needs to maintain (1) abuffer(an ordered set)Aitof covariance matrices and

anactivecovariance matrixA˜it,

(2) abufferBtiof b-vectors and anactiveb-vector˜bit, Initially, we set, for all i 2 V, A˜i0 = I, ˜bi0 = 0.

These active objects are used by the algorithm as suffi- cient statistics from which to calculate confidence balls, and summarise only information gathered before or during time⌧(t), where⌧is an arbitrary monotonically increasing function satisfying⌧(t)< t. The buffers are initially set to Ai0=;, andBi0=;. For eacht >1, each agent,i, shares and updates its buffers as follows:

(1) a random permutation, , of the numbers1, . . . ,|V|is chosen uniformly at random in a decentralised manner among the agents,1

(2) the buffers of i are then updated by averaging its buffers with those of (i), and then extending them us- ing their current observations2

Ait+1=⇣⇣

1

2(Ait+At(i))⌘ ⇣

xit+1 xit+1 T⌘⌘

, Bt+1i =⇣⇣

1

2(Bit+Bt(i))⌘

rit+1xit+1 ⌘ , A˜it+1= ˜Ait+ ˜At(i), and˜bit+1= ˜bit+ ˜bt(i).

(3) if the length|Ait+1|exceedst ⌧(t), the first element ofAit+1is added toA˜it+1and deleted fromAit+1.Bit+1

and˜bit+1are treated similarly.

In this way, each buffer remains of size at mostt ⌧(t), and contains only information gathered after time⌧(t). The re- sult is that, aftertrounds of sharing, the current covariance matrices and b-vectors used by the algorithm to make deci- sions have the form:

it:=I+

⌧(t)X

t0=1

|V|

X

i0=1

wii,t0,t0xit00xit00 T,

and˜bit:=

⌧(t)X

t0=1

|V|

X

i0=1

wii,t0,t0rit00xit00.

where the weightswii,t0,t0 are random variables which are unknown to the algorithm. Importantly for our analysis, as

1This can be achieved in a variety of ways.

2The symbol denotes the concatenation operation on two ordered sets: ifx = (a, b, c)andy = (d, e, f), thenx y = (a, b, c, d, e, f), andy x= (d, e, f, a, b, c).

a result of the overlay protocol’s uniformly random choice of , they are identically distributed (i.d.) for each fixed pair(t, t0), andP

i02V wii,t0,t0 =|V|. If information sharing was perfect at each time step, then the current covariance matrix could be computed using all the information gath- ered by all the agents, and would be:

At:=I+

|V|

X

i0=1

Xt t0=1

xit00

⇣xit00

T

. (1)

DCB algorithm The OFUL algorithm (Abbasi-Yadkori et al., 2011) is an improvement of the confidence ball al- gorithm from (Dani et al., 2008), which assumes that the confidence ballsCtcan be characterised byAtandbt. In the DCB algorithm, each agenti 2 V maintains a confi- dence ballCtifor the unknown parameter✓as in the OFUL algorithm, but calculated fromA˜itand˜bit. It then chooses its action, xit, to satisfy(xit,✓it) = arg max(x,✓)2DitCitxT✓, and receives a rewardrti. Finally, it shares its information buffer according to the sharing protocol above. Pseudo- code for DCB is given in Appendix A.1, and in Algorithm 1.

2.1. Results for DCB

Theorem 1. Let⌧(·) :t!4 log(|V|32t). Then, with prob- ability1 , the regret of DCB is bounded by

Rt(N( )|V|+⌫(|V|, d, t))k✓k2 + 4e2( (t) + 4R)

r

|V|tln⇣

(1 +|V|t/d)d⌘ , where⌫(|V|, d, t) := (d+1)d2(4|V|ln(|V|32t))3,N( ) :=

p3/((1 2 14)p ), and

(t) :=R vu

utln (1 +|V|t/d)d!

+k✓k2. (2)

The term ⌫(t,|V|, d)describes the loss compared to the centralised algorithm due to the delay in using informa- tion, whileN( )|V|describes the loss due to the incom- plete mixing of the data across the network.

If the agents implement CB independently and do not share any information, which we call CB-NoSharing, then it fol- lows from the results in (Abbasi-Yadkori et al., 2011), the equivalent regret bound would be

Rt|V| (t) q

tln ((1 +t/d)d) (3) Comparing Theorem 1 with (3) tells us that, after an initial

“burn in” period, the gain in regret performance of DCB over CB-NoSharingis of order almostp

|V|.

(4)

Corollary 2. We can recover a bound in expectation from Theorem 1, by using the value = 1/p

|V|t:

E[Rt]O(t14) +p

|V|tk✓k2 + 4e2 R

r ln⇣

(1 +|V|t/d)dp

|V|t⌘

+k✓k2+ 4R

!

⇥q

|V|tln ((1 +|V|t/d)d).

This shows that DCB exhibits asymptotically optimal re- gret performance, up to log factors, in comparison with any algorithm that can share its information perfectly between agents at each round.

COMMUNICATIONCOMPLEXITY

If the agents communicate their information to each other at each round without a central server, then every agent would need to communicate their chosen action and reward to ev- ery other agent at each round, giving a communication cost of orderd|V|2 per-round. We call such an algorithm CB- InstSharing. Under the gossip protocol we propose each agent requires at mostO(log2(|V|t)d2|V|)bits to be com- municated per round. Therefore, a significant communica- tion cost reduction is gained whenlog(|V|t)d⌧|V|. Using an epoch-based approach, as in (Sz¨or´enyi et al., 2013), the per-round communication cost of the gossip pro- tocol becomes O(d2|V|). This improves efficiency over any horizon, requiring only thatd⌧|V|, and the proofs of the regret performance are simple modifications of those for DCB. However, in comparison with growing buffers this is only an issue afterO(exp(|V|))number of rounds, and typically|V|is large.

While the DCB has a clear communication advantage over CB-InstSharing, there are other potential approaches to this problem. For example, instead of randomised neighbour sharing one can use a deterministic protocol such asRound- Robin(RR), which can have the same low communication costs as DCB. However, the regret bound for RR suffers from a naturally larger delay in the network than DCB.

Moreover, attempting to track potential doubling of data points when using a gossip protocol, instead of employing a delay, leads back to a communication cost of order|V|2 per round. More detail is included in Appendix A.2.

PROOF OFTHEOREM1

In the analysis we show that the bias introduced by imper- fect information sharing is mitigated by delaying the in- clusion of the data in the estimation of the parameter ✓.

The proof builds on the analysis in (Abbasi-Yadkori et al., 2011). The emphasis here is to show how to handle the extra difficulty stemming from imperfect information shar- ing, which results in the influence of the various rewards

at the various peers being unbalanced and appearing with a random delay. Proofs of the Lemmas 3 and 4, and of Proposition 1 are crucial, but technical, and are deferred to Appendix A.3.

Step 1: Define modified confidence ellipsoids. First we need a version of the confidence ellipsoid theorem given in (Abbasi-Yadkori et al., 2011) that incorporates the bias introduced by the random weights:

Proposition 1. Let > 0, ✓˜ti := ( ˜Ait) 1˜bit, W(⌧) :=

max{wi,ti0,t0 :t, t0⌧, i, i02V}, and let

Cti:=

x2Rd:k✓˜it xkA˜it  k✓k2 (4) +W(⌧(t))R

r 2 log⇣

det( ˜Ait)12/ ⌘ .

Then with probability1 ,✓2Cti.

In the rest of the proof we assume that✓2Cti.

Step 2: Instantaneous regret decomposition. Denote by (xit,✓ti) = arg maxx2Dit,y2CtixTy. Then we can de- compose the instantaneous regret, following a classic argu- ment (see the proof of Theorem 3 in (Abbasi-Yadkori et al., 2011)):

it=⇣ xi,tT

✓ (xit)T✓ xit Tti (xit)T

= xit Th⇣

it ✓˜it⌘ +⇣

✓˜ti ✓⌘i

 kxitk(A˜it) 1

ti ✓˜ti

A˜it+ ˜✓it

A˜it (5) Step 3: Control the bias.The norm differences inside the square brackets of the regret decomposition are bounded through (4) in terms of the matrices A˜it. We would like, instead, to have the regret decomposition in terms of the matrixAt(which is defined in (1)). To this end, we give some lemmas showing that using the matricesA˜itis almost the same as usingAt. These lemmas involve elementary matrix analysis, but are crucial for understanding the im- pact of imperfect information sharing on the final regret bounds.

Step 3a: Control the bias coming from the weight im- balance.

Lemma 3(Bound on the influence of general weights).For alli2V andt >0,

kxitk2(A˜it) 1 e

P⌧(t) t0=1

P|V|

i0=1wi0,t0i,t 1

kxitk2(A⌧(t)) 1, and det⇣

it

e

P⌧(t) t0=1

P|V|

i0=1wi0,t0i,t 1

det A⌧(t) .

Using Lemma 4 in (Sz¨or´enyi et al., 2013), by exploiting the random weights are identically distributed (i.d.) for each

(5)

fixed pair (t, t0), andP

i02V wii,t0,t0 = |V|under our gos- sip protocol, we can control the random exponential con- stant in Lemma 3, and the upper boundW(T)using the Chernoff-Hoeffding bound:

Lemma 4(Bound on the influence of weights under our sharing protocol). Fix some constants0 < t0 <1. Then with probability1 P⌧(t)

t0=1 t0

|V|

X

i0=1

⌧(t)X

t0=1

wii,t0,t0 1 |V|32 X⌧(t) t0=1

⇣2(t t0) t0

12 ,

andW(T)1 + max

1t0⌧(t)

|V|32

2(t t0) t0

12

.

In particular, for any 2 (0,1), choosing t0 = 2t

0 t

2 , with probability1 /(|V|3t2(1 2 1/2))we have

|V|

X

i0=1

X⌧(t) t0=1

wi,ti0,t0 1  1 (1 2 14)tp , andW(⌧(t))1 + |V|32

tp . (6)

Thus Lemma 3 and 4 give us control over the bias intro- duced by the imperfect information sharing. Combining them with Equations (4) and (5) we find that with probabil- ity1 /(|V|3t2(1 2 1/2)):

it2eC(t)kxitkAi

⌧(t)

1(1 +C(t)) (7)

"

R r

2 log⇣

eC(t)det A⌧(t)

1

2 1

+k✓k

#

whereC(t) := 1/(1 2 1/4)tp

Step 3b: Control the bias coming from the delay. Next, we need to control the bias introduced from leaving out the last4 log(|V|3/2t)time steps from the confidence ball estimation calculation:

Proposition 2. There can be at most

⌫(k) := (4|V|log(|V|3/2k))3(d+ 1)d(tr(A0) + 1) (8) pairs(i, k)21, . . . ,|V|⇥{1, . . . , t}for which one of

kxikk2A 1

⌧(k) ekxikk2(Ak 1+Pij=11xjk(xjk)T) 1, or det A⌧(k) edet

0

@Ak 1+

i 1

X

j=1

xjk(xjk)T 1 A holds.

Step 4: Choose constants and sum the simple regret.

Defining a constant

N( ) := 1 (1 2 14)p ,

we have, for allk N( ),C(k)1, and so, by (7) with probability1 (|V|k) 2 /(1 2 1/2)

ik2ekxikkA⌧(k)1 (9)

⇥ 2 642R

vu uu t2 log

0

@edet A⌧(k)

1 2

1 A+k✓k2

3 75.

Now, first applying Cauchy-Schwarz, then step 3b from above together with (9), and finally Lemma 11 from (Abbasi-Yadkori et al., 2011) yields that, with probability 1 1 +P1

t=1(|V|t) 2/(1 2 1/2) 1 3 , RtN( )|V|k✓k2+

2 4|V|t

Xt t0=N( )

|V|

X

i=1

it0

2

3 5

1 2

(N( )|V|+⌫(|V|, d, t))k✓k2

+ 4e2( (t) + 2R)

"

|V|t Xt t0=1

XM i=1

kxitk2(At) 1

#12

(N( )|V|+⌫(|V|, d, t))k✓k2 + 4e2( (t) + 2R)p

|V|t(2 log (det (At))), where (·)is as defined in (2). Replacing with /3fin- ishes the proof.

PROOF OFPROPOSITION2

This proof forms the major innovation in the proof of The- orem 1. Let(yk)k 1be any sequence of vectors such that kykk21for allk, and letBn :=B0+Pn

k=1ykykT, where B0is some positive definite matrix.

Lemma 5. For allt >0, and for anyc2(0,1), we have nk2{1,2, . . .}:kykk2Bk11 > co

(d+c)d(tr(B01) c)/c2, Proof. We begin by showing that, for anyc2(0,1)

kykk2B 1

k 1 > c (10)

can be true for only2dc 3differentk.

Indeed, let us suppose that (10) is true for some k. Let (e(ki 1))1id be the orthonormal eigenbasis for Bk 1, and, therefore, also forBk11, and writeyk =Pd

i=1iei. Let, also,( (ki 1))be the eigenvalues forBk 1. Then,

c < ykTBk11yk= Xd i=1

2i

(k 1)

i tr(Bk11),

=) 9j2{1, . . . , d}:

2 j (k 1) j

, (k11) j

> cd,

(6)

where we have used that↵2i <1for alli, sincekykk2<1.

Now,

tr(Bk11) tr(Bk1)

=tr(Bk11) tr((Bk 1+ykykT) 1)

> tr(Bk11) tr((Bk 1+↵2jejeTj) 1)

= (k11) j

1

(k 1)

j +↵2j =

2 j (k 1)

j ( (kj 1)+↵2j)

> d2c 2+dc 1 1>d(d+c)c2 So we have shown that (10) implies that

tr(Bk11)> cand tr(Bk11) tr(Bk1)> c2 d(d+c). Sincetr(B01) tr(Bk11) tr(Bk1) 0for allk, it follows that (10) can be true for at most(d+c)d(tr(B01) c)c 2differentk.

Now, using an argument similar the proof of Lemma 3, for allk < t

kyk+1kB⌧(k)1 e

Pk

s=⌧(k)+1kys+1kB 1

s kyk+1kBk1, and det B⌧(t) e

Pt

k=⌧(t)+1kykk2B 1

k det (Bt). Therefore,

kyk+1kB⌧(k)1 ckyk+1kBk1 or det(B⌧(k)) cdet(Bk)

=)

kX1 s=⌧(k)

kys+1kBs1 ln(c)

However, according to Lemma 5, there can be at most

⌫(t) :=⇣

d+ln(c)(t)⌘ d⇣

tr B01 ln(c)(t)⌘ ⇣ (t)

ln(c)

2

times s 2 {1, . . . , t}, such that kys+1kBs1

ln(c)/ (t), where (t) := max1kt{k ⌧(k)}. Hence Pk

s=⌧(j)+1kys+1kBs1 ln(c) is true for at most (t)⌫(|V|, d, t)indicesk2{1, . . . , t}.

Finally, we finish by setting(yk)k 1= t 1(xit)|i=1V|.

3. Clustering and the DCCB Algorithm

We now incorporate distributed clustering into the DCB al- gorithm. The analysis of DCB forms the backbone of the analysis of DCCB.

DCCB Pruning Protocol In order to run DCCB, each agent i must maintain some local information buffers in addition to those used for DCB. These are:

Algorithm 1DistributedClusteringConfidence Ball Input:Size of network|V|,⌧:t!t 4 log2t,↵, Initialization: 8i 2 V, set A˜i0 = Id,˜bi0 = 0,Ai0 = Bi0=;, andV0i=V.

fort= 0, . . .1do

Draw a random permutation of{1, . . . , V}respect- ing the current local clusters

fori= 1, . . . ,|V|do

Receive action setDitand construct the confidence ballCtiusingA˜itand˜bit

Choose action and receive reward:

Find(xit+1,⇤) = arg max(x,✓)˜2DitCtixT✓, and get˜ rewardrt+1i from contextxit+1.

Share and update information buffers:

ifk✓ˆlocali ✓ˆjlocalk> cthresh(t)

Update local cluster:Vt+1i =Vti\ { (i)},Vt+1(i)= Vt(i)\ {i}, and reset according to (13)

elseif Vti=Vt (i) Set Ait+1 = ⇣

1

2(Ait+At(i))⌘

(xit+1 xit+1 T) and Bt+1i = ⇣

1

2(Bti+Bt(i))⌘

(rit+1xit+1) else Update: SetAit+1=Ait (xit+1 xit+1 T)and Bit+1=Bit (rit+1xit+1)

endif

Update local estimator: Ailocal,t+1 = Ailocal,t + xit+1 xit+1 T,bilocal,t+1=bilocal,t+rit+1xit+1, and

✓ˆlocal,t+1 =⇣

Ailocal,t+11

bilocal,t+1

if|Ait+1| > t ⌧(t)setA˜it+1 = ˜Ait+Ait+1(1), Ait+1=Ait+1\ Ait+1(1). Similarly forBit+1. end for

end for

(1) a local covariance matrixAilocal =Ailocal,t, a local b- vectorbilocal =bilocal,t,

(2) and a local neighbour setVti.

The local covariance matrix and b-vector are updated as if the agent was applying the generic (single agent) confi- dence ball algorithm:Ailocal,0=A0,bilocal,0= 0,

Ailocal,t=xit(xit)T+Ailocal,t 1, andbilocal,t =ritxit+bilocal,t 1.

DCCB AlgorithmEach agent’s local neighbour setVtiis initially set toV. At each time stept, agenticontacts one other agent,j, at random fromVti, and both decide whether they do or do not belong to the same cluster. To do this

(7)

they share local estimates,✓ˆit=Ailocal,t 1bilocal,tand✓ˆjt = Ajlocal,t 1bjlocal,t, of the unknown parameter of the bandit problem they are solving, and see if they are further apart than a threshold functionc=cthresh(t), so that if

k✓ˆti ✓ˆjtk2 cthresh(t), (11) then Vt+1i = Vti \ {j} andVt+1j = Vtj \ {i}. Here is a parameter of an extra assumption that is needed, as in (Gentile et al., 2014), about the process generating the context setsDti:

(A) Each context set Dit = {xk}k is finite and contains i.i.d.random vectors such that for all,k,kxkk  1 and E(xkxTk) is full rank, with minimal eigenvalue

>0.

We definecthresh(t), as in (Gentile et al., 2014), by

cthresh(t) :=Rp

2dlog(t) + 2 log(2/ ) + 1

p1 + max{A (t, /(4d)),0} (12)

whereA (t, ) := t 8 logt+3 2q

tlogt+3.

The DCCB algorithm is pretty much the same as the DCB algorithm, except that it also applies the pruning protocol described. In particular, each agent, i, when sharing its information with another,j, has three possible actions:

(1) if (11) is not satisfied andVti = Vtj, then the agents share simply as in the DCB algorithm;

(2) if (11) is satisfied, then both agents remove each other from their neighbour sets and reset their buffers and active matrices so that

Ai= (0,0, . . . , Ailocal),Bi= (0,0, . . . , bilocal), andA˜i=Ailocal,˜bi=bilocal, (13) and similarly for agentj.

(3) if (11) is not satisfied butVti6=Vtj, then no sharing or pruning occurs.

It is proved in the theorem below, that under this sharing and pruning mechanism, in high probability after some fi- nite time each agentifinds its true cluster, i.e. Vti =Uk. Moreover, since the algorithm resets to its local informa- tion each time a pruning occurs, once the true clusters have been identified, each cluster shares only information gath- ered within that cluster, thus avoiding introducing a bias by sharing information gathered from outside the cluster be- fore the clustering has been identified. Full pseudo-code for the DCCB algorithm is given in Algorithm 1, and the dif- ferences with the DCB algorithm are highlighted in blue.

Distributed Clustering of Linear Bandits in Peer to Peer Networks they share local estimates,✓ˆit=Ailocal,t 1bilocal,tand✓ˆtj=

Ajlocal,t 1bjlocal,t, of the unknown parameter of the bandit problem they are solving, and see if they are further apart than a threshold functionc=cthresh(t), so that if

k✓ˆti ✓ˆjtk2 cthresh(t), (11) then Vt+1i = Vti \ {j} andVt+1j = Vtj \ {i}. Here is a parameter of an extra assumption that is needed, as in (Gentile et al., 2014), about the process generating the context setsDit:

(A) Each context setDit = {xk}k is finite and contains i.i.d. random vectors such that for all,k,kxkk  1 and E(xkxTk) is full rank, with minimal eigenvalue

>0.

We definecthresh(t), as in (Gentile et al., 2014), by

cthresh(t) := Rp

2dlog(t) + 2 log(2/ ) + 1

p1 + max{A (t, /(4d)),0} (12)

whereA (t, ) := t 8 logt+3 2q

tlogt+3. The DCCB algorithm is pretty much the same as the DCB algorithm, except that it also applies the pruning protocol described. In particular, each agent, i, when sharing its information with another,j, has three possible actions:

(1) if (11) is not satisfied andVti = Vtj, then the agents share simply as in the DCB algorithm;

(2) if (11) is satisfied, then both agents remove each other from their neighbour sets and reset their buffers and active matrices so that

Ai= (0,0, . . . , Ailocal),Bi= (0,0, . . . , bilocal), andA˜i=Ailocal,˜bi=bilocal, (13) and similarly for agentj.

(3) if (11) is not satisfied butVti6=Vtj, then no sharing or pruning occurs.

It is proved in the theorem below, that under this sharing and pruning mechanism, in high probability after some fi- nite time each agentifinds its true cluster, i.e. Vti =Uk. Moreover, since the algorithm resets to its local informa- tion each time a pruning occurs, once the true clusters have been identified, each cluster shares only information gath- ered within that cluster, thus avoiding introducing a bias by sharing information gathered from outside the cluster be- fore the clustering has been identified. Full pseudo-code for the DCCB algorithm is given in Algorithm 1, and the dif- ferences with the DCB algorithm are highlighted in blue.

1000 2000 3000 4000 5000 6000 7000 8000 9000 0

0.5 1 1.5 2 2.5 3 3.5 4

Rounds

Ratio of Cum. Rewards of Alg. against RAN

Delicious Dataset

DCCB CLUB CB−NoSharing CB−InstSharing

0 2000 4000 6000 8000 10000

1 2 3 4 5 6 7

Rounds

Ratio of Cum. Rewards of Alg. against RAN

LastFM Dataset

DCCB CLUB CB−NoSharing CB−InstSharing

0 2000 4000 6000 8000 10000

0 0.5 1 1.5 2 2.5 3 3.5 4

Rounds

Ratio of Cum. Rewards of Alg. against RAN

MovieLens Dataset

DCCB CLUB CB−NoSharing CB−InstSharing

Figure 1.Here we plot the performance of DCCB in comparison to CLUB, CB-NoSharingand CB-InstSharing. The plots show the ratio of cumulative rewards achieved by the algorithms to the cumulative rewards achieved by the random algorithm.

Figure 1.Here we plot the performance of DCCB in comparison to CLUB, CB-NoSharingand CB-InstSharing. The plots show the ratio of cumulative rewards achieved by the algorithms to the cumulative rewards achieved by the random algorithm.

(8)

3.1. Results for DCCB

Theorem 6. Assume that (A) holds, and let denote the smallest distance between the bandit parameters✓k. Then there exists a constantC=C( ,|V|, , ), such that with probability1 the total cumulative regret of cluster k when the agents employ DCCB is bounded by

Rt

maxnp

2N( ), C+ 4 log2(|V|32C)o

|Uk| +⌫(|Uk|, d, t) k✓k2

+ 4e( (t) + 3R) r

|Uk|tln⇣

(1 +|Uk|t/d)d⌘ ,

whereN and⌫are as defined in Theorem 1, and (t) :=

R r

2 ln⇣

(1 +|Uk|t/d)d

+k✓k2.

The constant C( ,|V|, , )is the time that you have to wait for the true clustering to have been identified, The analysis follows the following scheme: When the true clusters have been correctly identified by all nodes, within each cluster the algorithm, and thus the analysis, reduces to the case of Section 2.1. We adapt results from (Gentile et al., 2014) to show how long it will be before the true clusters are identified, in high probability. The proof is de- ferred to Appendices A.4 and A.5.

4. Experiments and Discussion

Experiments We closely implemented the experimental setting and dataset construction principles used in (Li et al., 2016a;b), and for a detailed description of this we refer the reader to (Li et al., 2016a). We evaluated DCCB on three real-world datasets against its centralised counter- part CLUB, and against the benchmarks used therein, CB- NoSharing, and CB-InstSharing. The LastFM dataset com- prises of91users, each of which appear at least95times.

The Delicious dataset has87users, each of which appear at least95times. The MovieLens dataset contains100users, each of which appears at least250times. The performance was measured using the ratio of cumulative reward of each algorithm to that of the predictor which chooses a random action at each time step. This is plotted in in Figure 1.

From the experimental results it is clear that DCCB per- forms comparably to CLUB in practice, and both outper- form CB-NoSharing, and CB-InstSharing.

Relationship to existing literature There are several strands of research that are relevant and complimentary to this work. First, there is a large literature on single agent linear bandits, and other more, or less complicated ban- dit problem settings. There is already work distributed approaches to multi-agent, multi-armed bandits, not least

(Sz¨or´enyi et al., 2013) which examines✏-greedy strategies over a peer to peer network, and provided an initial inspira- tion for this current work. The paper (Kalathil et al., 2014) examines the extreme case when there is no communication channel across which the agents can communicate, and all communication must be performed through obesrvation of action choices alone. Another approach to the multi-armed bandit case, (Nayyar et al., 2015), directly incorporates the communication cost into the regret.

Second, there are several recent advances regarding the state-of-the-art methods for clustering of bandits. The work (Li et al., 2016a) is a faster variant of (Gentile et al., 2014) which adopt the strategy of boosted training stage. In (Li et al., 2016b) the authors not only cluster the users, but also cluster the items under collaborative filtering case with a sharp regret analysis.

Finally, the paper (Tekin & van der Schaar, 2013) treats a setting similar to ours in which agents attempt to solve contextual bandit problems in a distributed setting. They present two algorithms, one of which is a distributed ver- sion of the approach taken in (Slivkins, 2014), and show that they achieve at least as good asymptotic regret perfor- mance in the distributed approach as the centralised algo- rithm achieves. However, rather than sharing information across a limited communication channel, they allow each agent only to ask another agent to choose their action for them. This difference in our settings is reflected worse re- gret bounds, which are of orderO(T2/3)at best.

Discussion Our analysis is tailored to adapt proofs from (Abbasi-Yadkori et al., 2011) about generic confidence ball algorithms to a distributed setting. However many of the elements of these proofs, including Propositions 1 and 2 could be reused to provide similar asymptotic regret guar- antees for the distributed versions of other bandit algo- rithms, e.g., the Thompson sampling algorithms, (Agrawal

& Goyal, 2013; Kaufmann et al., 2012; Russo & Van Roy, 2014).

Both DCB and DCCB are synchronous algorithms. The work on distributed computation through gossip algorithms in (Boyd et al., 2006) could alleviate this issue. The current pruning algorithm for DCCB guarantees that techniques from (Sz¨or´enyi et al., 2013) can be applied to our algo- rithms. However the results in (Boyd et al., 2006) are more powerful, and could be used even when the agents only identify a sub-network of the true clustering.

Furthermore, there are other existing interesting algorithms for performing clustering of bandits for recommender sys- tems, such as COFIBA in (Li et al., 2016b). It would be in- teresting to understand how general the techniques applied here to CLUB are.

(9)

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments. We would also like to thank Gergley Neu for very useful discussions. The first author thanks the support from EPSRC Autonomous Intelligent Systems project EP/I011587. The third author thanks the support from MIUR, QCRI-HBKU, and Amazon Research Grant.

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 306638.

References

Abbasi-Yadkori, Yasin, P´al, D´avid, and Szepesv´ari, Csaba.

Improved algorithms for linear stochastic bandits. In NIPS, pp. 2312–2320, 2011.

Agrawal, Shipra and Goyal, Navin. Thompson sampling for contextual bandits with linear payoffs. In ICML, 2013.

Boyd, Stephen, Ghosh, Arpita, Prabhakar, Balaji, and Shah, Devavrat. Randomized gossip algorithms.

IEEE/ACM Transactions on Networking (TON), 14(SI):

2508–2530, 2006.

Dani, Varsha, Hayes, Thomas P, and Kakade, Sham M.

Stochastic linear optimization under bandit feedback. In COLT, pp. 355–366, 2008.

Gentile, Claudio, Li, Shuai, and Zappella, Giovanni. On- line clustering of bandits. InICML, 2014.

Hao, Fei, Li, Shuai, Min, Geyong, Kim, Hee-Cheol, Yau, Stephen S, and Yang, Laurence T. An efficient approach to generating location-sensitive recommendations in ad- hoc social network environments.IEEE Transactions on Services Computing, 2015.

Jelasity, M., Montresor, A., and Babaoglu, O. Gossip- based aggregation in large dynamic networks. ACM Trans. on Computer Systems, 23(3):219–252, August 2005.

Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.- M., and van Steen, M. Gossip-based peer sampling.

ACM Transactions on Computer Systems, 25(3):8, 2007.

Kalathil, Dileep, Nayyar, Naumaan, and Jain, Rahul. De- centralized learning for multiplayer multiarmed bandits.

IEEE Transactions on Information Theory, 60(4):2331–

2345, 2014.

Kaufmann, Emilie, Korda, Nathaniel, and Munos, R´emi.

Thompson sampling: An asymptotically optimal finite- time analysis. InAlgorithmic Learning Theory, pp. 199–

213. Springer, 2012.

Kempe, D., Dobra, A., and Gehrke, J. Gossip-based com- putation of aggregate information. In Proc. 44th An- nual IEEE Symposium on Foundations of Computer Sci- ence (FOCS’03), pp. 482–491. IEEE Computer Society, 2003.

Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pp. 661–

670. ACM, 2010.

Li, Shuai, Hao, Fei, Li, Mei, and Kim, Hee-Cheol.

Medicine rating prediction and recommendation in mo- bile social networks. InProceedings of the International Conference on Grid and Pervasive Computing, 2013.

Li, Shuai, Gentile, Claudio, and Karatzoglou, Alexan- dros. Graph clustering bandits for recommendation.

CoRR:1605.00596, 2016a.

Li, Shuai, Karatzoglou, Alexandros, and Gentile, Claudio.

Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR Conference on Informa- tion Retrieval (SIGIR’16), 2016b.

Nayyar, Naumaan, Kalathil, Dileep, and Jain, Rahul.

On regret-optimal learning in decentralized multi-player multi-armed bandits.CoRR:1505.00553, 2015.

Russo, Daniel and Van Roy, Benjamin. Learning to opti- mize via posterior sampling.Mathematics of Operations Research, 39(4):1221–1243, 2014.

Slivkins, Aleksandrs. Contextual bandits with similarity information. JMLR, 2014.

Sz¨or´enyi, Bal´azs, Busa-Fekete, R´obert, Heged˝us, Istv´an, Orm´andi, R´obert, Jelasity, M´ark, and K´egl, Bal´azs.

Gossip-based distributed stochastic bandit algorithms. In ICML, pp. 19–27, 2013.

Tekin, Cem and van der Schaar, Mihaela. Distributed on- line learning via cooperative contextual bandits. IEEE Trans. Signal Processing, 2013.

Xiao, L., Boyd, S., and Kim, S.-J. Distributed average con- sensus with least-mean-square deviation.Journal of Par- allel and Distributed Computing, 67(1):33–46, January 2007.

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

We will then cover network formation, peer effects and the social multiplier, social capital and trust, information aggregation in networks, social learning, trade in

An online version of AdaBoost [11] is introduced in [8] that requires a random subset from the training data for each boosting iteration, and the base learner is trained on this

Our contribution is twofold: (1) we present a privacy preserving al- gorithm for distributed iteration that is extremely fault tolerant and has a low privacy-related overhead and (2)

To examine the partial TDGs as seen locally from sev- eral points of the Internet, we (i) created a static AS-level model of the Internet topology and routing, (ii) mapped the

As mentioned, the algorithm for solving this problem should be local: no global knowledge of the network is provided, each node i can exchange information only with the nodes in N i ′

We present distributed algorithms for effectively calculating basic statistics of data using the recently introduced newscast model of computation and we demonstrate how to

We suggest solutions for estimating network size and detecting partitioning, and we give estimations for the time complexity of global search in this environment.. Our methods rely

Using average outcomes of those peers who have been lost due to death or retirement related director exits to instrument for a firm’s average peer outcomes in the next time period