Distributed Clustering of Linear Bandits in Peer to Peer Networks

(1)

Distributed Clustering of Linear Bandits in Peer to Peer Networks

Nathan Korda NATHAN@ROBOTS.OX.AC.UK

MLRG, University of Oxford

Balázs Szörényi ^SZORENYI.^BALAZS@^GMAIL.^COM

EE, Technion & MTA-SZTE Research Group on Artificial Intelligence

Shuai Li ^SHUAILI.^SLI@^GMAIL.^COM

DiSTA, University of Insubria

Abstract

We provide two distributed confidence ball algorithms for solving linear bandit problems in peer to peer networks with limited communication ca- pabilities. For the first, we assume that all the peers are solving the same linear bandit problem, and prove that our algorithm achieves the optimal asymptotic regret rate of any centralised algorithm that can instantly communicate information between the peers. For the second, we assume that there are clusters of peers solving the same bandit problem within each cluster, and we prove that our algorithm discovers these clusters, while achieving the optimal asymptotic regret rate within each one. Through experiments on several real-world datasets, we demonstrate the performance of proposed algorithms compared to the state-of-the-art.

1. Introduction

Bandits are a class of classic optimisation problems that are fundamental to several important application areas. The most prominent of these is recommendation systems, and they can also arise more generally in networks (see, e.g., (Li et al., 2013; Hao et al., 2015)).

We consider settings where a network of agents are trying to solve collaborative linear bandit problems. Sharing experience can improve the performance of both the whole network and each agent simultaneously, while also increasing robustness. However, we want to avoid putting too much strain on communication channels. Communicating every piece of information would just overload these chan- Proceedings of the 33^rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

nels. The solution we propose is a gossip-based information sharing protocol which allows information to diffuse across the network at a small cost, while also providing robustness.

Such a set-up would benefit, for example, a small start-up that provides some recommendation system service but has limited resources. Using an architecture that enables the agents (the client’s devices) to exchange data between each other directly and to do all the corresponding computations themselves could significantly decrease the infrastructural costs for the company. At the same time, without a central server, communicating all information instantly between agents would demand a lot of bandwidth.

Multi-Agent Linear Bandits In the simplest setting we consider, all the agents are trying to solve the same underlying linear bandit problem. In particular, we have a set of nodesV, indexed byi, and representing a finite set of agents. At each time,t:

• a set ofactions(equivalently, thecontexts) arrives for each agenti,Dⁱt ⇢ Dand we assume the setDis a subset of the unit ball inR^d;

• each agent, i, chooses an action (context)xⁱ_t 2 Dⁱt, and receives areward

r_tⁱ= (xⁱ_t)^T✓+⇠ⁱ_t,

where✓is some unknowncoefficient vector, and⇠ⁱ_tis some zero mean,R-subGaussian noise;

• last, the agents can share information according to some protocol across a communication channel.

We define theinstantaneous regretat each nodei, and, respectively, thecumulative regretover the whole network to be:

⇢ⁱ_t:=⇣ x^i,⇤_t ⌘T

✓ Er_tⁱ, and R^t:=

Xt k=1

|V|

X

i=1

⇢ⁱ_t,

(2)

wherex^i,⇤_t := arg max_x_2Dⁱ

tx^T✓. The aim of the agents is to minimise the rate of increase of cumulative regret. We also wish them to use a sharing protocol that does not im- pose much strain on the information-sharing communication channel.

Gossip protocol In a gossip protocol (see, e.g., (Kempe et al., 2003; Xiao et al., 2007; Jelasity et al., 2005; 2007)), in each round, an overlay protocol assigns to every agent another agent, with which it can share information. After sharing, the agents aggregate the information and, based on that, they make their corresponding decisions in the next round. In many areas of distributed learning and computation gossip protocols have offered a good compromise between low-communication costs and algorithm performance. Using such a protocol in the multi-agent bandit setting, one faces two major challenges.

First, information sharing is not perfect, since each agent acquires information from only one other (randomly chosen) agent per round. This introduces a bias through the unavoidable doubling of data points. The solution is to mit- igate this by using a delay (typically ofO(logt)) on the time at which information gathered is used. After this delay, the information is sufficiently mixed among the agents, and the bias vanishes.

Second, in order to realize this delay, it is necessary to store information in a buffer and only use it to make decisions after the delay has been passed. In (Sz¨or´enyi et al., 2013) this was achieved by introducing an epoch structure into their algorithm, and emptying the buffers at the end of each epoch.

The Distributed Confidence Ball Algorithm (DCB)We use a gossip-based information sharing protocol to produce a distributed variant of the generic Confidence Ball (CB) algorithm, (Abbasi-Yadkori et al., 2011; Dani et al., 2008;

Li et al., 2010). Our approach is similar to (Sz¨or´enyi et al., 2013) where the authors produced a distributed✏-greedy algorithm for the simpler multi-armed bandit problem. How- ever their results do not generalise easily, and thus significant new analysis is needed. One reason is that the linear setting introduces serious complications in the analysis of the delay effect mentioned in the previous paragraphs. Ad- ditionally, their algorithm is epoch-based, whereas we are using a more natural and simpler algorithmic structure. The downside is that the size of the buffers of our algorithm grow with time. However, our analyses easily transfer to the epoch approach too. As the rate of growth is logarith- mic, our algorithm is still efficient over a very long time- scale.

The simplifying assumption so far is that all agents are solving the same underlying bandit problem, i.e. finding the same unknown✓-vector. This, however, is often unre-

alistic, and so we relax it in our next setup. While it may have uses in special cases, DCB and its analysis can be con- sidered as a base for providing an algorithm in this more realistic setup, where some variation in✓is allowed across the network.

Clustered Linear Bandits Proposed in (Gentile et al., 2014; Li et al., 2016a;b), this has recently proved to be a very successful model for recommendation problems with massive numbers of users. It comprises a multi-agent linear bandit model agents’✓-vectors are allowed to vary across a clustering. This clustering presents an additional challenge to find the groups of agents sharing the same underlying bandit problem before information sharing can accelerate the learning process. Formally, let{U^k}^k=1,...,Mbe a clustering ofV, assume some coefficient vector✓^kfor eachk, and let for agenti 2U^k the reward of actionxⁱ_tbe given by

rⁱ_t= (xⁱ_t)^T✓^k+⇠_tⁱ.

Both clusters and coefficient vectors are assumed to be initially unknown, and so need to be learnt on the fly.

The Distributed Clustering Confidence Ball Algorithm (DCCB)The paper (Gentile et al., 2014) proposes the initial centralised approach to the problem of clustering linear bandits. Their approach is to begin with a single cluster, and then incrementally prune edges when the available information suggests that two agents belong to different clusters. We show how to use a gossip-based protocol to give a distributed variant of this algorithm, which we call DCCB.

Our main contributions In Theorems 1 and 6 we show our algorithms DCB and DCCB achieve, in the multi-agent and clustered setting, respectively, near-optimal improve- ments in the regret rates. In particular, they are of order almost p

|V| better than applying CB without information sharing, while still keeping communication cost low.

And our findings are demonstrated by experiments on real- world benchmark data.

2. Linear Bandits and the DCB Algorithm

The generic Confidence Ball (CB) algorithmis designed for a single agent linear bandit problem (i.e. |V| = 1).

The algorithm maintains a confidence ballCt⇢R^dwithin which it believes the true parameter✓lies with high probability. This confidence ball is computed from the obser- vation pairs,(xk, rk)k=1,...,t(for the sake of simplicity, we dropped the agent index,i). Typically, the covariance matrix At = Pt

k=1xkx^T_k andb-vector, bt = Pt

k=1rkxk, are sufficient statistics to characterise this confidence ball.

Then, given its current action set,D^t, the agent selects the optimistic action, assuming that the true parameter sits in Ct, i.e. (xt,⇠) = arg max_(x,✓0)2Dt⇥Ct{x^T✓⁰}. Pseudo- code for CB is given in the Appendix A.1.

(3)

Gossip Sharing Protocol for DCB We assume that the agents are sharing across a peer to peer network, i.e. every agent can share information with every other agent, but that every agent can communicate with only one other agent per round. In our algorithms, each agent,i, needs to maintain (1) abuffer(an ordered set)Aⁱtof covariance matrices and

anactivecovariance matrixA˜ⁱ_t,

(2) abufferBtⁱof b-vectors and anactiveb-vector˜bⁱ_t, Initially, we set, for all i 2 V, A˜ⁱ₀ = I, ˜bⁱ₀ = 0.

These active objects are used by the algorithm as sufficient statistics from which to calculate confidence balls, and summarise only information gathered before or during time⌧(t), where⌧is an arbitrary monotonically increasing function satisfying⌧(t)< t. The buffers are initially set to Aⁱ0=;, andBⁱ0=;. For eacht >1, each agent,i, shares and updates its buffers as follows:

(1) a random permutation, , of the numbers1, . . . ,|V|is chosen uniformly at random in a decentralised manner among the agents,¹

(2) the buffers of i are then updated by averaging its buffers with those of (i), and then extending them using their current observations²

Aⁱt+1=⇣⇣

1

2(Aⁱt+At⁽ⁱ⁾)⌘ ⇣

xⁱ_t+1 xⁱ_t+1 ^T⌘⌘

, Bt+1ⁱ =⇣⇣

1

2(Bⁱt+Bt⁽ⁱ⁾)⌘

rⁱ_t+1xⁱ_t+1 ⌘ , A˜ⁱ_t+1= ˜Aⁱ_t+ ˜A_t⁽ⁱ⁾, and˜bⁱ_t+1= ˜bⁱ_t+ ˜b_t⁽ⁱ⁾.

(3) if the length|Aⁱt+1|exceedst ⌧(t), the first element ofAⁱt+1is added toA˜ⁱ_t+1and deleted fromAⁱt+1.Bⁱt+1

and˜bⁱ_t+1are treated similarly.

In this way, each buffer remains of size at mostt ⌧(t), and contains only information gathered after time⌧(t). The result is that, aftertrounds of sharing, the current covariance matrices and b-vectors used by the algorithm to make decisions have the form:

A˜ⁱ_t:=I+

⌧(t)X

t⁰=1

|V|

X

i⁰=1

wⁱ_i,t⁰^,t⁰xⁱ_t⁰0xⁱ_t⁰0 T,

and˜bⁱ_t:=

⌧(t)X

t⁰=1

|V|

X

i⁰=1

wⁱ_i,t⁰^,t⁰rⁱ_t0⁰xⁱ_t⁰0.

where the weightswⁱ_i,t⁰^,t⁰ are random variables which are unknown to the algorithm. Importantly for our analysis, as

1This can be achieved in a variety of ways.

2The symbol denotes the concatenation operation on two ordered sets: ifx = (a, b, c)andy = (d, e, f), thenx y = (a, b, c, d, e, f), andy x= (d, e, f, a, b, c).

a result of the overlay protocol’s uniformly random choice of , they are identically distributed (i.d.) for each fixed pair(t, t⁰), andP

i⁰2V wⁱ_i,t⁰^,t⁰ =|V|. If information sharing was perfect at each time step, then the current covariance matrix could be computed using all the information gathered by all the agents, and would be:

At:=I+

|V|

X

i⁰=1

Xt t⁰=1

xⁱ_t⁰0

⇣xⁱ_t⁰0

⌘T

. (1)

DCB algorithm The OFUL algorithm (Abbasi-Yadkori et al., 2011) is an improvement of the confidence ball algorithm from (Dani et al., 2008), which assumes that the confidence ballsCtcan be characterised byAtandbt. In the DCB algorithm, each agenti 2 V maintains a confidence ballC_tⁱfor the unknown parameter✓as in the OFUL algorithm, but calculated fromA˜ⁱ_tand˜bⁱ_t. It then chooses its action, xⁱ_t, to satisfy(xⁱ_t,✓ⁱ_t) = arg max_(x,✓)_2Dⁱ_t_⇥_Cⁱ_tx^T✓, and receives a rewardr_tⁱ. Finally, it shares its information buffer according to the sharing protocol above. Pseudo- code for DCB is given in Appendix A.1, and in Algorithm 1.

2.1. Results for DCB

Theorem 1. Let⌧(·) :t!4 log(|V|³²t). Then, with prob- ability1 , the regret of DCB is bounded by

R^t(N( )|V|+⌫(|V|, d, t))k✓k² + 4e²( (t) + 4R)

r

|V|tln⇣

(1 +|V|t/d)^d⌘ , where⌫(|V|, d, t) := (d+1)d²(4|V|ln(|V|³²t))³,N( ) :=

p3/((1 2 ¹⁴)p ), and

(t) :=R vu

utln (1 +|V|t/d)^d!

+k✓k². (2)

The term ⌫(t,|V|, d)describes the loss compared to the centralised algorithm due to the delay in using information, whileN( )|V|describes the loss due to the incom- plete mixing of the data across the network.

If the agents implement CB independently and do not share any information, which we call CB-NoSharing, then it follows from the results in (Abbasi-Yadkori et al., 2011), the equivalent regret bound would be

R^t|V| (t) q

tln ((1 +t/d)^d) (3) Comparing Theorem 1 with (3) tells us that, after an initial

“burn in” period, the gain in regret performance of DCB over CB-NoSharingis of order almostp

|V|.

(4)

Corollary 2. We can recover a bound in expectation from Theorem 1, by using the value = 1/p

|V|t:

E[R^t]O(t¹⁴) +p

|V|tk✓k² + 4e² R

r ln⇣

(1 +|V|t/d)^dp

|V|t⌘

+k✓k²+ 4R

!

⇥q

|V|tln ((1 +|V|t/d)^d).

This shows that DCB exhibits asymptotically optimal regret performance, up to log factors, in comparison with any algorithm that can share its information perfectly between agents at each round.

COMMUNICATIONC^OMPLEXITY

If the agents communicate their information to each other at each round without a central server, then every agent would need to communicate their chosen action and reward to every other agent at each round, giving a communication cost of orderd|V|² per-round. We call such an algorithm CB- InstSharing. Under the gossip protocol we propose each agent requires at mostO(log2(|V|t)d²|V|)bits to be com- municated per round. Therefore, a significant communication cost reduction is gained whenlog(|V|t)d⌧|V|. Using an epoch-based approach, as in (Sz¨or´enyi et al., 2013), the per-round communication cost of the gossip protocol becomes O(d²|V|). This improves efficiency over any horizon, requiring only thatd⌧|V|, and the proofs of the regret performance are simple modifications of those for DCB. However, in comparison with growing buffers this is only an issue afterO(exp(|V|))number of rounds, and typically|V|is large.

While the DCB has a clear communication advantage over CB-InstSharing, there are other potential approaches to this problem. For example, instead of randomised neighbour sharing one can use a deterministic protocol such asRound- Robin(RR), which can have the same low communication costs as DCB. However, the regret bound for RR suffers from a naturally larger delay in the network than DCB.

Moreover, attempting to track potential doubling of data points when using a gossip protocol, instead of employing a delay, leads back to a communication cost of order|V|² per round. More detail is included in Appendix A.2.

PROOF OFTHEOREM1

In the analysis we show that the bias introduced by imperfect information sharing is mitigated by delaying the in- clusion of the data in the estimation of the parameter ✓.

The proof builds on the analysis in (Abbasi-Yadkori et al., 2011). The emphasis here is to show how to handle the extra difficulty stemming from imperfect information sharing, which results in the influence of the various rewards

at the various peers being unbalanced and appearing with a random delay. Proofs of the Lemmas 3 and 4, and of Proposition 1 are crucial, but technical, and are deferred to Appendix A.3.

Step 1: Define modified confidence ellipsoids. First we need a version of the confidence ellipsoid theorem given in (Abbasi-Yadkori et al., 2011) that incorporates the bias introduced by the random weights:

Proposition 1. Let > 0, ✓˜_tⁱ := ( ˜Aⁱ_t) ¹˜bⁱ_t, W(⌧) :=

max{w_i,tⁱ⁰^,t⁰ :t, t⁰⌧, i, i⁰2V}, and let

C_tⁱ:=

⇢

x2R^d:k✓˜ⁱ_t xkA^˜ⁱ_t  k✓k² (4) +W(⌧(t))R

r 2 log⇣

det( ˜Aⁱ_t)¹²/ ⌘ .

Then with probability1 ,✓2C_tⁱ.

In the rest of the proof we assume that✓2C_tⁱ.

Step 2: Instantaneous regret decomposition. Denote by (xⁱ_t,✓_tⁱ) = arg max_x2Dⁱ_t_,y2C_tⁱx^Ty. Then we can de- compose the instantaneous regret, following a classic argument (see the proof of Theorem 3 in (Abbasi-Yadkori et al., 2011)):

⇢ⁱ_t=⇣ x^i,_t^⇤⌘T

✓ (xⁱ_t)^T✓ xⁱ_t ^T✓_tⁱ (xⁱ_t)^T✓

= xⁱ_t ^Th⇣

✓ⁱ_t ✓˜ⁱ_t⌘ +⇣

✓˜_tⁱ ✓⌘i

 kxⁱ_tk(^A^˜ⁱt) ¹



✓_tⁱ ✓˜_tⁱ

A˜ⁱ_t+ ˜✓ⁱ_t ✓

A˜ⁱ_t (5) Step 3: Control the bias.The norm differences inside the square brackets of the regret decomposition are bounded through (4) in terms of the matrices A˜ⁱ_t. We would like, instead, to have the regret decomposition in terms of the matrixAt(which is defined in (1)). To this end, we give some lemmas showing that using the matricesA˜ⁱ_tis almost the same as usingAt. These lemmas involve elementary matrix analysis, but are crucial for understanding the im- pact of imperfect information sharing on the final regret bounds.

Step 3a: Control the bias coming from the weight im- balance.

Lemma 3(Bound on the influence of general weights).For alli2V andt >0,

kxⁱ_tk²(^A^˜ⁱt) ¹ e

P⌧(t) t0=1

P|V|

i0=1w^i0,t0_i,t 1

kxⁱ_tk²(^A^⌧(t)) ¹, and det⇣

A˜ⁱ_t⌘

e

P⌧(t) t0=1

P_|V|

i0=1w^i0,t0_i,t 1

det A⌧(t) .

Using Lemma 4 in (Sz¨or´enyi et al., 2013), by exploiting the random weights are identically distributed (i.d.) for each

(5)

fixed pair (t, t⁰), andP

i⁰2V wⁱ_i,t⁰^,t⁰ = |V|under our gossip protocol, we can control the random exponential constant in Lemma 3, and the upper boundW(T)using the Chernoff-Hoeffding bound:

Lemma 4(Bound on the influence of weights under our sharing protocol). Fix some constants0 < t⁰ <1. Then with probability1 P⌧(t)

t⁰=1 t⁰

|V|

X

i⁰=1

⌧(t)X

t⁰=1

wⁱ_i,t⁰^,t⁰ 1 |V|³² X⌧(t) t⁰=1

⇣2^{(t t}⁰⁾ t⁰

⌘ ¹₂ ,

andW(T)1 + max

1t⁰⌧(t)

⇢

|V|³²⇣

2^{(t t}⁰⁾ t⁰

⌘ ¹2

.

In particular, for any 2 (0,1), choosing t⁰ = 2^t

0 t

2 , with probability1 /(|V|³t²(1 2 ^1/2))we have

|V|

X

i⁰=1

X⌧(t) t⁰=1

w_i,tⁱ⁰^,t⁰ 1  1 (1 2 ¹⁴)tp , andW(⌧(t))1 + |V|³²

tp . (6)

Thus Lemma 3 and 4 give us control over the bias introduced by the imperfect information sharing. Combining them with Equations (4) and (5) we find that with probabil- ity1 /(|V|³t²(1 2 ^1/2)):

⇢ⁱ_t2e^C(t)kxⁱ_tk^⇣_Ai

⌧(t)

⌘ 1(1 +C(t)) (7)

⇥

"

R r

2 log⇣

e^C(t)det A⌧(t)

1

2 1⌘

+k✓k

#

whereC(t) := 1/(1 2 ^1/4)tp

Step 3b: Control the bias coming from the delay. Next, we need to control the bias introduced from leaving out the last4 log(|V|^3/2t)time steps from the confidence ball estimation calculation:

Proposition 2. There can be at most

⌫(k) := (4|V|log(|V|^3/2k))³(d+ 1)d(tr(A0) + 1) (8) pairs(i, k)21, . . . ,|V|⇥{1, . . . , t}for which one of

kxⁱ_kk²_A ¹

⌧(k) ekxⁱ_kk²(^A^k ¹⁺^Pⁱj=1¹x^j_k(x^j_k)^T) ¹, or det A⌧(k) edet

0

@Ak 1+

i 1

X

j=1

x^j_k(x^j_k)^T 1 A holds.

Step 4: Choose constants and sum the simple regret.

Defining a constant

N( ) := 1 (1 2 ¹⁴)p ,

we have, for allk N( ),C(k)1, and so, by (7) with probability1 (|V|k) ² /(1 2 ^1/2)

⇢ⁱ_k2ekxⁱ_kkA_⌧(k)¹ (9)

⇥ 2 642R

vu uu t2 log

0

@edet A⌧(k)

1 2

1 A+k✓k²

3 75.

Now, first applying Cauchy-Schwarz, then step 3b from above together with (9), and finally Lemma 11 from (Abbasi-Yadkori et al., 2011) yields that, with probability 1 1 +P1

t=1(|V|t) ²/(1 2 ^1/2) 1 3 , R^tN( )|V|k✓k²+

2 4|V|t

Xt t⁰=N( )

|V|

X

i=1

⇢ⁱ_t0

2

3 5

1 2

(N( )|V|+⌫(|V|, d, t))k✓k²

+ 4e²( (t) + 2R)

"

|V|t Xt t⁰=1

XM i=1

kxⁱ_tk²(At) ¹

#¹2

(N( )|V|+⌫(|V|, d, t))k✓k² + 4e²( (t) + 2R)p

|V|t(2 log (det (At))), where (·)is as defined in (2). Replacing with /3fin- ishes the proof.

PROOF OFPROPOSITION2

This proof forms the major innovation in the proof of The- orem 1. Let(yk)k 1be any sequence of vectors such that kykk²1for allk, and letBn :=B0+Pn

k=1yky_k^T, where B0is some positive definite matrix.

Lemma 5. For allt >0, and for anyc2(0,1), we have nk2{1,2, . . .}:kykk²B_k¹₁ > co

(d+c)d(tr(B₀¹) c)/c², Proof. We begin by showing that, for anyc2(0,1)

kykk²_B ¹

k 1 > c (10)

can be true for only2dc ³differentk.

Indeed, let us suppose that (10) is true for some k. Let (e^(k_i ¹⁾)_1id be the orthonormal eigenbasis for Bk 1, and, therefore, also forB_k¹₁, and writeyk =Pd

i=1↵iei. Let, also,( ^(k_i ¹⁾)be the eigenvalues forBk 1. Then,

c < y_k^TB_k¹₁yk= Xd i=1

↵²_i

(k 1)

i tr(B_k¹₁),

=) 9j2{1, . . . , d}: ^↵

2 j (k 1) j

, (k¹1) j

> ^c_d,

(6)

where we have used that↵²_i <1for alli, sincekykk²<1.

Now,

tr(B_k¹₁) tr(B_k¹)

=tr(B_k¹₁) tr((Bk 1+yky_k^T) ¹)

> tr(B_k¹₁) tr((Bk 1+↵²_jeje^T_j) ¹)

= (k¹1) j

1

(k 1)

j +↵²_j = ^↵

2 j (k 1)

j ( ^(k_j ¹⁾+↵²_j)

> d²c ²+dc ¹ ¹>_d(d+c)^c² So we have shown that (10) implies that

tr(B_k¹₁)> cand tr(B_k¹₁) tr(B_k¹)> c² d(d+c). Sincetr(B₀¹) tr(B_k¹₁) tr(B_k¹) 0for allk, it follows that (10) can be true for at most(d+c)d(tr(B₀¹) c)c ²differentk.

Now, using an argument similar the proof of Lemma 3, for allk < t

kyk+1kB_⌧(k)¹ e

Pk

s=⌧(k)+1kys+1k_B 1

s kyk+1kB_k¹, and det B⌧(t) e

Pt

k=⌧(t)+1kykk²_B 1

k det (Bt). Therefore,

kyk+1kB_⌧(k)¹ ckyk+1kB_k¹ or det(B⌧(k)) cdet(Bk)

=)

kX1 s=⌧(k)

kys+1kBs¹ ln(c)

However, according to Lemma 5, there can be at most

⌫(t) :=⇣

d+^ln(c)_(t)⌘ d⇣

tr B₀¹ ^ln(c)_(t)⌘ ⇣ _(t)

ln(c)

⌘2

times s 2 {1, . . . , t}, such that kys+1kBs¹

ln(c)/ (t), where (t) := max1kt{k ⌧(k)}. Hence Pk

s=⌧(j)+1kys+1kBs¹ ln(c) is true for at most (t)⌫(|V|, d, t)indicesk2{1, . . . , t}.

Finally, we finish by setting(yk)k 1= t 1(xⁱ_t)^|_i=1^V^|.

3. Clustering and the DCCB Algorithm

We now incorporate distributed clustering into the DCB algorithm. The analysis of DCB forms the backbone of the analysis of DCCB.

DCCB Pruning Protocol In order to run DCCB, each agent i must maintain some local information buffers in addition to those used for DCB. These are:

Algorithm 1DistributedClusteringConfidence Ball Input:Size of network|V|,⌧:t!t 4 log₂t,↵, Initialization: 8i 2 V, set A˜ⁱ₀ = Id,˜bⁱ₀ = 0,Aⁱ0 = Bⁱ0=;, andV₀ⁱ=V.

fort= 0, . . .1do

Draw a random permutation of{1, . . . , V}respect- ing the current local clusters

fori= 1, . . . ,|V|do

Receive action setDⁱtand construct the confidence ballC_tⁱusingA˜ⁱ_tand˜bⁱ_t

Choose action and receive reward:

Find(xⁱ_t+1,⇤) = arg max_(x,✓)˜2Dⁱt⇥C_tⁱx^T✓, and get˜ rewardr_t+1ⁱ from contextxⁱ_t+1.

Share and update information buffers:

ifk✓ˆ_localⁱ ✓ˆ^j_localk> c^thresh(t)

Update local cluster:V_t+1ⁱ =V_tⁱ\ { (i)},V_t+1⁽ⁱ⁾= V_t⁽ⁱ⁾\ {i}, and reset according to (13)

elseif V_tⁱ=V_t ⁽ⁱ⁾ Set Aⁱt+1 = ⇣

1

2(Aⁱt+At⁽ⁱ⁾)⌘

(xⁱ_t+1 xⁱ_t+1 ^T) and Bt+1ⁱ = ⇣

1

2(Btⁱ+Bt⁽ⁱ⁾)⌘

(rⁱ_t+1xⁱ_t+1) else Update: SetAⁱt+1=Aⁱt (xⁱ_t+1 xⁱ_t+1 ^T)and Bⁱt+1=Bⁱt (rⁱ_t+1xⁱ_t+1)

endif

Update local estimator: Aⁱ_local,t+1 = Aⁱ_local,t + xⁱ_t+1 xⁱ_t+1 ^T,bⁱ_local,t+1=bⁱ_local,t+rⁱ_t+1xⁱ_t+1, and

✓ˆlocal,t+1 =⇣

Aⁱ_local,t+1⌘ ¹

bⁱ_local,t+1

if|Aⁱt+1| > t ⌧(t)setA˜ⁱ_t+1 = ˜Aⁱ_t+Aⁱt+1(1), Aⁱt+1=Aⁱt+1\ Aⁱt+1(1). Similarly forBⁱt+1. end for

end for

(1) a local covariance matrixAⁱ_local =Aⁱ_local,t, a local b- vectorbⁱ_local =bⁱ_local,t,

(2) and a local neighbour setV_tⁱ.

The local covariance matrix and b-vector are updated as if the agent was applying the generic (single agent) confidence ball algorithm:Aⁱ_local,0=A0,bⁱ_local,0= 0,

Aⁱ_local,t=xⁱ_t(xⁱ_t)^T+Aⁱ_local,t ₁, andbⁱ_local,t =rⁱ_txⁱ_t+bⁱ_local,t ₁.

DCCB AlgorithmEach agent’s local neighbour setV_tⁱis initially set toV. At each time stept, agenticontacts one other agent,j, at random fromV_tⁱ, and both decide whether they do or do not belong to the same cluster. To do this

(7)

they share local estimates,✓ˆⁱ_t=Aⁱ_local,t ¹bⁱ_local,tand✓ˆ^j_t = A^j_local,t ¹b^j_local,t, of the unknown parameter of the bandit problem they are solving, and see if they are further apart than a threshold functionc=c^thresh(t), so that if

k✓ˆ_tⁱ ✓ˆ^j_tk² c^thresh(t), (11) then V_t+1ⁱ = V_tⁱ \ {j} andV_t+1^j = V_t^j \ {i}. Here is a parameter of an extra assumption that is needed, as in (Gentile et al., 2014), about the process generating the context setsDtⁱ:

(A) Each context set Dⁱt = {xk}^k is finite and contains i.i.d.random vectors such that for all,k,kxkk  1 and E(xkx^T_k) is full rank, with minimal eigenvalue

>0.

We definec^thresh(t), as in (Gentile et al., 2014), by

c^thresh(t) :=Rp

2dlog(t) + 2 log(2/ ) + 1

p1 + max{A (t, /(4d)),0} (12)

whereA (t, ) := ^t 8 log^t+3 2q

tlog^t+3.

The DCCB algorithm is pretty much the same as the DCB algorithm, except that it also applies the pruning protocol described. In particular, each agent, i, when sharing its information with another,j, has three possible actions:

(1) if (11) is not satisfied andV_tⁱ = V_t^j, then the agents share simply as in the DCB algorithm;

(2) if (11) is satisfied, then both agents remove each other from their neighbour sets and reset their buffers and active matrices so that

Aⁱ= (0,0, . . . , Aⁱ_local),Bⁱ= (0,0, . . . , bⁱ_local), andA˜ⁱ=Aⁱ_local,˜bⁱ=bⁱ_local, (13) and similarly for agentj.

(3) if (11) is not satisfied butV_tⁱ6=V_t^j, then no sharing or pruning occurs.

It is proved in the theorem below, that under this sharing and pruning mechanism, in high probability after some finite time each agentifinds its true cluster, i.e. V_tⁱ =U^k. Moreover, since the algorithm resets to its local information each time a pruning occurs, once the true clusters have been identified, each cluster shares only information gathered within that cluster, thus avoiding introducing a bias by sharing information gathered from outside the cluster before the clustering has been identified. Full pseudo-code for the DCCB algorithm is given in Algorithm 1, and the differences with the DCB algorithm are highlighted in blue.

Distributed Clustering of Linear Bandits in Peer to Peer Networks they share local estimates,✓ˆⁱ_t=Aⁱ_local,t ¹bⁱ_local,tand✓ˆ_t^j=

A^j_local,t ¹b^j_local,t, of the unknown parameter of the bandit problem they are solving, and see if they are further apart than a threshold functionc=c^thresh(t), so that if

k✓ˆ_tⁱ ✓ˆ^j_tk2 c^thresh(t), (11) then V_t+1ⁱ = V_tⁱ \ {j} andV_t+1^j = V_t^j \ {i}. Here is a parameter of an extra assumption that is needed, as in (Gentile et al., 2014), about the process generating the context setsDⁱt:

(A) Each context setDⁱt = {xk}k is finite and contains i.i.d. random vectors such that for all,k,kxkk  1 and E(x_kx^T_k) is full rank, with minimal eigenvalue

>0.

We definec^thresh(t), as in (Gentile et al., 2014), by

c^thresh(t) := Rp

2dlog(t) + 2 log(2/ ) + 1

p1 + max{A (t, /(4d)),0} (12)

whereA (t, ) := ^t 8 log^t+3 2q

tlog^t+3. The DCCB algorithm is pretty much the same as the DCB algorithm, except that it also applies the pruning protocol described. In particular, each agent, i, when sharing its information with another,j, has three possible actions:

(1) if (11) is not satisfied andV_tⁱ = V_t^j, then the agents share simply as in the DCB algorithm;

(2) if (11) is satisfied, then both agents remove each other from their neighbour sets and reset their buffers and active matrices so that

Aⁱ= (0,0, . . . , Aⁱ_local),Bⁱ= (0,0, . . . , bⁱ_local), andA˜ⁱ=Aⁱ_local,˜bⁱ=bⁱ_local, (13) and similarly for agentj.

(3) if (11) is not satisfied butV_tⁱ6=V_t^j, then no sharing or pruning occurs.

It is proved in the theorem below, that under this sharing and pruning mechanism, in high probability after some finite time each agentifinds its true cluster, i.e. V_tⁱ =U^k. Moreover, since the algorithm resets to its local information each time a pruning occurs, once the true clusters have been identified, each cluster shares only information gathered within that cluster, thus avoiding introducing a bias by sharing information gathered from outside the cluster before the clustering has been identified. Full pseudo-code for the DCCB algorithm is given in Algorithm 1, and the differences with the DCB algorithm are highlighted in blue.

1000 2000 3000 4000 5000 6000 7000 8000 9000 0

0.5 1 1.5 2 2.5 3 3.5 4

Rounds

Ratio of Cum. Rewards of Alg. against RAN

Delicious Dataset

DCCB CLUB CB−NoSharing CB−InstSharing

0 2000 4000 6000 8000 10000

1 2 3 4 5 6 7

Rounds

LastFM Dataset

0 2000 4000 6000 8000 10000

0 0.5 1 1.5 2 2.5 3 3.5 4

Rounds

MovieLens Dataset

Figure 1.Here we plot the performance of DCCB in comparison to CLUB, CB-NoSharingand CB-InstSharing. The plots show the ratio of cumulative rewards achieved by the algorithms to the cumulative rewards achieved by the random algorithm.

(8)

3.1. Results for DCCB

Theorem 6. Assume that (A) holds, and let denote the smallest distance between the bandit parameters✓^k. Then there exists a constantC=C( ,|V|, , ), such that with probability1 the total cumulative regret of cluster k when the agents employ DCCB is bounded by

R^t



maxnp

2N( ), C+ 4 log₂(|V|³²C)o

|U^k| +⌫(|U^k|, d, t) k✓k²

+ 4e( (t) + 3R) r

|U^k|tln⇣

(1 +|U^k|t/d)^d⌘ ,

whereN and⌫are as defined in Theorem 1, and (t) :=

R r

2 ln⇣

(1 +|U^k|t/d)^d⌘

+k✓k².

The constant C( ,|V|, , )is the time that you have to wait for the true clustering to have been identified, The analysis follows the following scheme: When the true clusters have been correctly identified by all nodes, within each cluster the algorithm, and thus the analysis, reduces to the case of Section 2.1. We adapt results from (Gentile et al., 2014) to show how long it will be before the true clusters are identified, in high probability. The proof is deferred to Appendices A.4 and A.5.

4. Experiments and Discussion

Experiments We closely implemented the experimental setting and dataset construction principles used in (Li et al., 2016a;b), and for a detailed description of this we refer the reader to (Li et al., 2016a). We evaluated DCCB on three real-world datasets against its centralised counter- part CLUB, and against the benchmarks used therein, CB- NoSharing, and CB-InstSharing. The LastFM dataset comprises of91users, each of which appear at least95times.

The Delicious dataset has87users, each of which appear at least95times. The MovieLens dataset contains100users, each of which appears at least250times. The performance was measured using the ratio of cumulative reward of each algorithm to that of the predictor which chooses a random action at each time step. This is plotted in in Figure 1.

From the experimental results it is clear that DCCB per- forms comparably to CLUB in practice, and both outper- form CB-NoSharing, and CB-InstSharing.

Relationship to existing literature There are several strands of research that are relevant and complimentary to this work. First, there is a large literature on single agent linear bandits, and other more, or less complicated bandit problem settings. There is already work distributed approaches to multi-agent, multi-armed bandits, not least

(Sz¨or´enyi et al., 2013) which examines✏-greedy strategies over a peer to peer network, and provided an initial inspira- tion for this current work. The paper (Kalathil et al., 2014) examines the extreme case when there is no communication channel across which the agents can communicate, and all communication must be performed through obesrvation of action choices alone. Another approach to the multi-armed bandit case, (Nayyar et al., 2015), directly incorporates the communication cost into the regret.

Second, there are several recent advances regarding the state-of-the-art methods for clustering of bandits. The work (Li et al., 2016a) is a faster variant of (Gentile et al., 2014) which adopt the strategy of boosted training stage. In (Li et al., 2016b) the authors not only cluster the users, but also cluster the items under collaborative filtering case with a sharp regret analysis.

Finally, the paper (Tekin & van der Schaar, 2013) treats a setting similar to ours in which agents attempt to solve contextual bandit problems in a distributed setting. They present two algorithms, one of which is a distributed version of the approach taken in (Slivkins, 2014), and show that they achieve at least as good asymptotic regret performance in the distributed approach as the centralised algorithm achieves. However, rather than sharing information across a limited communication channel, they allow each agent only to ask another agent to choose their action for them. This difference in our settings is reflected worse regret bounds, which are of orderO(T^2/3)at best.

Discussion Our analysis is tailored to adapt proofs from (Abbasi-Yadkori et al., 2011) about generic confidence ball algorithms to a distributed setting. However many of the elements of these proofs, including Propositions 1 and 2 could be reused to provide similar asymptotic regret guarantees for the distributed versions of other bandit algorithms, e.g., the Thompson sampling algorithms, (Agrawal

& Goyal, 2013; Kaufmann et al., 2012; Russo & Van Roy, 2014).

Both DCB and DCCB are synchronous algorithms. The work on distributed computation through gossip algorithms in (Boyd et al., 2006) could alleviate this issue. The current pruning algorithm for DCCB guarantees that techniques from (Sz¨or´enyi et al., 2013) can be applied to our algorithms. However the results in (Boyd et al., 2006) are more powerful, and could be used even when the agents only identify a sub-network of the true clustering.

Furthermore, there are other existing interesting algorithms for performing clustering of bandits for recommender systems, such as COFIBA in (Li et al., 2016b). It would be interesting to understand how general the techniques applied here to CLUB are.

(9)

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments. We would also like to thank Gergley Neu for very useful discussions. The first author thanks the support from EPSRC Autonomous Intelligent Systems project EP/I011587. The third author thanks the support from MIUR, QCRI-HBKU, and Amazon Research Grant.

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 306638.

References

Abbasi-Yadkori, Yasin, Pál, Dávid, and Szepesvári, Csaba.

Improved algorithms for linear stochastic bandits. In NIPS, pp. 2312–2320, 2011.

Agrawal, Shipra and Goyal, Navin. Thompson sampling for contextual bandits with linear payoffs. In ICML, 2013.

Boyd, Stephen, Ghosh, Arpita, Prabhakar, Balaji, and Shah, Devavrat. Randomized gossip algorithms.

IEEE/ACM Transactions on Networking (TON), 14(SI):

2508–2530, 2006.

Dani, Varsha, Hayes, Thomas P, and Kakade, Sham M.

Stochastic linear optimization under bandit feedback. In COLT, pp. 355–366, 2008.

Gentile, Claudio, Li, Shuai, and Zappella, Giovanni. On- line clustering of bandits. InICML, 2014.

Hao, Fei, Li, Shuai, Min, Geyong, Kim, Hee-Cheol, Yau, Stephen S, and Yang, Laurence T. An efficient approach to generating location-sensitive recommendations in ad- hoc social network environments.IEEE Transactions on Services Computing, 2015.

Jelasity, M., Montresor, A., and Babaoglu, O. Gossip- based aggregation in large dynamic networks. ACM Trans. on Computer Systems, 23(3):219–252, August 2005.

Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.- M., and van Steen, M. Gossip-based peer sampling.

ACM Transactions on Computer Systems, 25(3):8, 2007.

Kalathil, Dileep, Nayyar, Naumaan, and Jain, Rahul. De- centralized learning for multiplayer multiarmed bandits.

IEEE Transactions on Information Theory, 60(4):2331–

2345, 2014.

Kaufmann, Emilie, Korda, Nathaniel, and Munos, R´emi.

Thompson sampling: An asymptotically optimal finite- time analysis. InAlgorithmic Learning Theory, pp. 199–

213. Springer, 2012.

Kempe, D., Dobra, A., and Gehrke, J. Gossip-based computation of aggregate information. In Proc. 44th An- nual IEEE Symposium on Foundations of Computer Sci- ence (FOCS’03), pp. 482–491. IEEE Computer Society, 2003.

Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pp. 661–

670. ACM, 2010.

Li, Shuai, Hao, Fei, Li, Mei, and Kim, Hee-Cheol.

Medicine rating prediction and recommendation in mo- bile social networks. InProceedings of the International Conference on Grid and Pervasive Computing, 2013.

Li, Shuai, Gentile, Claudio, and Karatzoglou, Alexan- dros. Graph clustering bandits for recommendation.

CoRR:1605.00596, 2016a.

Li, Shuai, Karatzoglou, Alexandros, and Gentile, Claudio.

Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR Conference on Informa- tion Retrieval (SIGIR’16), 2016b.

Nayyar, Naumaan, Kalathil, Dileep, and Jain, Rahul.

On regret-optimal learning in decentralized multi-player multi-armed bandits.CoRR:1505.00553, 2015.

Russo, Daniel and Van Roy, Benjamin. Learning to opti- mize via posterior sampling.Mathematics of Operations Research, 39(4):1221–1243, 2014.

Slivkins, Aleksandrs. Contextual bandits with similarity information. JMLR, 2014.

Szörényi, Balázs, Busa-Fekete, Róbert, Heged˝us, István, Ormándi, Róbert, Jelasity, Márk, and Kégl, Balázs.

Gossip-based distributed stochastic bandit algorithms. In ICML, pp. 19–27, 2013.

Tekin, Cem and van der Schaar, Mihaela. Distributed on- line learning via cooperative contextual bandits. IEEE Trans. Signal Processing, 2013.

Xiao, L., Boyd, S., and Kim, S.-J. Distributed average con- sensus with least-mean-square deviation.Journal of Par- allel and Distributed Computing, 67(1):33–46, January 2007.