Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits

(1)

“Broadening the knowledge base and supporting the long term professional sustainability of the Research University Centre of Excellence

at the University of Szeged by ensuring the rising generation of excellent scientists.””

Doctoral School of Mathematics and Computer Science

Stochastic Days in Szeged 27.07.2012.

Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits

Csaba Szepesvári

(University of Alberta)

TÁMOP‐4.2.2/B‐10/1‐2010‐0012 project

(2)

Online-to-Confidence-Set Conversions and Application to

Sparse Stochastic Bandits

Csaba Szepesv´ari

Department of Computing Science University of Alberta csaba.szepesvari@ualberta.ca

July 27, 2013

Stochastic days – honoring Andr´as Kr´amli 70th birthday

joint work with Yasin Abbasi-Yadkori and D´avid P´al

1 / 31

(3)

2 / 31

(4)

Linear Prediction and (Honest) Confidence Sets

Getting directions from Andr´as: MDPs, ILT, Dregely

4 / 31

(16)

The Data

I X1, . . . ,Xn∈R^d,Y1, . . . ,Yn∈R

I ∃θ∗ ∈R^ds.t.

Yt =hXt, θ∗i+η_t, t=1, . . . ,n

I The“noise”,η_tisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[e^λη^t|X1, . . . ,Xt, η₁, . . . , η_t−1]≤exp

λ²R² 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ∗ w.h.p.!

5 / 31

(17)

The Data

I X1, . . . ,Xn∈R^d,Y1, . . . ,Yn∈R

Yt =hXt, θ∗i+η_t, t=1, . . . ,n

λ²R² 2

.

Estimation Problems:

5 / 31

(18)

The Data

I X1, . . . ,Xn∈R^d,Y1, . . . ,Yn∈R

Yt =hXt, θ∗i+η_t, t=1, . . . ,n

λ²R² 2

.

Estimation Problems:

5 / 31

(19)

The Data

I X1, . . . ,Xn∈R^d,Y1, . . . ,Yn∈R

Yt =hXt, θ∗i+η_t, t=1, . . . ,n

λ²R² 2

.

Estimation Problems:

5 / 31

(20)

The Data

I X1, . . . ,Xn∈R^d,Y1, . . . ,Yn∈R

Yt =hXt, θ∗i+η_t, t=1, . . . ,n

λ²R² 2

.

Estimation Problems:

I Estimateθ_∗ based on((X1,Y1), . . . ,(Xn,Yn))!

5 / 31

(21)

The Data

I X1, . . . ,Xn∈R^d,Y1, . . . ,Yn∈R

Yt =hXt, θ∗i+η_t, t=1, . . . ,n

λ²R² 2

.

Estimation Problems:

5 / 31

(22)

The Data

I X1, . . . ,Xn∈R^d,Y1, . . . ,Yn∈R

Yt =hXt, θ∗i+η_t, t=1, . . . ,n

λ²R² 2

.

Estimation Problems:

5 / 31

(23)

Sub-Gaussianity

Definition

Random variableZisR-sub-Gaussianfor someR≥0if

∀γ∈R E[e^γZ]≤exp

γ²R² 2

.

The condition implies that

I E[Z] =0

I Var[Z]≤R² Examples:

I Zero-mean bounded in an interval of length2R (Hoeffding-Azuma)

I Zero-mean Gaussian with variance≤R²

6 / 31

(24)

Sub-Gaussianity

Definition

γ²R² 2

.

I E[Z] =0

I Var[Z]≤R²

Examples:

6 / 31

(25)

Sub-Gaussianity

Definition

γ²R² 2

.

I E[Z] =0

I Var[Z]≤R² Examples:

6 / 31

(26)

(Honest) Confidence Sets

Given the data((X1,Y1), . . . ,(Xn,Yn))and 0≤δ≤1,

construct

Cn ⊂R^d such that

Pr(θ∗ ∈Cn)≥1−δ.

7 / 31

(27)

(Honest) Confidence Sets

construct

Cn ⊂R^d such that

7 / 31

(28)

(Honest) Confidence Sets

construct

Cn ⊂R^d such that

θ

_∗

θ

_∗

C

_n

7 / 31

(29)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i

I Stack them into matrices:X_1:nisn×dandY_1:nisn×1

I Ridge regression estimator:

bθ_n= (X_1:nX^T_1:n+λI)⁻¹X^T_1:nY_1:n

I LetVn =X_1:nX^T_1:n+λI

Theorem ([AYPS11])

Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗

lies in Ct =

θ : kbθ_t−θkVt ≤R s

2 ln

det(Vt)^1/2 δdet(λI)^1/2

+S√

λ

wherekvkA =√

v^TAvis the matrixA-norm.

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.

8 / 31

(30)

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

lies in Ct =

2 ln

+S√

λ

wherekvkA =√

8 / 31

(31)

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

lies in Ct =

2 ln

+S√

λ

wherekvkA =√

8 / 31

(32)

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

lies in Ct =

2 ln

+S√

λ

wherekvkA =√

8 / 31

(33)

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

lies in Ct =

2 ln

+S√

λ

wherekvkA =√

8 / 31

(34)

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

lies in Ct =

2 ln

+S√

λ

wherekvkA =√

Proof technique: [RS70, dLS09].

Extends to separable Hilbert spaces.

8 / 31

(35)

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

lies in Ct =

2 ln

+S√

λ

wherekvkA =√

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces. _{8 / 31}

(36)

Comparison with Previous Confidence Sets

I Bound of [AYPS11]:

kθb_t−θ∗kVt ≤R s

2 ln

+S√

λ

I [DHK08]: Ifkθ∗k2,kXtk2≤1then for a specificλ kbθ_t−θ_∗kVt ≤Rmax

p

128dln(t)ln(t²/δ),8

3ln(t²/δ)

I [RT10]: IfkXtk2 ≤1 kθb_t−θ∗kVt ≤2Rκ√

lntp

dlnt+ln(t²/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).

The bound of [AYPS11] doesn’t depend ont.

9 / 31

(37)

Comparison with Previous Confidence Sets

2 ln

+S√

λ

p

3ln(t²/δ)

I [RT10]: IfkXtk2 ≤1 kθb_t−θ∗kVt ≤2Rκ√

lntp

9 / 31

(38)

Comparison with Previous Confidence Sets

2 ln

+S√

λ

p

3ln(t²/δ)

I [RT10]: IfkXtk2 ≤1 kθb_t−θ_∗kVt ≤2Rκ√

lntp

9 / 31

(39)

Comparison with Previous Confidence Sets

2 ln

+S√

λ

p

3ln(t²/δ)

I [RT10]: IfkXtk2 ≤1 kθb_t−θ_∗kVt ≤2Rκ√

lntp

9 / 31

(40)

Questions

I Are there other ways to construct confidence sets?

I Can we get tighter confidence sets when some special conditions are met?

I SPARSITY:

Onlypcoordinates ofθ∗ are nonzero.

I Can we construct tighter confidence sets based on the knowledge ofp?

I Least-squares (or ridge) estimators are not a good idea!

10 / 31

(41)

Questions

I SPARSITY:

Onlypcoordinates ofθ∗ are nonzero.

10 / 31

(42)

Questions

I SPARSITY:

Onlypcoordinates ofθ_∗ are nonzero.

10 / 31

(43)

Questions

I SPARSITY:

10 / 31

(44)

Questions

I SPARSITY:

10 / 31

(45)

Online-to-Confidence-Set Conversion

I Idea: Create a confidence set based on how well an online linear prediction algorithm works.

I This is a reduction!

I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets

I Hopefully it will work for the sparse case

Encouragement: Working on my thesis

11 / 31

(46)

Online-to-Confidence-Set Conversion

11 / 31

(47)

Online-to-Confidence-Set Conversion

11 / 31

(48)

Online-to-Confidence-Set Conversion

11 / 31

(49)

Online-to-Confidence-Set Conversion

11 / 31

(50)

Online-to-Confidence-Set Conversion

11 / 31

(51)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I Receive correct labelYt ∈R

I Suffer loss(Yt−Ybt)²

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(52)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(53)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(54)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(55)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

Goal: Compete with the best linear predictor in hindsight

No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online LASSO (??)

12 / 31

(56)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .!

There are heaps of algorithms for this problem:

I online LASSO (??)

12 / 31

(57)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(58)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(59)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(60)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(61)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(62)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXt ∈R^d

I PredictYbt ∈R

I online LASSO (??)

12 / 31

(63)

Online Linear Prediction, cnt’d

I Regretwith respect to a linear predictorθ∈R^d

ρ_n(θ) = Xn

t=1

(Yt−Ybt)²− Xn

t=1

(Yt−hXt, θi)²

I Prediction algorithms come with “regret bounds”Bn:

∀n ρ_n(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y₁,Y₂, . . . ,Y_n

I Typically,Bn =O(√

n)orBn=O(logn)

13 / 31

(64)

Online Linear Prediction, cnt’d

ρ_n(θ) = Xn

t=1

(Yt−Ybt)²− Xn

t=1

(Yt−hXt, θi)²

∀n ρ_n(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y₁,Y₂, . . . ,Y_n

n)orBn=O(logn)

13 / 31

(65)

Online Linear Prediction, cnt’d

ρ_n(θ) = Xn

t=1

(Yt−Ybt)²− Xn

t=1

(Yt−hXt, θi)²

∀n ρ_n(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn

n)orBn=O(logn)

13 / 31

(66)

Online Linear Prediction, cnt’d

ρ_n(θ) = Xn

t=1

(Yt−Ybt)²− Xn

t=1

(Yt−hXt, θi)²

∀n ρ_n(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn

n)orBn=O(logn)

13 / 31

(67)

Good Regret Implies Small Risk

I Data:{(Xt,Yt)}ⁿ_t=1isi.i.d.,

Yt =hXt, θ_∗i+η_t,η_tR-sub-Gaussian

I Online Learning Algorithm:

Aproduces{θ_t}ⁿ_t=1 and predictsY^t=hXt, θ_ti

I Regret bound: ∀n: ρ_n(θ_∗)≤Bn

I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)²].

Theorem ([CBG08])

Letθ¯_n= ¹_nP_n

t=1θ_t. Then, w.p. 1−δ,

R(θ¯_n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n² ln

Bn+3 δ

.

14 / 31

(68)

Good Regret Implies Small Risk

Theorem ([CBG08])

Letθ¯_n= ¹_nP_n

R(θ¯_n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n² ln

Bn+3 δ

.

14 / 31

(69)

Good Regret Implies Small Risk

Theorem ([CBG08])

Letθ¯_n= ¹_nP_n

R(θ¯_n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n² ln

Bn+3 δ

.

14 / 31

(70)

Good Regret Implies Small Risk

Theorem ([CBG08])

Letθ¯_n= ¹_nP_n

R(θ¯_n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n² ln

Bn+3 δ

.

14 / 31

(71)

Good Regret Implies Small Risk

Theorem ([CBG08])

Letθ¯_n= ¹_nP_n

R(θ¯_n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n² ln

Bn+3 δ

.

14 / 31

(72)

Online-to-Confidence-Set Conversion

I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θ_∗i+η_t andη_tis conditionallyR-sub-Gaussian.

I PredictionsYb₁,Yb₂, . . . ,Ybn I Regret boundρ(θ∗)≤Bn

Theorem (Conversion, [AYPS12])

With probability at least1−δ, foralln,θ_∗ lies in

Cn =

θ∈R^d : Xn

t=1

(^Yt−hXt, θi)²

≤1+2Bn+32R²ln R√ 8+√

1+Bn

δ

!

15 / 31

(73)

Online-to-Confidence-Set Conversion

I PredictionsYb1,Yb2, . . . ,Ybn

I Regret boundρ(θ∗)≤Bn

Theorem (Conversion, [AYPS12])

Cn =

θ∈R^d : Xn

t=1

(^Yt−hXt, θi)²

≤1+2Bn+32R²ln R√ 8+√

1+Bn

δ

!

15 / 31

(74)

Online-to-Confidence-Set Conversion

I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ∗)≤Bn

Theorem (Conversion, [AYPS12])

Cn =

θ∈R^d : Xn

t=1

(^Yt−hXt, θi)²

≤1+2Bn+32R²ln R√ 8+√

1+Bn

δ

!

15 / 31

(75)

Online-to-Confidence-Set Conversion

I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ∗)≤Bn

Theorem (Conversion, [AYPS12])

Cn =

θ∈R^d : Xn

t=1

(^Yt−hXt, θi)²

≤1+2Bn+32R²ln R√ 8+√

1+Bn

δ

!

15 / 31

(76)

Proof Sketch

Algebra: With probability1, due to the regret boundBn, Xn

t=1

(Ybt−hXt, θ∗i)² ≤Bn+2 Xn

t=1

η_t(Ybt−hXt, θ∗i)

| {z }

Mn

. (1)

(Mn)^∞_n=1is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,

|Mn|≤R vu

ut2 1+ Xn

t=1

(bYt−hXt, θ∗i)²

!

×ln



 q

1+P_n

t=1(Ybt−hXt, θ∗i)² δ



 . Combine with (1) and solve the inequality.

16 / 31

(77)

Proof Sketch

t=1

| {z }

Mn

. (1)

(Mn)^∞_n=1 is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,

|Mn|≤R vu

ut2 1+ Xn

t=1

!

×ln



 q

1+P_n



 .

Combine with (1) and solve the inequality.

16 / 31

(78)

Proof Sketch

t=1

| {z }

Mn

. (1)

(Mn)^∞_n=1 is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,

|Mn|≤R vu

ut2 1+ Xn

t=1

!

×ln



 q

1+P_n



 . Combine with (1) and solve the inequality.

16 / 31

(79)

Application to Sparse Linear Prediction

Theorem ([Ger11])

For anyθsuch thatkθk∞≤1andkθk0≤p, the the regret of SEQSEWis bounded by

ρ_n(θ)≤Bn=O(plog(nd)).

Corollary

∃A>0s.t. with probability at least1−δ, foralln,θ∗lies in Cn =

θ∈R^d : Xn

t=1

(^Yt−hXt, θi)²

≤1+2Aplog(nd) +32R²ln R√ 8+p

1Aplog(nd) δ

! .

17 / 31

(80)

Application to Sparse Linear Prediction

Theorem ([Ger11])

For anyθsuch thatkθk∞≤1andkθk0≤p, the the regret of SEQSEWis bounded by

ρ_n(θ)≤Bn=O(plog(nd)).

Corollary

∃A>0s.t. with probability at least1−δ, foralln,θ∗lies in Cn =

θ∈R^d : Xn

t=1

(^Yt−hXt, θi)²

≤1+2Aplog(nd) +32R²ln R√ 8+p

1Aplog(nd) δ

! .

17 / 31

(81)

Application to Linear Bandits

Encouragement: Gittin’s mistake Stamina: Andr´as’ theory of hiking

18 / 31

Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits

Doctoral School of Mathematics and Computer Science

Stochastic Days in Szeged 27.07.2012.

Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits

Csaba Szepesvári

(University of Alberta)

Online-to-Confidence-Set Conversions and Application to

Sparse Stochastic Bandits

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Linear Prediction and (Honest) Confidence Sets

The Data

Estimation Problems:

The Data

Estimation Problems:

The Data

Estimation Problems:

The Data

Estimation Problems:

The Data

Estimation Problems:

The Data

Estimation Problems:

The Data

Estimation Problems:

Sub-Gaussianity

Definition

Sub-Gaussianity

Definition

Sub-Gaussianity

Definition

(Honest) Confidence Sets

(Honest) Confidence Sets

(Honest) Confidence Sets

θ

θ

C

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

Confidence Sets based on Ridge-regression

Theorem ([AYPS11])

Comparison with Previous Confidence Sets

Comparison with Previous Confidence Sets

Comparison with Previous Confidence Sets

Comparison with Previous Confidence Sets

Questions

Questions

Questions

Questions

Questions

Online-to-Confidence-Set Conversion

Online-to-Confidence-Set Conversion

Online-to-Confidence-Set Conversion

Online-to-Confidence-Set Conversion

Online-to-Confidence-Set Conversion

Online-to-Confidence-Set Conversion

Online Linear Prediction

Online Linear Prediction

Online Linear Prediction

Online Linear Prediction

Online Linear Prediction