• Nem Talált Eredményt

Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits"

Copied!
132
0
0

Teljes szövegt

(1)

“Broadening the knowledge base and supporting the long term professional  sustainability of the Research University Centre of Excellence

at the University of Szeged by ensuring the rising generation of excellent scientists.””

Doctoral School of Mathematics and Computer Science

Stochastic Days in Szeged 27.07.2012.

Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits

Csaba Szepesvári

(University of Alberta)

TÁMOP‐4.2.2/B‐10/1‐2010‐0012 project

(2)

Online-to-Confidence-Set Conversions and Application to

Sparse Stochastic Bandits

Csaba Szepesv´ari

Department of Computing Science University of Alberta csaba.szepesvari@ualberta.ca

July 27, 2013

Stochastic days – honoring Andr´as Kr´amli 70th birthday

joint work with Yasin Abbasi-Yadkori and D´avid P´al

1 / 31

(3)

2 / 31

(4)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(5)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(6)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(7)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(8)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(9)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(10)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(11)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(12)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(13)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(14)

Contents

I Linear prediction and (honest) confidence sets

I Definition

I Some previous results

I Sparsity

I Online-to-confidence-set conversion

I Online linear prediction

I The conversion

I Why does it work?

I Application to linear bandits

I Sparse linear bandits

I Results

3 / 31

(15)

Linear Prediction and (Honest) Confidence Sets

Getting directions from Andr´as: MDPs, ILT, Dregely

4 / 31

(16)

The Data

I X1, . . . ,XnRd,Y1, . . . ,YnR

I ∃θRds.t.

Yt =hXt, θi+ηt, t=1, . . . ,n

I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp

λ2R2 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ w.h.p.!

5 / 31

(17)

The Data

I X1, . . . ,XnRd,Y1, . . . ,YnR

I ∃θRds.t.

Yt =hXt, θi+ηt, t=1, . . . ,n

I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp

λ2R2 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ w.h.p.!

5 / 31

(18)

The Data

I X1, . . . ,XnRd,Y1, . . . ,YnR

I ∃θRds.t.

Yt =hXt, θi+ηt, t=1, . . . ,n

I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp

λ2R2 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ w.h.p.!

5 / 31

(19)

The Data

I X1, . . . ,XnRd,Y1, . . . ,YnR

I ∃θRds.t.

Yt =hXt, θi+ηt, t=1, . . . ,n

I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp

λ2R2 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ w.h.p.!

5 / 31

(20)

The Data

I X1, . . . ,XnRd,Y1, . . . ,YnR

I ∃θRds.t.

Yt =hXt, θi+ηt, t=1, . . . ,n

I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp

λ2R2 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ w.h.p.!

5 / 31

(21)

The Data

I X1, . . . ,XnRd,Y1, . . . ,YnR

I ∃θRds.t.

Yt =hXt, θi+ηt, t=1, . . . ,n

I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp

λ2R2 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ w.h.p.!

5 / 31

(22)

The Data

I X1, . . . ,XnRd,Y1, . . . ,YnR

I ∃θRds.t.

Yt =hXt, θi+ηt, t=1, . . . ,n

I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,

∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp

λ2R2 2

.

I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)

Estimation Problems:

I Estimateθ based on((X1,Y1), . . . ,(Xn,Yn))!

I Construct a confidence set that containsθ w.h.p.!

5 / 31

(23)

Sub-Gaussianity

Definition

Random variableZisR-sub-Gaussianfor someR≥0if

∀γ∈R E[eγZ]≤exp

γ2R2 2

.

The condition implies that

I E[Z] =0

I Var[Z]≤R2 Examples:

I Zero-mean bounded in an interval of length2R (Hoeffding-Azuma)

I Zero-mean Gaussian with variance≤R2

6 / 31

(24)

Sub-Gaussianity

Definition

Random variableZisR-sub-Gaussianfor someR≥0if

∀γ∈R E[eγZ]≤exp

γ2R2 2

.

The condition implies that

I E[Z] =0

I Var[Z]≤R2

Examples:

I Zero-mean bounded in an interval of length2R (Hoeffding-Azuma)

I Zero-mean Gaussian with variance≤R2

6 / 31

(25)

Sub-Gaussianity

Definition

Random variableZisR-sub-Gaussianfor someR≥0if

∀γ∈R E[eγZ]≤exp

γ2R2 2

.

The condition implies that

I E[Z] =0

I Var[Z]≤R2 Examples:

I Zero-mean bounded in an interval of length2R (Hoeffding-Azuma)

I Zero-mean Gaussian with variance≤R2

6 / 31

(26)

(Honest) Confidence Sets

Given the data((X1,Y1), . . . ,(Xn,Yn))and 0≤δ≤1,

construct

CnRd such that

Pr(θ ∈Cn)≥1−δ.

7 / 31

(27)

(Honest) Confidence Sets

Given the data((X1,Y1), . . . ,(Xn,Yn))and 0≤δ≤1,

construct

CnRd such that

Pr(θ ∈Cn)≥1−δ.

7 / 31

(28)

(Honest) Confidence Sets

Given the data((X1,Y1), . . . ,(Xn,Yn))and 0≤δ≤1,

construct

CnRd such that

Pr(θ ∈Cn)≥1−δ.

θ

θ

C

n

7 / 31

(29)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θi

I Stack them into matrices:X1:nisn×dandY1:nisn×1

I Ridge regression estimator:

n= (X1:nXT1:n+λI)−1XT1:nY1:n

I LetVn =X1:nXT1:n+λI

Theorem ([AYPS11])

Ifkθk2 ≤S, then with probability at least1−δ, forallt,θ

lies in Ct =

θ : kbθt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

wherekvkA =√

vTAvis the matrixA-norm.

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.

8 / 31

(30)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θi

I Stack them into matrices:X1:nisn×dandY1:nisn×1

I Ridge regression estimator:

n= (X1:nXT1:n+λI)−1XT1:nY1:n

I LetVn =X1:nXT1:n+λI

Theorem ([AYPS11])

Ifkθk2 ≤S, then with probability at least1−δ, forallt,θ

lies in Ct =

θ : kbθt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

wherekvkA =√

vTAvis the matrixA-norm.

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.

8 / 31

(31)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θi

I Stack them into matrices:X1:nisn×dandY1:nisn×1

I Ridge regression estimator:

n= (X1:nXT1:n+λI)−1XT1:nY1:n

I LetVn =X1:nXT1:n+λI

Theorem ([AYPS11])

Ifkθk2 ≤S, then with probability at least1−δ, forallt,θ

lies in Ct =

θ : kbθt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

wherekvkA =√

vTAvis the matrixA-norm.

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.

8 / 31

(32)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θi

I Stack them into matrices:X1:nisn×dandY1:nisn×1

I Ridge regression estimator:

n= (X1:nXT1:n+λI)−1XT1:nY1:n

I LetVn =X1:nXT1:n+λI

Theorem ([AYPS11])

Ifkθk2 ≤S, then with probability at least1−δ, forallt,θ

lies in Ct =

θ : kbθt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

wherekvkA =√

vTAvis the matrixA-norm.

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.

8 / 31

(33)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θi

I Stack them into matrices:X1:nisn×dandY1:nisn×1

I Ridge regression estimator:

n= (X1:nXT1:n+λI)−1XT1:nY1:n

I LetVn =X1:nXT1:n+λI

Theorem ([AYPS11])

Ifkθk2 ≤S, then with probability at least1−δ, forallt,θ

lies in Ct =

θ : kbθt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

wherekvkA =√

vTAvis the matrixA-norm.

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.

8 / 31

(34)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θi

I Stack them into matrices:X1:nisn×dandY1:nisn×1

I Ridge regression estimator:

n= (X1:nXT1:n+λI)−1XT1:nY1:n

I LetVn =X1:nXT1:n+λI

Theorem ([AYPS11])

Ifkθk2 ≤S, then with probability at least1−δ, forallt,θ

lies in Ct =

θ : kbθt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

wherekvkA =√

vTAvis the matrixA-norm.

Proof technique: [RS70, dLS09].

Extends to separable Hilbert spaces.

8 / 31

(35)

Confidence Sets based on Ridge-regression

I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θi

I Stack them into matrices:X1:nisn×dandY1:nisn×1

I Ridge regression estimator:

n= (X1:nXT1:n+λI)−1XT1:nY1:n

I LetVn =X1:nXT1:n+λI

Theorem ([AYPS11])

Ifkθk2 ≤S, then with probability at least1−δ, forallt,θ

lies in Ct =

θ : kbθt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

wherekvkA =√

vTAvis the matrixA-norm.

Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces. 8 / 31

(36)

Comparison with Previous Confidence Sets

I Bound of [AYPS11]:

kθbt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

I [DHK08]: Ifkθk2,kXtk2≤1then for a specificλ kbθt−θkVt ≤Rmax

p

128dln(t)ln(t2/δ),8

3ln(t2/δ)

I [RT10]: IfkXtk2 ≤1 kθbt−θkVt ≤2Rκ√

lntp

dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).

The bound of [AYPS11] doesn’t depend ont.

9 / 31

(37)

Comparison with Previous Confidence Sets

I Bound of [AYPS11]:

kθbt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

I [DHK08]: Ifkθk2,kXtk2≤1then for a specificλ kbθt−θkVt ≤Rmax

p

128dln(t)ln(t2/δ),8

3ln(t2/δ)

I [RT10]: IfkXtk2 ≤1 kθbt−θkVt ≤2Rκ√

lntp

dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).

The bound of [AYPS11] doesn’t depend ont.

9 / 31

(38)

Comparison with Previous Confidence Sets

I Bound of [AYPS11]:

kθbt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

I [DHK08]: Ifkθk2,kXtk2≤1then for a specificλ kbθt−θkVt ≤Rmax

p

128dln(t)ln(t2/δ),8

3ln(t2/δ)

I [RT10]: IfkXtk2 ≤1 kθbt−θkVt ≤2Rκ√

lntp

dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).

The bound of [AYPS11] doesn’t depend ont.

9 / 31

(39)

Comparison with Previous Confidence Sets

I Bound of [AYPS11]:

kθbt−θkVt ≤R s

2 ln

det(Vt)1/2 δdet(λI)1/2

+S√

λ

I [DHK08]: Ifkθk2,kXtk2≤1then for a specificλ kbθt−θkVt ≤Rmax

p

128dln(t)ln(t2/δ),8

3ln(t2/δ)

I [RT10]: IfkXtk2 ≤1 kθbt−θkVt ≤2Rκ√

lntp

dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).

The bound of [AYPS11] doesn’t depend ont.

9 / 31

(40)

Questions

I Are there other ways to construct confidence sets?

I Can we get tighter confidence sets when some special conditions are met?

I SPARSITY:

Onlypcoordinates ofθ are nonzero.

I Can we construct tighter confidence sets based on the knowledge ofp?

I Least-squares (or ridge) estimators are not a good idea!

10 / 31

(41)

Questions

I Are there other ways to construct confidence sets?

I Can we get tighter confidence sets when some special conditions are met?

I SPARSITY:

Onlypcoordinates ofθ are nonzero.

I Can we construct tighter confidence sets based on the knowledge ofp?

I Least-squares (or ridge) estimators are not a good idea!

10 / 31

(42)

Questions

I Are there other ways to construct confidence sets?

I Can we get tighter confidence sets when some special conditions are met?

I SPARSITY:

Onlypcoordinates ofθ are nonzero.

I Can we construct tighter confidence sets based on the knowledge ofp?

I Least-squares (or ridge) estimators are not a good idea!

10 / 31

(43)

Questions

I Are there other ways to construct confidence sets?

I Can we get tighter confidence sets when some special conditions are met?

I SPARSITY:

Onlypcoordinates ofθ are nonzero.

I Can we construct tighter confidence sets based on the knowledge ofp?

I Least-squares (or ridge) estimators are not a good idea!

10 / 31

(44)

Questions

I Are there other ways to construct confidence sets?

I Can we get tighter confidence sets when some special conditions are met?

I SPARSITY:

Onlypcoordinates ofθ are nonzero.

I Can we construct tighter confidence sets based on the knowledge ofp?

I Least-squares (or ridge) estimators are not a good idea!

10 / 31

(45)

Online-to-Confidence-Set Conversion

I Idea: Create a confidence set based on how well an online linear prediction algorithm works.

I This is a reduction!

I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets

I Hopefully it will work for the sparse case

Encouragement: Working on my thesis

11 / 31

(46)

Online-to-Confidence-Set Conversion

I Idea: Create a confidence set based on how well an online linear prediction algorithm works.

I This is a reduction!

I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets

I Hopefully it will work for the sparse case

Encouragement: Working on my thesis

11 / 31

(47)

Online-to-Confidence-Set Conversion

I Idea: Create a confidence set based on how well an online linear prediction algorithm works.

I This is a reduction!

I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets

I Hopefully it will work for the sparse case

Encouragement: Working on my thesis

11 / 31

(48)

Online-to-Confidence-Set Conversion

I Idea: Create a confidence set based on how well an online linear prediction algorithm works.

I This is a reduction!

I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets

I Hopefully it will work for the sparse case

Encouragement: Working on my thesis

11 / 31

(49)

Online-to-Confidence-Set Conversion

I Idea: Create a confidence set based on how well an online linear prediction algorithm works.

I This is a reduction!

I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets

I Hopefully it will work for the sparse case

Encouragement: Working on my thesis

11 / 31

(50)

Online-to-Confidence-Set Conversion

I Idea: Create a confidence set based on how well an online linear prediction algorithm works.

I This is a reduction!

I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets

I Hopefully it will work for the sparse case

Encouragement: Working on my thesis

11 / 31

(51)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(52)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(53)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(54)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(55)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight

No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(56)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .!

There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(57)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(58)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(59)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(60)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(61)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(62)

Online Linear Prediction

Fort=1,2, . . .:

I ReceiveXtRd

I PredictYbtR

I Receive correct labelYtR

I Suffer loss(Yt−Ybt)2

Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:

I online gradient descent [Zin03]

I online least-squares [AW01, Vov01]

I exponentiated gradient algorithm [KW97]

I online LASSO (??)

I SeqSEW [Ger11, DT07]

12 / 31

(63)

Online Linear Prediction, cnt’d

I Regretwith respect to a linear predictorθ∈Rd

ρn(θ) = Xn

t=1

(Yt−Ybt)2− Xn

t=1

(Yt−hXt, θi)2

I Prediction algorithms come with “regret bounds”Bn:

∀n ρn(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn

I Typically,Bn =O(√

n)orBn=O(logn)

13 / 31

(64)

Online Linear Prediction, cnt’d

I Regretwith respect to a linear predictorθ∈Rd

ρn(θ) = Xn

t=1

(Yt−Ybt)2− Xn

t=1

(Yt−hXt, θi)2

I Prediction algorithms come with “regret bounds”Bn:

∀n ρn(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn

I Typically,Bn =O(√

n)orBn=O(logn)

13 / 31

(65)

Online Linear Prediction, cnt’d

I Regretwith respect to a linear predictorθ∈Rd

ρn(θ) = Xn

t=1

(Yt−Ybt)2− Xn

t=1

(Yt−hXt, θi)2

I Prediction algorithms come with “regret bounds”Bn:

∀n ρn(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn

I Typically,Bn =O(√

n)orBn=O(logn)

13 / 31

(66)

Online Linear Prediction, cnt’d

I Regretwith respect to a linear predictorθ∈Rd

ρn(θ) = Xn

t=1

(Yt−Ybt)2− Xn

t=1

(Yt−hXt, θi)2

I Prediction algorithms come with “regret bounds”Bn:

∀n ρn(θ)≤Bn

I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn

I Typically,Bn =O(√

n)orBn=O(logn)

13 / 31

(67)

Good Regret Implies Small Risk

I Data:{(Xt,Yt)}nt=1isi.i.d.,

Yt =hXt, θi+ηttR-sub-Gaussian

I Online Learning Algorithm:

Aproduces{θt}nt=1 and predictsY^t=hXt, θti

I Regret bound: ∀n: ρn)≤Bn

I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].

Theorem ([CBG08])

Letθ¯n= 1nPn

t=1θt. Then, w.p. 1−δ,

R(θ¯n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n2 ln

Bn+3 δ

.

14 / 31

(68)

Good Regret Implies Small Risk

I Data:{(Xt,Yt)}nt=1isi.i.d.,

Yt =hXt, θi+ηttR-sub-Gaussian

I Online Learning Algorithm:

Aproduces{θt}nt=1 and predictsY^t=hXt, θti

I Regret bound: ∀n: ρn)≤Bn

I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].

Theorem ([CBG08])

Letθ¯n= 1nPn

t=1θt. Then, w.p. 1−δ,

R(θ¯n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n2 ln

Bn+3 δ

.

14 / 31

(69)

Good Regret Implies Small Risk

I Data:{(Xt,Yt)}nt=1isi.i.d.,

Yt =hXt, θi+ηttR-sub-Gaussian

I Online Learning Algorithm:

Aproduces{θt}nt=1 and predictsY^t=hXt, θti

I Regret bound: ∀n: ρn)≤Bn

I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].

Theorem ([CBG08])

Letθ¯n= 1nPn

t=1θt. Then, w.p. 1−δ,

R(θ¯n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n2 ln

Bn+3 δ

.

14 / 31

(70)

Good Regret Implies Small Risk

I Data:{(Xt,Yt)}nt=1isi.i.d.,

Yt =hXt, θi+ηttR-sub-Gaussian

I Online Learning Algorithm:

Aproduces{θt}nt=1 and predictsY^t=hXt, θti

I Regret bound: ∀n: ρn)≤Bn

I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].

Theorem ([CBG08])

Letθ¯n= 1nPn

t=1θt. Then, w.p. 1−δ,

R(θ¯n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n2 ln

Bn+3 δ

.

14 / 31

(71)

Good Regret Implies Small Risk

I Data:{(Xt,Yt)}nt=1isi.i.d.,

Yt =hXt, θi+ηttR-sub-Gaussian

I Online Learning Algorithm:

Aproduces{θt}nt=1 and predictsY^t=hXt, θti

I Regret bound: ∀n: ρn)≤Bn

I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].

Theorem ([CBG08])

Letθ¯n= 1nPn

t=1θt. Then, w.p. 1−δ,

R(θ¯n)≤ Bn

n +36 n ln

Bn+3 δ

+

s Bn

n2 ln

Bn+3 δ

.

14 / 31

(72)

Online-to-Confidence-Set Conversion

I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θi+ηt andηtis conditionallyR-sub-Gaussian.

I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ)≤Bn

Theorem (Conversion, [AYPS12])

With probability at least1−δ, foralln,θ lies in

Cn =

θ∈Rd : Xn

t=1

(^Yt−hXt, θi)2

≤1+2Bn+32R2ln R√ 8+√

1+Bn

δ

!

15 / 31

(73)

Online-to-Confidence-Set Conversion

I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θi+ηt andηtis conditionallyR-sub-Gaussian.

I PredictionsYb1,Yb2, . . . ,Ybn

I Regret boundρ(θ)≤Bn

Theorem (Conversion, [AYPS12])

With probability at least1−δ, foralln,θ lies in

Cn =

θ∈Rd : Xn

t=1

(^Yt−hXt, θi)2

≤1+2Bn+32R2ln R√ 8+√

1+Bn

δ

!

15 / 31

(74)

Online-to-Confidence-Set Conversion

I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θi+ηt andηtis conditionallyR-sub-Gaussian.

I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ)≤Bn

Theorem (Conversion, [AYPS12])

With probability at least1−δ, foralln,θ lies in

Cn =

θ∈Rd : Xn

t=1

(^Yt−hXt, θi)2

≤1+2Bn+32R2ln R√ 8+√

1+Bn

δ

!

15 / 31

(75)

Online-to-Confidence-Set Conversion

I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θi+ηt andηtis conditionallyR-sub-Gaussian.

I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ)≤Bn

Theorem (Conversion, [AYPS12])

With probability at least1−δ, foralln,θ lies in

Cn =

θ∈Rd : Xn

t=1

(^Yt−hXt, θi)2

≤1+2Bn+32R2ln R√ 8+√

1+Bn

δ

!

15 / 31

(76)

Proof Sketch

Algebra: With probability1, due to the regret boundBn, Xn

t=1

(Ybt−hXt, θi)2 ≤Bn+2 Xn

t=1

ηt(Ybt−hXt, θi)

| {z }

Mn

. (1)

(Mn)n=1is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,

|Mn|≤R vu

ut2 1+ Xn

t=1

(bYt−hXt, θi)2

!

×ln

 q

1+Pn

t=1(Ybt−hXt, θi)2 δ

 . Combine with (1) and solve the inequality.

16 / 31

(77)

Proof Sketch

Algebra: With probability1, due to the regret boundBn, Xn

t=1

(Ybt−hXt, θi)2 ≤Bn+2 Xn

t=1

ηt(Ybt−hXt, θi)

| {z }

Mn

. (1)

(Mn)n=1 is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,

|Mn|≤R vu

ut2 1+ Xn

t=1

(bYt−hXt, θi)2

!

×ln

 q

1+Pn

t=1(Ybt−hXt, θi)2 δ

 .

Combine with (1) and solve the inequality.

16 / 31

(78)

Proof Sketch

Algebra: With probability1, due to the regret boundBn, Xn

t=1

(Ybt−hXt, θi)2 ≤Bn+2 Xn

t=1

ηt(Ybt−hXt, θi)

| {z }

Mn

. (1)

(Mn)n=1 is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,

|Mn|≤R vu

ut2 1+ Xn

t=1

(bYt−hXt, θi)2

!

×ln

 q

1+Pn

t=1(Ybt−hXt, θi)2 δ

 . Combine with (1) and solve the inequality.

16 / 31

(79)

Application to Sparse Linear Prediction

Theorem ([Ger11])

For anyθsuch thatkθk≤1andkθk0≤p, the the regret of SEQSEWis bounded by

ρn(θ)≤Bn=O(plog(nd)).

Corollary

∃A>0s.t. with probability at least1−δ, foralln,θlies in Cn =

θ∈Rd : Xn

t=1

(^Yt−hXt, θi)2

≤1+2Aplog(nd) +32R2ln R√ 8+p

1Aplog(nd) δ

! .

17 / 31

(80)

Application to Sparse Linear Prediction

Theorem ([Ger11])

For anyθsuch thatkθk≤1andkθk0≤p, the the regret of SEQSEWis bounded by

ρn(θ)≤Bn=O(plog(nd)).

Corollary

∃A>0s.t. with probability at least1−δ, foralln,θlies in Cn =

θ∈Rd : Xn

t=1

(^Yt−hXt, θi)2

≤1+2Aplog(nd) +32R2ln R√ 8+p

1Aplog(nd) δ

! .

17 / 31

(81)

Application to Linear Bandits

Encouragement: Gittin’s mistake Stamina: Andr´as’ theory of hiking

18 / 31

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

P2P stochastic bandits and our results In our P2P bandit setup, we assume that each of the N peers has access to the same set of K arms (with the same unknown distributions that

The confidence- based contractor has been associated to a validated integration method to compute reachable sets for different values of confidence level.. A propagation

S hibata , Global and local structures of oscillatory bifurcation curves with application to inverse bifurcation problem, J. Theory, published

They can also describe resource dependencies which force an execution order (with the requires property) or if a restart is required when a resource is updated (with the

From this point, to achieve absolute minimum is con- ditioned by zero relative sensitivity of the non-dimensional circuit elements and validity of the relationship

The proposed algorithm provides a computationally cheap alternative to previously introduced stochastic optimization methods based on Monte Carlo sampling by using the adaptive

Application of the stochastic approximation principle to the identification of some control elements is described The convergence of the identification process and

Anxiety symptoms in primary school-age children: Relation with emotion understanding competences. Monica Buta, Oana Ciomet,