“Broadening the knowledge base and supporting the long term professional sustainability of the Research University Centre of Excellence
at the University of Szeged by ensuring the rising generation of excellent scientists.””
Doctoral School of Mathematics and Computer Science
Stochastic Days in Szeged 27.07.2012.
Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits
Csaba Szepesvári
(University of Alberta)
TÁMOP‐4.2.2/B‐10/1‐2010‐0012 project
Online-to-Confidence-Set Conversions and Application to
Sparse Stochastic Bandits
Csaba Szepesv´ari
Department of Computing Science University of Alberta csaba.szepesvari@ualberta.ca
July 27, 2013
Stochastic days – honoring Andr´as Kr´amli 70th birthday
joint work with Yasin Abbasi-Yadkori and D´avid P´al
1 / 31
2 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Contents
I Linear prediction and (honest) confidence sets
I Definition
I Some previous results
I Sparsity
I Online-to-confidence-set conversion
I Online linear prediction
I The conversion
I Why does it work?
I Application to linear bandits
I Sparse linear bandits
I Results
3 / 31
Linear Prediction and (Honest) Confidence Sets
Getting directions from Andr´as: MDPs, ILT, Dregely
4 / 31
The Data
I X1, . . . ,Xn∈Rd,Y1, . . . ,Yn∈R
I ∃θ∗ ∈Rds.t.
Yt =hXt, θ∗i+ηt, t=1, . . . ,n
I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,
∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp
λ2R2 2
.
I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)
Estimation Problems:
I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!
I Construct a confidence set that containsθ∗ w.h.p.!
5 / 31
The Data
I X1, . . . ,Xn∈Rd,Y1, . . . ,Yn∈R
I ∃θ∗ ∈Rds.t.
Yt =hXt, θ∗i+ηt, t=1, . . . ,n
I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,
∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp
λ2R2 2
.
I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)
Estimation Problems:
I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!
I Construct a confidence set that containsθ∗ w.h.p.!
5 / 31
The Data
I X1, . . . ,Xn∈Rd,Y1, . . . ,Yn∈R
I ∃θ∗ ∈Rds.t.
Yt =hXt, θ∗i+ηt, t=1, . . . ,n
I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,
∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp
λ2R2 2
.
I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)
Estimation Problems:
I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!
I Construct a confidence set that containsθ∗ w.h.p.!
5 / 31
The Data
I X1, . . . ,Xn∈Rd,Y1, . . . ,Yn∈R
I ∃θ∗ ∈Rds.t.
Yt =hXt, θ∗i+ηt, t=1, . . . ,n
I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,
∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp
λ2R2 2
.
I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)
Estimation Problems:
I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!
I Construct a confidence set that containsθ∗ w.h.p.!
5 / 31
The Data
I X1, . . . ,Xn∈Rd,Y1, . . . ,Yn∈R
I ∃θ∗ ∈Rds.t.
Yt =hXt, θ∗i+ηt, t=1, . . . ,n
I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,
∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp
λ2R2 2
.
I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)
Estimation Problems:
I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!
I Construct a confidence set that containsθ∗ w.h.p.!
5 / 31
The Data
I X1, . . . ,Xn∈Rd,Y1, . . . ,Yn∈R
I ∃θ∗ ∈Rds.t.
Yt =hXt, θ∗i+ηt, t=1, . . . ,n
I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,
∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp
λ2R2 2
.
I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)
Estimation Problems:
I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!
I Construct a confidence set that containsθ∗ w.h.p.!
5 / 31
The Data
I X1, . . . ,Xn∈Rd,Y1, . . . ,Yn∈R
I ∃θ∗ ∈Rds.t.
Yt =hXt, θ∗i+ηt, t=1, . . . ,n
I The“noise”,ηtisconditionallyR-sub-Gaussianwith someR>0, i.e.,
∀λ∈R: E[eληt|X1, . . . ,Xt, η1, . . . , ηt−1]≤exp
λ2R2 2
.
I OftenXtis chosen based on(X1, . . . ,Xt−1)and (Y1, . . . ,Yt−1)
Estimation Problems:
I Estimateθ∗ based on((X1,Y1), . . . ,(Xn,Yn))!
I Construct a confidence set that containsθ∗ w.h.p.!
5 / 31
Sub-Gaussianity
Definition
Random variableZisR-sub-Gaussianfor someR≥0if
∀γ∈R E[eγZ]≤exp
γ2R2 2
.
The condition implies that
I E[Z] =0
I Var[Z]≤R2 Examples:
I Zero-mean bounded in an interval of length2R (Hoeffding-Azuma)
I Zero-mean Gaussian with variance≤R2
6 / 31
Sub-Gaussianity
Definition
Random variableZisR-sub-Gaussianfor someR≥0if
∀γ∈R E[eγZ]≤exp
γ2R2 2
.
The condition implies that
I E[Z] =0
I Var[Z]≤R2
Examples:
I Zero-mean bounded in an interval of length2R (Hoeffding-Azuma)
I Zero-mean Gaussian with variance≤R2
6 / 31
Sub-Gaussianity
Definition
Random variableZisR-sub-Gaussianfor someR≥0if
∀γ∈R E[eγZ]≤exp
γ2R2 2
.
The condition implies that
I E[Z] =0
I Var[Z]≤R2 Examples:
I Zero-mean bounded in an interval of length2R (Hoeffding-Azuma)
I Zero-mean Gaussian with variance≤R2
6 / 31
(Honest) Confidence Sets
Given the data((X1,Y1), . . . ,(Xn,Yn))and 0≤δ≤1,
construct
Cn ⊂Rd such that
Pr(θ∗ ∈Cn)≥1−δ.
7 / 31
(Honest) Confidence Sets
Given the data((X1,Y1), . . . ,(Xn,Yn))and 0≤δ≤1,
construct
Cn ⊂Rd such that
Pr(θ∗ ∈Cn)≥1−δ.
7 / 31
(Honest) Confidence Sets
Given the data((X1,Y1), . . . ,(Xn,Yn))and 0≤δ≤1,
construct
Cn ⊂Rd such that
Pr(θ∗ ∈Cn)≥1−δ.
θ
∗θ
∗C
n7 / 31
Confidence Sets based on Ridge-regression
I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i
I Stack them into matrices:X1:nisn×dandY1:nisn×1
I Ridge regression estimator:
bθn= (X1:nXT1:n+λI)−1XT1:nY1:n
I LetVn =X1:nXT1:n+λI
Theorem ([AYPS11])
Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗
lies in Ct =
θ : kbθt−θkVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
wherekvkA =√
vTAvis the matrixA-norm.
Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.
8 / 31
Confidence Sets based on Ridge-regression
I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i
I Stack them into matrices:X1:nisn×dandY1:nisn×1
I Ridge regression estimator:
bθn= (X1:nXT1:n+λI)−1XT1:nY1:n
I LetVn =X1:nXT1:n+λI
Theorem ([AYPS11])
Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗
lies in Ct =
θ : kbθt−θkVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
wherekvkA =√
vTAvis the matrixA-norm.
Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.
8 / 31
Confidence Sets based on Ridge-regression
I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i
I Stack them into matrices:X1:nisn×dandY1:nisn×1
I Ridge regression estimator:
bθn= (X1:nXT1:n+λI)−1XT1:nY1:n
I LetVn =X1:nXT1:n+λI
Theorem ([AYPS11])
Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗
lies in Ct =
θ : kbθt−θkVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
wherekvkA =√
vTAvis the matrixA-norm.
Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.
8 / 31
Confidence Sets based on Ridge-regression
I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i
I Stack them into matrices:X1:nisn×dandY1:nisn×1
I Ridge regression estimator:
bθn= (X1:nXT1:n+λI)−1XT1:nY1:n
I LetVn =X1:nXT1:n+λI
Theorem ([AYPS11])
Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗
lies in Ct =
θ : kbθt−θkVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
wherekvkA =√
vTAvis the matrixA-norm.
Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.
8 / 31
Confidence Sets based on Ridge-regression
I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i
I Stack them into matrices:X1:nisn×dandY1:nisn×1
I Ridge regression estimator:
bθn= (X1:nXT1:n+λI)−1XT1:nY1:n
I LetVn =X1:nXT1:n+λI
Theorem ([AYPS11])
Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗
lies in Ct =
θ : kbθt−θkVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
wherekvkA =√
vTAvis the matrixA-norm.
Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces.
8 / 31
Confidence Sets based on Ridge-regression
I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i
I Stack them into matrices:X1:nisn×dandY1:nisn×1
I Ridge regression estimator:
bθn= (X1:nXT1:n+λI)−1XT1:nY1:n
I LetVn =X1:nXT1:n+λI
Theorem ([AYPS11])
Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗
lies in Ct =
θ : kbθt−θkVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
wherekvkA =√
vTAvis the matrixA-norm.
Proof technique: [RS70, dLS09].
Extends to separable Hilbert spaces.
8 / 31
Confidence Sets based on Ridge-regression
I Data(X1,Y1), . . . ,(Xn,Yn)such thatYt ≈ hXt, θ∗i
I Stack them into matrices:X1:nisn×dandY1:nisn×1
I Ridge regression estimator:
bθn= (X1:nXT1:n+λI)−1XT1:nY1:n
I LetVn =X1:nXT1:n+λI
Theorem ([AYPS11])
Ifkθ∗k2 ≤S, then with probability at least1−δ, forallt,θ∗
lies in Ct =
θ : kbθt−θkVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
wherekvkA =√
vTAvis the matrixA-norm.
Proof technique: [RS70, dLS09]. Extends to separable Hilbert spaces. 8 / 31
Comparison with Previous Confidence Sets
I Bound of [AYPS11]:
kθbt−θ∗kVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
I [DHK08]: Ifkθ∗k2,kXtk2≤1then for a specificλ kbθt−θ∗kVt ≤Rmax
p
128dln(t)ln(t2/δ),8
3ln(t2/δ)
I [RT10]: IfkXtk2 ≤1 kθbt−θ∗kVt ≤2Rκ√
lntp
dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).
The bound of [AYPS11] doesn’t depend ont.
9 / 31
Comparison with Previous Confidence Sets
I Bound of [AYPS11]:
kθbt−θ∗kVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
I [DHK08]: Ifkθ∗k2,kXtk2≤1then for a specificλ kbθt−θ∗kVt ≤Rmax
p
128dln(t)ln(t2/δ),8
3ln(t2/δ)
I [RT10]: IfkXtk2 ≤1 kθbt−θ∗kVt ≤2Rκ√
lntp
dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).
The bound of [AYPS11] doesn’t depend ont.
9 / 31
Comparison with Previous Confidence Sets
I Bound of [AYPS11]:
kθbt−θ∗kVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
I [DHK08]: Ifkθ∗k2,kXtk2≤1then for a specificλ kbθt−θ∗kVt ≤Rmax
p
128dln(t)ln(t2/δ),8
3ln(t2/δ)
I [RT10]: IfkXtk2 ≤1 kθbt−θ∗kVt ≤2Rκ√
lntp
dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).
The bound of [AYPS11] doesn’t depend ont.
9 / 31
Comparison with Previous Confidence Sets
I Bound of [AYPS11]:
kθbt−θ∗kVt ≤R s
2 ln
det(Vt)1/2 δdet(λI)1/2
+S√
λ
I [DHK08]: Ifkθ∗k2,kXtk2≤1then for a specificλ kbθt−θ∗kVt ≤Rmax
p
128dln(t)ln(t2/δ),8
3ln(t2/δ)
I [RT10]: IfkXtk2 ≤1 kθbt−θ∗kVt ≤2Rκ√
lntp
dlnt+ln(t2/δ) +S√ λ whereκ=3+2 ln((1+λd)/λ).
The bound of [AYPS11] doesn’t depend ont.
9 / 31
Questions
I Are there other ways to construct confidence sets?
I Can we get tighter confidence sets when some special conditions are met?
I SPARSITY:
Onlypcoordinates ofθ∗ are nonzero.
I Can we construct tighter confidence sets based on the knowledge ofp?
I Least-squares (or ridge) estimators are not a good idea!
10 / 31
Questions
I Are there other ways to construct confidence sets?
I Can we get tighter confidence sets when some special conditions are met?
I SPARSITY:
Onlypcoordinates ofθ∗ are nonzero.
I Can we construct tighter confidence sets based on the knowledge ofp?
I Least-squares (or ridge) estimators are not a good idea!
10 / 31
Questions
I Are there other ways to construct confidence sets?
I Can we get tighter confidence sets when some special conditions are met?
I SPARSITY:
Onlypcoordinates ofθ∗ are nonzero.
I Can we construct tighter confidence sets based on the knowledge ofp?
I Least-squares (or ridge) estimators are not a good idea!
10 / 31
Questions
I Are there other ways to construct confidence sets?
I Can we get tighter confidence sets when some special conditions are met?
I SPARSITY:
Onlypcoordinates ofθ∗ are nonzero.
I Can we construct tighter confidence sets based on the knowledge ofp?
I Least-squares (or ridge) estimators are not a good idea!
10 / 31
Questions
I Are there other ways to construct confidence sets?
I Can we get tighter confidence sets when some special conditions are met?
I SPARSITY:
Onlypcoordinates ofθ∗ are nonzero.
I Can we construct tighter confidence sets based on the knowledge ofp?
I Least-squares (or ridge) estimators are not a good idea!
10 / 31
Online-to-Confidence-Set Conversion
I Idea: Create a confidence set based on how well an online linear prediction algorithm works.
I This is a reduction!
I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets
I Hopefully it will work for the sparse case
Encouragement: Working on my thesis
11 / 31
Online-to-Confidence-Set Conversion
I Idea: Create a confidence set based on how well an online linear prediction algorithm works.
I This is a reduction!
I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets
I Hopefully it will work for the sparse case
Encouragement: Working on my thesis
11 / 31
Online-to-Confidence-Set Conversion
I Idea: Create a confidence set based on how well an online linear prediction algorithm works.
I This is a reduction!
I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets
I Hopefully it will work for the sparse case
Encouragement: Working on my thesis
11 / 31
Online-to-Confidence-Set Conversion
I Idea: Create a confidence set based on how well an online linear prediction algorithm works.
I This is a reduction!
I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets
I Hopefully it will work for the sparse case
Encouragement: Working on my thesis
11 / 31
Online-to-Confidence-Set Conversion
I Idea: Create a confidence set based on how well an online linear prediction algorithm works.
I This is a reduction!
I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets
I Hopefully it will work for the sparse case
Encouragement: Working on my thesis
11 / 31
Online-to-Confidence-Set Conversion
I Idea: Create a confidence set based on how well an online linear prediction algorithm works.
I This is a reduction!
I If a new prediction algorithm is discovered, or a better performance bounds for an algorithm becomes available, we get tighter confidence sets
I Hopefully it will work for the sparse case
Encouragement: Working on my thesis
11 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight
No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .!
There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction
Fort=1,2, . . .:
I ReceiveXt ∈Rd
I PredictYbt ∈R
I Receive correct labelYt ∈R
I Suffer loss(Yt−Ybt)2
Goal: Compete with the best linear predictor in hindsight No assumptions whatsoever on(X1,Y1),(X2,Y2), . . .! There are heaps of algorithms for this problem:
I online gradient descent [Zin03]
I online least-squares [AW01, Vov01]
I exponentiated gradient algorithm [KW97]
I online LASSO (??)
I SeqSEW [Ger11, DT07]
12 / 31
Online Linear Prediction, cnt’d
I Regretwith respect to a linear predictorθ∈Rd
ρn(θ) = Xn
t=1
(Yt−Ybt)2− Xn
t=1
(Yt−hXt, θi)2
I Prediction algorithms come with “regret bounds”Bn:
∀n ρn(θ)≤Bn
I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn
I Typically,Bn =O(√
n)orBn=O(logn)
13 / 31
Online Linear Prediction, cnt’d
I Regretwith respect to a linear predictorθ∈Rd
ρn(θ) = Xn
t=1
(Yt−Ybt)2− Xn
t=1
(Yt−hXt, θi)2
I Prediction algorithms come with “regret bounds”Bn:
∀n ρn(θ)≤Bn
I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn
I Typically,Bn =O(√
n)orBn=O(logn)
13 / 31
Online Linear Prediction, cnt’d
I Regretwith respect to a linear predictorθ∈Rd
ρn(θ) = Xn
t=1
(Yt−Ybt)2− Xn
t=1
(Yt−hXt, θi)2
I Prediction algorithms come with “regret bounds”Bn:
∀n ρn(θ)≤Bn
I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn
I Typically,Bn =O(√
n)orBn=O(logn)
13 / 31
Online Linear Prediction, cnt’d
I Regretwith respect to a linear predictorθ∈Rd
ρn(θ) = Xn
t=1
(Yt−Ybt)2− Xn
t=1
(Yt−hXt, θi)2
I Prediction algorithms come with “regret bounds”Bn:
∀n ρn(θ)≤Bn
I Bndepends onn,d,θand possiblyX1,X2, . . . ,Xnand Y1,Y2, . . . ,Yn
I Typically,Bn =O(√
n)orBn=O(logn)
13 / 31
Good Regret Implies Small Risk
I Data:{(Xt,Yt)}nt=1isi.i.d.,
Yt =hXt, θ∗i+ηt,ηtR-sub-Gaussian
I Online Learning Algorithm:
Aproduces{θt}nt=1 and predictsY^t=hXt, θti
I Regret bound: ∀n: ρn(θ∗)≤Bn
I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].
Theorem ([CBG08])
Letθ¯n= 1nPn
t=1θt. Then, w.p. 1−δ,
R(θ¯n)≤ Bn
n +36 n ln
Bn+3 δ
+
s Bn
n2 ln
Bn+3 δ
.
14 / 31
Good Regret Implies Small Risk
I Data:{(Xt,Yt)}nt=1isi.i.d.,
Yt =hXt, θ∗i+ηt,ηtR-sub-Gaussian
I Online Learning Algorithm:
Aproduces{θt}nt=1 and predictsY^t=hXt, θti
I Regret bound: ∀n: ρn(θ∗)≤Bn
I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].
Theorem ([CBG08])
Letθ¯n= 1nPn
t=1θt. Then, w.p. 1−δ,
R(θ¯n)≤ Bn
n +36 n ln
Bn+3 δ
+
s Bn
n2 ln
Bn+3 δ
.
14 / 31
Good Regret Implies Small Risk
I Data:{(Xt,Yt)}nt=1isi.i.d.,
Yt =hXt, θ∗i+ηt,ηtR-sub-Gaussian
I Online Learning Algorithm:
Aproduces{θt}nt=1 and predictsY^t=hXt, θti
I Regret bound: ∀n: ρn(θ∗)≤Bn
I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].
Theorem ([CBG08])
Letθ¯n= 1nPn
t=1θt. Then, w.p. 1−δ,
R(θ¯n)≤ Bn
n +36 n ln
Bn+3 δ
+
s Bn
n2 ln
Bn+3 δ
.
14 / 31
Good Regret Implies Small Risk
I Data:{(Xt,Yt)}nt=1isi.i.d.,
Yt =hXt, θ∗i+ηt,ηtR-sub-Gaussian
I Online Learning Algorithm:
Aproduces{θt}nt=1 and predictsY^t=hXt, θti
I Regret bound: ∀n: ρn(θ∗)≤Bn
I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].
Theorem ([CBG08])
Letθ¯n= 1nPn
t=1θt. Then, w.p. 1−δ,
R(θ¯n)≤ Bn
n +36 n ln
Bn+3 δ
+
s Bn
n2 ln
Bn+3 δ
.
14 / 31
Good Regret Implies Small Risk
I Data:{(Xt,Yt)}nt=1isi.i.d.,
Yt =hXt, θ∗i+ηt,ηtR-sub-Gaussian
I Online Learning Algorithm:
Aproduces{θt}nt=1 and predictsY^t=hXt, θti
I Regret bound: ∀n: ρn(θ∗)≤Bn
I Risk of vectorθ: R(θ) =E[(Y1−hX1, θi)2].
Theorem ([CBG08])
Letθ¯n= 1nPn
t=1θt. Then, w.p. 1−δ,
R(θ¯n)≤ Bn
n +36 n ln
Bn+3 δ
+
s Bn
n2 ln
Bn+3 δ
.
14 / 31
Online-to-Confidence-Set Conversion
I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θ∗i+ηt andηtis conditionallyR-sub-Gaussian.
I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ∗)≤Bn
Theorem (Conversion, [AYPS12])
With probability at least1−δ, foralln,θ∗ lies in
Cn =
θ∈Rd : Xn
t=1
(^Yt−hXt, θi)2
≤1+2Bn+32R2ln R√ 8+√
1+Bn
δ
!
15 / 31
Online-to-Confidence-Set Conversion
I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θ∗i+ηt andηtis conditionallyR-sub-Gaussian.
I PredictionsYb1,Yb2, . . . ,Ybn
I Regret boundρ(θ∗)≤Bn
Theorem (Conversion, [AYPS12])
With probability at least1−δ, foralln,θ∗ lies in
Cn =
θ∈Rd : Xn
t=1
(^Yt−hXt, θi)2
≤1+2Bn+32R2ln R√ 8+√
1+Bn
δ
!
15 / 31
Online-to-Confidence-Set Conversion
I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θ∗i+ηt andηtis conditionallyR-sub-Gaussian.
I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ∗)≤Bn
Theorem (Conversion, [AYPS12])
With probability at least1−δ, foralln,θ∗ lies in
Cn =
θ∈Rd : Xn
t=1
(^Yt−hXt, θi)2
≤1+2Bn+32R2ln R√ 8+√
1+Bn
δ
!
15 / 31
Online-to-Confidence-Set Conversion
I Data(X1,Y1), . . . ,(Xn,Yn)whereYt=hXt, θ∗i+ηt andηtis conditionallyR-sub-Gaussian.
I PredictionsYb1,Yb2, . . . ,Ybn I Regret boundρ(θ∗)≤Bn
Theorem (Conversion, [AYPS12])
With probability at least1−δ, foralln,θ∗ lies in
Cn =
θ∈Rd : Xn
t=1
(^Yt−hXt, θi)2
≤1+2Bn+32R2ln R√ 8+√
1+Bn
δ
!
15 / 31
Proof Sketch
Algebra: With probability1, due to the regret boundBn, Xn
t=1
(Ybt−hXt, θ∗i)2 ≤Bn+2 Xn
t=1
ηt(Ybt−hXt, θ∗i)
| {z }
Mn
. (1)
(Mn)∞n=1is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,
|Mn|≤R vu
ut2 1+ Xn
t=1
(bYt−hXt, θ∗i)2
!
×ln
q
1+Pn
t=1(Ybt−hXt, θ∗i)2 δ
. Combine with (1) and solve the inequality.
16 / 31
Proof Sketch
Algebra: With probability1, due to the regret boundBn, Xn
t=1
(Ybt−hXt, θ∗i)2 ≤Bn+2 Xn
t=1
ηt(Ybt−hXt, θ∗i)
| {z }
Mn
. (1)
(Mn)∞n=1 is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,
|Mn|≤R vu
ut2 1+ Xn
t=1
(bYt−hXt, θ∗i)2
!
×ln
q
1+Pn
t=1(Ybt−hXt, θ∗i)2 δ
.
Combine with (1) and solve the inequality.
16 / 31
Proof Sketch
Algebra: With probability1, due to the regret boundBn, Xn
t=1
(Ybt−hXt, θ∗i)2 ≤Bn+2 Xn
t=1
ηt(Ybt−hXt, θ∗i)
| {z }
Mn
. (1)
(Mn)∞n=1 is a martingale. Using the same argument as in [AYPS11], we get that w.p.1−δ, foralln≥0,
|Mn|≤R vu
ut2 1+ Xn
t=1
(bYt−hXt, θ∗i)2
!
×ln
q
1+Pn
t=1(Ybt−hXt, θ∗i)2 δ
. Combine with (1) and solve the inequality.
16 / 31
Application to Sparse Linear Prediction
Theorem ([Ger11])
For anyθsuch thatkθk∞≤1andkθk0≤p, the the regret of SEQSEWis bounded by
ρn(θ)≤Bn=O(plog(nd)).
Corollary
∃A>0s.t. with probability at least1−δ, foralln,θ∗lies in Cn =
θ∈Rd : Xn
t=1
(^Yt−hXt, θi)2
≤1+2Aplog(nd) +32R2ln R√ 8+p
1Aplog(nd) δ
! .
17 / 31
Application to Sparse Linear Prediction
Theorem ([Ger11])
For anyθsuch thatkθk∞≤1andkθk0≤p, the the regret of SEQSEWis bounded by
ρn(θ)≤Bn=O(plog(nd)).
Corollary
∃A>0s.t. with probability at least1−δ, foralln,θ∗lies in Cn =
θ∈Rd : Xn
t=1
(^Yt−hXt, θi)2
≤1+2Aplog(nd) +32R2ln R√ 8+p
1Aplog(nd) δ
! .
17 / 31
Application to Linear Bandits
Encouragement: Gittin’s mistake Stamina: Andr´as’ theory of hiking
18 / 31