Degrees of Convexity - Online learning: Algorithms for Big Data

DRAFT

Chapter 10 Tracking

12.6 Degrees of Convexity

For a set S ⊂E define mS :E →R, the Minkowski gauge of S, by mS(x) = inf{λ≥0 : x∈λS} ,

where λS = {λs : s∈S}. For f : E → [0,+∞) positively homogeneous, let Bf = {x : f(x)≤1}be the “ball” generated by f. The following proposition follows immediately from the definitions:

Proposition 12.32 (Minkowski Gauges). f :E →Ris positively homogeneous withf(0) = 0 if and only there exists a set S ⊂E such that 0∈S and f is the Minkowski gauge of S. In particular, S=Bf .

={x : f(x)≤1}, i.e., f =mBf.

12.6 Degrees of Convexity

In this section we introduce various strengthened forms of convexity. The significance of these that, as we will see, strengthened forms of convexity, give us more control both in optimization and learning.

We already mentioned that if in (12.1) the inequality holds with strict inequality for all x, y ∈E,α ∈(0,1), the functionf is called strictly convex. Strictly convex functions have at most one minimizer:

Proposition 12.33 (Strictly Convex Functions Have Unique Minimizers). Let f :E →R be proper, strictly convex. Then f has at most one minimizer.

For C ⊂ E compact with dom(f)∩C 6= ∅, lower semicontinuity of f is a sufficient to guarantee the existence of a minimizer of f over C. This is also known as Weierstrass’

Theorem. Can we relax the condition that C is compact? For t ∈R, define the t-lower level set of a function f :E →Ras the set lev≤t(f) ={x∈E : f(x)≤t}. WhenC is closed and convex, a sufficient condition for the the existence of a minimizer of f over C is that for some t∈R, C∩lev≤t(f) is nonempty and bounded. We see that the key condition is that lev≤t(f) should be bounded for some t∈R, motivating the next result:

Proposition 12.34. Suppose f :E →R is proper, closed, convex. Then limkxk→+∞f(x) = +∞ if and only if there exists t ∈R such that lev≤t(f) is nonempty and bounded.

If limkxk→+∞f(x) = +∞then we say thatf iscoercive. We have already met supercoercive functions (cf. Theorem 12.24), i.e., functions that satisfy limkxk→+∞f(x)/kxk= +∞. From Proposition 12.34 it follows immediately that for a proper, closed, convex function f, and set closed convex set C such that C∩domf 6=∅, if either f is coercive or C is bounded thenf has a minimizer overC.

Suppose that we minimize a function f : E → R over C ⊂ E, C ∩dom(f) 6= ∅ and found a point u. We call f(u)−infx∈Cf(x) thesuboptimality gap of u (w.r.t. f and C). In numerical optimization there are two dual questions of equal importance:

(a) Knowing how large the suboptimality gap is, how far canu be from the set of minimizers Argmin_x∈Cf(x)?

DRAFT

166 CHAPTER 12. BACKGROUND ON CONVEXITY

(b) Knowing how far u is from the set of minimizers Argmin_x∈Cf(x), how big can the suboptimality gap be?

First, consider (a). Suppose f : E → R is a proper, convex function. Let K ⊂ E be a nonempty convex set and consider a point x∈Argmin_v∈Kf(v). For anyu ∈K ∩dom(f), by the definition of subdifferentials

f(u)−f(x) = (

f(u)−f(x)− sup

θ∈∂ f(x)hθ, u−xi )

+ sup

θ∈∂ f(x)hθ, u−xi.

When f andK meet the conditions of Theorem 12.11, since x is a minimizer of f over K, sup_{θ∈∂ f(x)}hθ, u−xi= 0. Hence,

f(u)−f(x)≥f(u)−f(x)− sup

θ∈∂ f(x)hθ, u−xi.

The right-hand side measures how small the gap between the value of f atu and its affine minorants based at x can be. If this gap was lower bounded by ψ(ku−xk) for some strictly increasing function ψ : [0,+∞)→ [0,+∞), then by inverting ψ, we could get a bound on ku−xk. This motivates the following definitions:

Definition 12.35 (Bregman Divergence). Let f : E → R. The function Df : E ×E → [0,+∞] defined by

Df(u, v) =f(u)−f(v)− sup

θ∈∂ f(v)hθ, u−vi

is called the Bregman divergence of f and Df(u, v) is called the divergence of u from v generated byf.

Definition 12.36 (Modulus of Convexity, Strong Convexity). Let k · k be a norm on E, f :E →R. The function ψ : [0,+∞)→[0,+∞] is called a modulus of convexity of f at v w.r.t. k · k if ψ vanishes only at zero and

Df(u, v)≥ψ(ku−vk) for all u∈E .

In this case we also say that f is (ψ,k · k)-convex at v. If ψ(t) = ^σ₂t² with some σ > 0, function f is called σ-strongly convex at v w.r.t. k · k. Finally, if f is σ-strongly convex at all points v ∈E, then it is said to be (globally) σ-strongly convex.

Note that in the definition of Bregman divergences we did not require f to be convex nor differentiable (as done usually). The advantage is increased generality with no harm caused.

Our generalization is a natural extension of this classical definition as the usual properties of Bregman divergences are still true:

Proposition 12.37. Df is well-defined (i.e., nonnegative valued),domDf = domf×dom∂ f and Df(·, v) is convex for any v ∈E.

The proof is left as an exercise (cf. Exercise 12.43).

Let us summarize our findings concerning Question (a):

DRAFT

12.6. DEGREES OF CONVEXITY 167

Proposition 12.38. Suppose f :E →R is a proper. Let K ⊂E be a nonempty convex set such that ri(dom(f))∩ri(K)6=∅. Let x∈Argminv∈Kf(v), and assume that ψ is a modulus of convexity of f w.r.t. the norm k · k at x. Then, for any u∈K ∩dom(f),

f(u)−f(x)≥ψ(ku−xk).

and thusku−xk ≤sup{t≥0 : ψ(t)≤f(u)−f(x)}. In particular, if f is σ-strongly convex w.r.t. the norm k · k at x,

f(u)−f(x)≥ σ

2ku−xk². and ku−xk ≤p

2(f(u)−f(x))/σ.

Strong convexity is “stronger” than strict convexity:

Proposition 12.39 (Strong Convexity Implies Strict Convexity). Assume that for allx∈E, f is (ψx,k · k)-convex at x with some ψx : [0,+∞)→[0,+∞]. Then f is strictly convex.

Thus, the global existence of moduli of convexity is a stronger condition than strict convexity, which was shown to be useful for showing the existence and uniqueness of optimizers.

The proof relies on that in the definition of strong convexity, ψ(t) = 0 implies t = 0 (cf.

Exercise 12.44).

LEFTOVER LEFTOVER LEFTOVER LEFTOVER LEFTOVER LEFTOVER Strong convexity

Proposition 12.40. Let f be proper, closed and strongly convex. Then f is supercoercive and it has exactly one minimizer overE.

In general, the degree of convexity of a function f is the gap between the right-hand side and the left-hand side of this inequality, i.e., by

αf(x) + (1−α)f(y)−f(αx+ (1−α)y).

When taking the difference between the right-hand side and left-hand side, the gap is expected to close when αgets closer to any of the endpoints of (0,1). Iff is continuously differentiable, the gap can be seen to be equal

G(α;x, y) .

=αf(x) + (1−α)f(y)−f(αx+ (1−α)y)

=α{f(x)−f(z)}+ (1−α){f(y)−f(z)} (defining z =αx+ (1−α)y)

=α{h∇f(z1), x−zi}+ (1−α){h∇f(z2), y−zi} (z1 ∈[x, z], z2 ∈[y, z] suitable)

=α(1−α)h∇f(z1), x−yi −(1−α)αh∇f(z2), x−yi

=α(1−α)h∇f(z1)− ∇f(z2), x−yi.

For x and y close to each other, the difference scales with α(1−α). This motivates the following definition:

DRAFT

168 CHAPTER 12. BACKGROUND ON CONVEXITY

Definition 12.41 (Uniform Convexity). Fix a norm k · k on E and let f : E → R be a proper, convex function. We say thatf isuniformly convex w.r.t. the normk·kwith modulus φ: [0,∞)→[0,+∞] if φ is increasing, vanishes only at zero and for anyx, y ∈dom(f) and α∈(0,1),

α(1−α)φ(kx−yk)≤αf(x) + (1−α)f(y) −f(αx+ (1−α)y). (12.10) When this holds we say that φ is a modulus of convexity of f.

Then we say that f is uniformly convex w.r.t. the given norm (with modulusφ). Another refinement is to replace the modulus φ(·) with a family of moduli (φx,y)x,y∈E×E:

α(1−α)φx,y(kx−yk)≤αf(x) + (1−α)f(y) −f(αx+ (1−α)y), in which case we say that f has a local modulus of convexity φx,y are x, y ∈E.

When f is linear, G(α;x, y) = 0. When f is quadratic, i.e., f(x) = hx−b, P(x−b)i+c, a simple calculation gives

G(α;x, y) =α(1−α)kx−yk²P. (12.11)

Definition 12.42 (Strongly Convex Functions). A proper function f : E → R is strongly convex w.r.t. normk · k with constant β if (12.10) holds with φ(t) = ^β₂t² with some β >0.

The next proposition gives a useful equivalent definition for strong convexity.

Proposition 12.43. Let f :E →R be proper, β >0. Then f is strongly convex w.r.t. k · k² with constant β if and only if f −β/2k · k²2 is convex.

Proof. This follows directly from (12.11) with P =I (the identity matrix).

Hierarchy of degrees of convexity: Convex, uniformly convex, strictly convex, strongly convex.

Modulus of convexity.

Duality between strong convexity and smoothness.

Bregman divergences.

strictly convex yields conjugate is differentiable; strongly convex yields differentiable;

stronger notions of differentiability (Lipschitz/H¨older gradient and implications) Thm 18.15 in Bauschke, Combettes

Definition 12.44 (Strong Convexity). Let A ⊂ R^d and F : A → R. We say that F is σ-strongly convex with respect to a squared norm k · k² and for σ >0 if for any u, v ∈A

F(u)−F(v)≥ h∇F(v), u−vi+σ

2ku−vk² . We say F is strongly convex if it is 1-strongly convex.

DRAFT

12.7. A CHARACTERIZATION OF THE SET OF CONVEX, DIFFERENTIABLE FUNCTIONS169 This definition depends on the norm used. A function F can be strongly convex with

respect to one norm but not another. Note that strong convexity (with respect to any norm) implies strict convexity. Convexity means that the function F(·) can be lower bounded by a linear function F(v) +h∇F(v),· −vi. In contrast, strong convexity means that a quadratic function F(v) +h∇F(v),· −vi+ ¹₂k · −vk² lower bounds F(·). The coefficient ¹₂ in front ku−vk² is an arbitrary choice; it was chosen for mathematical convenience.

Definition 12.45 (Bregman Divergence). Let F be a Legendre function F :A →R. The Bregman divergence corresponding toF is a functionDF :A×A^◦ →Rdefined by the formula

DF(u, v) =F(u)−F(v)− h∇F(v), u−vi.

Bregman divergence is the difference between the function value F(u) and its approxi-mation by the first order Taylor expansion F(v) +h∇F(v), u−vi aroundv. The first order Taylor expansion is a linear function tangent to F at point v. Since F is convex, the linear function lies belowF and therefore DF is non-negative. Furthermore, DF(u, v) = 0 implies that u = v because the strict convexity of F implies that the only point where the linear approximation meets F isu=v.

Consider now a Legendre function R : E → R and let DR(y, x) = R(y)− R(x) − h∇R(x), y−xibe the Bregman divergence underlying R. Then, a simple algebra shows that DR satisfies the so-called Pythagorean identity: For any x, y, u∈E,

DR(x, y) =DR(x, u) +DR(u, y) +h∇R(y)− ∇R(u), u−xi. (12.12) Note that h∇R(y)− ∇R(u), u−xi can be positive, hence DR does not satisfy the triangle inequality, which should not be surprising given that DR is closer to squared distance than to a distance. In fact, when R is the squared Euclidean norm,∇R(x) =x. In this case we recover the standard Pythagorean identity when y−u and x−u are perpendicular (i.e., h∇R(y)− ∇R(u), u−xi= 0).

Let C ⊂ E be convex, x ∈ C and let u be obtained by projecting y to C, i.e., if u= argmin_v∈CDR(v, y). Then,

DR(x, y)≥DR(x, u) +DR(u, y).

This inequality is called the Pythagorean inequality. The proof is immediate since, thanks to∇DR(·, y) = ∇R(·)− ∇R(y), the optimality criterion gives that for any v ∈C∩dom(f), h∇R(u)− ∇R(y), v−ui ≥0. Using v =x and (12.12) gives the result.

12.7 A Characterization of the Set of Convex,

In document Online learning: Algorithms for Big Data (Pldal 165-169)