** ? 6.1 The curse of dimensionality**

**6.4 Linear Discriminant Analysis**

**Linear discriminant analysis**(LDA) is a technique which also makes
use of the category of the data points they belong for performing
dimensionality reduction. More precisely, we are going to assume for
LDA that our data points**x**iare accompanied by a categorical class
labelyi ∈ Y that they are characterized by. For the sake of simplicity,
we can assume|Y| = 2, that is, every point belongs to either of the
positive or negative classes. Having access to the class label of the
data points makes two different objectives equally sensible now for
reducing the dimensionality of our data points.

On the one hand, it can be argued that points belonging to
differ-ent classes should be as much separable from each other as possible
after dimensionality reduction. What we want in other words, is that
points labeled differently mix to the least possible extent. From this
perspective, our goal is to find a transformation characterized by**w**
which maximizes the distance between the transformed data points
belonging to the different classes. This goal can be equivalently
ex-pressed and formalized via relying on the means of the points
be-longing to the different classes, i.e.,* µ*1and

*2. This is due to the fact that applying the same transformation*

**µ****w**to all the points will also affect their mean accordingly, i.e., the transformed means are going to be

**w**

^{|}

**µ**_{1}and

**w**

^{|}

**µ**_{2}. The first criteria hence can be expressed as

max**w** k^{w}^{|}*^{µ}*1−

^{w}^{|}

*2k*

^{µ}^{2}2=max

**w** **w**^{|}SB**w,** (6.20)

whereSBis a rank-1matrix responsible for characterizing the between-class scatter of the data points according to their original

representa-tion and which can be conveniently calculated in the binary (|Y|=2) case as

SB= (* µ*1−

*2)(*

^{µ}*1−*

**µ***2)*

^{µ}^{|}.

In the case, we have more than two classes (|Y| > 2), the between-class scatter matrix is generalized as

S_{B}=

### ∑

|Y|c=1

nc(* µ*c−

*)(*

^{µ}*c−*

**µ***)*

^{µ}^{|},

withncreferring to the number of data points falling into classc,
* µ*cbeing the mean data point calculated from thencobservations
and

*denoting the mean vector calculated from all the data points irrespective of their class labels.*

**µ**On the other hand, someone might argue – along the lines of

“birds of a feather flock together” – that those points which share the same class label are supposed to be clustered densely after di-mensionality reduction is performed. To put it differently, the av-erage distance between the images of the original points within the same category should be minimized. This formally can be quantified with the help of the within-class scatter score between data points.

The within-class scatter for data points belonging with classcfor a
particular projection given by**w**can be expressed as

˜

s^{2}_{c} =

### ∑

{(x_{i},y_{i})|y_{i}=c}

(**w**^{|}**x**_{i}−^{w}^{|}*^{µ}*c)

^{2}=

=

### ∑

{(xi,y_{i})|y_{i}=c}

**w**^{|}(**x**i−* µ*c)(

**x**i−

*c)*

**µ**^{|}

**w**=

**w**

^{|}Sc

**w,**withScdenoting the scatter matrix calculated over the data points belonging to classc, i.e.

Sc=

### ∑

{(x_{i},y_{i})|y_{i}=c}

(**x**_{i}−*^{µ}*c)(

**x**

_{i}−

*c)*

^{µ}^{|}.

For notational convenience, we shall refer to the sum of within class scatter matrices for classc = 0 andc = 1 as the aggregated within-class scatter matrix, that is

S_{W}=S0+S_{1},

giving us an overall information on how do the data points differ on average from the mean of the class they belong to.

It turns out that these two requirements often act against each other and the best one can do is to find a trade-off between them instead of performing optimally with respect both of them at the same time. In order to give both of our goals a share in the objective function, the expression that we wish to maximize in the case of

d i m e n s i o na l i t y r e du c t i o n 133

LDA is going to be a fraction. Maximizing a fraction is a good idea in this case, as it can be achieved by a large nominator and a small denominator. Hence the expression we aim at optimizing is

max**w**

**w**^{|}SB**w**

**w**^{|}SW**w**, (6.21)

withSBandSWdenoting the between-class and within-class scatter matrices, respectively.

Eq. (6.21) can be maximized if

∇**w** **w**^{|}S_{B}**w**

**w**^{|}SW**w** =**0**⇔(**w**^{|}SB**w**)∇**w****w**^{|}SW**w**= (**w**^{|}SW**w**)∇**w****w**^{|}SB**w**
(6.22)
is satisfied, which can be simplified as

SB**w**=*λS*_{W}**w.** (6.23)

Upon transitioning from Eq. (6.22) to Eq. (6.23) we made use of the
fact that∇**x****x**^{|}Ax = A^{|}**x**+Ax = (A+A^{|})**x**for any vector**x**and
matrixA. In the special case, when matrix Ais symmetric – exactly
what scatter matrices are –∇**x****x**^{|}Ax = 2Axalso holds. Eq. (6.23) is
pretty much reminds us to the standard eigenvalue problem, except
for the fact that there is an extra matrix multiplication on the right
hand side of the equation as well. These kind of problems are called
**generalized eigenvalue problems. There are multiple ways to solve**
generalized eigenvalue problems. There are more convoluted and
effective approaches to solve such problems, but we can also solve
them by simply left multiplying both sides withS^{−}_{W}^{1}, yielding

S^{−}_{W}^{1}SB**w**=*λw*

that we can regard as a regular eigenproblem. We shall add that in our special case, with only two class labels, we can express also obtain the optimal solution in a simpler form, i.e.

**w**^{∗}=S^{−}_{W}^{1}(**µ**_{1}−*^{µ}*2).

**Example6.8.** In order to geometrically illustrate the different solutions one
would get if the objective was either just the nominator or the denominator
of the joint objective function (Eq.(6.21)) let us consider the following
synthetic example problem.

Assume that those data points belonging to the positive class are
gen-erated byN([4, 4],[0.3 0; 0 3]), that is a bivariate Gaussian distribution
with mean vector[4, 4]and covariance matrix[0.3 0; 0 3]. Likewise, let us
assume that the negative class can be described asN([4,−^{5}],[0.3 0; 0 3]),
i.e., another bivariate Gaussian which only differs from the previous one in
its mean being shifted by*9*units along the second coordinate.

-8 -6 -4 -2 0 2 4 6

(a)10points sampled each from two bivariate normal distributions with means [4,4] and [4,-5]. respect the joint objective of LDA. respect the nominator of the LDA objective.

Figure6.24: An illustration of the effect of optimizing the joint fractional objec-tive of LDA (b) and its nominator (c) and denominator (d) separately.

Figure*6.24*(a)includes*10*points sampled from each of the positive and
negative classes. Figure*6.24*(b)–(d)contains the optimal projections of this
sample dataset when we consider the entire objective of LDA(b), only the
term in the nominator(c)and only the term in the denominator(d).

Figure*6.24*(c)nicely illustrates that optimizing for the nominator of
the objective of LDA, we are purely focusing on finding a hyperplane which
behaves such that the separation between the data points belonging to the
different classes get maximized.

Figure*6.24*(d)on the other hand demonstrates that exclusively focusing
on the optimization on the denominator of the LDA objective, we obtain a
hyperplane which minimizes the scatter for the data points belonging to the
same class. At the same time, this approach does not pay any attention for
the separation of the data points belonging to the different classes.

The solution seen in Figure*6.24*(b), however, does an equally good job in
trying to separate points belonging to distinct categories and minimizing the
cumulative within-class scatter.

Table*6.5*contains the distinct parts of the objective function when
opti-mizing for certain parts of the objective in a tabular format. We can see, that
– quite unsurprisingly – we indeed get the best objective value for Eq.(6.21)
when determining**w**^{∗}according to the approach of LDA.

Alternative solutions –listed in the penultimate and the last row of

Ta-d i m e n s i o na l i t y r e Ta-du c t i o n 135

ble*6.5*– are capable of obtaining better scores for certain parts (either the
nominator or the denominator) of the LDA objective, but they fail to do so
for the joint, i.e. fractional objective.

Try calculating the solution with Octave by solving the generalized eigenproblem defined in Eq. (6.23).

### ?

# sample 10 examples from the two Gaussian populations X1 = mvnrnd([4 4], [0.3 0; 0 3], 10);

X2 = mvnrnd([4 -5], [0.3 0; 0 3], 10);

mu1 = mean(X1);

mu2 = mean(X2);

mean_diff = mu1 - mu2;

Sw = (X1 - mu1)’ * (X1 - mu1) + (X2 - mu2)’ * (X2 - mu2);

w = inv(Sw) * mean_diff;

**C****ODE SNIPPET**

Figure6.25: Code snippet demonstrat-ing the procedure of LDA.

Objective **w**^{∗} _{w}^{w}_{|}^{|}_{S}^{S}^{B}^{w}

W**w** **w**^{|}S_{B}**w** **w**^{|}S_{W}**w**

max_{w}** ^{w}**|

^{|}S

^{S}

_{W}

^{B}

^{w}**w**[−

^{0.23,}−

^{0.97}]

**2.32**85.22 36.80 max

**w**

^{|}SB

**w**[−0.01, 1.00]

^{2}.29

**90.37**39.52 min

**w**

^{|}S

_{W}

**w**[−1.00,−0.07]

^{0}.05 0.38

**8.13**

Table6.5: The values of the different components of the LDA objective (along the columns) assuming that we are optimizing towards certain parts of the objective (indicated at the beginning of the rows). Best values along each column are marked bold.