? 6.1 The curse of dimensionality
6.4 Linear Discriminant Analysis
Linear discriminant analysis(LDA) is a technique which also makes use of the category of the data points they belong for performing dimensionality reduction. More precisely, we are going to assume for LDA that our data pointsxiare accompanied by a categorical class labelyi ∈ Y that they are characterized by. For the sake of simplicity, we can assume|Y| = 2, that is, every point belongs to either of the positive or negative classes. Having access to the class label of the data points makes two different objectives equally sensible now for reducing the dimensionality of our data points.
On the one hand, it can be argued that points belonging to differ-ent classes should be as much separable from each other as possible after dimensionality reduction. What we want in other words, is that points labeled differently mix to the least possible extent. From this perspective, our goal is to find a transformation characterized byw which maximizes the distance between the transformed data points belonging to the different classes. This goal can be equivalently ex-pressed and formalized via relying on the means of the points be-longing to the different classes, i.e.,µ1andµ2. This is due to the fact that applying the same transformationwto all the points will also affect their mean accordingly, i.e., the transformed means are going to bew|µ1andw|µ2. The first criteria hence can be expressed as
w w|SBw, (6.20)
whereSBis a rank-1matrix responsible for characterizing the between-class scatter of the data points according to their original
representa-tion and which can be conveniently calculated in the binary (|Y|=2) case as
In the case, we have more than two classes (|Y| > 2), the between-class scatter matrix is generalized as
withncreferring to the number of data points falling into classc, µcbeing the mean data point calculated from thencobservations andµdenoting the mean vector calculated from all the data points irrespective of their class labels.
On the other hand, someone might argue – along the lines of
“birds of a feather flock together” – that those points which share the same class label are supposed to be clustered densely after di-mensionality reduction is performed. To put it differently, the av-erage distance between the images of the original points within the same category should be minimized. This formally can be quantified with the help of the within-class scatter score between data points.
The within-class scatter for data points belonging with classcfor a particular projection given bywcan be expressed as
w|(xi−µc)(xi−µc)|w=w|Scw, withScdenoting the scatter matrix calculated over the data points belonging to classc, i.e.
For notational convenience, we shall refer to the sum of within class scatter matrices for classc = 0 andc = 1 as the aggregated within-class scatter matrix, that is
giving us an overall information on how do the data points differ on average from the mean of the class they belong to.
It turns out that these two requirements often act against each other and the best one can do is to find a trade-off between them instead of performing optimally with respect both of them at the same time. In order to give both of our goals a share in the objective function, the expression that we wish to maximize in the case of
d i m e n s i o na l i t y r e du c t i o n 133
LDA is going to be a fraction. Maximizing a fraction is a good idea in this case, as it can be achieved by a large nominator and a small denominator. Hence the expression we aim at optimizing is
withSBandSWdenoting the between-class and within-class scatter matrices, respectively.
Eq. (6.21) can be maximized if
w|SWw =0⇔(w|SBw)∇ww|SWw= (w|SWw)∇ww|SBw (6.22) is satisfied, which can be simplified as
Upon transitioning from Eq. (6.22) to Eq. (6.23) we made use of the fact that∇xx|Ax = A|x+Ax = (A+A|)xfor any vectorxand matrixA. In the special case, when matrix Ais symmetric – exactly what scatter matrices are –∇xx|Ax = 2Axalso holds. Eq. (6.23) is pretty much reminds us to the standard eigenvalue problem, except for the fact that there is an extra matrix multiplication on the right hand side of the equation as well. These kind of problems are called generalized eigenvalue problems. There are multiple ways to solve generalized eigenvalue problems. There are more convoluted and effective approaches to solve such problems, but we can also solve them by simply left multiplying both sides withS−W1, yielding
that we can regard as a regular eigenproblem. We shall add that in our special case, with only two class labels, we can express also obtain the optimal solution in a simpler form, i.e.
Example6.8. In order to geometrically illustrate the different solutions one would get if the objective was either just the nominator or the denominator of the joint objective function (Eq.(6.21)) let us consider the following synthetic example problem.
Assume that those data points belonging to the positive class are gen-erated byN([4, 4],[0.3 0; 0 3]), that is a bivariate Gaussian distribution with mean vector[4, 4]and covariance matrix[0.3 0; 0 3]. Likewise, let us assume that the negative class can be described asN([4,−5],[0.3 0; 0 3]), i.e., another bivariate Gaussian which only differs from the previous one in its mean being shifted by9units along the second coordinate.
-8 -6 -4 -2 0 2 4 6
(a)10points sampled each from two bivariate normal distributions with means [4,4] and [4,-5]. respect the joint objective of LDA. respect the nominator of the LDA objective.
Figure6.24: An illustration of the effect of optimizing the joint fractional objec-tive of LDA (b) and its nominator (c) and denominator (d) separately.
Figure6.24(a)includes10points sampled from each of the positive and negative classes. Figure6.24(b)–(d)contains the optimal projections of this sample dataset when we consider the entire objective of LDA(b), only the term in the nominator(c)and only the term in the denominator(d).
Figure6.24(c)nicely illustrates that optimizing for the nominator of the objective of LDA, we are purely focusing on finding a hyperplane which behaves such that the separation between the data points belonging to the different classes get maximized.
Figure6.24(d)on the other hand demonstrates that exclusively focusing on the optimization on the denominator of the LDA objective, we obtain a hyperplane which minimizes the scatter for the data points belonging to the same class. At the same time, this approach does not pay any attention for the separation of the data points belonging to the different classes.
The solution seen in Figure6.24(b), however, does an equally good job in trying to separate points belonging to distinct categories and minimizing the cumulative within-class scatter.
Table6.5contains the distinct parts of the objective function when opti-mizing for certain parts of the objective in a tabular format. We can see, that – quite unsurprisingly – we indeed get the best objective value for Eq.(6.21) when determiningw∗according to the approach of LDA.
Alternative solutions –listed in the penultimate and the last row of
Ta-d i m e n s i o na l i t y r e Ta-du c t i o n 135
ble6.5– are capable of obtaining better scores for certain parts (either the nominator or the denominator) of the LDA objective, but they fail to do so for the joint, i.e. fractional objective.
Try calculating the solution with Octave by solving the generalized eigenproblem defined in Eq. (6.23).
# sample 10 examples from the two Gaussian populations X1 = mvnrnd([4 4], [0.3 0; 0 3], 10);
X2 = mvnrnd([4 -5], [0.3 0; 0 3], 10);
mu1 = mean(X1);
mu2 = mean(X2);
mean_diff = mu1 - mu2;
Sw = (X1 - mu1)’ * (X1 - mu1) + (X2 - mu2)’ * (X2 - mu2);
w = inv(Sw) * mean_diff;
Figure6.25: Code snippet demonstrat-ing the procedure of LDA.
Objective w∗ ww||SSBw
Ww w|SBw w|SWw
maxww||SSWBww [−0.23,−0.97] 2.32 85.22 36.80 maxw|SBw [−0.01, 1.00] 2.29 90.37 39.52 minw|SWw [−1.00,−0.07] 0.05 0.38 8.13
Table6.5: The values of the different components of the LDA objective (along the columns) assuming that we are optimizing towards certain parts of the objective (indicated at the beginning of the rows). Best values along each column are marked bold.