? 6.1 The curse of dimensionality
6.5 Further reading
# sample 10 examples from the two Gaussian populations X1 = mvnrnd([4 4], [0.3 0; 0 3], 10);
X2 = mvnrnd([4 -5], [0.3 0; 0 3], 10);
mu1 = mean(X1);
mu2 = mean(X2);
mean_diff = mu1 - mu2;
Sw = (X1 - mu1)’ * (X1 - mu1) + (X2 - mu2)’ * (X2 - mu2);
w = inv(Sw) * mean_diff;
Figure6.25: Code snippet demonstrat-ing the procedure of LDA.
Objective w∗ ww||SSBw
Ww w|SBw w|SWw
maxww||SSWBww [−0.23,−0.97] 2.32 85.22 36.80 maxw|SBw [−0.01, 1.00] 2.29 90.37 39.52 minw|SWw [−1.00,−0.07] 0.05 0.38 8.13
Table6.5: The values of the different components of the LDA objective (along the columns) assuming that we are optimizing towards certain parts of the objective (indicated at the beginning of the rows). Best values along each column are marked bold.
6.5 Further reading
There is a wide range of further dimensional reduction approaches that are outside the scope of this document. Canonical correlation
analysis (CCA)7operates over two distinct representations (also 7Hotelling1936
often called views) of the dataset and tries to find such transfor-mations, one for each view, which maps the distinct views into a common space such that the correlation between the different views of the same data points gets maximized. Similar to other approaches discussed in this chapter, the problem can be solved as an eigenprob-lem of a special matrix. Hardoon et al. provided a detailed overview of the optimization problem and its applications. For a read
on the connection of CCA to PCA, refer to the tutorial8. 8Borga1999
With the advent of deep learning, deep canonical correlation
anal-ysis (DCCA)9has been also proposed. DCCA learns a transfor- 9Andrew et al.2013
mation expressed by a multi-layer neural network which involves non-linear activation functions. This property of DCCA offers the possibility to learn transformations of more complex nature com-pared to standard CCA.
Contrary to most dimensionality reduction approaches, Locally
Linear Embeddings (LLE)10 provide a non-linear model for dimen- 10Roweis and Saul2000
sionality reduction. LLE focuses on the local geometry of the data points, however, it also preserves the global geometry of the observa-tions in an implicit manner as illustrated by Figure6.26. LLE makes use of the assumption that even though the global geometry of our data might fail to be linear,individual data points could still be mod-elled linearly “microscopically”, i.e., based on their nearest neighbors.
Original data PCA projection LLE projectionFigure6.26: Illustration of the effec-tiveness of locally linear embedding when applied on a non-linear dataset originating from an S-shaped manifold.
Multi dimensional scaling (MDS)11tries to give a low-dimensional 11Kruskal and Wish1978
(often2-dimensional for visualization purposes) representation of the data, such that the pairwise distances between the data points – assumed to be given as an input – get distorted to the minimal
extent. Borg et al.12provides a thorough application-oriented 12Borg et al.2012
overview of MDS.
t-Stochastic Neighbor Embedding (t-SNE)13aims to find a low- 13van der Maaten and Hinton2008
dimensional mapping of typically high-dimensional data points. The algorithm builds upon the pairwise similarity between data points for determining their mapping which can be successfully employed for visualization purposes in2or3dimensions. t-SNE operates by trying to minimize the Kullback-Leibler divergence between the probability distribution defined over the data points in the original high-dimensional space and their low-dimensional image. t-SNE is known for its sensitivity to the choice of hyperparameters. This
as-pect of t-SNE is thoroughly analyzed byWattenberg et al.14 14Wattenberg et al.2016
with ample recommendations and practical considerations for apply-ing t-SNE.
Most recently, Uniform Manifold Approximation and Projection
for Dimension Reduction (UMAP)15has been introduced, which 15McInnes and Healy2018
d i m e n s i o na l i t y r e du c t i o n 137
offers increased robustness over t-SNE by assuming that the dataset to be visualized is uniformly distributed on a Riemannian manifold, the Riemannian metric is locally constant and that the manifold is locally connected. For an efficient implementation by the authors of the paper, seehttps://github.com/lmcinnes/umap.
6.6 Summary of the chapter
In this chapter we familiarized with some of the most prominent approaches for dimensionality reduction techniques. We derived principal component analysis (PCA) and the closely related algorithm of singular value decomposition (SVD) and their applications. This chapter also introduced CUR decompositions which aims to remedy some of the shortcomings of SVD. The chapter also discussed linear discriminant analysis (LDA) which substantially differs from the other approaches in that it also takes into account the class labels our particular data points belong to. At the end of the chapter, we provided additional references to a series of alternative algorithms that the readers should be able to differentiate and argue for their strength and weaknesses for a particular application.
Market basket analysis, i.e., analyzing what products are cus-tomers frequently purchasing together has enormous business po-tentials. Supermarkets having access to such information can set up their promotion campaigns with this valuable extra information in mind or they can also decide on their product placement strategy within the shops.
We define a transaction as the act of purchasing multiple items in a supermarket or a web shop. The number of transactions per a day can range between a few hundreds to several millions. You can easily convince yourself about the latter if you think of all the rush going around every year during Black Friday for instance.
The problem of finding item sets which co-occur frequently is calledfrequent pattern miningand our primarily focus in this chap-ter is to make the reader familiar with the design and implementa-tion of efficient algorithms that can be used to tackle the problem.
We should also note that frequent pattern mining need not be inter-preted in its very physical sense, i.e., it is possible – and sometimes necessary – to think out-of-the-box and abstractly about the products and baskets we work with. This implies that the problem we discuss in this chapter have even larger practical implications than we might think at first glance.
Can you think of further non-trivial use cases where frequent pattern mining can be applied?
Example7.1. A less trivial problem that can be tackled with the help of
frequent pattern mining is that of plagiarism detection. In that case, one would look for document pairs (‘item pairs’) which use a substantial amount of overlapping text fragments.
What kind of information would we find if in the plagiarism detec-tion example we exchanged the roles of documents (items) and text fragments (baskets)?
In this setting, whenever a pair of document uses the same phrase in their body, we treat them as a pair of items that co-occur in the same ‘market basket’. Market baskets could be hence identified as and labeled by phrases and text fragments included in documents.
Whenever we find a pair of documents being present in the same basket, it means that they are using the same vocabulary. If their co-occurrence exceed some threshold, it is reasonable to assume that this textual overlap is not purely due to chance.
• Learn the concepts related to Fre-quent item set Mining
• Association rule mining
• Apriori principle
• Park-Chen-Yu algorithm
• FP-Growth and FP trees
m i n i n g f r e q u e n t i t e m s e t s 139
Exercise7.1. Suppose you have a collection of recipes including a list of