** ? 6.1 The curse of dimensionality**

**6.5 Further reading**

# sample 10 examples from the two Gaussian populations X1 = mvnrnd([4 4], [0.3 0; 0 3], 10);

X2 = mvnrnd([4 -5], [0.3 0; 0 3], 10);

mu1 = mean(X1);

mu2 = mean(X2);

mean_diff = mu1 - mu2;

Sw = (X1 - mu1)’ * (X1 - mu1) + (X2 - mu2)’ * (X2 - mu2);

w = inv(Sw) * mean_diff;

**C****ODE SNIPPET**

Figure6.25: Code snippet demonstrat-ing the procedure of LDA.

Objective **w**^{∗} _{w}^{w}_{|}^{|}_{S}^{S}^{B}^{w}

W**w** **w**^{|}S_{B}**w** **w**^{|}S_{W}**w**

max_{w}** ^{w}**|

^{|}S

^{S}

_{W}

^{B}

^{w}**w**[−

^{0.23,}−

^{0.97}]

**2.32**85.22 36.80 max

**w**

^{|}SB

**w**[−0.01, 1.00]

^{2}.29

**90.37**39.52 min

**w**

^{|}S

_{W}

**w**[−1.00,−0.07]

^{0}.05 0.38

**8.13**

Table6.5: The values of the different components of the LDA objective (along the columns) assuming that we are optimizing towards certain parts of the objective (indicated at the beginning of the rows). Best values along each column are marked bold.

*6.5* Further reading

There is a wide range of further dimensional reduction approaches that are outside the scope of this document. Canonical correlation

analysis (CCA)^{7}operates over two distinct representations (also ^{7}Hotelling1936

often called views) of the dataset and tries to find such transfor-mations, one for each view, which maps the distinct views into a common space such that the correlation between the different views of the same data points gets maximized. Similar to other approaches discussed in this chapter, the problem can be solved as an eigenprob-lem of a special matrix. Hardoon et al.[2004] provided a detailed overview of the optimization problem and its applications. For a read

on the connection of CCA to PCA, refer to the tutorial^{8}. ^{8}Borga1999

With the advent of deep learning, deep canonical correlation

anal-ysis (DCCA)^{9}has been also proposed. DCCA learns a transfor- ^{9}Andrew et al.2013

mation expressed by a multi-layer neural network which involves non-linear activation functions. This property of DCCA offers the possibility to learn transformations of more complex nature com-pared to standard CCA.

Contrary to most dimensionality reduction approaches, Locally

Linear Embeddings (LLE)^{10} provide a non-linear model for dimen- ^{10}Roweis and Saul2000

sionality reduction. LLE focuses on the local geometry of the data points, however, it also preserves the global geometry of the observa-tions in an implicit manner as illustrated by Figure6.26. LLE makes use of the assumption that even though the global geometry of our data might fail to be linear,individual data points could still be mod-elled linearly “microscopically”, i.e., based on their nearest neighbors.

### Original data PCA projection LLE projection

^{Figure}

^{6}

^{.}

^{26}: Illustration of the effec-tiveness of locally linear embedding when applied on a non-linear dataset originating from an S-shaped manifold.

Multi dimensional scaling (MDS)^{11}tries to give a low-dimensional ^{11}Kruskal and Wish1978

(often2-dimensional for visualization purposes) representation of the data, such that the pairwise distances between the data points – assumed to be given as an input – get distorted to the minimal

extent. Borg et al.[2012]^{12}provides a thorough application-oriented ^{12}Borg et al.2012

overview of MDS.

t-Stochastic Neighbor Embedding (t-SNE)^{13}aims to find a low- ^{13}van der Maaten and Hinton2008

dimensional mapping of typically high-dimensional data points. The algorithm builds upon the pairwise similarity between data points for determining their mapping which can be successfully employed for visualization purposes in2or3dimensions. t-SNE operates by trying to minimize the Kullback-Leibler divergence between the probability distribution defined over the data points in the original high-dimensional space and their low-dimensional image. t-SNE is known for its sensitivity to the choice of hyperparameters. This

as-pect of t-SNE is thoroughly analyzed byWattenberg et al.[2016]^{14} ^{14}Wattenberg et al.2016

with ample recommendations and practical considerations for apply-ing t-SNE.

Most recently, Uniform Manifold Approximation and Projection

for Dimension Reduction (UMAP)^{15}has been introduced, which ^{15}McInnes and Healy2018

d i m e n s i o na l i t y r e du c t i o n 137

offers increased robustness over t-SNE by assuming that the dataset to be visualized is uniformly distributed on a Riemannian manifold, the Riemannian metric is locally constant and that the manifold is locally connected. For an efficient implementation by the authors of the paper, seehttps://github.com/lmcinnes/umap.

*6.6* Summary of the chapter

In this chapter we familiarized with some of the most prominent approaches for dimensionality reduction techniques. We derived principal component analysis (PCA) and the closely related algorithm of singular value decomposition (SVD) and their applications. This chapter also introduced CUR decompositions which aims to remedy some of the shortcomings of SVD. The chapter also discussed linear discriminant analysis (LDA) which substantially differs from the other approaches in that it also takes into account the class labels our particular data points belong to. At the end of the chapter, we provided additional references to a series of alternative algorithms that the readers should be able to differentiate and argue for their strength and weaknesses for a particular application.

Market basket analysis, i.e., analyzing what products are cus-tomers frequently purchasing together has enormous business po-tentials. Supermarkets having access to such information can set up their promotion campaigns with this valuable extra information in mind or they can also decide on their product placement strategy within the shops.

We define a transaction as the act of purchasing multiple items in a supermarket or a web shop. The number of transactions per a day can range between a few hundreds to several millions. You can easily convince yourself about the latter if you think of all the rush going around every year during Black Friday for instance.

The problem of finding item sets which co-occur frequently is
called**frequent pattern mining**and our primarily focus in this
chap-ter is to make the reader familiar with the design and
implementa-tion of efficient algorithms that can be used to tackle the problem.

We should also note that frequent pattern mining need not be inter-preted in its very physical sense, i.e., it is possible – and sometimes necessary – to think out-of-the-box and abstractly about the products and baskets we work with. This implies that the problem we discuss in this chapter have even larger practical implications than we might think at first glance.

Can you think of further non-trivial use cases where frequent pattern mining can be applied?

**Example7.1.** A less trivial problem that can be tackled with the help of

### ?

frequent pattern mining is that of plagiarism detection. In that case, one would look for document pairs (‘item pairs’) which use a substantial amount of overlapping text fragments.

What kind of information would we find if in the plagiarism detec-tion example we exchanged the roles of documents (items) and text fragments (baskets)?

### ?

In this setting, whenever a pair of document uses the same phrase in their body, we treat them as a pair of items that co-occur in the same ‘market basket’. Market baskets could be hence identified as and labeled by phrases and text fragments included in documents.

Whenever we find a pair of documents being present in the same basket, it means that they are using the same vocabulary. If their co-occurrence exceed some threshold, it is reasonable to assume that this textual overlap is not purely due to chance.

**Learning Objectives:**

• Learn the concepts related to Fre-quent item set Mining

• Association rule mining

• Apriori principle

• Park-Chen-Yu algorithm

• FP-Growth and FP trees

m i n i n g f r e q u e n t i t e m s e t s 139

**Exercise7.1.** Suppose you have a collection of recipes including a list of