Further reading - Data mining

There is a wide range of further dimensional reduction approaches that are outside the scope of this document. Canonical correlation

analysis (CCA)⁷operates over two distinct representations (also ⁷Harold Hotelling. Relations Between Two Sets of Variates. Biometrika,28 (3/4):321–377,1936. ISSN00063444. d o i:10.2307/2333955. URL^http:

//dx.doi.org/10.2307/2333955

often called views) of the data set and tries to find such transfor-mations, one for each view, which maps the distinct views into a common space such that the correlation between the different views of the same data points gets maximized. Similar to other approaches discussed in this chapter, the problem can be solved as an

eigenprob-lem of a special matrix. ⁸ contains a detailed overview of the opti- ⁸David R. Hardoon, Sandor R. Szed-mak, and John R. Shawe-taylor. Canon-ical correlation analysis: An overview with application to learning meth-ods. Neural Comput.,16(12):2639–2664, December2004. ISSN0899-7667. d o i:10.1162/0899766042321814. URLhttp://dx.doi.org/10.1162/

0899766042321814

mization problem and its applications. For a read on the connection of CCA to PCA, refer to the tutorial⁹.

9Magnus Borga. Canonical correlation a tutorial,1999

With the advent of deep learning, deep canonical correlation anal-ysis (DCCA)¹⁰has been also proposed. DCCA learns a

transfor-10Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In San-joy Dasgupta and David McAllester, editors,Proceedings of the30th Interna-tional Conference on Machine Learning, volume28ofProceedings of Machine Learning Research, pages1247–1255, Atlanta, Georgia, USA,17–19Jun2013. PMLR. URLhttp://proceedings.mlr.

press/v28/andrew13.html

mation expressed by a multi-layer neural network which involves non-linear activation functions. This property of DCCA offers the possibility to learn transformations of more complex nature com-pared to standard CCA.

Contrary to most dimensionality reduction approaches, Locally Linear Embeddings (LLE)¹¹provide a non-linear model for

dimen-11Sam T. Roweis and Lawrence K. Saul.

Nonlinear dimensionality reduction by locally linear embedding. SCIENCE, 290:2323–2326,2000

sionality reduction. LLE focuses on the local geometry of the data points, however, it also preserves the global geometry of the observa-tions in an implicit manner as illustrated by Figure6.26. LLE makes use of the assumption that even though the global geometry of our data might fail to be linear,individual data points could still be mod-elled linearly “microscopically”, i.e., based on their nearest neighbors.

Original data

PCA projection LLE projection ^Figure⁶^.²⁶: Illustration of the effec-tiveness of locally linear embedding when applied on a non-linear data set originating from an S-shaped manifold.

Multi dimensional scaling (MDS)¹²tries to give a low-dimensional ¹²J.B. Kruskal and M. Wish. Multidi-mensional Scaling. Sage Publications,

(often2-dimensional for visualization purposes) representation of 1978

the data, such that the pairwise distances between the data points – assumed to be given as an input – get distorted to the minimal

extent. Borg et al.[2012]¹³provides a thorough application-oriented ¹³Ingwer Borg, Patrick J.F. Groenen, and Patrick Mair. Applied Multidimensional Scaling. Springer Publishing Company, Incorporated,2012. ISBN3642318479, 9783642318474

overview of MDS.

t-Stochastic Neighbor Embedding (t-SNE)¹⁴aims to find a

low-14Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research,9:2579–2605,2008. URL http://www.jmlr.org/papers/v9/

vandermaaten08a.html

dimensional mapping of typically high-dimensional data points. The algorithm builds upon the pairwise similarity between data points for determining their mapping which can be successfully employed for visualization purposes in2or3dimensions. t-SNE operates by trying to minimize the Kullback-Leibler divergence between the probability distribution defined over the data points in the original high-dimensional space and their low-dimensional image. t-SNE is known for its sensitivity to the choice of hyperparameters. This

as-pect of t-SNE is thoroughly analyzed byWattenberg et al.[2016]¹⁵ ¹⁵Martin Wattenberg, Fernanda Vié-gas, and Ian Johnson. How to use t-sne effectively. Distill,2016. d o i: 10.23915/distill.00002. URL^http:

//distill.pub/2016/misread-tsne

with ample recommendations and practical considerations for apply-ing t-SNE.

Most recently, Uniform Manifold Approximation and Projection

for Dimension Reduction (UMAP)¹⁶has been introduced, which ¹⁶L. McInnes and J. Healy. UMAP:

Uniform Manifold Approximation and Projection for Dimension Reduction.

ArXiv e-prints, February2018

offers increased robustness over t-SNE by assuming that the data set to be visualized is uniformly distributed on a Riemannian manifold, the Riemannian metric is locally constant and that the manifold is locally connected. For an efficient implementation by the authors of the paper, seehttps://github.com/lmcinnes/umap.

d i m e n s i o na l i t y r e du c t i o n 137

6.6 Summary of the chapter

In this chapter we familiarized with some of the most prominent approaches for dimensionality reduction techniques. We derived principal component analysis (PCA) and the closely related algorithm of singular value decomposition (SVD) and their applications. This chapter also introduced CUR decompositions which aims to remedy some of the shortcomings of SVD. The chapter also discussed linear discriminant analysis (LDA) which substantially differs from the other approaches in that it also takes into account the class labels our particular data points belong to. At the end of the chapter, we provided additional references to a series of alternative algorithms that the readers should be able to differentiate and argue for their strength and weaknesses for a particular application.

Market basket analysis, i.e., analyzing what products are cus-tomers frequently purchasing together has enormous business po-tentials. Supermarkets having access to such information can set up their promotion campaigns with this valuable extra information in mind or they can also decide on their product placement strategy within the shops.

We define a transaction as the act of purchasing multiple items in a supermarket or a web shop. The number of transactions per a day can range between a few hundreds to several millions. You can easily convince yourself about the latter if you think of all the rush going around every year during Black Friday for instance.

The problem of finding item sets which co-occur frequently is calledfrequent pattern miningand our primarily focus in this chap-ter is to make the reader familiar with the design and implementa-tion of efficient algorithms that can be used to tackle the problem.

We should also note that frequent pattern mining need not be inter-preted in its very physical sense, i.e., it is possible – and sometimes necessary – to think out-of-the-box and abstractly about the products and baskets we work with. This implies that the problem we discuss in this chapter have even larger practical implications than we might think at first glance.

Can you think of further non-trivial use cases where frequent pattern mining can be applied?

Example7.1. A less trivial problem that can be tackled with the help of

?

frequent pattern mining is that of plagiarism detection. In that case, one would look for document pairs (‘item pairs’) which use a substantial amount of overlapping text fragments.

What kind of information would we find if in the plagiarism detec-tion example we exchanged the roles of documents (items) and text fragments (baskets)?

?

In this setting, whenever a pair of document uses the same phrase in their body, we treat them as a pair of items that co-occur in the same ‘market basket’. Market baskets could be hence identified as and labeled by phrases and text fragments included in documents.

Whenever we find a pair of documents being present in the same basket, it means that they are using the same vocabulary. If their co-occurrence exceed some threshold, it is reasonable to assume that this textual overlap is not purely due to chance.

Learning Objectives:

• Learn the concepts related to Fre-quent item set Mining

• Association rule mining

• Apriori principle

• Park-Chen-Yu algorithm

• FP-Growth and FP trees

m i n i n g f r e q u e n t i t e m s e t s 139

Exercise7.1. Suppose you have a collection of recipes including a list of

In document Data mining (Pldal 135-139)