BudapestUniversityofTechnologyandEconomics Prof.VanTienDo Supervisor YangYuanLi by Ph.D.Dissertation AMethodtoProcessImagesDataandPredictionModelsforsomeMapReduceApplications

(1)

A Method to Process Images Data and Prediction Models for some MapReduce

Applications

Ph.D. Dissertation

by

YangYuan Li

Supervisor

Prof. Van Tien Do

Budapest University of Technology and Economics

Budapest, Hungary 2020

(2)

(3)

myself, and I only used the sources given at the end. Every part that was quoted word-for- word, or was taken over with the same content, I noted explicitly by giving the reference of the source.

The reviews of the dissertation and the report of the thesis discussion are available at the Deans Office of the Faculty of Electrical Engineering and Informatics of the Budapest University of Technology and Economics.

Budapest, February 19, 2020

YangYuan Li

(4)

(5)

Big data analytics are widely applied in diversity productive fields for retrieving useful information and making decision. Dimensionality reduction in high dimensional data processing and energy consumption estimation for a specific computing job are both challenging problems prior to big data analytics. This thesis has two purposes: (1) to develop an efficient dimensionality reduction algorithm for 2D unlabeled image data, (2) to characterize MapReduce applications for big data analytics and further to establish models for predicting resource dependency pattern and usage parameters in future. In the dissertation, the analytical methodology is adopted to reach the first aim and the practical methodology is applied to reach the second one.

The dissertation is composed of six chapters, each of them dealing with different aspects of this study.

Chapter 1 is introductory and the motivation, problem, objective and research methodology in the dissertation.

Chapter 2 proposes a novel unsupervised 2-dimensional dimensionality reduction method with graph embedding, which incorporates the similarity matrix learning and global discriminative information into the procedure of dimensionality reduction. Furthermore, an efficient optimization algorithm is developed to realize the proposed method and the convergence property of this algorithm is proven. The proposed method is compared with several 2-dimensional unsupervised dimensionality reduction methods and the clustering performance is evaluated by K-means on several benchmark data sets. The obtained results show that the proposed method outperforms the state-of-the-art methods.

Chapter 3 characterizes several MapReduce applications by investigating and identifying the significant correlated coefficients of resource usages parameters and their lagged variable. The inter-dependency and inner-dependency of resource usage parameters are analyzed. The experimental results showed that the identified signatures can be used to categorize MapReduce applications.

Chapter 4 concentrates on establishing multiple linear regression models for several MapReduce applications from usage parameters perspective and on identifying the minimal data sampling time for stable modeling. The analytical results show that resource- intensive characteristics play an important role on either forecasting resource dependency pattern or the minimal number of data samples for stable modeling.

Chapter 5 establishes and compares resource usage parameters forecasting models for four benchmark MapReduce applications using multiple linear regression (MLR) and long short-term memory (LSTM), respectively. Simultaneously, the impact of data sample size to prediction performance of long short-term memory models are investigated. The results show that LSTM models with sufficient sample size exhibit a higher prediction accuracy than multiple linear regression models and the resource-intensive characteristics are closely related to prediction accuracy. Additionally, in order to alleviate underfitting problem, a

i

(6)

two-phase modeling approach is proposed and the effective validation is implemented on read/write intensive application. Moreover, the extensive prediction based on local LSTM model is proposed for write-intensive application. The experimental results show that the prediction value obtained from local model multiplying the standard disk I/O rate ratio might be used to predict usage parameters on heterogeneous machine for write-intensive application.

The main conclusions and the applicability of results are drawn in chapter 6.

-ii-

(7)

Abstract i

List of Figures viii

List of Tables ix

Acknowledgements xi

1 Introduction 1

1.1 Motivations . . . 1

1.1.1 High dimensionality reduction . . . 1

1.1.2 Characterization of MapReduce application . . . 2

1.1.3 Resource dependency pattern modeling . . . 2

1.1.4 Usage parameters prediction . . . 2

1.2 Objectives . . . 3

1.3 Thesis structure . . . 3

2 Discriminative Unsupervised 2D Dimensionality Reduction with Graph Embedding 5 2.1 Introduction . . . 6

2.2 Related works . . . 6

2.2.1 Unsupervised dimensionality reduction . . . 7

2.2.2 Spectral clustering . . . 7

2.3 The proposed method . . . 8

2.3.1 The DUGE algorithm . . . 11

2.3.2 The solution of parameter γ and matrixP . . . 12

2.3.3 Convergence analysis of algorithm . . . 14

2.4 Experimental analysis . . . 16

2.4.1 Data sets . . . 16

2.4.2 Evaluation matrices . . . 17

2.4.3 Comparison methods . . . 17 iii

(8)

CONTENTS

2.4.4 Experiment settings . . . 18

2.4.5 Experimental results . . . 18

2.4.6 Convergence analysis . . . 19

2.5 Conclusion . . . 20

3 Characterization of MapReduce Applications 21 3.1 Introduction . . . 22

3.2 Technical overview . . . 23

3.2.1 Apache Hadoop . . . 23

3.2.2 MapReduce programming model . . . 23

3.2.3 MapReduce application catalog . . . 24

3.2.4 Time series data . . . 24

3.3 Experimental environment and data collection . . . 25

3.3.1 Experimental environment . . . 25

3.3.2 Data collection . . . 25

3.4 Evaluation . . . 26

3.4.1 Explore non-randomness . . . 26

3.4.2 Non-randomness identification . . . 26

3.4.3 Correlated characteristic analysis . . . 29

3.4.4 Correlation matrix . . . 29

3.4.5 Evaluation and discussions . . . 31

4 Multiple Linear Regression Models for MapReduce Applications 37 4.1 Introduction . . . 38

4.2 Technical Overview . . . 39

4.2.1 Multiple linear regression methods . . . 39

4.2.2 Collinearity problem . . . 39

4.2.3 K-fold cross-validation . . . 39

4.2.4 Subset selection approach . . . 40

4.3 Data collection . . . 40

4.4 Models . . . 41

4.4.1 Multiple linear regression model . . . 41

4.4.2 Data autocorrelation pattern . . . 42

4.4.3 Linear relationship investigation . . . 43

4.4.4 Feasibility test . . . 44

4.4.5 Modeling implementation . . . 45 -iv-

(9)

4.4.6 Validation of autocorrelation of error terms . . . 48

4.5 The minimal sampling time for stable modeling . . . 49

4.5.1 Experimental design . . . 49

4.5.2 The minimal sampling time of estimated coefficient . . . 49

4.5.3 The minimal sampling time of statistical metrics . . . 50

4.6 Evaluation and discussion . . . 51

4.6.1 Estimated coefficients and statistical metrics . . . 51

4.6.2 The minimal sampling time for stable modeling . . . 53

5 LSTM Models to Forecast Usage Parameters of MapReduce Applica- tions 57 5.1 Introduction . . . 58

5.2 Related works . . . 59

5.3 Models . . . 60

5.3.1 Multiple linear regression model . . . 60

5.3.1.1 Identifying significant autoregressive term . . . 60

5.3.1.2 Multiple linear regression methods . . . 62

5.3.2 Multivariate LSTM model . . . 63

5.3.2.1 Hyperparameters learning algorithm . . . 64

5.3.2.2 Prediction algorithm . . . 67

5.4 Results and Discussion . . . 67

5.4.1 Environment and data collection . . . 68

5.4.2 CPU usage prediction comparison . . . 68

5.4.3 Prediction accuracy and impact of sample size . . . 70

5.4.3.1 Performance baseline model . . . 70

5.4.3.2 Prediction accuracy comparison . . . 70

5.4.3.3 Impact of sample size . . . 72

5.4.3.4 Overfitting and underfitting evaluation . . . 74

5.4.3.5 Two-phase modeling approach . . . 76

5.4.4 Characteristics of LSTM . . . 78

5.5 Conclusions . . . 81

6 Summary 83

Own Publication 84

Bibliography 86

-v-

(10)

CONTENTS

-vi-

(11)

2.1 Clustering ACC and NMI of DUGE on 4 data sets . . . 19

2.2 Convergence curve of DUGE on Coil20 dataset . . . 20

3.1 Autocorrelation plot and autocovariance plot for Pi application . . . 27

3.2 Autocorrelation plot and autocovariance plot for Terasort application . . . . 28

3.3 Correlation coefficient of CPU-intensive application . . . 32

3.4 Correlation coefficient of read-intensive application . . . 33

3.5 Correlation coefficient of write-intensive application . . . 33

3.6 Correlation coefficient of read/write-intensive application . . . 34

4.1 ACF and PACF plot of resource usage of Terasort application . . . 43

4.2 Residual ACF plot of regression model of Terasort . . . 48

4.3 Estimated coefficients distribution of read rate model of Terasort application 50 4.4 Statistical metrics distribution of read rate model of Terasort application . 51 4.5 Estimated coefficients distribution of regression models . . . 52

4.6 RSE and R² distribution of regression models . . . 53

4.7 The minimal sampling time of MapReduce applications . . . 54

4.8 The minimal sampling time of statistical metrics . . . 54

5.1 ACF and PACF plot of resource usage of Wordcount application . . . 61

5.2 Illustration of long short-term memory networks model . . . 63

5.3 Common module for calculating the mean of validation RMSE for each forecasting model . . . 65

5.4 Hyperparameters learning algorithm . . . 66

5.5 Dropout rate learning algorithm . . . 67

5.6 Time series graphs of real value vs prediction on CPU usage . . . 69

5.7 NRMSE of forecasting models for CPU usage . . . 71

5.8 NRMSE of forecasting models for predicting memory usage . . . 71

5.9 NRMSE of forecasting models for predicting read rate . . . 72

5.10 NRMSE of forecasting models for predicting write rate . . . 72 vii

(12)

LIST OF FIGURES

5.11 Sensitivity comparison of CPU usage forecasting models . . . 73

5.12 Sensitivity comparison of memory usage forecasting models . . . 74

5.13 Sensitivity comparison of read rate forecasting models . . . 74

5.14 Sensitivity comparison of write rate forecasting models . . . 75

5.15 Overfitting vs Underfitting . . . 75

5.16 Map phase vs Reduce phase . . . 76

5.17 Overall modeling vs separated phase modeling . . . 77

5.18 Real usage parameters comparison of Teragen between two machines . . . . 78

5.19 LSTM model using scale factor for CPU usage prediction of Teragen . . . . 80

5.20 LSTM model using scale factor for write rate prediction of Teragen . . . 81

-viii-

(13)

2.1 Notation summary . . . 8

2.2 Description of data sets . . . 16

2.3 Clustering result in terms of accuracy . . . 18

2.4 Clustering result in terms of normalized mutual information . . . 19

3.1 Correlation matrix of Pi application . . . 29

3.2 Correlation matrix of Wordcount application . . . 30

3.3 Correlation matrix of Teragen application . . . 30

3.4 Correlation matrix of Terasort application . . . 30

3.5 Categorized threshold of correlation coefficient . . . 31

4.1 The largest partial autocorrelations and the corresponding lag numbers of applications . . . 44

4.2 Correlation matrix of Terasort application . . . 44

4.3 The increasing percentage of base error rate . . . 45

4.4 Multiple linear regression models . . . 47

5.1 Abbreviation form of mentioned regression models . . . 62

5.2 The abbreviation expressions of LSTM models . . . 68

5.3 Standard specification of two machines . . . 68

5.4 Abbreviation expressions of persistence models . . . 70

5.5 Notations of some variables . . . 79

5.6 Usage parameters mean on heterogeneous machines . . . 80

ix

(14)

LIST OF TABLES

-x-

(15)

I would like to thank all people who have provided valuable assistance during my study towards the Ph.D. degree.

I would like to express my sincere gratitude to Prof. Dr. Do Van Tien for his intensive supervision. Prof. Dr. Do Van Tien has guided me on the direction of my research at preliminary time and taught me a lot about the vital ability to approach and solve problems. Without his continuous supervision and straight criticisms, I could not grow up to accomplish this study and achieve PhD degree. I deeply thank Ms. Tran Thi Xuan, one of my colleague in Analysis, Design and Development of ICT systems laboratory at our department, for her work cooperation and enthusiastic support through my research. All the other members of the Analysis, Design and Development of ICT systems laboratory, colleagues, and the university staffs are acknowledged.

Finally, I dedicate my sincerely thankfulness to my parents, wife and daughter, for their love and intensive encouragement. I am also grateful to all family members and friends who have supported me throughout.

xi

(16)

Chapter 1

Introduction

1.1 Motivations

With the rapid development of digital technology and information system, plenty of huge and complex repositories with terabytes of data (even petabytes) are generated explosively.

Usually, this kind of repository is called big data and has the characteristics of huge volume, high velocity,and enormous variety [88]. Consequently, such large and complex datasets are challenging traditional database management or data processing tools.

As a consensus, images and image sequences (videos) make up about 80 percent of all corporate and public unstructured big data [83]. An inherent nature of image data is high dimensionality which is prone to incur unavailability of traditional statistical analyzing approach. Therefore, dimensionality reduction method plays a crucial role for alleviating the dimensionality curses in many fields such as multimedia event detection, image partition, video category recognition, gene expression and time series prediction in advance data analytics.

Furthermore, numerous tools are available for processing big data, such as Apache Hadoop etc. It is a collection of open-source software utilities and use computing cluster to process massive amounts of data and computations. MapReduce is a key component of Hadoop for data parallel processing. The fault-tolerant storage and high throughput data processing are its highlight characteristics [1]. According to MapReduce paradigm, numerous MapReduce applications are developed for big data analytic job. However, the uncertainty of resource dependency pattern and demands for a specific computing job often leads to the low efficiency and energy wasting in a specific distributed computing platform. Therefore, the characterization of MapReduce application for categorical goal, resource dependency pattern modeling and accurate usage parameters prediction are in great need and play an crucial role in resource allocation and scheduling strategy from the perspective of cluster/cloud operator.

The following four subsections are the problem statements in the dissertation.

1.1.1 High dimensionality reduction

High-dimensional data processing is a crucial challenge in many fields such as multimedia event detection, image partition, video category recognition, gene expression and time series prediction [115, 133]. The typical solution for alleviating this problem is to implement dimensionality reduction methods in advance data analytics. In past decades, a number

(17)

of dimensionality reduction methods have been proposed [4, 50, 18, 128]. The conventional clustering methods often rely on the representations of the relationships among data points. Clustering is then accomplished by spectral or other graph-theoretic optimization procedures [76, 96]. However, most of the spectral-based clustering methods only focus on the local structures and ignore the global discriminative information of data, which may lead to overfitting and degrade the clustering performance [121, 125]. Thus, a discriminative 2D unsupervised dimensionality reduction method is needed to solve the problems.

1.1.2 Characterization of MapReduce application

The most established software platform for big data analytics is Apache Hadoop which has been widely applied in cloud computing. Originally, Hadoop is designed for computer clusters built from commodity hardware. It provides MapReduce programming model for data parallel processing and hadoop distributed file system (HDFS) for data storing [1].

Based on MapReduce programming model, various MapReduce application with different resource demands are developed. Unfortunately, these resource demands performed huge differences between them according to their computing goal yet cloud operator and relevant consumers could not in advance know these demands and might results in inappropriate resource allocation or reservation. Therefore, identifying the characteristics of workload on data analytics platform prior to being executed could help cluster owner or operator actively to control computing resource which can be beneficial to power savings and service performance improving.

1.1.3 Resource dependency pattern modeling

MapReduce applications have been widely applied to big data analytics. The resource allocation of computing job has a huge influence to efficiency of computing cluster. How- ever, the unknown characteristics of resource dependency often leads to lower efficiency and wasting of computing resource of system. To avoid above problems, therefore, modeling the dependent pattern among resource usage parameters of MapReduce applications is crucially needed from the viewpoint of cloud operators. Many researches have used multiple linear regression (MLR) to predict performance metrics of MapReduce application [61, 130, 123, 34]. However, the constructed performance models above mainly focus on predicting execution time of Hadoop jobs and none of them could not be used for effective resource utilization by both users/consumers and service providers in the cloud.

1.1.4 Usage parameters prediction

The usage parameters prediction constitutes one of particularly significant tasks in the operation of computing clusters. MapReduce application is widely used for processing huge amount data either in public clouds or in private clouds. Cloud providers can manage their resource usage by obtaining future usage demand from the current and past usage patterns of resources. Therefore, the accurately forecasting usage parameters is of great importance for dynamic scaling of cloud resources to achieve efficiency in terms of cost and energy consumption while keeps quality of service. Although multiple linear regression (MLR) [61, 85, 123, 34] and long short-term memory (LSTM) [28, 106, 129, 13, 122] are widely applied to time series prediction in many fields, few of them were applied to forecast resource usage parameters of MaprReduce application.

-2-

(18)

1.2. OBJECTIVES

1.2 Objectives

The main objectives of the dissertation include:

• the development of an efficient 2D unsupervised dimensionality reduction method for 2D unlabeled image data to alleviate overfitting and improve clustering performance,

• the characterization of MapReduce applications,

• the development of models to predict usage parameters of several MapReduce applications.

In the dissertation, I applied a mathematical analysis and statistical methods to investigate the problems.

1.3 Thesis structure

In chapter 2, a novel unsupervised 2-dimensional dimensionality reduction method, which incorporates the similarity matrix learning and global discriminant information into the procedure of dimensionality reduction was proposed [J1]. This discriminative graph-based embedding 2D unsupervised dimensionality reduction method learns the projection matrices which are useful to clustering. Instead of using a predetermined similarity matrix to characterize the local structures of original 2D data, the proposed approach involves the similarity matrix learning into the procedure of dimensionality reduction. Inspired by the thought that isolated local structures may incur overfitting and degrade the clustering performance, we integrated the global discriminative information into the proposed method. An iterative optimization algorithm is then derived to solve the proposed mini- mization problem. The convergence of the proposed algorithm is analyzed in theory and experiment. Both of the theoretical analysis and experimental performance indicate the effectiveness and superiority of the proposed method. We compare the proposed method with several 2-dimensional unsupervised dimensionality reduction methods and evaluate the clustering performance by K-means [38] on several benchmark data sets. The obtained results show that the proposed method outperforms the state-of-the-art methods.

Chapter 3 characterizes MapReduce applications by analyzing and extracting either inter- dependency or inner-dependency relationships among resource usage parameters and its’

significant historic usage [J4]. The extracted characteristics might further be used to per- form categorizing based on correlated coefficients of resource usages parameters. Simulta- neously, the non-randomness of each usage parameters were investigated by calculating the relevant autocorrelation and autocovariance. As a result, the significant lagged variable of each usage parameter is identified based on the observations of non-randomness. By calculating Pearson correlation coefficient, the inter-dependency of resource usage parameters including autocorrelation of each usage parameter and the correlation of resource usage parameters were investigated and analysed. Based on the analysis, the specific groups of correlation efficients for categorizing goal of MapReduce applications were identified.

This work can be of much use for efficient scheduling of MapReduce applications on the commercial computing clouds as well as helps the provider of cloud-service to predict the resource usage of the systems.

In chapter 4, some resource usage dependency models are established for 7 benchmark MapReduce applications using multiple linear regression methods and the minimal number

-3-

(19)

of data samples for stable modeling [J2] were identified. Due to the significant autocorrelation nature, the associated autoregressive term is involved by analyzing non-randomness of resource usage. Based on the intuitive observation of correlation matrix and precise calculation of base error rate improvement, the feasibility of linear regression modeling was ensured. Then, multiple linear regression models for resource dependency pattern are established. The estimated coefficients of model come from the ordinary least squares approach. The measurements of R² and residual standard error (RSE) are used to evaluate model fit quality. In addition, the minimal sampling time for stable modeling was investigated and identified by observing if the change rate of estimated coefficients and statistical metrics converged to a threshold (0.1). The numerical results showed that resource-intensive characteristics play an important role on forecasting resource dependency pattern as well as on the minimal data sampling time for stable modeling.

Chapter 5 establishes forecasting models (multivariate long short-term memory and multiple linear regression) for future resource usage of four MapReduce applications with ex- clusive resource-intensive characteristics [C2, C1] [J3]. To effectively evaluate prediction accuracy and feasibility of both models, NRMSE (normalized root mean squared error) and performance baseline model were involved. Meanwhile, the impact of sample size to prediction accuracy was also investigated. Moreover, a two-phase modeling approach was proposed for read/write intensive application (Terasort) to alleviate serious underfitting issues. According to scaled up/down characteristics of usage parameters on heterogeneous machines, an extensive prediction approach based on local model for write-intensive application is proposed. The results show that models using long short-term memory with sufficient sample size exhibit a higher accuracy than using multiple linear regression and the resource-intensive characteristics are closely related to prediction accuracy of forecasting models.

Finally, chapter 6 summaries conclusions of the dissertation and provides the applicability of new results.

-4-

(20)

Chapter 2

Discriminative Unsupervised 2D

Dimensionality Reduction with

Graph Embedding

(21)

2.1 Introduction

High-dimensional data processing is applied in many fields such as multimedia event detection, multimedia event recounting, video category recognition, gene expression and time series prediction [115, 133]. Dimensionality reduction is a typical method to increase the data processing rate. In recent years, many one dimensional dimensionality reduction methods have been proposed. However, the processing of transforming 2-dimensional data matrices to one dimensional vectors may destroy the structures of data. Thus, 2- dimensional (2D) dimensionality reduction methods are considered as a helpful substitute.

Since it is difficult to obtain the tag information, 2D unsupervised dimensionality methods attract increasing attentions. In the conventional 2D unsupervised dimensionality reduction methods, the similarity matrix plays an important role for its efficiency and comprehensibility [4, 50, 18, 128]. Usually, similarity learning and dimensionality reduction are conducted in two separated steps, so the performance of dimensionality reduction highly depends on the quality of similarity learning. However, due to noises in the collected data the learned similarity matrix may not be the optimal one [10, 59, 118]. Therefore, efforts have been devoted to find a similarity matrix that captures the structures of data.

For example, Nie et al. [76] proposed to learn the data similarity matrix by assigning the adaptive and nearest neighbors for each data point. Du and Shen [24] concerned both the global and local structures of data, and performed adaptive structure learning. Kodirov et al. [27] integrated the graph’s learning into the 1-norm graph regularized optimization problem for a robust subspace clustering.

Another crucial aspect of unsupervised dimensionality reduction is spectral clustering, which has closely relationship with similarity matrix. The conventional clustering methods often rely on the representations of the relationships among data points. Clustering is then accomplished by spectral or other graph-theoretic optimization procedures [76, 96].

However, most of the spectral-based clustering methods only focus on the local structures and ignore the global discriminative information of data, which may lead to overfitting and degrade the clustering performance [121, 125].

In this chapter, we propose a discriminative 2D unsupervised dimensionality reduction method, named Discriminative Unsupervised 2D Dimensionality Reduction with Graph Embedding (DUGE). DUGE mitigates the negative impact of the predetermined similarity matrix. After transforming the 2-dimensional data matrices to the corresponding 1-dimensional vectors, the proposed method involves the global discriminative information of data distribution. Extensive experiments are conducted on several real-world benchmark data sets to validate the proposed method. The experimental results show that our method outperforms the state-of-the-art methods.

The structure of this chapter is as follows. In section 2.2, we gives the overview of state- of-the-art on similarity matrix based unsupervised dimensionality reduction methods and the spectral clustering. In section 2.3, we propose Discriminative Unsupervised 2D Di- mensionality Reduction with Graph Embedding (DUGE). Section 2.4 presents extensive experiment results. Conclusions are included in section 2.5.

2.2 Related works

Our work is mainly related to unsupervised dimensionality reduction. Particularly, we focus on the 2D unsupervised dimensionality reduction with similarity matrix learning and spectral clustering learning. In this section, we introduce some state-of-the-art similarity

-6-

(22)

2.2. RELATED WORKS

matrix based unsupervised dimensionality reduction methods firstly, and then we extend our discussion to the spectral clustering.

2.2.1 Unsupervised dimensionality reduction

Principal Component Analysis (PCA) is widely applied for its efficiency and comprehensibility. Most of the PCA-based 1D dimensionality reduction methods first trans- form the sample data to a vector, then construct a covariance matrix to extract features [16, 42, 51, 101, 66, 132]. However, when the dimensionality of the samples is high, it is difficult to calculate the covariance matrix [59, 69]. Thus, to cope with the limitation of conventional PCA, Yang et al. [117] proposed a 2-dimensional principal component analysis (2DPCA) method that computes the covariance matrix on the original 2D image matrices. Due to the smaller size of an image variance matrix than the original variance matrix, the time to extract image features is small. Nevertheless, 2DPCA only works on the row direction, which can not minimize the dimensionality of feature space. Therefore, Zhang et al. developed a 2-dimensional PCA model that simultaneously considers the row and column directions [21].

Although great progress have been achieved, the PCA-based 2D dimensionality reduction methods might fail to obtain a desirable subspace representation when the distance between two clusters is shorter than that of intra-cluster [120]. In the unsupervised dimensionality reduction methods, the local structures of data distribution attract more and more attention, such as locality preserving projections (LPP), which describes the adjacency relationships of data points by constructing the similarity matrix [40, 116, 118].

After that, Hu et al. [42] extended the conventional LPP model to its 2D version by pre- determining the adjacency relation in the original 2D image space. Once the similarity graph is determined, it is a fixed matrix in the next procedures, and performance of the dimensionality reduction is mainly determined by the similarity matrix [4, 119, 118]. Since the existence of noises among original data, the LPP-based methods might fail to construct an ideal similarity graph [56, 76, 95]. Nie et al. proposed a 1-dimensional adaptive neighbors clustering algorithm called Clustering and Projected Clustering with Adaptive Neighbors (PCAN), which learns the similarity matrix and clustering structure simultaneously [76]. On the basis of PCAN, Wang et al proposed a Discriminative Unsupervised Dimensionality Reduction (DUDR) method, which also constructs the similarity matrix in the procedure of dimensionality reduction [108]. Zhao et al. proposed a Unsupervised 2D Dimensionality Reduction with Adaptive Structure Learning (DRASL) method, which constructs the similarity matrix by learning the local structures of 2D image space in dimensionality reduction process [131].

2.2.2 Spectral clustering

Spectral clustering is used to learn local geometric structures. The general formula of spectral clustering can be written as

min

F^TF=IT r(F^TLF), (2.1)

where L is the Laplacian matrix [15]. Most of the spectral-based clustering algorithms construct the similarity matrix ahead of dimensionality reduction. To cope with noises in data, Nie et al. proposed a Clustering and Projected Clustering With Adaptive Neighbors method that incorporates the spectral clustering learning [76]. Local Learning-based Clus-

(23)

tering (LLC) utilizes a kernel regression model for label prediction based on the assumption that the class label of a data point can be determined by its neighbors [114, 113]. However, these methods only focus on the learning of local structures, which may induce overfitting under certain condition. To deal with this problem, Yang et al. proposed a Nonnegative Spectral Clustering with Discriminative Regularization (NSDR) method which takes both the local structures and global structures into account [125]. The objective function of NSDR can be formulated as

min

F^TF=IT r(F^TLF) +ξΩ(F), (2.2) where ξ ≥0 is a regularization parameter, T r[F^TLF]is associated with local structures, and Ω(F) contain the global discriminative information.

Although, (2.2) is easily applied to obtain the relaxed cluster indicator, the corresponding similarity matrix still is to be constructed on the original data. Therefore, noise may cause a problem for the conventional spectral-based clustering algorithms.

2.3 The proposed method

Let X1, X2, . . . , XN denote a 2D image data set, where N data points are sampled from c clusters. P_ij represents the similarity between data points X_i and X_j, (X_i, X_j ∈ R^m^×ⁿ, i, j= 1,2..., N). Let U ∈R^m^×^u and V ∈Rⁿ^×^v be the row-directional and column- directional projection matrices respectively.

Notation Description

N the number of data points

c Cluster number

L Laplacian matrix

P Similarity matrix

Pi Thei-th column of matrix P,(i= 1,2...., N)

I Identity matrix

1∈R^N A column vector with all the elements are 1 T r(·) Trace operator

r(·) Rank operator

t The iterative step in the DUGE algorithm γ A regularization parameter of a penalty term

fi The row vector of i-th point in cluster indicator matrix F The cluster indicator matrix

Ω(F) The global discriminative information

λ_∞ A large enough value for keeping the csmallest eigenvalue of L equal to zero

ξ A regularization parameter for trade-off of global discriminative information

S_b Between-cluster scatter matrix S_t Total scatter matrix

Xbi The vector form of 2D image data µ The mean of all data points inXb

Table 2.1. Notation summary

(24)

2.3. THE PROPOSED METHOD

The dimensionality reduction problem is formulated [68, 25] as min

P,U,V

∑N i,j=1

∥U^TXiV −U^TXjV∥²_FPij+γP_ij²

s.t. U^TU =I, V^TV =I, P1=1,0≤P_ij ≤1,1≤i, j≤N (2.3) whereP = [Pij]∈R^N^×^N. In order to avoid a trivial solution, the penalty termγ∑_N

i,j=1P_ij² is imposed into (2.3), where γ is a regularization parameter [76].

Assume that each data point is given a function valuef_i∈R¹^×^c. From [31], we have

∑N i,j=1

∥f_i−f_j∥²2P_ij = 2T r(F^TLF), (2.4)

where matrixF = [f^T₁,f^T₂, . . . ,f^T_N]∈R^N^×^c. L=D−^P^T₂^+P is the Laplacian matrix,D is a diagonal matrix withDii=∑

j(Pij +Pji)/2.

According to [17, 70], the number of connected components in the graph associated with similarity matrixP is equal to the multiplicities c of the eigenvalue 0 whenP is nonnegative. If r(L) = N−c data points could be assigned to c clusters, we can add the rank constraint to (2.3). Therefore, the problem can be written as

min

P,U,V

∑N i,j=1

∥U^TXiV −U^TXjV∥²FPij +γP_ij²

s.t. P1=1,0≤P_ij ≤1,1≤i, j ≤N, U^TU =I, r(L) =N −c. (2.5) Let σ_i ≥ 0 denote the i-th smallest eigenvalue of L. If ∑_c

i=1σ_i = 0, then r(L) = N−c holds. Moreover, according to Ky Fan’s theorem [31], we have

∑c i=1

σi = min

F∈R^N×c,F^TF=IT r(F^TLF). (2.6) Therefore, (2.5) is equivalent to [131]

min

P,U,V

∑N i,j=1

∥U^TX_iV −U^TX_jV∥²FP_ij +γP_ij² +λ_∞∥f_i−f_j∥²2P_ij

s.t. U^TU =I, V^TV =I, P1=1,0≤P_ij ≤1,1≤i, j≤N, F^TF =I, (2.7) whereλ_∞ is a large enough value, which can keep thec smallest eigenvalue ofL equal to zero.

Suppose Xˆ = [ ˆX1,Xˆ2, . . . ,XˆN], where Xˆi ∈ R^M(i = 1,···, N) is the vector form of 2D data matrix X_i and M = m×n. H = I − _N¹11^T is the centering matrix. The

(25)

between-cluster scatter matrix and total scatter matrix can be obtained as [125]:

Sb=

∑c i=1

Ni(µi−µ)(µi−µ)^T =XF Fe ^TXe^T, (2.8)

St=

∑N i=1

( ˆXi−µ)( ˆXi−µ)^T =XeXe^T, (2.9) where µi is the mean of data points ini-th cluster, µis the mean of all data points in X,ˆ N_i is the number of data points belong to the i-th cluster, Xe = ˆXH. To minimize the covariance of within-cluster scatter matrix and maximize the covariance of between-cluster scatter matrix be, the following formulation is proposed

max

F^TF=IT r[(St+µI)⁻¹Sb], (2.10) where µI is used to make S_t+µI invertible. (2.10) is actually the learning of the global discriminative information. Combining (2.8) and (2.9), (2.10) can be reformulated as:

max

F^TF=I

T r[F^TXe^T(XeXe^T +µI)⁻¹XFe ]. (2.11) Since

T r[F^THF] =T r[F^T(I− 1

N11^T)F] =c−1, (2.12) (2.10) is equivalent to

min

F^TF=IT r[F^THF −(S_t+µI)⁻¹S_b]. (2.13) Let us recall that our aim is to find two projection matrices. Thus, combining (2.7), (2.11), (2.12) and (2.13), we arrive at

min

P,U,V,F(

∑N i,j=1

∥U^TX_iV −U^TX_jV∥²FP_ij+γP_ij² +λ_∞∥f_i−f_j∥²2P_ij) +ξΩ(F)

s.t. U^TU =I, V^TV =I, P1=1,0≤Pij ≤1,1≤i, j≤N, F^TF =I, (2.14) where Ω(F) = T r[F^THF −F^TXe^T(XbXe^T +µI)⁻¹XFe ], ξ ≥ 0 is a regularization parameter [125].

2.3.1 The DUGE algorithm

In this section, we propose an algorithm for (2.14). As a result, projection matrices U andV that map the original data points in high-dimensional space to a lower-dimensional space can be obtained.

When U, V and P are fixed, F can be obtained through solving the following problem min

F∈R^N^×^c,F^TF=IT r[F^T(2λ_∞L+ξR)F], (2.15)

(26)

whereR=H−Xe^T(XeXe^T +µI)⁻¹X. By setting the derivative with respect toe F to 0,F can be solved.

With P and F are fixed, (2.14) is transformed into

min

U^TU=I,V^TV=I

∑N i,j=1

∥U^TX_iV −U^TX_jV∥²_FP_ij. (2.16)

Note that

G(U,V) =

∑N i,j=1

∥U^TX_iV −U^TX_jV∥²FP_ij; (2.17)

W^v =

∑N i,j=1

Pij(Xi−Xj)V V^T(Xi−Xj)^T; (2.18) and

W^u =

∑N i,j=1

Pij(Xi−Xj)^TU U^T(Xi−Xj). (2.19) Combining (2.17), (2.18) and (2.19), G(U,V) can be obtained

G(U,V) =T r(U^TW^vU) =T r(V^TW^uV). (2.20) With fixed V, (2.16) can be written as:

min

U^TU=IT r(U^TW^vU). (2.21) The solution ofU is the orthogonal generalized eigenvectors of W^v corresponding to the u smallest generalized eigenvalues.

In the similar way, with fixedU, optimization problem (2.16) can be written as:

min

V^TV=IT r(V^TW^uV). (2.22) The solution ofV to problem (2.16) is formed by theV eigenvectors corresponding to the v smallest eigenvalues ofW^u.

With fixed U, V and F, we can obtainP by tackling the following problem

min

P (

∑N i,j=1

∥U^TXiV −U^TXjV∥²_FPij+γP_ij² +λ_∞∥f_i−f_j∥²₂Pij)

s.t. P1=1,0≤P_ij ≤1,1≤i, j≤N (2.23) Let us introduce d¹_ij =∥U^TX_iV −U^TX_jV∥²_F, d²_ij =∥f_i−f_j∥²₂ and d_ij =d¹_ij +d²_ij and an optimization problem

min

PiT1=1,0≤Pij≤1,1≤j≤N

P_i+ 1 2γd_i²

2, (2.24)

(27)

where P_i,(i= 1,2..., N), is thei-th column of similarity matrix P.

This optimization problem turns to a conventional Euclidean projection problem in the simplex space, and can be solved efficiently by using the methods proposed in [76].

We propose a procedure for (2.14), which is depicted in Algorithm 1.

Algorithm 1:The optimization algorithm of DUGE

Data: Data points X₁, X₂, . . . X_N, the parameters k, c, u, v, ξ and λ_∞. Result: Projection matricesU ∈R^m^×^u and V ∈Rⁿ^×^v.

Initialize columniofP,i= 1, . . . , N, by solving the optimization problem min

P1=1,0≤Pij≤1

∑N j=1

∥X_i−X_j∥²FP_ij +γP_ij²;

The initial matrices ofV and U are set as an arbitrary column orthogonal matrix;

Sett=0;

repeat

UpdateL^t=D^t−^P^tT₂^+P^t, whereD^t∈R^N^×^N is a diagonal matrix with the i-th diagonal element as∑

j(P_ij^t +P_ji^t)/2;

UpdateF^t, whose columns are the ceigenvectors of(2λ_∞L^t+ξR) corresponding to itsc smallest eigenvalues;

UpdateU^t, whose columns are the u eigenvectors ofW_t^v corresponding to the u smallest eigenvalues in (2.21);

UpdateV^t, whose columns are thev eigenvectors ofW_t^u corresponding to the v smallest eigenvalues in (2.22);

Update thei-th column ofP^t,i= 1, . . . , N, by solving the (2.24), where d_i∈R^N^×¹ is a vector with the j-th element isd_ij =d¹_ij+d²_ij;

t=t+ 1;

untilConvergence;

Return the projection matrices U and V.

2.3.2 The solution of parameter γ and matrix P

In our proposed method, γ is an important parameter which is connected with the con- struction of similarity matrixP. In order to ensure each data point has onlykneighbors, the following method is adopted to solve the value ofγ. The Lagrangian function of (2.24) can be formulated as

1 2

Pi+ di

2γi

²

2−α(PiT1−1)−β^T_i Pi. (2.25) Under the condition of KKT [7], the solution ofPij can be obtained by

Pij = (−dij

2γ_i +α )

+. (2.26)

(28)

In order to ensureP has onlyk non-zero elements, we have {−^Q_2γ^it_i +α >0 t= 1, . . . , k;

−^Q_2γ^it_i +α≤0 t=k+ 1, . . . , N; (2.27) where Q is obtained by sorting each row of D with ascending order, D is a matrix with its ij-th element is dij. Additionally, we add a constraint∑_N

j=1Pij = 1 on (2.25), andα can be obtained by

α= 1 k + 1

2kγ_i

∑k t=1

Q_it. (2.28)

Note thatγ_i satisfies k

2Qik−1 2

∑k t=1

Qit < γi ≤ k

2Qi,k+1−1 2

∑k t=1

Qit. (2.29)

Without loss of generality, we set γi= k

2Q_i,k+1−1 2

∑k t=1

Qit, (2.30)

and γ can be solved by

γ = 1 N

∑N i=1

(k

2Q_i,k+1−1 2

∑k t=1

Q_it )

. (2.31)

2.3.3 Convergence analysis of algorithm

According to the following theorem, the objective function value in the iteration process decreases to the minimal value.

Theorem 1 The inequality ∑_N

i,j=1

(∥U^t+1^TXiV^t+1 −U^t+1^TXjV^t+1∥²_FP_ij^t+1 +γP_ij^t+1² + λ_∞∥f^t+1_i −f^t+1_j ∥²2P_ij^t+1

)

+ξΩ(F^t+1) ≤ ∑_N

i,j=1

(∥U^t^TXiV^t−U^t^TXjV^t∥²_FP_ij^t +γP_ij^t² + λ_∞∥f^t_i−f^t_j∥²2P_ij^t

)

+ξΩ(F^t) holds in the iterative process, where t is the iteration step.

Proof:

After thet-th iteration of Algorithm 1, the updated U, V, P andF are denoted asU^t,V^t, P^t and F^t respectively. Similarly, they are denoted as U = U^t+1, V =V^t+1, P =P^t+1 and F =F^t+1 in the next iteration.

(29)

If we fixedP^t, V^t andF^t, the following inequality is obtained:

∑N i,j=1

(∥U^t+1^TXiV^t−U^t+1^TXjV^t∥²FP_ij^t +γP_ij^t²+λ_∞∥f^t_i−f^t_j∥²2P_ij^t )

+ξΩ(F^t)

≤

∑N i,j=1

(∥U^t^TXiV^t−U^t^TXjV^t∥²FP_ij^t +γP_ij^t²+λ_∞∥f^t_i−f^t_j∥²2P_ij^t )

+ξΩ(F^t). (2.32)

In the same way, ifP^t,U^t and F^t are fixed, we have

∑N i,j=1

(∥U^t^TX_iV^t+1−U^t^TX_jV^t+1∥²FP_ij^t +γP_ij^t²+λ_∞∥f^t_i−f^t_j∥²2P_ij^t )

+ξΩ(F^t)

≤

∑N i,j=1

(∥U^t^TX_iV^t−U^t^TX_jV^t∥²FP_ij^t +γP_ij^t²+λ_∞∥f^t_i−f^t_j∥²2P_ij^t )

+ξΩ(F^t). (2.33) when U^t,V^t,F^t are fixed,

∑N i,j=1

(∥U^t^TX_iV^t−U^t^TX_jV^t∥²FP_ij^t+1+γP_ij^t+1²+λ_∞∥f^t_i−f^t_j∥²2P_ij^t+1 )

+ξΩ(F^t)

≤

∑N i,j=1

+ξΩ(F^t). (2.34) With fixed P^t,V^t and U^t, we obtain

∑N i,j=1

(∥U^t^TX_iV^t−U^t^TX_jV^t∥²FP_ij^t +γP_ij^t²+λ_∞∥f^t+1_i −f^t+1_j ∥²2P_ij^t )

+ξΩ(F^t+1)

≤

∑N i,j=1

+ξΩ(F^t). (2.35) Additionally, consider the following two inequalities

∑N i,j=1

∥U^t+1^TXiV^t+1−U^t+1^TXjV^t+1∥²FP_ij^t+1

≤

∑N i,j=1

∥U^t^TXiV^t−U^t^TXjV^t∥²FP_ij^t, (2.36) and

∑N i,j=1

(∥f^t+1_i −f^t+1_j ∥²2P_ij^t+1 )

+ξΩ(F^t+1)≤

∑N i,j=1

(∥f^t_i−f^t_j∥²2P_ij^t )

+ξΩ(F^t) (2.37)