• Nem Talált Eredményt

Dimension reduction procedures are found in numerous applications in the fields of object classification [78, 77, 76], clustering [79], data exploration [80], feature selection [81] or feature extraction [82]. Classical methods, such as Principal Com-ponent Analysis (PCA) [83], Independent ComCom-ponent Analysis (ICA) [84] and Linear Discriminant Analysis (LDA) [85] are most commonly employed. Unsuper-vised methods, such as the Fuzzy k-Means (FKM) have also shown great results in the context of text retrieval [86].

In this particular application, however, it is more appropriate to describe instances of objects as unordered sets or graphs of feature vectors. In the thesis, these classes are referred to as Structured Composite Classes (SCC), while the individual feature vectors of an instance are referred to as nodes of the object or class. While it is possible to apply classic dimension reduction methods to this problem, these algorithms do not produce optimal results when applied to SCCs.

Structured Composite Classes also appear in a wide range of computer vision ap-plications. It is common to describe the visual appearance of an object by using

local image features, such as SIFT [87]. Some models, such as Bag of Visual Words [88] describe objects as sets of visual features, while others, such as part-based or constellation models [89] treat them as graphs. The local features approach also appears in 3D shape recognition [90]. Another occurrence of SSCs is in Schnabel, Wessel, Wahl, and Klein [53], who - similarly to the proposed method - describe 3D shapes as graphs of primitive shapes. It is worth noting that by embedding graph nodes, graphs of vectors can be reduced to sets of vectors [56].

The method discussed in this chapter involves spotting or localizing subgraphs or subsets corresponding to a certain object class (segmentation by recognition). Our strategy is to recognize individual nodes and use node labels to infer the presence of an object from a given class. This method does not only require discrimination between classes in general but also between the individual nodes of the objects. The ability to tell different nodes apart also provides the opportunity to weigh them using their relative importance.

This extra criterion is also extremely useful in cases where object registration or pose estimation is required. If it is easy to discriminate between nodes, then it is possible to learn a class model consisting of multiple nodes (including the relative position of the individual nodes). Then, by matching nodes between the observation and the model, pose estimation can be performed. However, if the nodes are too similar after the dimension reduction, the matching will be ambiguous. It is worth noting, that in most of these applications only class labels are available, that is, individual nodes are not labeled based on their similarity.

One of the most well-known dimension reduction methods is Principal Component Analysis (PCA), [83] which determines the principal components of the original dataset. Principal components form an orthogonal base in the variable space, with the first and last components showing the directions of the largest and smallest variance respectively. All other principal components are in a descending order of variance.

By transforming the database in the new base, and discarding the variables with small variances, PCA is able to compute the optimal compression for a the dataset with normal distribution. Since PCA doesn’t require class labels, it is widely used in unsupervised cases. However, for classification problems PCA might not select the dimensions that are most useful for telling different classes apart [91].

Márton Szemenyei 42/130 FEATURE COMPRESSION

3.1.1 Discriminant Analysis

Discriminant Analysis (DA) techniques are supervised procedures that use class labels to find the directions in the parameter space that are most suited to separate the different classes. The simplest of such methods employs Wilks’ lambda [77]

statistic, which is computed as follows:

λW = Swc

Stotal, (3.1)

whereSwc is the sample variance within class and Stotal is the total sample variance along a given dimension. The statistic is computed for each dimension independently and the dimensions where it is close to 0 are kept. An obvious disadvantage of this method is that it evaluates all variables independently, therefore it will fail if the data is separable along the linear combinations of the variables. In this case it is beneficial to use PCA transformation (without discarding any variables) on the data before computingλW.

Linear Discriminant Analysis [85] is one of the most widely used dimension reduction methods. Its basic assumption is that the classes are normally distributed with different means but the same covariance matrix. Similarly to PCA, LDA computes optimal orthogonal linear combinations of the original dimensions. However, these new base vectors maximize separability of the classes instead of the variance of the entire dataset.

LDA is formulated as an optimization problem:

maxw

wTSbw

wTSww, (3.2)

Sb=

C i=1

i µ)(µi µ)T, (3.3)

Sw =

C i=1

ni

j=1

ixi,j)(µixi,j)T, (3.4) whereSb is the between-class scatter matrix,Sw is the within class scatter matrix, µis the mean of the data set µi is the mean of thei-th class, xi,j is thej-th vector of thei-th class,ni is the number of vectors in thei-th class andC is the number of classes. It is worth noting, that Sw may be replaced with the total scatter matrix St because of the following equation:

St =Sb+Sw, (3.5) where

St =

n i=1

(xiµ)(xiµ)T, (3.6)

where n is the total number of vectors. This optimization criterion leads to a generalized eigenvalue-eigenvector problem, which in the case of an invertible Sw or St can be solved by performing spectral decomposition on St1Sb or Sw1Sb and taking the eigenvector belonging to the largest eigenvalue. More discriminant dimensions may be extracted by taking the eigenvectors corresponding to the second, third, etc. largest eigenvalues. Note, that this only an approximate solution, still the result is usually close to the true optimum [92].

Numerous variations of LDA have been proposed, such as Penalized Discriminant Analysis (PDA) [93], which is a weighted version of LDA. Weights allow the algo-rithm to penalize unstable features, thus improving the robustness of the method.

Another variant is the Nonparametric Discriminant Analysis (NDA) [94], which uses a nearest-neighbors approach to define the between-class scatter matrix to relax the Normal assumption of LDA. Górecki and Łuczak [95] used the Moore-Penrose pseu-doinverse to generalize LDA to problems with few observations.

Another significant problem with LDA is that it does not take the local geometry of the dataset into account [96]. To overcome this problem, Locality Sensitive Discrim-inant Analysis (LSDA) [96] has been proposed, which discovers the local manifold structure of the dataset, and uses this information to find the optimal discriminat-ing projection. Methods, such as Structured Semi-supervised Discriminant Analysis (SSDA) [91] employ similar strategies, while using the local manifold information to make use of unlabeled data as well.

Notably, the kernel trick can also be used to extend LDA to nonlinear cases, such as the Generalized Discriminant Analysis method [97]. In Harandi, Sanderson, Shi-razi, and Lovell [98] a Grassmannian graph embedding framework is employed to implement kernel-based DA. However, none of these methods are applicable to the case of structured composite classes, since they all assume at least partly labeled data.

Márton Szemenyei 44/130 FEATURE COMPRESSION

3.1.2 Subclass and Mixture Methods

In this section particular variations of LDA methods are discussed that make the assumption that the data of each class is generated by the several different normal distributions, or in other words they assume a Mixture of Gaussians (MoG) model.

The reason for discussing these variants separately is that they are relatively easy to extend to the problem of structured composite classes, since the only important dif-ference is that with structured composite classes, instances from different subclasses might be present at the same time.

Some of the mixture methods [100, 99] use the Expectation Maximization (EM) [101]

algorithm to estimate the underlying distributions of the classes, then they use classic LDA to find the optimal discriminating projection. A drawback of these methods is that they are not applicable if the number of data points is too low [102].

A different approach, Subclass Discriminant Analysis (SDA) [102] uses clustering to estimate the means of the underlying normal distributions of the subclasses. With the help of subclass means they define the between subclass scatter matrix as follows:

Sbs =

C i=1

H j=1

i,jµ)(µi,j µ)T, (3.7)

where C is the number of classes, H is the number of subclasses, and µi,j is the mean of the j-th subclass of the i-th class. SDA replaces the between class scatter Sbwith the between subclass scatterSbs. The procedure also determines the number of clusters using a brute-force iteration from one to a user-defined maximum and selecting the number that maximizes classification accuracy.

A modified version of SDA called Mixture Subclass Discriminant Analysis (MSDA) [103]

computes the scatter matrix only between the subclasses ofdifferent classesin order to prevent the algorithm from preferring directions that can separate subclasses of the same class. The between subclass scatter matrix is computed as follows:

Sbsb=

C1

i=1 Hi

j=1

C k=i+1

Hj

l=1

i,jµk,l)(µi,j µk,l)T, (3.8)

where Hi and Hj are the number of subclasses in the i-th and j-th classes. It is important to point out that SDA assumes that all classes have the same number of subclasses, since it uses the sameHfor all classes during clustering. This assumption is false in many cases, and it may cause the algorithm to fail to separate subclasses

if their number is underestimated for a given class. In the case of overestimating the number of subclasses, the method may keep variables that should have been discarded, decreasing the efficiency of the method.