• Nem Talált Eredményt

Biclustering algorithms for Data mining in high-dimensional data

4.1.1 Literature review

Mining frequent itemsets or patterns is a fundamental problem in many data mining applications, such association rule discovery, correlations, multi-dimensional patterns, sequential rules, episodes, etc. [153]. The basic prob-lem can be expressed as follows: Find frequent patterns in a given large dataset, which are itemsets, subsequences, submatrices or substructures that appear in the dataset with frequency no less than a user-specied threshold.

The problem with this approach is that such mining process often generates a huge number of substructures satisfying the threshold, because all the sub-patterns of a frequent pattern is also frequent. To overcome this problem, the mining of frequent closed itemsets were proposed by Pasquier et al. in [124], where frequent patterns which have no proper super-pattern with the same support are searched. The main benet of this approach is that the set of closed frequent patterns contains the complete information regarding to its corresponding frequent patterns.

The so-called biclustering is a widely used technique in bioinformat-ics mining in gene expression data, where so-called biclusters are searched [51, 167, 105]. In biological data gene subsets are typically co-expressed only under a subset of samples or sample condition groups. In principle, bicluster-ing provides a solution to this problem as it does not set a priori constrains of the organization of the biclusters, meaning that any gene can belong to mul-tiple or none of the resulting clusters. Thus biclustering is potentially able to identify gene groups that have similar expression patterns over only a subset of samples or sample condition groups. A bicluster corresponds to a subset of rows and a subset of genes with a high similarity score, where similarity is not treated as a function of pairs of rows or pairs of columns, instead, it is

a measure of coherence of rows and columns in a bicluster. It will be proved in the next section, that frequent closed itemset mining, and biclustering technique can produce the same result, applying appropriate constraints and strict similarity measure. Because frequent itemset mining is well studied in numerous articles, we will focus on biclustering in the rest of this section using gene expression data as initial data.

Due to the realization of the underlying potential, several biclustering algorithms have been proposed for the identication of gene expression pat-terns during the last decade. The rst attempt for clustering biclusters with constant values was introduced by Hartigan [78]. Hartigan introduced a partition-based algorithm, known as Block Clustering, which splits the data matrix to additional sub-matrices and uses a variance to evaluate the quality of these sub-matrices. Cheng and Church were the rst who used the term

"biclustering" [51] in gene expression data analysis. In [65], authors were capable to identify biclusters with constant rows or columns ans in a coupled two-way clustering (CTWC) approach. A similar procedure was presented in [156]. Authors in [47, 166] provided a greedy iterative search algorithm, and FLOC (FLexible Overlapped biClustering) [166, 167] or the algorithm in [92] also addressed the problem of nding biclusters with coherent val-ues. In the Iterative Signature Algorithm (ISA) [38] and the plaid model [98] was also attempted to discover one bicluster at a time in an iterative process. In [98, 37] authors used statistics, while in [115] an algorithm to nd xMOTIFs, i.e. biclusters with coherent evolutions on their rows was introduced. Tanay et al. in [154] introduced an exhaustive biclustering enu-meration method, which uses probabilistic modeling of the data, and graph theoretic techniques to nd the most signicant bicluster in the data matrix.

Further comprehensive reviews about previous biclustering algorithms can be found in [155, 105, 45]. A very recent in-press publication [163] deals with biclustering using hipergraphs in gene expression data.

As we saw in the previous paragraphs some biclustering methods work on real valued and some on discretized, but most of them in practise, binarized data. Methods working on real values are computationally very intensive and usually require signicant preltering of data to limit the size of the input.

Another common limitation of these methods is the predened number of biclusters that the user has to provide before running these tools, e.g. [46].

Most discretized methods avoid these problems and even though discretiza-tion decreases informadiscretiza-tion content of the data, sometimes reducing the data complexity can be benecial. As it will be shown in next sections, even the most powerful techniques from the cited methods are either computationally expensive or not accurate enough, i.e. they can't discover all the interesting

subsets, or both.

The literature of closed frequent itemset mining is at least as wide as biclustering's. Only the list of most important ones will be presented here.

The mining of frequent closed itemsets was proposed by Pasquier et al. in 1999 [124] where an apriori-based algorithm was presented. Other algorithm was presented for closed frequent itemset mining, including CLOSET [125], CHARM [171], FPClose [69], AFOPT [100], CLOSET+ [162]. The main challenge in closed pattern mining is to check whether if a candidate itemset is closed. CHARM uses a hashing technique on its TID (Transaction IDenti-er) values, while AFOPT, FPCOLSE, CLOSET and CLOSET+ maintain the found detected patterns in an FP-tree-like pattern-tree. Further reeding about closed itemset mining can be found in [76]. These algorithms can pro-duce exactly the same results as biclustering techniques, which will be proven mathematically (section 4.2) and by practical examples (section 4.3.5).

In further sections, the mathematical formulation of the problem will be given (section 4.2), and in the main part of the manuscript in section 4.3, a novel algorithm will be proposed for frequent closed itemset mining, which uses a recursive procedure and solves the biclustering problem much faster than other solutions so far. Furthermore, our novel algorithm is capable to discover patterns in {−1,0,1} data, which includes oppositely changed patterns as well, which have to be discovered. Because of the biological importance, a general technique to handle the{−1,0,1}data using previous algorithms will be proposed, which is a special transformation into binary values. A novel technique to handle errors in clusters and a novel visualization method will be also presented. Section 4.3.5 contain results and comparisons with previous approaches and conclusion with future opportunities.

4.2 Problem formulation

4.2.1 Biclustering

Biclustering has been introduced to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or none of the resulting clusters purely based on their similarities. This prop-erty makes biclustering a powerful approach especially when it is applied to data with a large number of objects. During recent years, many biclustering algorithms have been developed especially for the analysis of gene expression data. With biclustering, genes with similar expression proles can be identi-ed not only over the whole data set but also across subsets of experimental conditions by allowing genes to simultaneously belong to several expression

Figure 4.2. Schematic representation of the biclustering problem. It is im-portant to note that the objects within one bicluster can be located either very close to each other (as in B1) or further apart (as in B2, B3 and B4).

patterns. Therefore, biclustering is able to identify gene groups that have similar expression patterns even over only a subset of samples or sample condition groups. A schematic representation of the problem is depicted in Fig. 4.2. For comprehensive reviews, see [45, 105, 154].

We follow the formulation given in [133] to dene the problem of mining biclusters in gene expression data. According to common practice of the eld, bicluster mining is restricted to a binary matrix, i.e. gene expression values are transformed to 1 (expressed) or 0 (not expressed) using an expression cuto. Let E ∈ {0,1}n×m be an expression matrix, where E represents the set of m experiments for n genes. A cell eij contains 1 whenever gene i is expressed in condition j and 0 otherwise. A bicluster (G, C) corresponds to a subset of genes G ⊆ {1, . . . , n} that jointly responds a subset of samples C ⊆ {1, . . . , m}. Therefore, the bicluster (G, C) is a submatrix of E in which all elements are equal to 1. Using the above denition, every cell eij having only non-zero values represents a bicluster. However, such patterns are usually redundant as they are entirely contained by other patterns. Thus,

the denition of inclusion-maximal bicluster (IMB) was introduced to discover all biclusters not entirely contained by any other cluster [133]: the pair (G, C) ∈ 2{1,...,n} ×2{1,...,m} is an IMB, if and only if ∀i ∈ G, j ∈ C : eij = 1 and @(G0, C0) ∈ 2{1,...,n}×2{1,...,m} where ∀i0 ∈ G0, j0 ∈ C0 : ei0j0 = 1 and G⊆G0∧C ⊆C0∧(G0, C0)6= (G, C).

By default an IMB can contain any number of genes and samples. Ad-ditionally, the so-called minimum support thresholds can be used to specify the minimum number of genes and samples required for the biclusters.