• Nem Talált Eredményt

Biclustering algorithms for Data mining in high-dimensional data

4.3 Ecient methods for bicluster mining

4.3.5 Experimental results

In this section we compare our proposed closed pattern mining method (see section 4.3.1) with a biclustering based (BiMAX [133]) and a frequent closed itemset mining based (DCI_Closed [103]) methods that are able to discover all frequent closed patterns in binary data. These algorithms previously served as highly recognized reference methods for their application elds [133]. Note that all methods developed for frequent closed itemset mining produce the same patterns as DCI_Closed. Using several synthetic and real biological data sets, we show that 1) all three methods discover the same closed patterns in binary data and thus, experimentally prove our claim that both biclustering and frequent closed itemset mining methods discover the same patterns; 2) our pattern discovery method outperforms the other meth-ods and 3) it is the only method that is able to discover previously hidden and biologically potentially relevant closed patterns by using the extended {−1,0,1} data.

Comparison and computational eciency of the closed pattern mining methods

To compare the three mining methods and demonstrate their computational eciency, we applied them to several real and generated synthetic data sets.

Real data come from various biological studies previously used as reference data in biclustering research [75, 99, 15]. For the comparison of the compu-tational eciency, all biological data sets were binarized. For both the fold-change data (stem cell data sets) and the absolute expression data (Leukemia, Compendium, Yeast-80) fold-change cut-o 2 is used. Synthetic data were generated by both our own and IBM Quest Synthetic Data generator tool [128]. Results are shown in Table 4.1 (synthetic data) and Table 4.2 (real data), respectively. All three methods were able to discover all closed pat-terns for all synthetic and real data sets. The tables also show that FCPMiner outperforms the other two methods and provides the best running times in all cases, especially when the number of rows and columns are higher.

Biological relevance of closed pattern mining on {−1,0,1} data Here we illustrate the potential of our closed pattern mining method when applied to {−1,0,1} data. The real data set used in this section comes from the study of the eects of Tet1-knockdown on gene expression in mouse embryonic stem cell and trophoblast stem cell conditions. The data have been previously analyzed using our standard analysis pipeline and the results

Table 4.1. Computational results using synthetic data sets.

r: number of rows c: number of columns

d: density (proportion of ones) [%]

sc: minimum support count during the search (min_cols in pattern mining) sr: minimum row count during pattern mining (min_rows)

cf: number of identied closed patterns c: number of closed patterns after ltering

b: number of found patterns by the corresponding algorithm t: running time [s]

Data r c d sc sr BiMAX DCI_Closed FCPMiner

b t cf c t b t

S1 50 50 10 2 2 78 1 119 78 0.016 78 0.001

S2 50 50 20 4 2 140 1 189 140 0.024 140 0.016

S3 50 50 50 15 2 238 1 288 238 0.033 238 0.438 S4 100 100 10 3 2 337 2 436 337 0.041 337 0.041 S5 100 100 20 7 2 488 2 588 488 0.028 488 0.015 S6 100 100 50 30 2 694 3 794 694 0.034 694 0.488 S7 300 300 10 8 2 437 5 737 437 0.041 437 0.031 S8 300 300 20 22 2 156 52 456 156 0.085 156 0.047 S9 300 300 50 90 2 1038 >600 1338 1038 0.241 1038 0.318 S10 700 700 10 15 2 1318 195 2018 1318 0.365 1318 0.266 S11 700 700 20 45 2 375 >300 1075 375 0.720 375 0.499 S12 700 700 50 210 2 283 >300 983 283 2.631 283 1.857 S13 1000 1000 10 20 2 1496 >600 2496 1496 0.916 1496 0.671 S14 1000 1000 20 60 2 714 >600 1714 714 2.182 714 1.451 S15 1000 1000 50 290 2 1030 >600 2030 1030 8.110 1030 6.238

IBM1 100 100 9.04 4 4 6 1 452 6 0.070 6 0.004

IBM2 1000 100 9.32 4 6 15 1 19974 15 0.142 15 0.061 IBM3 10000 100 8.94 4 10 NA NA 426508 7 1.517 7 1.099 IBM4 100000 100 8.99 4 12 NA NA 8572510 16 38.909 16 24.147 IBM5 100 100 7.78 6 6 101 0.8 350 101 0.015 101 0.001 IBM6 100 1000 7.14 12 20 216 26 1649889 216 25.648 216 20.668

Table 4.2. Comparison to DCI_Closed.

r: number of rows c: number of columns

d: density (portion of ones) [%]

sc: minimum support count during the search (min_cols in pattern mining) sr: minimum row count during pattern mining (min_rows)

cf: number of identied closed patterns c: number of closed patterns after ltering

b: number of found patterns by the corresponding algorithm t: running time [s]

Problem r c d sc sr DCI_Closed FCPMiner

cf c t b t

Compendium 6316 300 1.2 50 2 2715 2594 0.157 2594 0.124 StemCell-27 45276 27 5.8 200 2 7999 7972 0.521 7972 0.325 Leukemia 12533 72 19.3 400 2 3715 3643 0.823 3643 0.787 StemCell-9 1840 9 15.5 2 2 186 177 0.032 177 0.001 Yeast-80 6221 80 6.8 80 2 3348 3285 0.094 3285 0.055

closed pattern mining was created based on the dierentially expressed genes between dierent biological sample groups. Therefore, the expression values were discretized as 1's signifying up-regulation, -1's down-regulation and 0's no change. For more information on preparing the input data for the mining, see [15]. Here it is important to note that methods developed only for binary data do not take the direction of gene regulation into account and therefore, transform the discretized values to 1's denoting both up- and down-regulation and to 0's denoting no change.

While FCPMiner identied all 115 valid frequent closed patterns, BiMAX, DCI_Closed developed only for binary data, found 128 patterns. When inspecting these patterns more closely, we nd that 70% of them are invalid, i.e. contain erroneous genes with uncorrelated regulation proles due to the binarization. Examples are shown in Fig. 4.8.

A common way to compare dierent biclustering methods is to run func-tional enrichment analysis for the resulting gene regulation patterns. This ap-proach takes an advantage of databases grouping genes in pathways and func-tional categories according to known biological association. An overrepresen-tation analysis can then be carried out to detect patterns containing more genes within specic functional categories than expected by chance alone and thus giving insight on the underlying biological mechanisms within the

stud-Figure 4.8. Examples of patterns discovered by FCPMiner and binary FCP mining methods.

ied experimental setup. Therefore, the dierent pattern mining methods can be compared by looking at the patterns detected at certain enrichment sig-nicance levels for each method. Here the discovered patterns were analyzed with respect to the enrichment of functional GO categories [77] and KEGG pathways [90] using overrepresentation analysis applying a hypergeometric test [135] to calculate an enrichment p-value for each category and pathway.

After examining the results we have identied several closed patterns that were discovered only with FCPMiner. For example, the rst panel on the left side of Fig. 4.8 shows an FCP reported signicant by FCPMiner within a GO category at p-value level 5E-12 but missed at this signicance level by other methods due to binarization and the resulting inclusion of erroneous genes. The remaining panels show patterns for KEGG that were discovered by FCPMiner and missed by other methods at the p-value signicance level 5E-6. Patterns with the calculated GO categories and KEGG pathways with the corresponding p-value are given in the supplementary data.