Comparison of biclustering methods - Biological validation of discovered patterns

Biclustering algorithms for Data mining in high-dimensional data

4.5 Biological validation of discovered patterns

4.5.1 Comparison of biclustering methods

The processed biological data (StemCell_9) was used to compare our method and three previously published biclustering methods, namely, BiMAX, QUBIC and BiBit. We analyzed both the DEG data and the full sample data (27 samples) with all three. For the DEG data, the minimum number of condi-tions was set to 3 for each method. While our method managed to discover all valid 115 biclusters, BiMAX, BiBit and QUBIC discovered 128, 127 and 68, respectively. We found that the 68 biclusters that QUBIC found are entirely included in our 115. Because of the greedy nature of QUBIC, the remaining 47 remained hidden. BiMAX and BiBit found more biclusters due to their stringent binarization. However, when inspecting these clusters more closely, we nd that 70% of the clusters are invalid, i.e. containing erroneous genes with uncorrelated regulation proles.

Common way to compare dierent methods is to run functional enrich-ment analysis for the result biclusters and then calculate the percentage of biclusters detected with certain signicance levels in each method. Here the discovered biclusters were analyzed with respect to the enrichment of func-tional GO categories and KEGG pathways using overrepresentation analysis applying a hypergeometric test [134] to calculate enrichment p-value for each Table 4.6. Number of biclusters showing signicant enrichment in GO cate-gories for our method and the three compared biclustering algorithms.

Max.p-value Our method BiMAX QUBIC BiBit

5E-12 2 1 2 1

5E-11 2 2 2 2

5E-10 3 3 3 3

5E-09 5 4 5 4

5E-08 8 5 8 5

5E-07 23 22 17 22

5E-06 56 59 36 59

5E-05 74 86 49 85

5E-04 108 122 66 121

0.005 115 128 68 127

0.05 115 128 68 127

0.5 115 128 68 127

Table 4.7. Number of biclusters showing signicant enrichment in KEGG pathways for our method and other biclustering algorithms.

Max.p-value Our method BiMAX QUBIC BiBit

5E-06 4 2 4 2

category and pathway. In Table4.6 and Table 4.7, the numbers of biclusters are displayed that showed signicant enrichment of any GO category and KEGG pathways below certain p-value thresholds.

These data show that all four methods are able to discover the main bi-clusters and capture the major functional categories and pathways related to cell dierentiation processes (e.g. GO:0048863: stem cell dierentiation, GO:0048864: stem cell development) for the properly preprocessed data even though there are slight variation in the p-values. This is somewhat in contrast to ndings of some previous studies where clear advantages over other meth-ods and even strong disagreements between them have been reported. Likely explanations for the earlier reported method disagreement are the improper preprocessing of data, which may aect certain methods stronger than others and the unsuitability of the metrics used for comparing the performance of dierent methods. The high agreement with respect to enrichment results in our comparison is however not surprising as the majority of the genes in the biclusters found by dierent methods remain the same and thus the general functional trends are not strongly aected. However, when the focus is on in-dividual genes and gene groups, which is the case when biological researchers are looking at the results related to real experiments, having valid clusters without erroneous genes be-comes more important. In addition, despite the overall consistency of dierent approaches, methods using binarization com-pletely missed some biclusters at lower signicance levels due to erroneous genes in the clusters.

A representative example is depicted in Fig. 4.11. Analyzing gene ex-pression data like "Yeast-80", we can illustrate the gene exex-pression values over dierent conditions, and biologists can extract the information from the bicluster, like the genes moving the same or exactly opposite way.

Figure 4.11. Representative example. Depicts the expression value of the genes over conditions from "Yeast-80" data, in the identied bicluster no.

3282

4.6 Conclusions

The development of closed frequent itemset mining and bicluster mining al-gorithms are separated to each other in the literature. However, as it was demonstrated in this chapter, choosing the parameters of existing algorithms appropriately, these two techniques provide exactly the same result set. In section 4.2 the equivalence of closed frequent itemset mining and biclustering under appropriately chosen parameters was proved, which was conrmed us-ing small examples from the literature by applyus-ing both type of algorithms for the same dataset. Since most of existing biclustering algorithms are either not accurate enough or their bad scalability result in long running times, or both, a novel recursive biclustering technique was developed to handle {−1,0,1}

(see section 4.3.1), while a really easily interpretable bit-table based method was discussed in section 4.4. A detailed and comparative computational analysis was elaborated for both novel methods in sections 4.3.5 and 4.4.2 to illustrate the applicability of the algorithms. The novel methods were proved to be more powerful than any other solution so far to discover con-stant valued biclusters by the solution of several test problem. Furthermore, our rst algorithm is capable to nd oppositely changing patterns also, since it has serious importance in eld of cell biology. Because of the fact that the most accurate biclustering algorithm (e.g. BiMAX which can be con-sidered as reference) only capable to deal with binary data, a novel general

forms{−1,0,1}data into binary format and ensures the consistency with the original data. To provide wider application area of the resulted biclusters, a novel merging technique and visualization method was also presented in section 4.3.4 and 4.3.3, with which bigger but less consistent biclusters can be constructed and visualized. Since the most important application area of biclustering is the eld of cell biology, section 4.5 presented detailed analysis of the results and fair comparisons with previous methods using biological tests.

Chapter 5

In document Döntéstámogató rendszerekben alkalmazható számítási intelligencián és adatbányászaton alapuló algoritmusok (Pldal 118-122)