A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data

BioData Mining - Tập 5 - Trang 1-10 - 2012
Li Li1,2,3, Yang Guo4,3, Wenwu Wu1,4,3, Youyi Shi1,2,3, Jian Cheng1,4,3, Shiheng Tao1,4,3
1State Key Laboratory of Crop Stress Biology in Arid Areas and College of Sciences, Northwest A&F University, Yangling, China
2Institute of Applied Mathematics, Northwest A&F University, Yangling, China
3Bioinformatics Center, Northwest A&F University, Yangling, China
4College of Life Sciences, Northwest A&F University, Yangling, China

Tóm tắt

Several biclustering algorithms have been proposed to identify biclusters, in which genes share similar expression patterns across a number of conditions. However, different algorithms would yield different biclusters and further lead to distinct conclusions. Therefore, some testing and comparisons between these algorithms are strongly required. In this study, five biclustering algorithms (i.e. BIMAX, FABIA, ISA, QUBIC and SAMBA) were compared with each other in the cases where they were used to handle two expression datasets (GDS1620 and pathway) with different dimensions in Arabidopsis thaliana (A. thaliana) GO (gene ontology) annotation and PPI (protein-protein interaction) network were used to verify the corresponding biological significance of biclusters from the five algorithms. To compare the algorithms’ performance and evaluate quality of identified biclusters, two scoring methods, namely weighted enrichment (WE) scoring and PPI scoring, were proposed in our study. For each dataset, after combining the scores of all biclusters into one unified ranking, we could evaluate the performance and behavior of the five biclustering algorithms in a better way. Both WE and PPI scoring methods has been proved effective to validate biological significance of the biclusters, and a significantly positive correlation between the two sets of scores has been tested to demonstrate the consistence of these two methods. A comparative study of the above five algorithms has revealed that: (1) ISA is the most effective one among the five algorithms on the dataset of GDS1620 and BIMAX outperforms the other algorithms on the dataset of pathway. (2) Both ISA and BIMAX are data-dependent. The former one does not work well on the datasets with few genes, while the latter one holds well for the datasets with more conditions. (3) FABIA and QUBIC perform poorly in this study and they may be suitable to large datasets with more genes and more conditions. (4) SAMBA is also data-independent as it performs well on two given datasets. The comparison results provide useful information for researchers to choose a suitable algorithm for each given dataset.

Tài liệu tham khảo

Sokal RR, Michener CD: A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. 1958, 38: 1409-1438. Cheng Y, Church GM: Biclustering of Expression Data. Book Biclustering of Expression Data. 2000, 93-103. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nat Genet. 2002, 31: 370-377. Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002, 18 (Suppl 1): S136-144. Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006, 22: 1122-1129. Gupta N, Aggarwal S: MIB: Using mutual information for biclustering gene expression data. Pattern Recognition. 2010, 43: 2692-2697. Gan XC, Liew AWC, Yan H: Discovering biclusters in gene expression data based on high-dimensional linear geometries. BMC Bioinforma. 2008, 9: 9- Zhang YJ, Wang H, Hu ZY: A Novel Clustering and Verification Based Microarray Data Bi-clustering Method. Advances in Swarm Intelligence, Pt 2, Proceedings. Volume 6146. Edited by: Tan Y, Shi YH, Tan KC. 2010, 611-618. Lecture Notes in Computer Science Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans Comput Biol Bioinform. 2004, 1: 24-45. Allison DB, Cui XQ, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006, 7: 55-65. Al-Akwaa FM, Ali MH, Kadah YM: BicAT_Plus: An Automatic Comparative Tool For Bi/Clustering of Gene Expression Data Obtained Using Microarrays. Nrsc: 2009 National Radio Science Conference: Nrsc 2009. 2009, 1 and 2: 964-971. Ayadi W, Elloumi M, Hao J-K: A biclustering algorithm based on a bicluster enumeration tree: application to DNA microarray data. BioData mining. 2009, 2: 9- Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W: FABIA: factor analysis for bicluster acquisition. Bioinformatics. 2010, 26: 1520-1527. Li GJ, Ma Q, Tang HB, Paterson AH, Xu Y, QUBIC: QUBIC: a qualitative biclustering algorithm for analyses ofgene expression data. Nucleic Acids Res 2009, 37. Shlomi T, Cabili MN, Herrgard MJ, Palsson BO, Ruppin E: Network-based prediction of human tissue-specific metabolism. Nat Biotechnol. 2008, 26: 1003-1010. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009, 37: D885-D890. Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E: BicAT: a biclustering analysis toolbox. Bioinformatics. 2006, 22: 1282-1283. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge YC, Gentry J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: 119-134. R Development Core Team: R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing. 2011, [http://www.R-project.org/] Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks. Bioinformatics. 2005, 21: 3448-3449. Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005, 21: 3587-3595. Castillo-Davis CI, Hartl DL: GeneMerge - post-genomic analysis, data mining, and hypothesis testing. Bioinformatics. 2003, 19: 891-892. Liang H, Li WH: MicroRNA regulation of human protein-protein interaction network. Rna-a Publication of the Rna Society. 2007, 13: 1402-1408. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39: D561-D568. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005, 33: D433-D437. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M: STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009, 37: D412-D416. Kaiser S, Santamaria R, Sill M, Theron R: biclust: BiCluster Algorithms.R package version 101. 2011, [http://CRAN.R-project.org/package=biclust] Kaiser S, Leisch F: A Toolbox for Bicluster Analysis in R.Compstat 2008-Proceedings in Computational Statistics. 2008, [http://www.stat.uni-muenchen.de] Csardi G, Kutalik Z, Bergmann S: Modular analysis of gene expression data with R. Bioinformatics. 2010, 26: 1376-1377. Shamir R, Maron-Katz A, Tanay A, Linhart C, Steinfeld I, Sharan R, Shiloh Y, Elkon R: EXPANDER - An integrative program suite for microarray data analysis. BMC Bioinformatic. 2005, 6: 232-240. Kendall M: A New Measure of Rank Correlation. Biometrika. 1938, 30: 81-89. Richards AL, Holmans P, O'Donovan MC, Owen MJ, Jones L: A comparison of four clustering methods for brain expression microarray data. BMC Bioinforma. 2008, 9: 490-506. Chia BKH, Karuturi RKM: Differential co-expression framework to quantify goodness of biclusters and comparebiclustering algorithms. Algorithms for Molecular Biology 2010, 5.