A robust approach based on Weibull distribution for clustering gene expression data
Tóm tắt
Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest. In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets. The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.
Tài liệu tham khảo
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO: Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000, 24: 227-235. 10.1038/73432
Schlom J, Tsang KY, Kantor JA, Abrams SI, Zaremba S, Greiner J, Hodge JW: Cancer vaccine development. Expert Opin Investig Drugs. 1998, 7: 1439-1452. 10.1517/13543784.7.9.1439
Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science. 1997, 276: 1268-1272. 10.1126/science.276.5316.1268
Khademhosseini A: Chips to Hits: microarray and microfluidic technologies for high-throughput analysis and drug discovery. September 12-15, 2005, MA, USA. Expert Rev Mol Diagn. 2005, 5: 843-846. 10.1586/14737159.5.6.843
Khan J, Bittner ML, Chen Y, Meltzer PS, Trent JM: DNA microarray technology: the anticipated impact on the study of human disease. Biochim Biophys Acta. 1999, 1423: M17-28.
Watson A, Mazumder A, Stewart M, Balasubramanian S: Technology for microarray analysis of gene expression. Curr Opin Biotechnol. 1998, 9: 609-614. 10.1016/S0958-1669(98)80138-9
Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. J Comput Biol. 1999, 6: 281-297. 10.1089/106652799318274
Guess MJ, Wilson SB: Introduction to hierarchical clustering. J Clin Neurophysiol. 2002, 19: 144-151. 10.1097/00004691-200203000-00005
Rahnenfuhrer J: Clustering algorithms and other exploratory methods for microarray data analysis. Methods Inf Med. 2005, 44: 444-448.
Boutros PC, Okey AB: Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform. 2005, 6: 331-343. 10.1093/bib/6.4.331
Sierra A, Corbacho F: Reclassification as supervised clustering. Neural Comput. 2000, 12: 2537-2546. 10.1162/089976600300014836
MacQueen JB: Some Methods for classification and Analysis of Multivariate Observations. the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281-297. University of California Press
Gourevitch B, Le Bouquin-Jeannes R: K-means clustering method for auditory evoked potentials selection. Med Biol Eng Comput. 2003, 41: 397-402. 10.1007/BF02348081
Cottrell M, Ibbou S, Letremy P: SOM-based algorithms for qualitative variables. Neural Netw. 2004, 17: 1149-1167. 10.1016/j.neunet.2004.07.010
Lee BH, Scholz M: Application of the self-organizing map (SOM) to assess the heavy metal removal performance in experimental constructed wetlands. Water Res. 2006, 40: 3367-3374. 10.1016/j.watres.2006.07.027
Weibull W: A statistical distribution function of wide applicability. J Appl Mech-Trans ASME. 1951, 18: 293-297.
Turnbull BW: The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society Series B. 1976, 38: 290-295.
Frank J, Massey J: The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association. 1951, 46: 68-78. 10.2307/2280095
Huang S, Yeo AA, Li SD: Modification of Kolmogorov-Smirnov test for DNA content data analysis through distribution alignment. Assay Drug Dev Technol. 2007, 5: 663-671. 10.1089/adt.2007.071
Ong LD, LeClare PC: The Kolmogorov-Smirnov test for the log-normality of sample cumulative frequency distributions. Health Phys. 1968, 14: 376-
Clason R: Finding Clusters: An application of the Distance Concept. The Mathematics Teacher. 1990
Blake JA, Harris MA: The Gene Ontology (GO) project: structured vocabularies for molecular biology and their application to genome and expression analysis. Curr Protoc Bioinformatics. 2008, 7: Unit 7 2
Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57.
Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics. 2001, 17: 309-318. 10.1093/bioinformatics/17.4.309
R Giancarlo DS, Utro F: Statistical Indexes for Computational and Data Driven Class Discovery in Microarray Data. In Biological Data Mining. 2009, Chapman and Hall
Mosca E, Bertoli G, Piscitelli E, Vilardo L, Reinbold RA, Zucchi I, Milanesi L: Identification of functionally related genes using data mining and data integration: a breast cancer case study. BMC Bioinformatics. 2009, 10 (Suppl 12): S8- 10.1186/1471-2105-10-S12-S8
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA. 2001, 98: 13790-13795. 10.1073/pnas.191502998
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002, 8: 68-74. 10.1038/nm0102-68
Dyrskjot L, Thykjaer T, Kruhoffer M, Jensen JL, Marcussen N, Hamilton-Dutoit S, Wolf H, Orntoft TF: Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet. 2003, 33: 90-96. 10.1038/ng1061