A robust approach based on Weibull distribution for clustering gene expression data

Springer Science and Business Media LLC - Tập 6 - Trang 1-9 - 2011

Huakun Wang^1,2, Zhenzhen Wang¹, Xia Li¹, Binsheng Gong¹, Lixin Feng², Ying Zhou²

¹College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, PR China

²School of Mathematical sciences, Heilongjiang University, Harbin, PR China

Tóm tắt

Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest. In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets. The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.

Tài liệu tham khảo

Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO: Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000, 24: 227-235. 10.1038/73432 Schlom J, Tsang KY, Kantor JA, Abrams SI, Zaremba S, Greiner J, Hodge JW: Cancer vaccine development. Expert Opin Investig Drugs. 1998, 7: 1439-1452. 10.1517/13543784.7.9.1439 Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science. 1997, 276: 1268-1272. 10.1126/science.276.5316.1268 Khademhosseini A: Chips to Hits: microarray and microfluidic technologies for high-throughput analysis and drug discovery. September 12-15, 2005, MA, USA. Expert Rev Mol Diagn. 2005, 5: 843-846. 10.1586/14737159.5.6.843 Khan J, Bittner ML, Chen Y, Meltzer PS, Trent JM: DNA microarray technology: the anticipated impact on the study of human disease. Biochim Biophys Acta. 1999, 1423: M17-28. Watson A, Mazumder A, Stewart M, Balasubramanian S: Technology for microarray analysis of gene expression. Curr Opin Biotechnol. 1998, 9: 609-614. 10.1016/S0958-1669(98)80138-9 Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. J Comput Biol. 1999, 6: 281-297. 10.1089/106652799318274 Guess MJ, Wilson SB: Introduction to hierarchical clustering. J Clin Neurophysiol. 2002, 19: 144-151. 10.1097/00004691-200203000-00005 Rahnenfuhrer J: Clustering algorithms and other exploratory methods for microarray data analysis. Methods Inf Med. 2005, 44: 444-448. Boutros PC, Okey AB: Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform. 2005, 6: 331-343. 10.1093/bib/6.4.331 Sierra A, Corbacho F: Reclassification as supervised clustering. Neural Comput. 2000, 12: 2537-2546. 10.1162/089976600300014836 MacQueen JB: Some Methods for classification and Analysis of Multivariate Observations. the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281-297. University of California Press Gourevitch B, Le Bouquin-Jeannes R: K-means clustering method for auditory evoked potentials selection. Med Biol Eng Comput. 2003, 41: 397-402. 10.1007/BF02348081 Cottrell M, Ibbou S, Letremy P: SOM-based algorithms for qualitative variables. Neural Netw. 2004, 17: 1149-1167. 10.1016/j.neunet.2004.07.010 Lee BH, Scholz M: Application of the self-organizing map (SOM) to assess the heavy metal removal performance in experimental constructed wetlands. Water Res. 2006, 40: 3367-3374. 10.1016/j.watres.2006.07.027 Weibull W: A statistical distribution function of wide applicability. J Appl Mech-Trans ASME. 1951, 18: 293-297. Turnbull BW: The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society Series B. 1976, 38: 290-295. Frank J, Massey J: The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association. 1951, 46: 68-78. 10.2307/2280095 Huang S, Yeo AA, Li SD: Modification of Kolmogorov-Smirnov test for DNA content data analysis through distribution alignment. Assay Drug Dev Technol. 2007, 5: 663-671. 10.1089/adt.2007.071 Ong LD, LeClare PC: The Kolmogorov-Smirnov test for the log-normality of sample cumulative frequency distributions. Health Phys. 1968, 14: 376- Clason R: Finding Clusters: An application of the Distance Concept. The Mathematics Teacher. 1990 Blake JA, Harris MA: The Gene Ontology (GO) project: structured vocabularies for molecular biology and their application to genome and expression analysis. Curr Protoc Bioinformatics. 2008, 7: Unit 7 2 Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57. Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics. 2001, 17: 309-318. 10.1093/bioinformatics/17.4.309 R Giancarlo DS, Utro F: Statistical Indexes for Computational and Data Driven Class Discovery in Microarray Data. In Biological Data Mining. 2009, Chapman and Hall Mosca E, Bertoli G, Piscitelli E, Vilardo L, Reinbold RA, Zucchi I, Milanesi L: Identification of functionally related genes using data mining and data integration: a breast cancer case study. BMC Bioinformatics. 2009, 10 (Suppl 12): S8- 10.1186/1471-2105-10-S12-S8 Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA. 2001, 98: 13790-13795. 10.1073/pnas.191502998 Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002, 8: 68-74. 10.1038/nm0102-68 Dyrskjot L, Thykjaer T, Kruhoffer M, Jensen JL, Marcussen N, Hamilton-Dutoit S, Wolf H, Orntoft TF: Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet. 2003, 33: 90-96. 10.1038/ng1061

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA