An integrated method for cancer classification and rule extraction from microarray data

Journal of Biomedical Science - Tập 16 - Trang 1-10 - 2009
Liang-Tsung Huang1
1Department of Computer Science and Information Engineering, Mingdao University, Changhua, Taiwan

Tóm tắt

Different microarray techniques recently have been successfully used to investigate useful information for cancer diagnosis at the gene expression level due to their ability to measure thousands of gene expression levels in a massively parallel way. One important issue is to improve classification performance of microarray data. However, it would be ideal that influential genes and even interpretable rules can be explored at the same time to offer biological insight. Introducing the concepts of system design in software engineering, this paper has presented an integrated and effective method (named X-AI) for accurate cancer classification and the acquisition of knowledge from DNA microarray data. This method included a feature selector to systematically extract the relative important genes so as to reduce the dimension and retain as much as possible of the class discriminatory information. Next, diagonal quadratic discriminant analysis (DQDA) was combined to classify tumors, and generalized rule induction (GRI) was integrated to establish association rules which can give an understanding of the relationships between cancer classes and related genes. Two non-redundant datasets of acute leukemia were used to validate the proposed X-AI, showing significantly high accuracy for discriminating different classes. On the other hand, I have presented the abilities of X-AI to extract relevant genes, as well as to develop interpretable rules. Further, a web server has been established for cancer classification and it is freely available at http://bioinformatics.myweb.hinet.net/xai.htm .

Tài liệu tham khảo

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999, 96 (12): 6745-6750. 10.1073/pnas.96.12.6745. Zhang H, Yu CY, Singer B, Xiong M: Recursive partitioning for tumor classification with gene expression microarray data. Proc Natl Acad Sci USA. 2001, 98 (12): 6730-6735. 10.1073/pnas.111153698. Olshen AB, Jain AN: Deriving quantitative conclusions from microarray expression data. Bioinformatics. 2002, 18 (7): 961-970. 10.1093/bioinformatics/18.7.961. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J, Raffeld M, Yakhini Z, Ben-Dor A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Sauter G: Gene-expression profiles in hereditary breast cancer. N Engl J Med. 2001, 344 (8): 539-548. 10.1056/NEJM200102223440801. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA. 2002, 99 (10): 6567-6572. 10.1073/pnas.082099299. Huang X, Pan W: Linear regression and two-class classification with gene expression data. Bioinformatics. 2003, 19 (16): 2072-2078. 10.1093/bioinformatics/btg283. Bicciato S, Luchini A, Di Bello C: PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics. 2003, 19 (5): 571-578. 10.1093/bioinformatics/btg051. Martella F: Classification of microarray data with factor mixture models. Bioinformatics. 2006, 22 (2): 202-208. 10.1093/bioinformatics/bti779. Su Z, Hong H, Perkins R, Shao X, Cai W, Tong W: Consensus analysis of multiple classifiers using non-repetitive variables: diagnostic application to microarray gene expression data. Comput Biol Chem. 2007, 31 (1): 48-56. 10.1016/j.compbiolchem.2007.01.001. Huang LT, Gromiha MM, Hwang SF, Ho SY: Knowledge acquisition and development of accurate rules for predicting protein stability changes. Comput Biol Chem. 2006, 30 (6): 408-415. 10.1016/j.compbiolchem.2006.06.004. Huang LT, Gromiha MM, Ho SY: Sequence analysis and rule development of predicting protein stability change upon mutation using decision tree model. Journal of Molecular Modeling. 2007, 13 (8): 879-890. 10.1007/s00894-007-0197-4. Huang LT, Gromiha MM, Ho SY: iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations. Bioinformatics. 2007, 23 (10): 1292-1293. 10.1093/bioinformatics/btm100. Borgelt C, Berthold MR: Mining molecular fragments: finding relevant substructures of molecules. 2002, The 2002 IEEE international Conference on Data Mining, Washington, DC, 51-58. Oyama T, Kitano K, Satou K, Ito T: Extraction of knowledge on protein-protein interaction by association rule discovery. Bioinformatics. 2002, 18 (5): 705-714. 10.1093/bioinformatics/18.5.705. Creighton C, Hanash S: Mining gene expression databases for association rules. Bioinformatics. 2003, 19 (1): 79-86. 10.1093/bioinformatics/19.1.79. Carmona-Saez P, Chagoyen M, Rodriguez A, Trelles O, Carazo JM, Pascual-Montano A: Integrated analysis of gene expression by Association Rules Discovery. BMC Bioinformatics. 2006, 7: 54-10.1186/1471-2105-7-54. Li J, Liu H, Downing JR, Yeoh AE, Wong L: Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics. 2003, 19 (1): 71-78. 10.1093/bioinformatics/19.1.71. Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D: Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics. 2005, 21 (20): 3896-3904. 10.1093/bioinformatics/bti631. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2002, 30 (1): 41-47. 10.1038/ng765. Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576, Statistics Dept, UC Berkeley. 2000 Barla A, Jurman G, Riccadonna S, Merler S, Chierici M, Furlanello C: Machine learning methods for predictive proteomics. Brief Bioinform. 2008, 9 (2): 119-128. 10.1093/bib/bbn008. Yourdon E, Constantine LL: Structured design: fundamentals of a discipline of computer program and systems design. 1979, Englewood Cliffs, N.J., Prentice Hall Berrar DP, Dubitzky W, Granzow M: A practical approach to microarray data analysis. 2003, Boston, MA, Kluwer Academic Publishers Huan L, Rudy S: Chi2: Feature Selection and Discretization of Numeric Attributes. Seventh International Conference on Tools with Artificial Intelligence (ICTAI). 1995, 388- Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2005, San Francisco, Morgan Kaufmann, 2 Theodoridis S, Koutroumbas K: Pattern recognition. 2006, Amsterdam; Boston, Elsevier/Academic Press, 3 Huang LT, Gromiha MM: Analysis and prediction of protein folding rates using quadratic response surface models. Journal of Computational Chemistry. 2008, 29 (10): 1675-1683. 10.1002/jcc.20925. Smyth P, Goodman RM: An information theoretic approach to rule induction from databases. Knowledge and Data Engineering, IEEE Transactions on. 1992, 4 (4): 301-316. 10.1109/69.149926. Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KF, Mewes HW: Gene selection from microarray data for cancer classification – a machine learning approach. Comput Biol Chem. 2005, 29 (1): 37-46. 10.1016/j.compbiolchem.2004.11.001. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000, 16 (10): 906-914. 10.1093/bioinformatics/16.10.906. Li J, Wong L: Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. Bioinformatics. 2002, 18 (5): 725-734. 10.1093/bioinformatics/18.5.725. Antonov AV, Tetko IV, Mader MT, Budczies J, Mewes HW: Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics. 2004, 20 (5): 644-652. 10.1093/bioinformatics/btg462. Fort G, Lambert-Lacroix S: Classification using partial least squares with penalized logistic regression. Bioinformatics. 2005, 21 (7): 1104-1111. 10.1093/bioinformatics/bti114. Liu CC, Chen WS, Lin CC, Liu HC, Chen HY, Yang PC, Chang PC, Chen JJ: Topology-based cancer classification and related pathway mining using microarray data. Nucleic Acids Res. 2006, 34 (14): 4069-4080. 10.1093/nar/gkl583.