Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes

ClinTransMed, AB - Tập 2 - Trang 1-12 - 2012

Xiaowei Guan^1,2, Mark R Chance^1,2, Jill S Barnholtz-Sloan^1,2

¹Case Comprehensive Cancer Center, Cleveland, USA

²Center for Proteomics and Bioinformatics, Cleveland, USA

Tóm tắt

The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity. The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10–200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in “hub” genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC. The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional “omic” data and provides novel insights into the cellular mechanisms that define cancer subtypes.

Tài liệu tham khảo

Kaiser J: Biomarker Tests Need Closer Scrutiny, IOM Concludes. Science. 2012, 335 (6076): 1554-10.1126/science.335.6076.1554. Guyon I, Elisseeff A: An introduction to variable and feature selection. J Mach Learn Res. 2003, 3: 1157-1182. Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 2507-2517. 10.1093/bioinformatics/btm344. Tibshirani R: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996, 58 (1): 267-288. Zou H, Hastie T: Regression shrinkage and selection via the elastic net, with applications to microarrays. 2003, Technical report, Department of Statistics, Stanford University Breiman L: Random Forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324. Breiman L, Friedman JH, Olshen R, Stone CJ: Classification and Regression Tree. 1984, Chapman & Hall, Wadsworth, Belmont Barnholtz-Sloan JS, Guan X, Zeigler-Johnson C, Meropol NJ, Rebbeck TR: Decision tree-based modeling of androgen pathway genes and prostate cancer risk. Cancer Eπdemiol Biomarkers Prev. 2011, 20 (6): 1146-1155. 10.1158/1055-9965.EPI-10-0996. Strobl C, Zeileis A: Danger: High power! - Exploring the statistical properties of a test for random forest variable importance. Proceedings of the 18th International Conference on Computational Statistics: 2008; Porto, Portugal. Edited by: Brito P. 2008, Physica-Verlag, Heidelberg Nicodemus KK, Malley JD, Strobl C, Ziegler A: The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinforma. 2010, 11: 110-10.1186/1471-2105-11-110. Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinforma. 2006, 7: 3-10.1186/1471-2105-7-3. Diaz-Uriarte R: GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinforma. 2007, 8: 328-10.1186/1471-2105-8-328. Ihaka R, Gentleman R: R: A Language for Data Analysis and Graphics. J Comput Graph Stat. 1996, 5 (3): 299-314. Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80. Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, et al: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17 (1): 98-110. 10.1016/j.ccr.2009.12.020. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498. Dabney AR: ClaNC: point-and-click software for classifying microarrays to nearest centroids. Bioinformatics. 2006, 22 (1): 122-123. 10.1093/bioinformatics/bti756. Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S, et al: Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A. 2007, 104 (50): 20007-20012. 10.1073/pnas.0710052104. Phillips HS, Kharbanda S, Chen R, Forrest WF, Soriano RH, Wu TD, Misra A, Nigro JM, Colman H, Soroceanu L, et al: Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer Cell. 2006, 9 (3): 157-173. 10.1016/j.ccr.2006.02.019. Sun L, Hui AM, Su Q, Vortmeyer A, Kotliarov Y, Pastorino S, Passaniti A, Menon J, Walling J, Bailey R, et al: Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell. 2006, 9 (4): 287-300. 10.1016/j.ccr.2006.03.003. Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van’t Veer LJ, Bartelink H, et al: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci U S A. 2005, 102 (10): 3738-3743. 10.1073/pnas.0409462102. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, et al: Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009, 27 (8): 1160-1167. 10.1200/JCO.2008.18.1370. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci. 2001, 98 (19): 10869-10874. 10.1073/pnas.191367098. Hu Z, Fan C, Oh D, Marron J, He X, Qaqish B, Livasy C, Carey L, Reynolds E, Dressler L, et al: The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics. 2006, 7 (1): 96-10.1186/1471-2164-7-96. Perreard L, Fan C, Quackenbush JF, Mullins M, Gauthier NP, Nelson E, Mone M, Hansen H, Buys SS, Rasmussen K, et al: Classification and risk stratification of invasive breast carcinomas using a real-time quantitative RT-PCR assay. Breast Cancer Res. 2006, 8 (2): R23-10.1186/bcr1399. Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci. 2003, 100 (14): 8418-8423. 10.1073/pnas.0932692100. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, et al: Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008, 14 (16): 5198-5208. 10.1158/1078-0432.CCR-08-0196. Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003, 19 (3): 368-375. 10.1093/bioinformatics/btf877. Hassan MR, Ramamohanarao K, Karmakar C, Hossain MM, Bailey J: A novel scalable multi-class ROC for effective visualization and computation. Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I. Edited by: Zaki MJ, Yu JX, Ravindran B, Pudi V. 2010, Springer-Verlag, Hyderabad, India Kleihues P, Ohgaki H: Genetics of Glioma Progression and the Definition of Primary and Secondary Glioblastoma. Brain Pathology. 1997, 7 (4): 1131-1136. 10.1111/j.1750-3639.1997.tb00993.x. Kleihues P, Ohgaki H: Primary and secondary glioblastomas: from concept to clinical diagnosis. Neuro Oncol. 1999, 1 (1): 44-51. Ohgaki H, Dessen P, Jourde B, Horstmann S, Nishikawa T, Di Patre PL, Burkhard C, Schuler D, Probst-Hensch NM, Maiorka PC, et al: Genetic pathways to glioblastoma: a population-based study. Cancer Res. 2004, 64 (19): 6892-6899. 10.1158/0008-5472.CAN-04-1337. Ohgaki H, Kleihues P: Eπdemiology and etiology of gliomas. Acta Neuropathol. 2005, 109 (1): 93-108. 10.1007/s00401-005-0991-y. Ishwaran H, Kogalur UB, Chen X, Minn AJ: Random survival forests for high-dimensional data. Statistical Analysis and Data Mining. 2011, 4 (1): 115-132. 10.1002/sam.10103.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA