Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
Tóm tắt
The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity. The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10–200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in “hub” genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC. The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional “omic” data and provides novel insights into the cellular mechanisms that define cancer subtypes.
Tài liệu tham khảo
Kaiser J: Biomarker Tests Need Closer Scrutiny, IOM Concludes. Science. 2012, 335 (6076): 1554-10.1126/science.335.6076.1554.
Guyon I, Elisseeff A: An introduction to variable and feature selection. J Mach Learn Res. 2003, 3: 1157-1182.
Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 2507-2517. 10.1093/bioinformatics/btm344.
Tibshirani R: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996, 58 (1): 267-288.
Zou H, Hastie T: Regression shrinkage and selection via the elastic net, with applications to microarrays. 2003, Technical report, Department of Statistics, Stanford University
Breiman L: Random Forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.
Breiman L, Friedman JH, Olshen R, Stone CJ: Classification and Regression Tree. 1984, Chapman & Hall, Wadsworth, Belmont
Barnholtz-Sloan JS, Guan X, Zeigler-Johnson C, Meropol NJ, Rebbeck TR: Decision tree-based modeling of androgen pathway genes and prostate cancer risk. Cancer Eπdemiol Biomarkers Prev. 2011, 20 (6): 1146-1155. 10.1158/1055-9965.EPI-10-0996.
Strobl C, Zeileis A: Danger: High power! - Exploring the statistical properties of a test for random forest variable importance. Proceedings of the 18th International Conference on Computational Statistics: 2008; Porto, Portugal. Edited by: Brito P. 2008, Physica-Verlag, Heidelberg
Nicodemus KK, Malley JD, Strobl C, Ziegler A: The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinforma. 2010, 11: 110-10.1186/1471-2105-11-110.
Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinforma. 2006, 7: 3-10.1186/1471-2105-7-3.
Diaz-Uriarte R: GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinforma. 2007, 8: 328-10.1186/1471-2105-8-328.
Ihaka R, Gentleman R: R: A Language for Data Analysis and Graphics. J Comput Graph Stat. 1996, 5 (3): 299-314.
Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80.
Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, et al: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17 (1): 98-110. 10.1016/j.ccr.2009.12.020.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498.
Dabney AR: ClaNC: point-and-click software for classifying microarrays to nearest centroids. Bioinformatics. 2006, 22 (1): 122-123. 10.1093/bioinformatics/bti756.
Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S, et al: Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A. 2007, 104 (50): 20007-20012. 10.1073/pnas.0710052104.
Phillips HS, Kharbanda S, Chen R, Forrest WF, Soriano RH, Wu TD, Misra A, Nigro JM, Colman H, Soroceanu L, et al: Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer Cell. 2006, 9 (3): 157-173. 10.1016/j.ccr.2006.02.019.
Sun L, Hui AM, Su Q, Vortmeyer A, Kotliarov Y, Pastorino S, Passaniti A, Menon J, Walling J, Bailey R, et al: Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell. 2006, 9 (4): 287-300. 10.1016/j.ccr.2006.03.003.
Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van’t Veer LJ, Bartelink H, et al: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci U S A. 2005, 102 (10): 3738-3743. 10.1073/pnas.0409462102.
Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, et al: Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009, 27 (8): 1160-1167. 10.1200/JCO.2008.18.1370.
Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci. 2001, 98 (19): 10869-10874. 10.1073/pnas.191367098.
Hu Z, Fan C, Oh D, Marron J, He X, Qaqish B, Livasy C, Carey L, Reynolds E, Dressler L, et al: The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics. 2006, 7 (1): 96-10.1186/1471-2164-7-96.
Perreard L, Fan C, Quackenbush JF, Mullins M, Gauthier NP, Nelson E, Mone M, Hansen H, Buys SS, Rasmussen K, et al: Classification and risk stratification of invasive breast carcinomas using a real-time quantitative RT-PCR assay. Breast Cancer Res. 2006, 8 (2): R23-10.1186/bcr1399.
Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci. 2003, 100 (14): 8418-8423. 10.1073/pnas.0932692100.
Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, et al: Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008, 14 (16): 5198-5208. 10.1158/1078-0432.CCR-08-0196.
Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003, 19 (3): 368-375. 10.1093/bioinformatics/btf877.
Hassan MR, Ramamohanarao K, Karmakar C, Hossain MM, Bailey J: A novel scalable multi-class ROC for effective visualization and computation. Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I. Edited by: Zaki MJ, Yu JX, Ravindran B, Pudi V. 2010, Springer-Verlag, Hyderabad, India
Kleihues P, Ohgaki H: Genetics of Glioma Progression and the Definition of Primary and Secondary Glioblastoma. Brain Pathology. 1997, 7 (4): 1131-1136. 10.1111/j.1750-3639.1997.tb00993.x.
Kleihues P, Ohgaki H: Primary and secondary glioblastomas: from concept to clinical diagnosis. Neuro Oncol. 1999, 1 (1): 44-51.
Ohgaki H, Dessen P, Jourde B, Horstmann S, Nishikawa T, Di Patre PL, Burkhard C, Schuler D, Probst-Hensch NM, Maiorka PC, et al: Genetic pathways to glioblastoma: a population-based study. Cancer Res. 2004, 64 (19): 6892-6899. 10.1158/0008-5472.CAN-04-1337.
Ohgaki H, Kleihues P: Eπdemiology and etiology of gliomas. Acta Neuropathol. 2005, 109 (1): 93-108. 10.1007/s00401-005-0991-y.
Ishwaran H, Kogalur UB, Chen X, Minn AJ: Random survival forests for high-dimensional data. Statistical Analysis and Data Mining. 2011, 4 (1): 115-132. 10.1002/sam.10103.