mSRFR: a machine learning model using microalgal signature features for ncRNA classification

BioData Mining - Tập 15 - Trang 1-11 - 2022
Songtham Anuntakarun1,2, Supatcha Lertampaiporn3, Teeraphan Laomettachit1, Warin Wattanapornprom4, Marasri Ruengjitchatchawalya1,5,6
1Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi (KMUTT), Bangkok, Thailand
2School of Information Technology, Bangkok, Thailand
3Biochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency at King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
4Department of Mathematics, Faculty of Science, KMUTT, Bangkok, Thailand
5Biotechnology program, School of Bioresources and Technology, KMUTT, Bangkok, Thailand
6Algal Biotechnology Research Group, Pilot Plant Development and Training Institute (PDTI), KMUTT, Bangkok, Thailand

Tóm tắt

This work presents mSRFR (microalgae SMOTE Random Forest Relief model), a classification tool for noncoding RNAs (ncRNAs) in microalgae, including green algae, diatoms, golden algae, and cyanobacteria. First, the SMOTE technique was applied to address the challenge of imbalanced data due to the different numbers of microalgae ncRNAs from different species in the EBI RNA-central database. Then the top 20 significant features from a total of 106 features, including sequence-based, secondary structure, base-pair, and triplet sequence-structure features, were selected using the Relief feature selection method. Next, ten-fold cross-validation was applied to choose a classifier algorithm with the highest performance among Support Vector Machine, Random Forest, Decision Tree, Naïve Bayes, K-nearest Neighbor, and Neural Network, based on the receiver operating characteristic (ROC) area. The results showed that the Random Forest classifier achieved the highest ROC area of 0.992. Then, the Random Forest algorithm was selected and compared with other tools, including RNAcon, CPC, CPC2, CNCI, and CPPred. Our model achieved a high accuracy of about 97% and a low false-positive rate of about 2% in predicting the test dataset of microalgae. Furthermore, the top features from Relief revealed that the %GA dinucleotide is a signature feature of microalgal ncRNAs when compared to Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, and Homo sapiens.

Tài liệu tham khảo

Hoffmann L. Algae of terrestrial habitats. Bot Rev. 1989;55(2):77–105. https://doi.org/10.1007/BF02858529. John DM, Whitton BA, Brook AJ. The freshwater algal flora of the British Isles: an identification guide to freshwater and terrestrial algae, vol. I. Cambridge: Cambridge University Press; 2002. Geider RJ, La Roche J. Redfield revisited: variability of C:N:P in marine microalgae and its biochemical basis. Eur J Phycol. 2002;37(1):1–17. https://doi.org/10.1017/S0967026201003456. Delhi N. Functional ingredients and algae for foods and nutraceuticals. Burlington: Elsevier Science; 2013. Wan Ngah WS, Hanafiah MAKM. Removal of heavy metal ions from wastewater by chemically modified plant wastes as adsorbents: a review. Bioresour Technol. 2008;99(10):3935–48. https://doi.org/10.1016/j.biortech.2007.06.011. Schenk PM, Thomas-Hall SR, Stephens E, Marx UC, Mussgnug JH, Posten C, et al. Second generation biofuels: high-efficiency microalgae for biodiesel production. BioEnergy Res. 2008;1(1):20–43. https://doi.org/10.1007/s12155-008-9008-8. Thillairajasekar K, Duraipandiyan V, Perumal P, Ignacimuthu S. Antimicrobial activity of Trichodesmium erythraeum (Ehr) (microalga) from south east coast of Tamil Nadu. India Int J Integr Biol. 2009;5:167–70. Lauritano C, Ferrante MI, Rogato A. Marine natural products from microalgae: an -omics overview. Mar Drugs. 2019;17(5):269. https://doi.org/10.3390/md17050269. Mattick JS, Makunin IV. Non-coding RNA. Hum Mol Genet. 2006;15 spec (1):R17–29. Beermann J, Piccoli MT, Viereck J, Thum T. Non-coding RNAs in development and disease: background, mechanisms, and therapeutic approaches. Physiol Rev. 2016;96(4):1297–325. https://doi.org/10.1152/physrev.00041.2015. Serghiou S, Kyriakopoulou A, Ioannidis JPA. Long noncoding RNAs as novel predictors of survival in human cancer: a systematic review and meta-analysis. Mol Cancer. 2016;15(1):50. https://doi.org/10.1186/s12943-016-0535-1. Molnár A, Schwach F, Studholme DJ, Thuenemann EC, Baulcombe DC. miRNAs control gene expression in the single-cell alga Chlamydomonas reinhardtii. Nature. 2007;447(7148):1126–9. https://doi.org/10.1038/nature05903. Yu Y, Zhang Y, Chen X, Chen Y. Plant noncoding RNAs: hidden players in development and stress responses. Annu Rev Cell Dev Biol. 2019;35(1):407–31. https://doi.org/10.1146/annurev-cellbio-100818-125218. Panwar B, Arora A, Raghava GPS. Prediction and classification of ncRNAs using structural information. BMC Genomics. 2014;15(1):127. https://doi.org/10.1186/1471-2164-15-127. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35:W345–9. https://doi.org/10.1093/nar/gkm391. Kang YJ, Yang DC, Kong L, Hou M, Meng YQ, Wei L, et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017;45(W1):W12–6. https://doi.org/10.1093/nar/gkx428. Sun L, Luo H, Bu D, Zhao G, Yu K, Zhang C, et al. Utilizing sequence intrinsic composition to classify protein-coding and long noncoding transcripts. Nucleic Acids Res. 2013;41(17):e166. https://doi.org/10.1093/nar/gkt646. Tong X, Liu S. CPPred: Coding potential prediction based on the global description of RNA sequence. Nucleic Acids Res. 2019;47(8):e43. https://doi.org/10.1093/nar/gkz087. Bao M, Cervantes Cervantes M, Zhong L, Wang JTL. Searching for noncoding RNAs in genomic sequences using ncRNAscout. Genom Proteom Bioinform. 2012;10(2):114–21. https://doi.org/10.1016/j.gpb.2012.05.004. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explor Newsl. 2009;11(1):10–8. https://doi.org/10.1145/1656274.1656278. Lertampaiporn S, Thammarongtham C, Nukoolkit C, Kaewkamnerdpong B, Ruengjitchatchawalya M. Heterogeneous ensemble approach with discriminative features and modified-SMOTEbagging for pre-miRNA classification. Nucleic Acids Res. 2013;41(1):e21. https://doi.org/10.1093/nar/gks878. Lertampaiporn S, Thammarongtham C, Nukoolkit C, Kaewkamnerdpong B, Ruengjitchatchawalya M. Identification of noncoding RNAs with a new composite feature in the hybrid random Forest Ensemble algorithm. Nucleic Acids Res. 2014;42(11):e93. https://doi.org/10.1093/nar/gku325. Kent JT. Information Gain and a General Measure of Correlation. Biometrika. 1983;70(1):163-73. http://www.jstor.org/stable/2335954 Accessed 06 Oct 2016. Holte RC. Very simple classification rules perform well on Most commonly used datasets. Mach Learn. 1993;11(1):63–91. https://doi.org/10.1023/A:1022631118932. Robnik-Šikonja M, Kononenko I. An adaptation of Relief for attribute estimation in regression. Mach Learning Proc Fourteenth Int Conf. 1997;5:296–304. Ahmad MW, Mourshed M, Rezgui Y. Trees vs neurons: comparison between random forest and ANN for high-resolution prediction of building energy consumption. Energy Build. 2017;147:77–89. https://doi.org/10.1016/j.enbuild.2017.04.038. Wehenkel M, Sutera A, Bastin C, Geurts P, Phillips C. Random forests based group importance scores and their statistical interpretation: application for Alzheimer’s disease. Front Neurosci. 2018;12:1–19. https://doi.org/10.3389/fnins.2018.00411. Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH. Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform. 2018;85:168–88. https://doi.org/10.1016/j.jbi.2018.07.015. Shaw TI, Manzour A, Wang Y, Malmberg RL, Cai L. Analyzing modular RNA structure reveals low global structural entropy in microRNA sequence. J Bioinform Comput Biol. 2011;9(2):283–98. https://doi.org/10.1142/S0219720011005495. Wan Y, Qu K, Ouyang Z, Kertesz M, Li J, Tibshirani R, et al. Genome-wide measurement of RNA folding energies. Mol Cell. 2012;48(2):169–81. https://doi.org/10.1016/j.molcel.2012.08.008. Leclercq M, Diallo AB, Blanchette M. Computational prediction of the localization of microRNAs within their pre-miRNA. Nucleic Acids Res. 2013;41(15):7200–11. https://doi.org/10.1093/nar/gkt466. Winkler WC, Grundy FJ, Murphy BA, Henkin TM. The GA motif: an RNA element common to bacterial antitermination systems, rRNA, and eukaryotic RNAs. RNA. 2001;7(8):1165–72. https://doi.org/10.1017/S1355838201002370. Wilde A, Hihara Y. Transcriptional and posttranscriptional regulation of cyanobacterial photosynthesis. Biochim Biophys Acta. 2016;1857(3):296–308. https://doi.org/10.1016/j.bbabio.2015.11.002.