A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts

Springer Science and Business Media LLC - Tập 18 - Trang 1-14 - 2017
Hugo W. Schneider1, Taina Raiol2, Marcelo M. Brigido3, Maria Emilia M. T. Walter1, Peter F. Stadler4
1Department of Computer Science, University of Brasilia, ICC Central, Instituto de Ciências Exatas, Campus Universitario Darcy Ribeiro, Brasilia, Brazil
2Gerência Regional de Brasilia (GEREB), Oswaldo Cruz Foundation (Fiocruz), Av. L3 Norte, Campus Universitário Darcy Ribeiro, Brasília, Brazil
3Laboratory of Molecular Biology, University of Brasilia, Instituto de Ciencias Biologicas, Campus Universitario Darcy Ribeiro, Brasilia, Brazil
4Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany

Tóm tắt

In recent years, a rapidly increasing number of RNA transcripts has been generated by thousands of sequencing projects around the world, creating enormous volumes of transcript data to be analyzed. An important problem to be addressed when analyzing this data is distinguishing between long non-coding RNAs (lncRNAs) and protein coding transcripts (PCTs). Thus, we present a Support Vector Machine (SVM) based method to distinguish lncRNAs from PCTs, using features based on frequencies of nucleotide patterns and ORF lengths, in transcripts. The proposed method is based on SVM and uses the first ORF relative length and frequencies of nucleotide patterns selected by PCA as features. FASTA files were used as input to calculate all possible features. These features were divided in two sets: (i) 336 frequencies of nucleotide patterns; and (ii) 4 features derived from ORFs. PCA were applied to the first set to identify 6 groups of frequencies that could most contribute to the distinction. Twenty-four experiments using the 6 groups from the first set and the features from the second set where built to create the best model to distinguish lncRNAs from PCTs. This method was trained and tested with human (Homo sapiens), mouse (Mus musculus) and zebrafish (Danio rerio) data, achieving 98.21%, 98.03% and 96.09%, accuracy, respectively. Our method was compared to other tools available in the literature (CPAT, CPC, iSeeRNA, lncRNApred, lncRScan-SVM and FEELnc), and showed an improvement in accuracy by ≈3.00%. In addition, to validate our model, the mouse data was classified with the human model, and vice-versa, achieving ≈97.80% accuracy in both cases, showing that the model is not overfit. The SVM models were validated with data from rat (Rattus norvegicus), pig (Sus scrofa) and fruit fly (Drosophila melanogaster), and obtained more than 84.00% accuracy in all these organisms. Our results also showed that 81.2% of human pseudogenes and 91.7% of mouse pseudogenes were classified as non-coding. Moreover, our method was capable of re-annotating two uncharacterized sequences of Swiss-Prot database with high probability of being lncRNAs. Finally, in order to use the method to annotate transcripts derived from RNA-seq, previously identified lncRNAs of human, gorilla (Gorilla gorilla) and rhesus macaque (Macaca mulatta) were analyzed, having successfully classified 98.62%, 80.8% and 91.9%, respectively. The SVM method proposed in this work presents high performance to distinguish lncRNAs from PCTs, as shown in the results. To build the model, besides using features known in the literature regarding ORFs, we used PCA to identify features among nucleotide pattern frequencies that contribute the most in distinguishing lncRNAs from PCTs, in reference data sets. Interestingly, models created with two evolutionary distant species could distinguish lncRNAs of even more distant species.

Tài liệu tham khảo

The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447:799–816. Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engström PG, Lenhard B, Aturaliya RN, Batalov S, Beisel KW, Bult CJ, Fletcher CF, Forrest AR, Furuno M, Hill D, Itoh M, Kanamori-Katayama M, Katayama S, Katoh M, Kawashima T, Quackenbush J, Ravasi T, Ring BZ, Shibata K, Sugiura K, Takenaka Y, Teasdale RD, Wells CA, Zhu Y, Kai C, Kawai J, Hume DA, Carninci P, Hayashizaki Y. Transcript annotation in FANTOM3: Mouse Gene Catalog based on physical cDNAs. PLoS Genet. 2006; 2:62. Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL, Ponting CP, Stadler PF, Morris KJ, Morillon A, Rozowsky JS, Gerstein M, Wahlestedt C, Hayashizaki Y, Carninci P, Gingeras TR, Mattick JS. The reality of pervasive transcription. PLoS Biol. 2011; 9:1000625. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74. Carthew RW, Sontheimer EJ. Origins and mechanisms of miRNAs and siRNAs. Cell. 136; 2009:642–55. Kapranov P, St Laurent G, Raz T, Ozsolak F, Reynolds CP, Sorensen PH, Reaman G, Milos P, Arceci RJ, Thompson JF, Triche TJ. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is ’dark matter’ un-annotated RNA. BMC Biol. 2010; 8:149. Hackermüller J, Reiche K, Otto C, Hösler N, Blumert C, Brocke-Heidrich K, Böhlig L, Nitsche A, Kasack K, Ahnert P, Krupp W, Engeland K, Stadler PF, Horn F. Cell cycle, oncogenic and tumor suppressor pathways regulate numerous long and macro non-protein coding RNAs. Genome Biol. 2014; 15:48. Nakaya HI, Amaral PP, Louro R, Lopes A, Fachel AA, Moreira YB, El-Jundi TA, da Silva AM, Reis EM, Verjovski-Almeida S. Genome mapping and expression analyses of human intronic noncoding RNAs reveal tissue-specific patterns and enrichment in genes related to regulation of transcription. Genome Biol. 2007; 8(3):43. Engelhardt J, Stadler PF. Evolution of the unspliced transcriptome. BMC Evol Biol. 2015; 15:166. doi:10.1186/s12862-015-0437-7. Kapranov P, Cheng J, Dike S, Nix D, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Madhavan G, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007; 316:1484–8. Ponting CP, Oliver PL, Reik W. Evolution and functions of long noncoding RNAs. Cell. 2009; 136(4):629–41. doi:10.1016/j.cell.2009.02.006. Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009; 10(3):155–9. doi:10.1038/nrg2521. Orom UA, Shiekhattar R. Noncoding RNAs and enhancers: complications of a long-distance relationship. Trends Genet. 2011; 27(10):433–9. doi:10.1016/j.tig.2011.06.009. Rinn JL, Kertesz M, Wang JK, Squazzo SL, Xu X, Brugmann SA, Goodnough LH, Helms JA, Farnham PJ, Segal E, Chang HY. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell. 2007; 129(7):1311–23. doi:10.1016/j.cell.2007.05.022. He Y, Vogelstein B, Velculescu VE, Papadopoulos N, Kinzler KW. The antisense transcriptomes of human cells. Science. 2008; 322(5909):1855–7. doi:10.1126/science.1163853. Guenther MG, Levine SS, Boyer LA, Jaenisch R, Young RA. A chromatin landmark and transcription initiation at most promoters in human cells. Cell. 2007; 130(1):77–88. doi:10.1016/j.cell.2007.05.042. Ashe HL, Monks J, Wijgerde M, Fraser P, Proudfoot NJ. Intergenic transcription and transinduction of the human beta-globin locus. Genes Dev. 1997; 11(19):2494–509. Weirick T, John D, Dimmeler S, Uchida S. C-It-Loci: a knowledge database for tissue-enriched loci. Bioinformatics. 2015; 31(21):3537–43. http://dx.doi.org/10.1093/bioinformatics/btv410. Hall JR, Messenger ZJ, Tam HW, Phillips SL, Recio L, Smart RC. Long noncoding RNA lincRNA-p21 is the major mediator of UVB-induced and p53-dependent apoptosis in keratinocytes. Cell Death Dis. 2015; 6:1700. Tang SS, Zheng BY, Xiong XD. LincRNA-p21: Implications in Human DiseasesLong noncoding RNA lincRNA-p21 is the major mediator of UVB-induced and p53-dependent apoptosis in keratinocytes. Int J Mol Sci. 2015; 16:18732–40. Kumar V, Westra HJ, Karjalainen J, Zhernakova DV, Esko T, Hrdlickova B, Almeida R, Zhernakova A, Reinmaa E, Vosa U, Hofker MH, Fehrmann RSN, Fu J, Withoff S, Metspalu A, Franke L, Wijmenga C. Human Disease-Associated Genetic Variation Impacts Large Intergenic Non-Coding RNA Expression. Cell Death Dis. 2013; 9(1):1003201. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007; 35(Web Server issue):345–9. doi:10.1093/nar/gkm391. Wang L, Park HJ, Dasari S, co-authors. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucl Ac Res. 2013; 41(6):74–4. Pian C, Zhang G, Chen Z, Chen Y, Zhang J, Yang T, Zhang L. LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature. PLoS ONE. 2016; 11(5):0154567. doi:10.1371/journal.pone.0154567. Han S, Liang Y, Li Y, Du W. Long Noncoding RNA Identification: Comparing Machine Learning Based Tools for Long Noncoding Transcripts Discrimination. Biomed Res Int. 2016; 2016:8496165. doi:10.1155/2016/8496165. Tripathi R, Patel S, Kumari V, Chakraborty P, Varadwaj PK. Deeplnc, a long non-coding rna prediction tool using deep neural network. Netw Model Anal Health Inform Bioinformatics. 2016; 5(1):21. doi:10.1007/s13721-016-0129-2. Sun K, Chen X, Jiang P, Song X, Wang H, Sun H. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics. 2013; 14 Suppl 2:7. doi:10.1186/1471-2164-14-S2-S7. Wucher V, Legeai F, Hédan B, Rizk G, Lagoutte L, Leeb T, Jagannathan V, Cadieu E, David A, Lohi H, Cirera S, Fredholm M, Botherel N, Leegwater PAJ, Le Béguec C, Fieten H, Johnson J, Alföldi J, André CK, Lindblad-Toh, Hitte C, Derrien T. FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res. 2016;1–12. doi:10.1093/nar/gkw1306. Guo X, Gao L, Wang Y, Chiu DKY, Wang T, Deng Y. Advances in long noncoding RNAs: identification, structure prediction and function annotation. Brief Funct Genomics. 2016; 15(1):38–46. doi:10.1093/bfgp/elv022. Ventola GM, Noviello TMR, D ́Aniello S, Spagnuolo A, Ceccarelli M, Cerulo L. Identification of long non-coding transcripts with feature selection: a comparative study. BMC Bioinformatics. 2017; 18(187):1–16. doi:10.1186/s12859-017-1594-z. Fritah S, Niclou SP, Azuaje F. Databases for lncRNAs: a comparative evaluation of emerging tools. RNA. 2014; 20(11):1655–65. Cunningham F, co-authors. Ensembl 2015. Nucleic Acids Res. 2015; 43(Database issue):662–9. doi:10.1093/nar/gku1010. Xie C, Yuan J, Li H, Li M, Zhao G, Bu D, Zhu W, Wu W, Chen R, Zhao Y. NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic Acids Res. 2014; 42(D1):98–103. Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, Gloss BS, Dinger ME. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 2015; 43(Database issue):168–73. Jin J, Liu J, Wang H, Wong L, Chua NH. PLncDB: plant long non-coding RNA database. Bioinformatics. 2013; 29(8):1068–71. Dinger ME, Pang KC, Mercer TR, Crowe ML, Grimmond SM, Mattick JS. NRED: a database of long noncoding RNA expression. Nucleic Acids Res. 2009; 37(suppl 1):122–6. Paraskevopoulou MD, Georgakilas G, Kostoulas N, Reczko M, Maragkakis M, Dalamagas TM, Hatzigeorgiou AG. DIANA-LncBase: experimentally verified and computationally predicted microRNA targets on long non-coding RNAs. Nucleic Acids Res. 2013; 41(Database issue):239–45. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2013; 41(Database issue):983–6. Yan K, Arfat Y, Li D, co-authors. Structure prediction: New insights into decrypting long noncoding RNAs. Int J Mol Sci. 2016; 17(1):132. Jolliffe IT. Principal Component Analysis, 2nd ed edn. New York: Springer; 2002. Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, Baker JC, Grützner F, Kaessmann H. The evolution of lncrna repertoires and expression patterns in tetrapods. Nature. 2014; 505(7485):635–40. doi:10.1038/nature12943. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. COLT ’92. New York: ACM: 1992. p. 144–52. doi:10.1145/130385.130401. http://doi.acm.org/10.1145/130385.130401. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2:27–12727. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I. Uniprotkb/swiss-prot, the manually annotated section of the uniprot knowledgebase: How to use the entry view. Methods Mol Biol. 2016; 1374:23–54. Agarwal P, Bafna V. The ribosome scanning model for translation initiation: implications for gene prediction and full-length cdna detection. Proc Int Conf Intell Syst Mol Biol. 1998; 6:2–7. Sun L, Liu H, Zhang L, Meng J. lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine. PLoS ONE. 2015; 10(10):0139654. doi:10.1371/journal.pone.0139654. Volders PJ, Verheggen K, Menschaert G, Vandepoele K, Martens L, Vandesompele J, Mestdagh P. An update on lncipedia: a database for annotated human lncrna sequences. Nucleic Acids Res. 2015; 43(8):4363–4. doi:10.1093/nar/gkv295. Nitsche A, Rose D, Fasold M, Reiche K, Stadler PF. Comparison of splice sites reveals that long non-coding RNAs are evolutionarily well conserved. RNA. 2015; 21:801–12. doi:10.1261/rna.046342.114. Frankish A, Harrow J. Gencode pseudogenes. Methods Mol Biol. 2014; 1167:129–55.