ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues

Genome Biology - Tập 19 - Trang 1-12 - 2018
Serghei Mangul1,2, Harry Taegyun Yang1, Nicolas Strauli3, Franziska Gruhl4,5, Hagit T. Porath6, Kevin Hsieh1, Linus Chen7, Timothy Daley8, Stephanie Christenson9, Agata Wesolowska-Andersen10, Roberto Spreafico2, Cydney Rios10, Celeste Eng11, Andrew D. Smith8, Ryan D. Hernandez12,13,14, Roel A. Ophoff15,16,17, Jose Rodriguez Santana18, Erez Y. Levanon19, Prescott G. Woodruff9, Esteban Burchard20, Max A. Seibold21,22, Sagiv Shifman23, Eleazar Eskin1,16, Noah Zaitlen9
1Department of Computer Science, University of California, Los Angeles, USA
2Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, USA
3Biomedical Sciences Graduate Program, University of California, San Francisco, USA
4Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
5SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
6The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, Israel
7Department of Bioengineering, University of California, Los Angeles, USA
8Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, USA
9Division of Pulmonary, Critical Care, Sleep and Allergy, Department of Medicine, and Cardiovascular Research Institute, University of California, San Francisco, USA
10Center for Genes, Environment, and Health, National Jewish Health, Denver, USA
11Department of Medicine, University of California, San Francisco, USA
12Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, USA
13Institute for Quantitative Biosciences, University of California, San Francisco, USA
14Institute for Human Genetics, University of California San Francisco, San Francisco, USA
15Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University California, Los Angeles, USA
16Department of Human Genetics, University of California, Los Angeles, USA
17Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, The Netherlands
18Centro de Neumología Pediátrica, San Juan, Puerto Rico
19The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
20Schools of Pharmacy and Medicine, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, USA
21Department of Pediatrics, National Jewish Health, Denver, USA
22University of Colorado School of Medicine, Denver, USA
23Department of Genetics, The Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel

Tóm tắt

High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki .

Tài liệu tham khảo

Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–60. Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5:613–9. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6:377–82. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–5. Nicolae M, Mangul S, Mandoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol. 2011;6:9. Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–5. Ardlie KG, Deluca DS, Segre AV, Sullivan TJ, Young TR, Gelfand ET, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348:648–60. Li S, Tighe SW, Nicolet CM, Grove D, Levy S, Farmerie W, et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat Biotechnol. 2014;32:915–25. Seqc/Maqc-Iii Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32:903–14. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet. 2011;12:87–98. Baruzzo G, Hayer KE, Kim EJ, Di Camillo B, FitzGerald GA, Grant GR. Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods. 2017;14:135. Siragusa E, Weese D, Reinert K. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 2013;41:e78. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52. Blachly JS, Ruppert AS, Zhao W, Long S, Flynn J, Flinn I, et al. Immunoglobulin transcript sequence and somatic hypermutation computation from unselected RNA-seq reads in chronic lymphocytic leukemia. Proc Natl Acad Sci. 2015;112:4322–7. Strauli NB, Hernandez RD. Statistical inference of a convergent antibody repertoire response to influenza vaccine. Genome Med. 2016;8:1. Porath HT, Carmi S, Levanon EY. A genome-wide map of hyper-edited RNA reveals numerous new sites. Nat Commun. 2014;5:4726. Wu C-S, Yu CY, Chuang CY, Hsiao M, Kao CF, Kuo HC, et al. Integrative transcriptome sequencing identifies trans-splicing events with important roles in human embryonic stem cell pluripotency. Genome Res. 2014;24:25–36. Wang X-S, Prensner JR, Chen G, Cao Q, Han B, Dhanasekaran SM, et al. An integrative approach to reveal driver gene fusions from paired-end sequencing data in cancer. Nat Biotechnol. 2009;27:1005–11. Jeck WR, Sharpless NE. Detecting and characterizing circular RNAs. Nat Biotechnol. 2014;32:453–61. Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RG, Getz G, et al. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol. 2011;29:393–6. Chuang T-J, Wu CS, Chen Y, Hung LY, Chiang TW, Yang MY. NCLscan: accurate identification of non-co-linear transcripts (fusion, trans-splicing and circular RNA) with a good balance between sensitivity and precision. Nucleic Acids Res. 2016;44:e29. Brown SD, Raeburn LA, Holt RA. Profiling tissue-resident T cell repertoires by RNA sequencing. Genome Med. 2015;7:1–8. Strauli N, Hernandez R. Statistical inference of a convergent antibody repertoire response to influenza vaccine. Genome Med. 2016;8:60. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013;41:W34–40. Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015;12:902–3. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12:87. Tausch SH, Renard BY, Nitsche A, Dabrowski PW. RAMBO-K: rapid and sensitive removal of background sequences from next generation sequencing data. PLoS One. 2015;10:e0137896. Li B, Li T, Pignon JC, Wang B, Wang J, Shukla SA, et al. Landscape of tumor-infiltrating T cell repertoire of human cancers. Nat Genet. 2016;48:725–32. GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348:648–60. Criscione SW, Zhang Y, Thompson W, Sedivy JM, Neretti N. Transcriptional landscape of repetitive elements in normal and cancer human cells. BMC Genomics. 2014;15:583. Bazak L, Haviv A, Barak M, Jacob-Hirsch J, Deng P, Zhang R, et al. A-to-I RNA editing occurs at over a hundred million genomic sites, located in a majority of human genes. Genome Res. 2014;24:365–76. Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 2011;12:R72. Zhang XO, Dong R, Zhang Y, Zhang JL, Luo Z, Zhang J, et al. Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Res. 2016;26:1277–87. https://doi.org/10.1101/gr.202895.115. Poole A, Urbanek C, Eng C, Schageman J, Jacobson S, O’Connor BP, et al. Dissecting childhood asthma with nasal transcriptomics distinguishes subphenotypes of disease. J Allergy Clin Immunol. 2014;133:670–8. Yan M, Pamp SJ, Fukuyama J, Hwang PH, Cho DY, Holmes S, et al. Nasal microenvironments and interspecific interactions influence nasal microbiota complexity and S. aureus carriage. Cell Host Microbe. 2013;14:631–40. Beck JM, Young VB, Huffnagle GB. The microbiome of the lung. Transl Res. 2012;160:258–66. Strong MJ, Xu G, Morici L, Splinter Bon-Durant S, Baddoo M, Lin Z, et al. Microbial contamination in next generation sequencing: implications for sequence-based analysis of clinical samples. PLoS Pathog. 2014;10:e1004437. Westermann AJ, Gorski SA, Vogel J. Dual RNA-seq of pathogen and host. Nat Rev Microbiol. 2012;10:618–30. Spreafico R, Rossetti M, van Loosdregt J, Wallace CA, Massa M, Magni-Manzoni S, et al. A circulating reservoir of pathogenic-like CD4+ T cells shares a genetic and phenotypic signature with the inflamed synovial micro-environment. Ann Rheum Dis. 2016;75:459–65. Jin Y, Tam OH, Paniagua E, Hammell M. TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics. 2015;31:3593–9. Melé M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, et al. The human transcriptome across tissues and individuals. Science. 2015;348:660–5. Anders S, Pyl PT, Huber W. HTSeq--A Python framework to work with high-throughput sequencing data. Bioinformatics. 2014;31:166–9. Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009;Chapter 4:Unit 4. Mangul S, Yang HT, Strauli N, Gruhl F, Porath HT, Hsieh K, et al. ROP: Dumpster Diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Gene Expression Omnibus. 2018; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109313. Mangul S, Yang HT, Strauli N, Gruhl F, Porath HT, Hsieh K, et al. ROP: Dumpster Diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Gene Expression Omnibus. 2018; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109484.