Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis
Tóm tắt
Experimental designs that take advantage of high-throughput sequencing to generate datasets include RNA sequencing (RNA-seq), chromatin immunoprecipitation sequencing (ChIP-seq), sequencing of 16S rRNA gene fragments, metagenomic analysis and selective growth experiments. In each case the underlying data are similar and are composed of counts of sequencing reads mapped to a large number of features in each sample. Despite this underlying similarity, the data analysis methods used for these experimental designs are all different, and do not translate across experiments. Alternative methods have been developed in the physical and geological sciences that treat similar data as compositions. Compositional data analysis methods transform the data to relative abundances with the result that the analyses are more robust and reproducible. Data from an in vitro selective growth experiment, an RNA-seq experiment and the Human Microbiome Project 16S rRNA gene abundance dataset were examined by ALDEx2, a compositional data analysis tool that uses Bayesian methods to infer technical and statistical error. The ALDEx2 approach is shown to be suitable for all three types of data: it correctly identifies both the direction and differential abundance of features in the differential growth experiment, it identifies a substantially similar set of differentially expressed genes in the RNA-seq dataset as the leading tools and it identifies as differential the taxa that distinguish the tongue dorsum and buccal mucosa in the Human Microbiome Project dataset. The design of ALDEx2 reduces the number of false positive identifications that result from datasets composed of many features in few samples. Statistical analysis of high-throughput sequencing datasets composed of per feature counts showed that the ALDEx2 R package is a simple and robust tool, which can be applied to RNA-seq, 16S rRNA gene sequencing and differential growth datasets, and by extension to other techniques that use a similar approach.
Tài liệu tham khảo
Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD: Count-based 631 differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013, 8 (9): 1765-86. 10.1038/nprot.2013.099.
Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrëzic F, on behalf of the French StatOmique Consortium: A comprehensive evaluation of normalizationmethods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013, 14 (6): 671-83. 10.1093/bib/bbs046.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75 (23): 7537-41. 10.1128/AEM.01541-09.
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R: Qiime allows analysis of high-throughput community sequencing data. Nat Methods. 2010, 7 (5): 335-6. 10.1038/nmeth.f.303.
Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, Huttenhower C: Microbial co-occurrence relationships in the human microbiome. PLoS Comput Biol. 2012, 8 (7): 1002606-10.1371/journal.pcbi.1002606.
Smith CJ, Osborn AM: Advantages and limitations of quantitative PCR (Q-PCR)-based approaches in microbial ecology. FEMS Microbiol Ecol. 2009, 67 (1): 6-20. 10.1111/j.1574-6941.2008.00629.x.
Zuo C, Keles S: A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics. 2013, 30 (6): 753-60.
Fernandes AD, Macklaim JM, Linn TG, Reid G, Gloor GB: ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-seq. PLoS ONE. 2013, 8 (7): 67019-10.1371/journal.pone.0067019.
Friedman J, Alm EJ: Inferring correlation networks from genomic survey data. PLoS Comput Biol. 2012, 8 (9): 1002687-10.1371/journal.pcbi.1002687.
Kuczynski J, Lauber CL, Walters WA Parfrey LW, Clemente JC, Gevers D, Knight R: Experimental and analytical tools for studying the human microbiome. Nat Rev Genet. 2012, 13 (1): 47-58.
Lovell D, Müller W, Taylor J, Zwart A, Helliwell C, Pawlowsky-Glahn V, Buccianti A: Proportions, percentages, ppm: do the molecular biosciences treat compositional data right?. Compositional Data Anal: Theory Appl. Edited by: Pawlowsky-Glahn V, Buccianti A. 2011, Chichester: John Wiley & Sons, 193-207.
Aitchison J: The Statistical Analysis of Compositional Data. 1986, London: Chapman & Hall
Hron K Jelínková, Filzmoser P, Kreuziger R, Barták P, Bednář P: Statistical analysis of wines using a robust compositional biplot. Talanta. 2012, 90: 46-50.
Filzmoser P, Hron K, Reimann C: Univariate statistical analysis of environmental (compositional) data: problems and possibilities. Sci Total Environ. 2009, 407 (23): 6100-8. 10.1016/j.scitotenv.2009.08.008.
Kucera M, Malmgren BA: Logratio transformation of compositional data: a resolution of the constant sum constraint. Mar Micropaleontology. 1998, 34 (1): 117-20.
Pearson K: Mathematical contributions to the theory of evolution – on a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond. 1896, 60: 489-98. 10.1098/rspl.1896.0076.
van den Boogaart KG, Tolosana-Delgado R: ‘compositions’: a unified R package to analyze compositional data. Comput Geosci. 2008, 34 (4): 320-38. 10.1016/j.cageo.2006.11.017.
Efron B: Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika. 1981, 68 (3): 589-10.1093/biomet/68.3.589.
Gloor GB, Hummelen R, Macklaim JM, Dickson RJ, Fernandes AD, MacPhee R, Reid G: Microbiome profiling by Illumina sequencing of combinatorial sequence-tagged PCR products. PLoS One. 2010, 5 (10): 15406-10.1371/journal.pone.0015406.
Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R: Global patterns of 16s rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2011, 108 ((Suppl 1): 4516-22.
Egozcue J, Pawlowsky-Glahn V: Groups of parts and their balances in compositional data analysis. Math Geol. 2005, 37 (7): 795-828. 10.1007/s11004-005-7381-9.
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barcelõ-Vidal C: Isometric logratio transformations for compositional data analysis. Math Geol. 2003, 35 (3): 279-300. 10.1023/A:1023818214614.
ALDEx2 R package. [https://github.com/ggloor/ALDEx2]
Auer PL, Doerge RW: A two-stage Poisson model for testing RNA-seq data. Stat Appl Genet Mol Biol. 2011, 10 (1): 1-26.
Newey WK, McFadden D: Large sample estimation and hypothesis testing. Handbook of Econometrics. Volume 4. Edited by: Engle R, McFadden D. 1994, Amsterdam: Elsevier Science, 2111-245.
Holmes I, Harris K, Quince C: Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS One. 2012, 7 (2): 30126-10.1371/journal.pone.0030126.
La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, Wang Q, Sodergren E, Weinstock G, Shannon WD: Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS One. 2012, 7 (12): 52078-10.1371/journal.pone.0052078.
Frigyik BA, Kapila A, Gupta MR: Introduction to the Dirichlet distribution and related processes. Technical Report UWEETR-2010-0006, Department of Electrical Engineering, University of Washington. December 2010, [https://www.ee.washington.edu/techsite/papers/refer/UWEETR-2010-0006.html]
Berger JO, Bernardo JM: Ordered group reference priors with application to the multinomial problem. Biometrika. 1992, 79 (1): 25-10.1093/biomet/79.1.25.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B (Methodol). 1995, 57 (1): 289-300.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): 106-10.1186/gb-2010-11-10-r106.
Li J, Tibshirani R: Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat Methods Med Res. 2013, 22 (5): 519-36. 10.1177/0962280211428386.
Hardcastle TJ, Kelly KA: Empirical Bayesian analysis of paired high-throughput sequencing data with a beta-binomial distribution. BMC Bioinformatics. 2013, 14 (1): 135-10.1186/1471-2105-14-135.
R Development Core Team: R: A Language and Environment for Statistical Computing. 2012, Vienna, Austria: R Foundation for Statistical Computing, ISBN 3-900051-07-0. [http://www.R-project.org]
McMurrough TA, Dickson RJ, Thibert SMF, Gloor GB, Edgell DR: Control of catalytic efficiency by a co-evolving network of catalytic and non-catalytic residues. arXiv. April 2014, [http://arxiv.org/abs/1404.3917]
Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013, 14: 91-10.1186/1471-2105-14-91.
Bottomly D, Walter NAR, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R: Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-seq and microarrays. PLoS One. 2011, 6 (3): 17820-10.1371/journal.pone.0017820.
Frazee AC, Langmead B, Leek JT: Recount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011, 12: 449-10.1186/1471-2105-12-449.
Scott M, Gunderson CW, Mateescu EM, Zhang Z, Hwa T: Interdependence of cell growth and gene expression: origins and consequences. Science. 2010, 330 (6007): 1099-102. 10.1126/science.1192588.
Altman DG, Bland JM: Measurement in medicine: the analysis of method comparison studies. J R Stat Soc Series D (Statistician). 1983, 32 (3): 307-17.
HMQCP – QIIME Community Profiling. [http://downloads.hmpdacc.org/data/HMQCP/otu_table_psn_v13.txt.gz] Accessed 1 Ju 2010
Segata N, Haake SK, Mannon P, Lemon KP, Waldron L, Gevers D, Huttenhower C, Izard J: Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biol. 2012, 13 (6): 42-10.1186/gb-2012-13-6-r42.
Legendre P, Gallagher ED: Ecologically meaningful transformations for ordination of species data. Oecologia. 2001, 129 (2): 271-80. 10.1007/s004420100716.
Dixon P: VEGAN, a package of R functions for community ecology. J Vegetation Sci. 2003, 14 (6): 927-30. 10.1111/j.1654-1103.2003.tb02228.x.
Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNA-seq: a matter of depth. Genome Res. 2011, 21 (12): 2213-23. 10.1101/gr.124321.111.
Liu Y, Zhou J, White KP: RNA-seq differential expression studies: more sequence or more replication?. Bioinformatics. 2013, 30 (3): 301-4.
Auer PL, Doerge RW: Statistical design and analysis of RNA sequencing data. Genetics. 2010, 185 (2): 405-16. 10.1534/genetics.110.114983.