Differential expression analysis for sequence count data
Tóm tắt
High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.
Tài liệu tham khảo
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 1344–1349. 10.1126/science.1158441.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621–628. 10.1038/nmeth.1226.
Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4: 651–657. 10.1038/nmeth1068.
Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, Clark TA, Schweitzer AC, Blume JE, Wang X, Darnell JC, Darnell RB: HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008, 456: 464–469. 10.1038/nature07488.
Smith AM, Heisler LE, Mellor J, Kaper F, Thompson MJ, Chee M, Roth FP, Giaever G, Nislow C: Quantitative phenotyping via deep barcode sequencing. Genome Res. 2009, 19: 1836–1842. 10.1101/gr.093955.109.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509–1517. 10.1101/gr.079558.108.
Wang L, Feng Z, Wang X, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010, 26: 136–138. 10.1093/bioinformatics/btp612.
Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23 (21): 2881–2887. 10.1093/bioinformatics/btm453.
Whitaker L: On the Poisson law of small numbers. Biometrika. 1914, 10: 36–71. 10.1093/biomet/10.1.36.
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139–140. 10.1093/bioinformatics/btp616.
Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321–332. 10.1093/biostatistics/kxm030.
Cameron AC, Trivedi PK: Regression Analysis of Count Data. 1998, Cambridge University Press
Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11: R25-10.1186/gb-2010-11-3-r25.
Loader C: Local Regression and Likelihood. 1999, Springer
McCullagh P, Nelder JA: Generalized Linear Models. 1989, Chapman & Hall/CRC, 2
locfit: Local regression, likelihood and density estimation. [https://doi.org/cran.r-project.org/web/packages/locfit/]
Agresti A: Categorical Data Analysis. 2002, Wiley, 2
Engström P, Tommei D, Stricker S, Smith A, Pollard S, Bertone P: Transcriptional characterization of glioblastoma stem cell lines using tag sequencing. 2010
Morrissy AS, Morin RD, Delaney A, Zeng T, McDonald H, Jones S, Zhao Y, Hirst M, Marra MA: Next-generation tag sequencing for cancer gene expression profiling. Genome Res. 2009, 19: 1825–1835. 10.1101/gr.094482.109.
Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M: Variation in transcription factor binding among humans. Science. 2010, 328: 232–235. 10.1126/science.1183621.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995, 57: 289–300.
Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010, 11: 94-10.1186/1471-2105-11-94.
Bloom JS, Khan Z, Kruglyak L, Singh M, Caudy AA: Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics. 2009, 10: 221-10.1186/1471-2164-10-221.
Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, R Irizarry WH. 2005, New York: Springer, 397–420. full_text.
Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article3-
Lönnstedt I, Speed T: Replicated microarray data. Stat Sin. 2002, 12: 31–46.
R: A Language and Environment for Statistical Computing. [https://doi.org/www.R-project.org]
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.
Bliss CI, Fisher RA: Fitting the negative binomial distribution to biological data. Biometrics. 1953, 9: 176–200. 10.2307/3001850.
Clark SJ, Perry JN: Estimation of the negative binomial parameter κ by maximum quasi-likelihood. Biometrics. 1989, 45: 309–316. 10.2307/2532055.
Lawless JF: Negative binomial and mixed Poisson regression. Can J Stat. 1987, 15: 209–225. 10.2307/3314912.
Saha K, Paul S: Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics. 2005, 61: 179–285. 10.1111/j.0006-341X.2005.030833.x.
Fast and accurate computation of binomial probabilities. (Note: This is a copy of the original paper, which is no longer available online.), [https://doi.org/projects.scipy.org/scipy/raw-attachment/ticket/620/loader2000Fast.pdf]
Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.
HTSeq: Analysing high-throughput sequencing data with Python. [https://doi.org/www-huber.embl.de/users/anders/HTSeq/]
DESeq. [https://doi.org/www-huber.embl.de/users/anders/DESeq]