DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics

F1000Research - Tập 5 - Trang 1356
Małgorzata Nowicka1,2, Mark D. Robinson1,2
1Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057
2SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057

Tóm tắt

There are many instances in genomics data analyses where measurements are made on a multivariate response. For example, alternative splicing can lead to multiple expressed isoforms from the same primary transcript. There are situations where differences (e.g. between normal and disease state) in the relative ratio of expressed isoforms may have significant phenotypic consequences or lead to prognostic capabilities. Similarly, knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, so-called splicing quantitative trait loci (sQTL) will help to characterize the effects of genetic variation on gene expression. RNA sequencing (RNA-seq) has provided an attractive toolbox to carefully unravel alternative splicing outcomes and recently, fast and accurate methods for transcript quantification have become available. We propose a statistical framework based on the Dirichlet-multinomial distribution that can discover changes in isoform usage between conditions and SNPs that affect relative expression of transcripts using these quantifications. The Dirichlet-multinomial model naturally accounts for the differential gene expression without losing information about overall gene abundance and by joint modeling of isoform expression, it has the capability to account for their correlated nature. The main challenge in this approach is to get robust estimates of model parameters with limited numbers of replicates. We approach this by sharing information and show that our method improves on existing approaches in terms of standard statistical performance metrics. The framework is applicable to other multivariate scenarios, such as Poly-A-seq or where beta-binomial models have been applied (e.g., differential DNA methylation). Our method is available as a Bioconductor R package called DRIMSeq.

Từ khóa


Tài liệu tham khảo

D McCarthy, 2012, Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation., Nucleic Acids Res., 40, 4288-4297, 10.1093/nar/gks042

M Robinson, 2008, Small-sample estimation of negative binomial dispersion, with applications to SAGE data., Biostatistics., 9, 321-332, 10.1093/biostatistics/kxm030

S Anders, 2010, Differential expression analysis for sequence count data., Genome Biol., 11, R106, 10.1186/gb-2010-11-10-r106

M Ritchie, 2015, Limma powers differential expression analyses for RNA-sequencing and microarray studies., Nucleic Acids Res., 43, e47, 10.1093/nar/gkv007

C Law, 2014, voom: Precision weights unlock linear model analysis tools for RNA-seq read counts., Genome Biol., 15, R29, 10.1186/gb-2014-15-2-r29

J Mosimann, 1962, On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions., Biometrika., 49, 65-82, 10.2307/2333468

T Tvedebrink, 2010, Overdispersion in allelic counts and θ-correction in forensic genetics., Theor Popul Biol., 78, 200-210, 10.1016/j.tpb.2010.07.002

J Chen, 2013, Variable Selection for Sparse Dirichlet-Multinomial Regression With an Application To Microbiome Data Analysis., Ann Appl Stat., 7, 418-442, 10.1214/12-AOAS592

G Finak, 2014, Mixture models for single-cell assays with applications to vaccine studies., Biostatistics., 15, 87-101, 10.1093/biostatistics/kxt024

R Samb, 2015, Using informative Multinomial-Dirichlet prior in a t-mixture with reversible jump estimation of nucleosome positions for genome-wide profiling., Stat Appl Genet Mol Biol., 14, 517-532, 10.1515/sagmb-2014-0098

J Mosimann, 1963, On the Compound Negative Multinomial Distribution and Correlations Among Inversely Sampled Pollen Counts., Biometrika., 50, 47-54, 10.1093/biomet/50.1-2.47

D Farewell, 2013, Dirichlet negative multinomial regression for overdispersed correlated count data., Biostatistics., 14, 395-404, 10.1093/biostatistics/kxs050

D Sun, 2014, MOABS: model based analysis of bisulfite sequencing data., Genome Biol., 15, R38, 10.1186/gb-2014-15-2-r38

Y Park, 2014, MethylSig: a whole genome DNA methylation analysis pipeline., Bioinformatics., 30, 2414-22, 10.1093/bioinformatics/btu339

H Feng, 2014, A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data., Nucleic Acids Res., 42, e69, 10.1093/nar/gku154

E Wang, 2008, Alternative isoform regulation in human tissue transcriptomes., Nature., 456, 470-6, 10.1038/nature07509

G Wang, 2007, Splicing in disease: disruption of the splicing code and the decoding machinery., Nat Rev Genet., 8, 749-61, 10.1038/nrg2164

J Tazi, 2009, Alternative splicing and disease., Biochim Biophys Acta., 1792, 14-26, 10.1016/j.bbadis.2008.09.017

J Hooper, 2014, A survey of software for genome-wide discovery of differential splicing in RNA-Seq data., Hum Genomics., 8, 3, 10.1186/1479-7364-8-3

M Robinson, 2010, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data., Bioinformatics., 26, 139-140, 10.1093/bioinformatics/btp616

A Derti, 2012, A quantitative atlas of polyadenylation in five mammals., Genome Res., 22, 1173-1183, 10.1101/gr.132563.111

G Alamancos, 2014, Methods to study splicing from high-throughput RNA sequencing data., Methods Mol Biol., 1126, 357-397, 10.1007/978-1-62703-980-2_26

C Soneson, 2016, Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage., Genome Biol., 17, 12, 10.1186/s13059-015-0862-3

Y Liao, 2014, FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features., Bioinformatics., 30, 923-930, 10.1093/bioinformatics/btt656

S Anders, 2012, Detecting differential usage of exons from RNA-seq data., Genome Res., 22, 2008-2017, 10.1101/gr.133744.111

S Anders, 2015, HTSeq--a Python framework to work with high-throughput sequencing data., Bioinformatics., 31, 166-169, 10.1093/bioinformatics/btu638

H Ongen, 2015, Alternative Splicing QTLs in European and African Populations., Am J Hum Genet., 97, 567-575, 10.1016/j.ajhg.2015.09.004

Y Katz, 2010, Analysis and design of RNA sequencing experiments for identifying isoform regulation., Nat Methods., 7, 1009-1015, 10.1038/nmeth.1528

S Shen, 2014, rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data., Proc Natl Acad Sci U S A., 111, E5593-601, 10.1073/pnas.1419161111

G Alamancos, 2015, Leveraging transcript quantification for fast computation of alternative splicing profiles., RNA., 21, 1521-1531, 10.1261/rna.051557.115

L Goldstein, 2016, Prediction and Quantification of Splice Events from RNA-Seq Data., PLoS One., 11, e0156132, 10.1371/journal.pone.0156132

K Zhao, 2013, GLiMMPS: Robust statistical model for regulatory variation of alternative splicing using RNA-seq data., Genome Biol., 14, R74, 10.1186/gb-2013-14-7-r74

C Jia, 2014, Mapping Splicing Quantitative Trait Loci in RNA-Seq., Cancer Inform., 13, 35-43, 10.4137/CIN.S13971

Y Hu, 2014, PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read distribution., Nucleic Acids Res., 42, e20, 10.1093/nar/gkt1304

J Monlong, 2014, Identification of genetic variants associated with alternative splicing using sQTLseekeR., Nat Commun., 5, 10.1038/ncomms5698

P Glaus, 2012, Identifying differentially expressed transcripts from RNA-seq data with biological variation., Bioinformatics., 28, 1721-1728, 10.1093/bioinformatics/bts260

D Rossell, 2014, Quantifying Alternative Splicing From Paired-End RNA-Sequencing Data., Ann Appl Stat., 8, 309-330, 10.1214/13-AOAS687

C Trapnell, 2010, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation., Nat Biotechnol., 28, 511-515, 10.1038/nbt.1621

B Li, 2011, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome., BMC Bioinformatics., 12, 323, 10.1186/1471-2105-12-323

E Bernard, 2014, Efficient RNA isoform identification and quantification from RNA-Seq data with network flows., Bioinformatics., 30, 2447-2455, 10.1093/bioinformatics/btu317

R Patro, 2014, Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms., Nat Biotechnol., 32, 462-4, 10.1038/nbt.2862

N Bray, 2016, Near-optimal probabilistic RNA-seq quantification., Nat Biotechnol., 34, 525-7, 10.1038/nbt.3519

R Patro, 2015, Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment., bioRxiv., 021592, 10.1101/021592

A Kanitz, 2015, Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data., Genome Biol., 16, 150, 10.1186/s13059-015-0702-5

M Teng, 2016, A benchmark for RNA-seq quantification pipelines., Genome Biol., 17, 74, 10.1186/s13059-016-0940-1

T Lappalainen, 2013, Transcriptome and genome sequencing uncovers functional variation in humans., Nature., 501, 506-11, 10.1038/nature12531

A Battle, 2014, Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals., Genome Res., 24, 14-24, 10.1101/gr.155192.113

J Pickrell, 2010, Understanding mechanisms underlying human gene expression variation with RNA sequencing., Nature., 464, 768-772, 10.1038/nature08872

S Montgomery, 2010, Transcriptome genetics using second generation sequencing in a Caucasian population., Nature., 464, 773-777, 10.1038/nature08903

H Ongen, 2016, Fast and efficient QTL mapper for thousands of molecular phenotypes., Bioinformatics., 32, 1479-85, 10.1093/bioinformatics/btv722

C Trapnell, 2013, Differential analysis of gene regulation at transcript resolution with RNA-seq., Nat Biotechnol., 31, 46-53, 10.1038/nbt.2450

Y Li, 2016, LeafCutter: Annotation-free quantification of RNA splicing., bioRxiv., 10.1101/044107

M Robinson, 2007, Moderated statistical tests for assessing differences in tag abundance., Bioinformatics., 23, 2881-2887, 10.1093/bioinformatics/btm453

N Reid, 2003, Likelihood inference in the presence of nuisance parameters, 7

P McCullagh, 1990, A Simple Method for the Adjustment of Profile Likelihoods., J R Stat Soc Series B Stat Methodol., 52, 325-344, 10.1111/j.2517-6161.1990.tb01790.x

D Cox, 1987, Parameter orthogonality and approximate conditional inference., J R Stat Soc Series B Stat Methodol., 49, 1-39

J Choi, 2009, Intrinsic variability of gene expression encoded in nucleosome positioning sequences., Nat Genet., 41, 498-503, 10.1038/ng.319

A Singh, 2013, Quantifying intrinsic and extrinsic variability in stochastic gene expression models., PLoS One., 8, e84301, 10.1371/journal.pone.0084301

A Brooks, 2011, Conservation of an RNA regulatory map between Drosophila and mammals., Genome Res., 21, 193-202, 10.1101/gr.108662.110

S Kim, 2013, A high-dimensional, deep-sequencing study of lung adenocarcinoma in female never-smokers., PLoS One., 8, e55596, 10.1371/journal.pone.0055596

M Nowicka, 2016, Source code of the R package used for analyses in "DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics" paper., Zenodo.

M Nowicka, 2016, Source code of the analyses in the "DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics” paper., Zenodo.