Classification and clustering of sequencing data using a Poisson model

Annals of Applied Statistics - Tập 5 Số 4 - 2011
Daniela Witten1
1University of Washington

Tóm tắt

Từ khóa


Tài liệu tham khảo

Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. <i>Genome Res.</i> <b>18</b> 1509–1517.

Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. <i>Bioinformatics</i> <b>26</b> 139–140.

Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. <i>Genome Biol.</i> <b>11</b> R106.

Lee, S., Huang, J. Z. and Hu, J. (2010). Sparse logistic principal components analysis for binary data. <i>Ann. Appl. Stat.</i> <b>4</b> 1579–1601.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. <i>J. Amer. Statist. Assoc.</i> <b>66</b> 846–850.

Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. <i>Nat. Rev. Genet.</i> <b>10</b> 57–63.

Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. <i>Proc. Natl. Acad. Sci. USA</i> <b>99</b> 6567–6572.

Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. <i>Biometrika</i> <b>35</b> 246–254.

Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. <i>Science</i> <b>316</b> 1497–1502.

Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. <i>Genome Biol.</i> <b>11</b> R25.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. <i>Nature Methods</i> <b>5</b> 621–628.

Auer, P. L. and Doerge, R. W. (2010). Statistical design and analysis of RNA sequencing data. <i>Genetics</i> <b>185</b> 405–416.

Barrett, T., Suzek, T. O., Troup, D. B., Wilhite, S. E., Ngau, W.-C., Ledoux, P., Rudnev, D., Lash, A. E., Fujibuchi, W. and Edgar, R. (2005). NCBI GEO: Mining millions of expression profiles–database and tools. <i>Nucleic Acids Res.</i> <b>33</b> D562–D566.

Berninger, P., Gaidatzis, D., van Nimwegen, E. and Zavolan, M. (2008). Computational analysis of small RNA cloning data. <i>Methods</i> <b>44</b> 13–21.

Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. <i>Bernoulli</i> <b>10</b> 989–1010.

Brown, P. and Botstein, D. (1999). Exploring the new world of the genome with DNA microarrays. <i>Nature Genetics</i> <b>21</b> 33–37.

Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. <i>BMC Bioinformatics</i> <b>11</b> 94.

Cai, L., Huang, H., Blackshaw, S., Liu, J., Cepko, C. and Wong, W. (2004). Clustering analysis of SAGE data using a Poisson approach. <i>Genome Biology</i> <b>5</b> R51.

DeRisi, J., Iyer, V. and Brown, P. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. <i>Science</i> <b>278</b> 680–686.

Dudoit, S., Fridlyand, J. and Speed, T. P. (2001). Comparison of discrimination methods for the classification of tumors using gene expression data. <i>J. Amer. Statist. Assoc.</i> <b>96</b> 1151–1160.

Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S. M., Habegger, L., Rozowsky, J., Shi, M., Urban, A. E., Hong, M.-Y., Karczewski, K. J., Huber, W., Weissman, S. M., Gerstein, M. B., Korbel, J. O. and Snyder, M. (2010). Variation in transcription factor binding among humans. <i>Science</i> <b>328</b> 232–235.

Linsen, S. E. V., de Wit, E., Janssens, G., Heater, S., Chapman, L., Parkin, R. K., Fritz, B., Wyman, S. K., de Bruijn, E., Voest, E. E., Kuersten, S., Tewari, M. and Cuppen, E. (2009). Limitations and possibilities of small RNA digital gene expression profiling. <i>Nature Methods</i> <b>6</b> 474–476.

Monti, S., Savage, K. J., Kutok, J. L., Feuerhake, F., Kurtin, P., Mihm, M., Wu, B., Pasqualucci, L., Neuberg, D., Aguiar, R. C. T., Cin, P. D., Ladd, C., Pinkus, G. S., Salles, G., Harris, N. L., Dalla-Favera, R., Habermann, T. M., Aster, J. C., Golub, T. R. and Shipp, M. A. (2005). Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. <i>Blood</i> <b>105</b> 1851–1861.

Morozova, O., Hirst, M. and Marra, M. A. (2009). Applications of new sequencing technologies for transcriptome analysis. <i>Annu. Rev. Genomics Hum. Genet.</i> <b>10</b> 135–151.

Nagalakshmi, U., Wong, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. and Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. <i>Science</i> <b>302</b> 1344–1349.

Nielsen, T., West, R., Linn, S., Alter, O., Knowling, M., O’Connell, J. S. Z., Fero, M., Sherlock, G., Pollack, J., Brown, P., Botstein, D. and van de Rijn, M. (2002). Molecular characterisation of soft tissue tumours: A gene expression study. <i>The Lancet</i> <b>359</b> 1301–1307.

Oshlack, A., Robinson, M. and Young, M. (2010). From RNA-seq reads to differential expression results. <i>Genome Biology</i> <b>11</b> 220.

Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds system biology. <i>Biology Direct</i> <b>4</b> 14.

Pepke, S., Wold, B. and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq studies. <i>Nature Methods</i> <b>6</b> S22–S32.

Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E. and Golub, T. (2001). Multiclass cancer diagnosis using tumor gene expression signature. <i>PNAS</i> <b>98</b> 15149–15154.

Spellman, P. T., Sherlock, G., Iyer, V. R., Zhang, M., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998). Comprehensive identification of cell cycle-reulated genes of the yeast saccharomyces by microarray hybridization. <i>Mol. Cell. Biol.</i> <b>9</b> 3273–3975.

Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. <i>Statist. Sci.</i> <b>18</b> 104–117.

Wang, S. M. (2007). Understanding SAGE data. <i>Trends Genet.</i> <b>23</b> 42–50.

Wilhelm, B. T. and Landry, J.-R. (2009). RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. <i>Methods</i> <b>48</b> 249–257.

Witten, D. and Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. <i>J. Roy. Statist. Soc. Ser. B</i> <b>73</b> 753–772.

Witten, D., Tibshirani, R., Gu, S., Fire, A. and Lui, W. (2010). Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumous and matched controls. <i>BMC Biology</i> <b>8</b> 58.

Agresti, A. (2002). <i>Categorical Data Analysis</i>. Wiley, Hoboken, NJ.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). <i>The Elements of Statistical Learning</i>: <i>Data Mining, Inference, and Prediction</i>. Springer, New York.

Li, J., Witten, D., Johnstone, I. and Tibshirani, R. (2011). Normalization, testing, and false discovery rate estimation for RNA-sequencing data. <i>Biostatistics</i>. To appear.