Statistical Issues in the Design and Analysis of Gene Expression Microarray Studies of Animal Models
Tóm tắt
Appropriate statistical design and analysis of gene expression microarray studies is critical in order to draw valid and useful conclusions from expression profiling studies of animal models. In this paper, several aspects of study design are discussed, including the number of animals that need to be studied to ensure sufficiently powered studies, usefulness of replication and pooling, and allocation of samples to arrays. Data preprocessing methods for both cDNA dual-label spotted arrays and Affymetrix-style oligonucleotide arrays are reviewed. High-level analysis strategies are briefly discussed for each of the types of study aims, namely class comparison, class discovery, and class prediction. For class comparison, methods are discussed for identifying genes differentially expressed between classes while guarding against unacceptably high numbers of false positive findings. Various clustering methods are discussed for class discovery aims. Class prediction methods are briefly reviewed, and reference is made to the importance of proper validation of predictors.
Tài liệu tham khảo
T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Dowing, M. Caligiuri, C. Bloomfield, and E. Lander (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286:531–537.
L. D. Miller, P. M. Long, L. Wong, S. Mukherjee, L. M. McShane, and E. T. Liu (2002). Optimal gene expression analysis by microarrays. Cancer Cell 2:353–361.
R. Simon, M. D. Radmacher, and K. Dobbin (2002). Design of studies using DNA microarrays. Genet. Epidemiol. 23:21–36.
Y. H. Yang and T. Speed (2002). Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3:579–588.
R. Simon and K. Dobbin (2003). Experimental design of DNA microarray experiments. Biotechniques 34:S16-S21.
K. Dobbin, J. Shih, and R. Simon (2003). Questions and answers on design of dual-label microarrays for identifying differentially expressed genes. J. Natl. Cancer Inst. 95(18):1362–1369.
K. Dobbin and R. Simon (2002). Comparison of microarray designs for class comparison and class discovery. Bioinformatics 18:1438–1445.
M. K. Kerr and G. A. Churchill (2001). Statistical design and the analysis of gene expression microarray data. Genet. Res. 77:123–128.
M. K. Kerr and G. A. Churchill (2001). Experimental design for gene expression microarrays. Biostatistics 2:183–201.
R. D. Wolfinger, G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules (2001). Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8:625–638.
M.-L. Lee, F. C. Kuo, G. A. Whitmore, and J. Sklar (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. U.S.A. 97:983–989.
K. Dobbin, J. Shih, and R. Simon (2003). Statistical design of reverse dye microarrays. Bioinformatics 19(7):803–810.
R. Simon, E. Korn, L. M. McShane, M. D. Radmacher, G. W. Wright, and Y. Zhao (in press). Design and Analysis of DNA Microarray Investigations, Springer-Verlag, New York, a: chapter 3; b: chapter 9.
J. Neter, W. Wasserman, and M. H. Kutner (1985). Applied Linear Statistical Models, 2nd edn., Richard D. Irwin, Homewood, IL, pp. 547–549, 700–702, 818, 919–920.
K. V. Desai, N. Xiao, W. Wang, L. Gangi, J. Greene, J. I. Powell, R. Dickson, P. Furth, K. Hunter, R. Kucherlapati, R. Simon, E. T. Liu, and J. E. Green (2002). Initiating oncogenic event determines gene-expression patterns of human breast cancer models. Proc. Natl. Acad. Sci. U.S.A. 99:6967–6972.
C. M. Kendziorski, Y. Zhang, H. Lan, and A. D. Attie (2003). The efficiency of pooling mRNA in microarray experiments. Biostatistics 4:465–477.
Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and P. Speed (2002). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30(4):e15.
Affymetrix (2001). Affymetrix Microarray Suite User Guide. 5th edn., Affymetrix, Santa Clara, CA.
C. Li and W. H. Wong (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. U.S.A. 98:31–36.
C. Li and W. H. Wong (2001). Model-based analysis of oligonucleotide arrays: Model validation, design issues and standard error application. Genome Biol. 2:research0032.1–0032.11.
R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and T. P. Speed (2003). Summaries of Affymetrix genechip probe level data. Nucleic Acids Res. 31(4):e15.
R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–264.
B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed (2003). A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19(2):185–193.
R. Simon, M. D. Radmacher, K. Dobbin, and L. M. McShane (2003). Pitfalls in the analysis of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95:14–18.
G. W. Snedecor and W. G. Cochran (1989). Statistical Methods, 8th edn., Iowa State University Press, Ames, IA, a: pp. 234–236; b: chapter 9.
M. Hollander and D. A. Wolfe (1999). Nonparametric Statistical Methods, 2nd edn., Wiley, New York, a: pp. 106–124; b: 190–201.
V. Tusher, R. Tibshirani, and G. Chu (2001). Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proc. Natl. Acad. Sci. U.S.A. 98:5116–5121.
B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher (2001). Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96:1151–1160.
E. L. Korn, J. F. Troendle, L. M. McShane, and R. Simon (in press). Controlling the number of false discoveries: Application to high-dimensional genomic data. J. Stat. Plan. Infer.
P. H. Westfall and S. S. Young (1993). Resampling-Based Multiple Testing, Wiley, New York, pp. 72–74.
P. Baldi and A. D. Long (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics 17(6):509–519.
P. Broet, S. Richardson, and F. Radvanyi (2002). Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol. 9(4):671–683.
G. Wright and R. Simon (in press). The random variance model for differential gene detection in small sample microarray experiments. Bioinformatics.
A. K. Jain, M. N. Murty, and P. J. Flynn (1999). Data clustering: A Review. ACM Comput. Surv. 31(3):264–323.
M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863–14868.
R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering Methods for the Analysis of DNA Microarray Data, Stanford University Department of Statistics Technical Report, Stanford, CA.
P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96:2907–2912.
J. MacQueen (1967). Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp. Math. Stat. Probability 1:281–297.
L. M. McShane, M. D. Radmacher, B. Freidlin, R. Yu, M. Li, and R. Simon (2002). Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18:1462–1469.
R. A. Fisher (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen. 7:179–188.
I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O. P. Kallioniemi, B. Wilfond, A. Borg, and J. Trent (2001). Gene expression profiles of hereditary breast cancer. N. Engl. J. Med. 344:549–548.
M. D. Radmacher, L. M. McShane, and R. Simon (2002). A paradigm for class prediction using gene expression profiles. J. Comput. Biol. 9:505–511.
L. Breiman, J. Friedman, C. Stone, and R. Olshen (1984). Classification and Regression trees, Wadsworth, Belmont, CA.
J. Khan, J. S. Wei, M. Ringnér, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7:673–679.
T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16:906–914.
M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristiani, C. W. Sunet, T. S. Furey, M. Ares, and D. Haussler (2000). Knowedge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. U.S.A. 97:262–267.
R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. U.S.A. 99:6567–6572.
S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97:77–87.
S. Knudsen (2002). A Biologist's Guide to Analysis of DNA Microarray data, Wiley, New York.