Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail
Tóm tắt
The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant. We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters: We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (L > 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the k best alignments. Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of k best alignments is included.
Tài liệu tham khảo
Brown S: Bioinformatics. 2000, Natick (MA): Eaton Publishing
Rashidi S, Buehler L: Bioinformatics Basics. 2000, Boca Raton (FL): CRC Press
The Protein Data Bank. http://www.pdb.org
Fraser C, Gocayne J: The Minimal Gene Complement of Mycoplasma Genitalium. Science. 1995, 270: 397-
Needleman SB, Wunsch CD: A General Method Applicabel to Search for Similarities in the Amino Acid Sequence of two Proteins. J Mol Biol. 1970, 48: 443-453.
Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J Mol Biol. 1981, 147: 195-197.
Gotoh O: An Improved Algorithm for Matching Biological Sequences. J Mol Biol. 1982, 162: 705-
Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic Local Alignment Search Tool. J Mol Biol. 1990, 215: 403-410.
Karlin S, Altschul S: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990, 87: 2264-
Dembo A, Karlin S, Zeitouni O: Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score. Ann Prob. 1994, 22: 2022-2039.
Yu Y, Hwa T: Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models. J Comp Biol. 2001, 8 (3): 249-282.
Yu Y, Bundschuh R, Hwa T: Statistical Significance and Extreme Ensemble of Gapped Local Hybrid Alignment. Biological Evolution and Statistical Physics. Edited by: Lässig M, Valeriani A. 2002, 3-22. Berlin: Springer-Verlag
Kschischo M, Lässig M, Yu Y: Toward an accurate statistics of gapped alignments. Bull Math Biol. 2004, 67: 169-191.
Siegmund D, Yakir B: Approximate p-Values for Local Sequence Alignments. Annals of Statistics. 2000, 28: 657-680.
Metzler D, Grossmann S, Wakolbinger A: A poisson model for gapped local alignments. Stat Prob Letters. 2002, 60: 91-100.
Altschul S, Gish W: Local Alignment Statistics. Meth Enzym. 1996, 266: 460-
Olsen R, Bundschuh R, Hwa T: Rapid Assessment of Extremal Statistics for Local Alignment with Gaps. Proceedings of the seventh International Conference on Intelligent Systems for Molecular Biology. Edited by: Lengauer T, Schneider R, Bork P, Brutlag D, Glasgow J, Mewes HW, Zimmer R, Menlo Park. 1999, 270: 211-222. CA: AAAI Press
Altschul S, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions. Nucl Acid Res. 2001, 29 (2): 351-361.
Hartmann A: Sampling rare events: Statistics of local sequence alignments. Phys Rev E. 2002, 65 (5 Pt 2): 056102-
Heinkoff S, Heinkoff J: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919.
Dayhoff M, Schwartz R, Orcutt B: A model of Evolutionary Change in Proteins. Atlas of Protein Sequence and Structure. Edited by: Dayhoff M. 1978, 5 (Suppl 3): 345-352. Washington, D.C: National Biomedical Research Foundation
Schwartz R, Dayhoff M: Matrices for Detecting Distant Relationships. Atlas of Protein Sequence and Structure. Edited by: Dayhoff M. 1978, 5 (Suppl 3): 353-358. Washington, D.C.: National Biomedical Research Foundation
Gumbel E: Statistics of Extremes. 1958, New York: Columbia University Press
Arratia R, Waterman M: A Phase Transition for the Score in Matching Random Sequences Allowing Deletions. Ann Appl Prob. 1994, 4: 200-225.
Hwa T, Lässig M: Optimal Detection of Sequence Similarity by Local Alignment. Proceedings of the Second Annual International Conference on Computational Molecular Biology (RECOMB98). Edited by: Istrail S, Pevzner P, Waterman M. 1998, 109-
Sellers P: Pattern recognition in genetic sequences by mismatch density. Bull Math Biol. 1984, 46: 501-514.
Altschul S, Erickson B: Locally optimal subalignments using nonlinear similartity functions. Bull Math Biol. 1986, 48: 633-660.
Karlin S, Altschul S: Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA. 1993, 90: 5873-5877.
Dieker A, Mandjes M: On Asymptotically efficient simulation of large deviation probabilities. Adv Appl Prob. 2005, 37: 539-552.
Hastings WK: Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. 1970, 57: 97-109.
Liu J: Monte Carlo Strategies in Scientific Computing. 2002, New York: Springer
Liu J: Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statist Comput. 1996, 6: 113-119.
Geyer C: Monte Carlo Maximum Likelihood for Depend Data. Proceedings of the 23rd Symposium on the Interface. 1991, 156-163.
Hukushima K, Nemoto K: Exchange Monte Carlo Method and Application to Spin Glass Simulations. J Phys Soc Jpn. 1996, 65: 1604-1608.
Earl D, Deem M: Parallel tempering: Theory, applications, and new perspectives. Phys Chem Chem Phys. 2005, 7: 3910-3916.
Zhou R: Exploring the protein folding free energy landscape: Coupling replica exchange method with P3ME/RESPA algorithm. J Molec Graph Mod. 2004, 22 (5): 451-463.
Zhou R, Berne B: Can a continuum solvent model reproduce the free energy landscape of a β -hairpin folding in water?. Proc Natl Acad Sci USA. 2002, 99: 12777-12782.
Zhou R, Berne B: Trp-cage: Folding free energy landscape in explicit water. Proc Natl Acad Sci USA. 2002, 100 (23): 13280-13285.
Garci'a A, Onuchic J: Folding a protein in a computer: An atomic description of the folding/unfolding of protein. Proc Natl Acad Sci USA. 2003, 100: 13898-13903.
Zhou R, Berne B, Germain R: The free energy landscape for β hairpin folding in explicit water. Proc Natl Acad Sci USA. 2001, 98: 14931-14936.
Auer S, Frenkel D: Prediction of absolute crystal-nucleation rate in hard-sphere colloids. Nature. 2001, 409: 1020-1023.
Marinari E, Parisi G, Ruiz-Lorenzo J: Numerical Simulations of Spin Glass Systems. Spin Glasses and Random Fields, Directions in Condensed Matter Physics. Edited by: Young A. 1998, 12: 109-World Scientific
Katzgraber H, Palassini M, Young A: Monte Carlo simulations of spin glasses at low temperatures. Phys Rev B. 2001, 63: 1844221-18442210.
Körner M, Katzgraber H, Hartmann A: Probing tails of energy distributions using importance-sampling in the disorder with a guiding function. Stat Mech. 2006, P04005-
Wilbur W: Accurate Monte Carlo Estimation of Very Small P-Values In Markov Chains. Comp Stat. 1998, 13: 153-168.
Geyer C: Estimating Normalization Constants and Reweighting Mixtures in Markov Chain Monte Carlo. Tech Rep 568. 1994, School of Statistics, University of Minnesota
Meng X, Wong W: Simulating Ratios of Normalization Constants via a Simple Identity: ATheoretical Exploration. Statistica Sinica. 1996, 6: 831-860.
Raftery A, Lewis S: How Many Iterations in the Gibbs Sampler. Bayesian Statistics 4. Edited by: Bernardo J, Berger J, Dawid A, Smith A. 1992, 763-773. Oxford University Press
Cowles M, Carlin B: Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review. JASA. 1996, 91 (434): 883-904.
StatLib. http://lib.stat.cmu.edu/
Coda R package. http://www.r-project.org/
Gelman A, Rubin D: Inference from iterative simulation using multiple sequences. Stat Sci. 1992, 7: 457-472.
Brooks S, Gelman A: General methods for monitoring convergence of iterative simulations. J Comput Graph Stat. 1998, 7: 434-455.
BEfron: The Jackknife, the Bootstrap and Other Resampling Plans. 1982, New York: SIAM
Robinson A, Robinson L: Distribution of glutamine and asparagine residues and their near neighbours in peptides and proteins. Proc Natl Acad Sci USA. 1991, 88: 8880-8884.
gnuplot. http://www.gnuplot.info/
SWISSPROT. http://www.expasy.org/
NCBI BLAST. http://www.ncbi.nlm.nih.gov/BLAST