Faster exact distributions of pattern statistics through sequential elimination of states

Annals of the Institute of Statistical Mathematics - Tập 69 - Trang 231-248 - 2015
Donald E. K. Martin1, Laurent Noé2
1Department of Statistics, North Carolina State University, Raleigh, USA
2CRIStAL (UMR 9189 Lille University/CNRS), INRIA Lille Nord-Europe, Villeneuve d’Ascq, France

Tóm tắt

When using an auxiliary Markov chain (AMC) to compute sampling distributions, the computational complexity is directly related to the number of Markov chain states. For certain complex pattern statistics, minimal deterministic finite automata (DFA) have been used to facilitate efficient computation by reducing the number of AMC states. For example, when statistics of overlapping pattern occurrences are counted differently than non-overlapping occurrences, a DFA consisting of prefixes of patterns extended to overlapping occurrences has been generated and then minimized to form an AMC. However, there are situations where forming such a DFA is computationally expensive, e.g., with computing the distribution of spaced seed coverage. In dealing with this problem, we develop a method to obtain a small set of states during the state generation process without forming a DFA, and show that a huge reduction in the size of the AMC can be attained.

Tài liệu tham khảo

Aho, A. V., Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333–340. Aston, J. A. D., Martin, D. E. K. (2007). Distributions associated with general runs and patterns in hidden Markov models. Annals of Applied Statistics, 1(2), 585–611. Balakrishnan, N., Koutras, M. V. (2002). Runs and scans with applications. New York: Wiley. Bassino, F., Clement, J., Fayolle, J., Nicodème, P. (2008). Constructions for clumps statistics. Discrete Mathematics and Theoretical Computer Science (DMTCS), AI, 179–194. Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27, 573–580. Benson, G., Mak, D. Y. F. (2009). Exact distribution of a spaced seed statistic for DNA homology detection. String processing and information retrieval, Lecture Notes in Computer Science, 5280, 282–293. Buhler, J., Keich, U., Sun, Y. (2005). Designing seeds for similarity search in genomic DNA. Journal of Computer and System Sciences, 70, 342–363. Ebneshahrashoob, M., Gao, T., Wu, M. (2005). An efficient algorithm for exact distribution of discrete scan statistic. Methodology and Computing in Applied Probability, 7(4), 459–471. Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multi-state trials. Statistica Sinica, 6, 957–974. Fu, J. C., Koutras, M. V. (1994). Distribution theory of runs: a Markov chain approach. Journal of the American Statistical Association, 89, 1050–1058. Fu, J. C., Lou, W. Y. (2003). Distribution theory of runs and patterns and its applications. Singapore: World Scientific Publishing Co. Fu, J. C., Lou, W. Y. W., Bai, Z.-D., Li, G. (2002). The exact and limiting distributions of the number of successes in success runs within a sequence of Markov-dependent two-state trials. Annals of the Institute of Statistical Mathematics, 54(4), 719–730. Hopcroft, J. E. (1971). An \(n\) log \(n\) algorithm for minimizing states in a finite automaton. In Z. Kohavi & A. Paz (Eds.), Theory of Machines and Computations (pp. 189–196). New York: Academic Press. Hopcroft, J. E., Motwani, R., Ullman, J. D. (2001). Introduction to automata theory, languages, and computation. New York: Addison-Wesley. Keich, U., Li, M., Ma, B., Tromp, J. (2004). On spaced seeds for similarity search. Discrete Applied Mathematics, 138(3), 253–263. Koutras, M. V., Alexandrou, V. A. (1995). Runs, scans and urn models: A unified Markov chain approach. Annals of the Institute of Statistical Mathematics, 47, 743–766. Kucherov G., Noé, L., Roytberg, M. (2007). Subset seed automaton. In Implementation and application of automata, Lecture Notes in Computer Science, Volume 4783 (pp. 180–191). Ledent, S., Robin, S. (2005). Checking homogeneity of motifs’ distribution in heterogenous sequences. Journal of Computational Biology, 12, 672–685. Lladser, M., Betterton, M. D., Knight, R. (2008). Multiple pattern matching: a Markov chain approach. Journal of Mathematical Biology, 56(1–2), 51–92. Lou, W. Y. W. (2003). The exact distribution of the \(k\)-tuple statistic for sequence homology. Statistics and Probability Letters, 61, 51–59. Ma, B., Tromp, J., Li, M. (2002). Patternhunter-faster and more sensitive homology search. Bioinformatics, 18(3), 440–445. Marshall, T. and Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics. Lecture Notes In Computer Science; Vol. 5029, Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (pp. 95–106). Martin, D. E. K. (2006). The exact joint distribution of the sum of heads and apparent size statistics of a “tandem repeats finder” algorithm. Bulletin of Mathematical Biology, 68, 2353–2364. Martin, D. E. K. (2008). Application of auxiliary Markov chains to start-up demonstration tests. European Journal of Operational Research, 184(2), 574–583. Martin, D. E. K. (2013). Coverage of spaced seeds as a measure of clumping. In American Statistical (Ed.), Association 2013 Proceedings of the Section on Computational Statistics. Alexandria, VA: American Statistical Association. Martin, D. E. K. (2014). P-values for the discrete scan statistic through slack variables. Communications in Statistics, Simulation and Computation. doi:10.1080/03610918.2013.777457. Martin, D. E. K., Aston, J. A. D. (2008). Waiting time distribution of generalized later patterns. Computational Statistics and Data Analysis, 52, 4879–4890. Martin, D. E. K., Aston, J. A. D. (2013). Distributions of statistics of hidden state sequences through the sum-product algorithm. Methodology and Computing in Applied Probability, 15(4), 897–918. Martin, D. E. K., Coleman, D. A. (2011). Distributions of clump statistics for a collection of words. Journal of Applied Probability, 48, 1049–1059. Mealy, G. H. (1955). A method for synthesizing sequential circuits. Bell System Technical Journal, 34(5), 1045–1079. Moore E.F. (1956) Gedanken-experiments on sequential machines. Automata Studies: Annals of Mathematical Studies, \(34\), 129–153. Princeton, N.J.: Princeton University Press. Noé, L., Kucherov, G. (2004). Improved hit criteria for DNA local alignment. BMC Bioinformatics, 5, 149. Noé, L., Martin, D. E. K. (2014). A coverage criterion for spaced seeds and its applications to SVM string kernels and \(k\)-mer distances. Journal of Computational Biology, 21(12), 947–963. Nuel, G. (2008). Pattern Markov chains: Optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45(1), 226–243. Parzen, E. (1962). Stochastic Processes. San Francisco: Holden-Day Inc. Ribeca, P., Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: a DFA-only approach. Bioinformatics, 24(24), 2839–2848. Robin, S., Rodolphe, F., Schbath, S. (2005). DNA, words and models. United Kingdom: Cambridge University Press. Tewari, A., Srivastava, U., Gupta, P. (2002). A parallel DFA minimization algorithm. In High performance computing Hi PC, Lecture Notes in Computer Science (Vol. 2552, pp. 34–40).