Using interesting sequences to interactively build Hidden Markov Models

Data Mining and Knowledge Discovery - Tập 21 - Trang 186-220 - 2010
Szymon Jaroszewicz1
1National Institute of Telecommunications, Warsaw, Poland

Tóm tắt

The paper presents a method of interactive construction of global Hidden Markov Models (HMMs) based on local sequence patterns discovered in data. The method is based on finding interesting sequences whose frequency in the database differs from that predicted by the model. The patterns are then presented to the user who updates the model using their intelligence and their understanding of the modelled domain. It is demonstrated that such an approach leads to more understandable models than automated approaches. Two variants of the problem are considered: mining patterns occurring only at the beginning of sequences and mining patterns occurring at any position; both practically meaningful. For each variant, algorithms have been developed allowing for efficient discovery of all sequences with given minimum interestingness. Applications to modelling webpage visitors behavior and to modelling protein secondary structure are presented, validating the proposed approach.

Tài liệu tham khảo

Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD conference on management of data, pp 207–216 Asai K, Hayamizu S, Handa K (1993) Prediction of protein secondary structure by the Hidden Markov Model. Bioinformatics 9(2): 141–146 Balasubramanian V (1993) Equivalence and reduction of Hidden Markov Models. A.I. Technical Report 1370, Artificial Intelligence Laboratory, MIT Bouchaffra D, Tan J (2006) Protein fold recognition using a structural Hidden Markov Model. In: International conference on pattern recognition, vol 3. Los Alamitos, CA, pp 186–189 Dunham MH (2003) Data mining, introductory and advanced topics, Part III. http://engr.smu.edu/~mhd/dmbook/part3.ppt Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86 Hand D (2002) Pattern detection and discovery. In: Hand D, Adams N, Bolton R (eds) Pattern detection and discovery. Springer, Berlin, pp 1–12 Hunter L (1993) Molecular biology for computer scientists. In: Hunter L (eds) Artificial intelligence and molecular biology, chap 1. AAAI Press, Menlo Park, pp 1–46 Huo Q, Lee CH (1997) On-line adaptive learning of the continuous density Hidden Markov Model based on approximate recursive Bayes estimate. IEEE Trans Speech Audio Process 5(2): 161–172 Huo Q, Chan C, Lee CH (1995) Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition. IEEE Trans Speech Audio Process 3(5): 334–345 Inoue M, Ueda N (2003) Exploitation of unlabeled sequences in Hidden Markov Models. IEEE Trans Pattern Anal Mach Intell 25: 1570–1581 Jaroszewicz S (2008) Interactive HMM construction based on interesting sequences. In: Proceedings of local patterns to global models (LeGo’08) workshop at the 12th European conference on principles and practice of knowledge discovery in databases (PKDD’08), Antwerp, Belgium, pp 82–91 Jaroszewicz S, Scheffer T (2005) Fast discovery of unexpected patterns in data, relative to a Bayesian network. In: 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2005), Chicago, IL, pp 118–127 Jaroszewicz S, Simovici D (2004) Interestingness of frequent itemsets using Bayesian networks as background knowledge. In: 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2004), Seattle, WA, pp 178–186 Jaroszewicz S, Scheffer T, Simovici D (2009) Scalable pattern mining with Bayesian networks as background knowledge. Data Min Knowl Discov 18(1): 56–100 Ji S, Watson L, Carin L (2009) Semisupervised learning of Hidden Markov Models via a homotopy method. IEEE Trans Pattern Anal Mach Intell 31: 275–287 Laxman S, Sastry P (2005) Discovering frequent episodes and learning Hidden Markov Models: a formal connection. IEEE Trans Knowl Data Eng 17(11): 1505–1517 Laxman S, Sastry P (2006) A survey of temporal data mining. Sadhana 31(2): 173–198 Lee CH, Gauvin JL (1996) Bayesian adaptive learning and map estimation of HMM. In: Lee CH, Soong F, Paliwal K (eds) Automatic speech and speaker recognition: advanced topics. Springer, New York, pp 83–108 Li D, Laurent A, Poncelet P (2007) Mining unexpected sequential patterns and rules. Technical Report RR-07027, Laboratoire d’Informatique de Robotique et de Microélectronique de Montpellier Low-Kam C, Mas A, Teisseire M (2009) Mining for unexpected sequential patterns given a Markov model. http://www.math.univ-montp2.fr/~mas/lmt_siam09.pdf Meyer C (2001) Matrix analysis and applied linear algebra book and solutions manual. SIAM, Philadelphia Prum B, Rodolphe F, de Turckheim E (1995) Finding words with unexpected frequencies in DNA sequences. J R Stat Soc Ser B 57: 205–220 Qian N, Sejnowski T (1988) Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202: 865–884 Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77(2): 257–286 Robbins H (1956) An empirical bayes approach to statistics. In: Proceedings of the third Berkeley symposium on mathematical statistics and probability, vol 1: contributions to the theory of statistics, University of California Press, pp 157–163 Satsangi A, Zaiane O (2007) Contrasting the contrast sets: an alternative approach. In: Eleventh international database engineering and applications symposium (IDEAS 2007), Banff, Canada Spiliopoulou M (1999) Managing interesting rules in sequence mining. In: Proceedings of the third European conference on principles of data mining and knowledge discovery, pp 554–560 Uhlig F (2001) Transform linear algebra. Prentice Hall, Upper Saddle River van Loan C, Golub G (1996) Matrix computations. Johns Hopkins University Press, London Welch L (2003) Hidden Markov Models and the Baum-Welch algorithm. IEEE Inform Theory Soc Newsl 53(4): 1,10–13 Zaiane O, Yacef K, Kay J (2007) Finding top-n emerging sequences to contrast sequence sets. Tech. Rep. TR07-03, Department of Computing Science, University of Alberta, Edmonton, AB, Canada