ATreeGrep: approximate searching in unordered trees

D. Shasha1, J.T.L. Wang2, Huiyuan Shan3, Kaizhong Zhang4
1Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
2College of Computing Sciences of Technology, New Jersey Institute of Technology, Newark, NJ, USA
3Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
4Department of Computer Science, University of Western Ontario, London, ONT, Canada

Tóm tắt

An unordered labeled tree is a tree in which each node has a string label and the parent-child relationship is significant, but the order among siblings is unimportant. This paper presents an approach to the nearest neighbor search problem for these trees. Given a database D of unordered labeled trees and a query tree Q, the goal is to find those trees in D that "approximately" contain Q. Our approach is based on storing the paths of the trees in a suffix array and then counting the number of mismatching paths between the query tree and a data tree. To speed up a search, we use a hash-based technique to filter out unqualified data trees at an early stage of the search. Experimental results obtained by running our techniques on phylogenetic trees and synthetic data demonstrate the good performance of the proposed approach. We also discuss the use of our work in XML and scientific database management.

Từ khóa

#XML #Computer science #Nearest neighbor searches #Phylogeny #Decision trees #Change detection algorithms #Object oriented databases #Filters #Algorithm design and analysis #Object oriented modeling

Tài liệu tham khảo

10.1109/ICDE.2001.914874 cracraft, 2000, Report from NSF Workshop 10.1109/34.935845 jaakkola, 1996, Report C-1996-83 jagadish, 2001, TAX: A tree algebra for XML, Proc of the Workshop on Data Base Programming Languages kannan, 1990, Determining the evolutionary tree, Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, 475 leung, 1993, The AQUA data model and algebra, Proc of the Workshop on Data Base Programming Languages, 157 manber, 1990, Suffix arrays: A new method for on-line string searches, Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, 319 noetzel, 1983, An analysis of the general tree-editing problem, Time Warps String Edits and Macromolecules The Theory and Practice of Sequence Comparison, 237 10.1109/SSDM.2002.1029699 10.1016/0020-0190(92)90136-J buneman, 1997, Adding structure to unstructured data, Proceedings of the 6th International Conference on Database Theory, 336 10.1006/jagm.1994.1003 10.1145/299432.299457 chang, 2001, Mining the World Wide Web An Information Search Approach, 10.1007/978-1-4615-1639-2 10.2307/2406441 10.1145/253260.253266 10.1109/ICDE.1998.655752 10.1145/375663.375730 chawathe, 1996, Change detection in hierarchically structured information, Proceedings of the ACM SIGMOD International Conference on Management of Data, 493, 10.1145/235968.233366 10.1093/sysbio/21.4.390 10.1109/ICDE.1995.380405 10.1145/191839.191863 subramanian, 1993, Ordered types in the AQUA data model, Proc of the Workshop on Data Base Programming Languages, 115 wang, 1999, Pattern Discovery in Biomolecular Data Tools Techniques and Applications, 10.1093/oso/9780195119404.001.0001 10.1109/69.298173 10.1145/135239.135244 10.1109/34.709622