XML data clustering
Tóm tắt
In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.
Từ khóa
Tài liệu tham khảo
Algergawy , A. , Schallehn , E. , and Saake , G . 2008a. Combining effectiveness and efficiency for schema matching evaluation . In Proceedings of the 1st International Workshop Model-Based Software and Data Integration (MBSDI'08) . 19--30. Algergawy, A., Schallehn, E., and Saake, G. 2008a. Combining effectiveness and efficiency for schema matching evaluation. In Proceedings of the 1st International Workshop Model-Based Software and Data Integration (MBSDI'08). 19--30.
Baeza-Yates , R. and Ribeiro-Neto , B. 1999. Modern Information Retrieval . ACM Press/Addison-Wesley . Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. ACM Press/Addison-Wesley.
Berkhin P. 2002. Survey of clustering data mining techniques. 10.1.1.145.895.pdf. Berkhin P. 2002. Survey of clustering data mining techniques. 10.1.1.145.895.pdf.
Bolshakova N. and Cunningham P. 2005. cluML: a markup language for clustering and cluster validity assessment of microarray data. Tech. rep. TCD-CS-2005-23 The University of Dublin. Bolshakova N. and Cunningham P. 2005. cluML: a markup language for clustering and cluster validity assessment of microarray data. Tech. rep. TCD-CS-2005-23 The University of Dublin.
Bourret R. 2009. XML database products. http://www.rpbourret.com/xml/XMLDatabaseProds.htm. Bourret R. 2009. XML database products. http://www.rpbourret.com/xml/XMLDatabaseProds.htm.
Buttler , D. 2004 . A short survey of document structure similarity algorithms . In Proceedings of the International Conference on Internet Computing. 3--9. Buttler, D. 2004. A short survey of document structure similarity algorithms. In Proceedings of the International Conference on Internet Computing. 3--9.
Cerami , E. 2005. XML for Bioinformatics . Springer New York . Cerami, E. 2005. XML for Bioinformatics. Springer New York.
Cohen , W. W. , Ravikumar , P. , and Fienberg , S. E . 2003. A comparison of string distance metrics for name-matching tasks . In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web (IIWeb). 73--78 . Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web (IIWeb). 73--78.
Do , H. H. and Rahm , E . 2002. COMA- A system for flexible combination of schema matching approaches . In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 610--621 . Do, H. H. and Rahm, E. 2002. COMA- A system for flexible combination of schema matching approaches. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 610--621.
Doan A. Madhavan J. Domingos P. and Halevy A. 2004. Handbook on Ontologies. Springer 385--404. Doan A. Madhavan J. Domingos P. and Halevy A. 2004. Handbook on Ontologies. Springer 385--404.
Florescu , D. and Kossmann , D. 1999 . Storing and querying XML data using an RDMBS . IEEE Data Eng. Bull. 22 , 3, 27 -- 34 . Florescu, D. and Kossmann, D. 1999. Storing and querying XML data using an RDMBS. IEEE Data Eng. Bull. 22, 3, 27--34.
Giannotti , F. , Gozzi , C. , and Manco , G . 2002. Clustering transactional data . In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 175--187 . Giannotti, F., Gozzi, C., and Manco, G. 2002. Clustering transactional data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 175--187.
Giunchiglia , F. , Yatskevich , M. , and Shvaiko , P. 2007 . Semantic matching: Algorithms and implementation . J. Data Semantics 9 , 1 -- 38 . Giunchiglia, F., Yatskevich, M., and Shvaiko, P. 2007. Semantic matching: Algorithms and implementation. J. Data Semantics 9, 1--38.
Guerrini G. Mesiti M. and Sanz I. 2007. An Overview of Similarity Measures for Clustering XML Documents. Web Data Management Practices: Emerging Techniques and Technologies. IDEA GROUP. Guerrini G. Mesiti M. and Sanz I. 2007. An Overview of Similarity Measures for Clustering XML Documents. Web Data Management Practices: Emerging Techniques and Technologies. IDEA GROUP.
Hanisch , D. , Zimmer , R. , and Lengauer , T. 2002 . ProML-the protein markup language for specification of protein sequences, structures and families . Silico Biol. 2 , 3, 313 -- 324 . Hanisch, D., Zimmer, R., and Lengauer, T. 2002. ProML-the protein markup language for specification of protein sequences, structures and families. Silico Biol. 2, 3, 313--324.
Madhavan , J. , Bernstein , P. A. , and Rahm , E . 2001. Generic schema matching with cupid . In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01) . 49--58. Madhavan, J., Bernstein, P. A., and Rahm, E. 2001. Generic schema matching with cupid. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 49--58.
Manning C. D. Raghavan P. and Schütze H. 2008. Introduction to Information Retrieval. Cambridge University Press. Manning C. D. Raghavan P. and Schütze H. 2008. Introduction to Information Retrieval. Cambridge University Press.
Melnik , S. , Garcia-Molina , H. , and Rahm , E . 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching . In Proceedings of the 18th International Conference on Data Engineering (ICDE). Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the 18th International Conference on Data Engineering (ICDE).
Melton J. and Buxton S. 2006. Querying XML: XQuery XPath and SQL/XML in Context. Morgan Kaufmann/Elsevier. Melton J. and Buxton S. 2006. Querying XML: XQuery XPath and SQL/XML in Context. Morgan Kaufmann/Elsevier.
Muller , T. , Selinski , S. , and Ickstadt , K . 2005a. Cluster analysis: A comparison of different similarity measures for SNP data . In Proceedings of the Second Joint Meeting of the Institute of Mathematical Statistics and International Society for Bayesian Analysis (IMS-ISBA). Muller, T., Selinski, S., and Ickstadt, K. 2005a. Cluster analysis: A comparison of different similarity measures for SNP data. In Proceedings of the Second Joint Meeting of the Institute of Mathematical Statistics and International Society for Bayesian Analysis (IMS-ISBA).
Muller T. Selinski S. and Ickstadt K. 2005b. How similar is it? towards personalized similarity measures in ontologies. In 7. International Tagung Wirschaftinformatik. Muller T. Selinski S. and Ickstadt K. 2005b. How similar is it? towards personalized similarity measures in ontologies. In 7. International Tagung Wirschaftinformatik.
Nierman , A. and Jagadish , H. V . 2002. Evaluating structural similarity in XML documents . In Proceedings of the 5th International Workshop on the Web and Databases (WebDB). 61--66 . Nierman, A. and Jagadish, H. V. 2002. Evaluating structural similarity in XML documents. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB). 61--66.
Shanmugasundaram , J. , Tufte , K. , He , G. , Zhang , C. , DeWitt , D. , and Naughton , J . 1999. Relational databases for querying XML documents: Limitations and opportunities . In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). 302--314 . Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., and Naughton, J. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). 302--314.
Shasha D. and Zhang K. 1995. Approximate tree pattern matching. In Pattern Matching in Strings Trees and Arrays. Oxford University Press. Shasha D. and Zhang K. 1995. Approximate tree pattern matching. In Pattern Matching in Strings Trees and Arrays. Oxford University Press.
Singhal , A. 2001 . Modern information retrieval: A brief overview . IEEE Data Eng. Bull. 24 , 4, 35 -- 43 . Singhal, A. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 4, 35--43.
Somervuo , P. and Kohonen , T . 2000. Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map . In Proceedings of the 3rd International Conference on Discovery Science. 76--85 . Somervuo, P. and Kohonen, T. 2000. Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map. In Proceedings of the 3rd International Conference on Discovery Science. 76--85.
Srikant , R. and Agrawal , R . 1996. Mining sequential patterns: Generalizations and performance improvements . In Proceedings of the 5th International Conference on Extending Database Technology (EDBT). 3--17 . Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology (EDBT). 3--17.
Tagarelli , A. and Greco , S . 2006. Toward semantic XML clustering . In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199 . Tagarelli, A. and Greco, S. 2006. Toward semantic XML clustering. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199.
Tekli , J. , Chbeir , R. , and Ytongnon , K . 2007. Structural similarity evaluation between XML documents and DTDs . In Proceedings of the 8th International Conference on Web Information Systems Engineering (WISE). 196--211 . Tekli, J., Chbeir, R., and Ytongnon, K. 2007. Structural similarity evaluation between XML documents and DTDs. In Proceedings of the 8th International Conference on Web Information Systems Engineering (WISE). 196--211.
Tran , T. , Nayak , R. , and Bruza , P . 2008. Combining structure and content similarities for XML document clustering . In Proceedings of the 7th Australasian Data Mining Conference (AusDM). 219--226 . Tran, T., Nayak, R., and Bruza, P. 2008. Combining structure and content similarities for XML document clustering. In Proceedings of the 7th Australasian Data Mining Conference (AusDM). 219--226.
Vutukuru , V. , Pasupuleti , K. , Khare , A. , and Garg , A . 2002. Conceptemy: An issue in XML information retrieval . In Proceedings of the International World Wide Web Conference (WWW). Vutukuru, V., Pasupuleti, K., Khare, A., and Garg, A. 2002. Conceptemy: An issue in XML information retrieval. In Proceedings of the International World Wide Web Conference (WWW).
Zhao Y. and Karypis G. 2002a. Criterion functions for document clustering: Experiments and analysis. Tech. rep. 01-40 Department of Computer Science University of Minnesota. Zhao Y. and Karypis G. 2002a. Criterion functions for document clustering: Experiments and analysis. Tech. rep. 01-40 Department of Computer Science University of Minnesota.