XML data clustering

ACM Computing Surveys - Tập 43 Số 4 - Trang 1-41 - 2011
Alsayed Algergawy1, Marco Mesiti2, Richi Nayak3, Gunter Saake4
1Madgeburg University, Madegeburg, Germany
2Univ. of Milano, Milano, Italy#TAB#
3Queensland University of Technology, Brisbane, Australia
4Magdeburg University, Magdeburg, Germany

Tóm tắt

In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.

Từ khóa


Tài liệu tham khảo

10.1145/1281192.1281201

Algergawy , A. , Schallehn , E. , and Saake , G . 2008a. Combining effectiveness and efficiency for schema matching evaluation . In Proceedings of the 1st International Workshop Model-Based Software and Data Integration (MBSDI'08) . 19--30. Algergawy, A., Schallehn, E., and Saake, G. 2008a. Combining effectiveness and efficiency for schema matching evaluation. In Proceedings of the 1st International Workshop Model-Based Software and Data Integration (MBSDI'08). 19--30.

10.1145/1497308.1497337

10.1016/j.datak.2009.01.001

10.1093/bib/bbn058

10.1145/1363686.1363940

Baeza-Yates , R. and Ribeiro-Neto , B. 1999. Modern Information Retrieval . ACM Press/Addison-Wesley . Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. ACM Press/Addison-Wesley.

10.1145/27633.27634

Berkhin P. 2002. Survey of clustering data mining techniques. 10.1.1.145.895.pdf. Berkhin P. 2002. Survey of clustering data mining techniques. 10.1.1.145.895.pdf.

10.1109/4236.968835

10.1016/S0306-4379(03)00031-0

10.1007/s10844-006-0023-y

10.1016/j.tcs.2004.12.030

Bolshakova N. and Cunningham P. 2005. cluML: a markup language for clustering and cluster validity assessment of microarray data. Tech. rep. TCD-CS-2005-23 The University of Dublin. Bolshakova N. and Cunningham P. 2005. cluML: a markup language for clustering and cluster validity assessment of microarray data. Tech. rep. TCD-CS-2005-23 The University of Dublin.

10.1145/1353343.1353358

10.1145/1096601.1096629

Bourret R. 2009. XML database products. http://www.rpbourret.com/xml/XMLDatabaseProds.htm. Bourret R. 2009. XML database products. http://www.rpbourret.com/xml/XMLDatabaseProds.htm.

Buttler , D. 2004 . A short survey of document structure similarity algorithms . In Proceedings of the International Conference on Internet Computing. 3--9. Buttler, D. 2004. A short survey of document structure similarity algorithms. In Proceedings of the International Conference on Internet Computing. 3--9.

10.1145/130226.134466

10.1007/978-3-540-34963-1_36

Cerami , E. 2005. XML for Bioinformatics . Springer New York . Cerami, E. 2005. XML for Bioinformatics. Springer New York.

10.1137/S0097539702418498

10.5555/645925.671669

10.1016/j.datak.2006.02.004

Cohen , W. W. , Ravikumar , P. , and Fienberg , S. E . 2003. A comparison of string distance metrics for name-matching tasks . In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web (IIWeb). 73--78 . Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web (IIWeb). 73--78.

10.1023/A:1013625426931

10.1016/j.is.2004.11.009

Do , H. H. and Rahm , E . 2002. COMA- A system for flexible combination of schema matching approaches . In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 610--621 . Do, H. H. and Rahm, E. 2002. COMA- A system for flexible combination of schema matching approaches. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 610--621.

Doan A. Madhavan J. Domingos P. and Halevy A. 2004. Handbook on Ontologies. Springer 385--404. Doan A. Madhavan J. Domingos P. and Halevy A. 2004. Handbook on Ontologies. Springer 385--404.

10.1145/1031453.1031465

10.1073/pnas.95.25.14863

10.1109/TKDE.2005.27

Florescu , D. and Kossmann , D. 1999 . Storing and querying XML data using an RDMBS . IEEE Data Eng. Bull. 22 , 3, 27 -- 34 . Florescu, D. and Kossmann, D. 1999. Storing and querying XML data using an RDMBS. IEEE Data Eng. Bull. 22, 3, 27--34.

Giannotti , F. , Gozzi , C. , and Manco , G . 2002. Clustering transactional data . In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 175--187 . Giannotti, F., Gozzi, C., and Manco, G. 2002. Clustering transactional data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 175--187.

Giunchiglia , F. , Yatskevich , M. , and Shvaiko , P. 2007 . Semantic matching: Algorithms and implementation . J. Data Semantics 9 , 1 -- 38 . Giunchiglia, F., Yatskevich, M., and Shvaiko, P. 2007. Semantic matching: Algorithms and implementation. J. Data Semantics 9, 1--38.

10.1109/TKDE.2007.1060

Guerrini G. Mesiti M. and Sanz I. 2007. An Overview of Similarity Measures for Clustering XML Documents. Web Data Management Practices: Emerging Techniques and Technologies. IDEA GROUP. Guerrini G. Mesiti M. and Sanz I. 2007. An Overview of Similarity Measures for Clustering XML Documents. Web Data Management Practices: Emerging Techniques and Technologies. IDEA GROUP.

10.1145/276304.276312

10.1016/S0306-4379(00)00022-3

Hanisch , D. , Zimmer , R. , and Lengauer , T. 2002 . ProML-the protein markup language for specification of protein sequences, structures and families . Silico Biol. 2 , 3, 313 -- 324 . Hanisch, D., Zimmer, R., and Lengauer, T. 2002. ProML-the protein markup language for specification of protein sequences, structures and families. Silico Biol. 2, 3, 313--324.

10.1023/A:1009769707641

10.1093/bioinformatics/btg015

10.1145/331499.331504

10.1016/j.eswa.2007.01.025

10.1007/11915034_96

10.1145/1410140.1410178

10.1007/11751632_22

10.1145/362084.362140

10.1145/584792.584841

10.1007/s00778-006-0024-z

10.1145/1321440.1321483

10.1007/s10115-004-0156-7

10.1016/S0169-023X(99)00044-0

10.1109/TKDE.2004.1264824

10.1007/11511854_7

10.2197/ipsjdc.2.382

Madhavan , J. , Bernstein , P. A. , and Rahm , E . 2001. Generic schema matching with cupid . In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01) . 49--58. Madhavan, J., Bernstein, P. A., and Rahm, E. 2001. Generic schema matching with cupid. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 49--58.

Manning C. D. Raghavan P. and Schütze H. 2008. Introduction to Information Retrieval. Cambridge University Press. Manning C. D. Raghavan P. and Schütze H. 2008. Introduction to Information Retrieval. Cambridge University Press.

Melnik , S. , Garcia-Molina , H. , and Rahm , E . 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching . In Proceedings of the 18th International Conference on Data Engineering (ICDE). Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the 18th International Conference on Data Engineering (ICDE).

Melton J. and Buxton S. 2006. Querying XML: XQuery XPath and SQL/XML in Context. Morgan Kaufmann/Elsevier. Melton J. and Buxton S. 2006. Querying XML: XQuery XPath and SQL/XML in Context. Morgan Kaufmann/Elsevier.

Muller , T. , Selinski , S. , and Ickstadt , K . 2005a. Cluster analysis: A comparison of different similarity measures for SNP data . In Proceedings of the Second Joint Meeting of the Institute of Mathematical Statistics and International Society for Bayesian Analysis (IMS-ISBA). Muller, T., Selinski, S., and Ickstadt, K. 2005a. Cluster analysis: A comparison of different similarity measures for SNP data. In Proceedings of the Second Joint Meeting of the Institute of Mathematical Statistics and International Society for Bayesian Analysis (IMS-ISBA).

Muller T. Selinski S. and Ickstadt K. 2005b. How similar is it? towards personalized similarity measures in ontologies. In 7. International Tagung Wirschaftinformatik. Muller T. Selinski S. and Ickstadt K. 2005b. How similar is it? towards personalized similarity measures in ontologies. In 7. International Tagung Wirschaftinformatik.

10.1007/s10115-007-0080-8

10.1016/j.knosys.2006.08.006

10.1142/S0218001407005648

Nierman , A. and Jagadish , H. V . 2002. Evaluating structural similarity in XML documents . In Proceedings of the 5th International Workshop on the Web and Databases (WebDB). 61--66 . Nierman, A. and Jagadish, H. V. 2002. Evaluating structural similarity in XML documents. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB). 61--66.

10.1109/TKDE.2004.25

10.1109/TNN.2002.1031947

10.1109/DEXA.2006.62

10.1007/s007780100057

10.1016/j.is.2008.01.010

10.1145/361219.361220

10.1109/DEXA.2008.55

10.1007/11687238_88

10.1093/bioinformatics/18.suppl_1.S14

Shanmugasundaram , J. , Tufte , K. , He , G. , Zhang , C. , DeWitt , D. , and Naughton , J . 1999. Relational databases for querying XML documents: Limitations and opportunities . In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). 302--314 . Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., and Naughton, J. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). 302--314.

Shasha D. and Zhang K. 1995. Approximate tree pattern matching. In Pattern Matching in Strings Trees and Arrays. Oxford University Press. Shasha D. and Zhang K. 1995. Approximate tree pattern matching. In Pattern Matching in Strings Trees and Arrays. Oxford University Press.

Singhal , A. 2001 . Modern information retrieval: A brief overview . IEEE Data Eng. Bull. 24 , 4, 35 -- 43 . Singhal, A. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 4, 35--43.

Somervuo , P. and Kohonen , T . 2000. Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map . In Proceedings of the 3rd International Conference on Discovery Science. 76--85 . Somervuo, P. and Kohonen, T. 2000. Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map. In Proceedings of the 3rd International Conference on Discovery Science. 76--85.

Srikant , R. and Agrawal , R . 1996. Mining sequential patterns: Generalizations and performance improvements . In Proceedings of the 5th International Conference on Extending Database Technology (EDBT). 3--17 . Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology (EDBT). 3--17.

Tagarelli , A. and Greco , S . 2006. Toward semantic XML clustering . In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199 . Tagarelli, A. and Greco, S. 2006. Toward semantic XML clustering. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199.

10.1145/322139.322143

10.1073/pnas.96.6.2907

10.1016/j.cosrev.2009.03.001

Tekli , J. , Chbeir , R. , and Ytongnon , K . 2007. Structural similarity evaluation between XML documents and DTDs . In Proceedings of the 8th International Conference on Web Information Systems Engineering (WISE). 196--211 . Tekli, J., Chbeir, R., and Ytongnon, K. 2007. Structural similarity evaluation between XML documents and DTDs. In Proceedings of the 8th International Conference on Web Information Systems Engineering (WISE). 196--211.

Tran , T. , Nayak , R. , and Bruza , P . 2008. Combining structure and content similarities for XML document clustering . In Proceedings of the 7th Australasian Data Mining Conference (AusDM). 219--226 . Tran, T., Nayak, R., and Bruza, P. 2008. Combining structure and content similarities for XML document clustering. In Proceedings of the 7th Australasian Data Mining Conference (AusDM). 219--226.

10.1007/978-3-540-30192-9_59

10.1007/978-3-540-88871-0_35

Vutukuru , V. , Pasupuleti , K. , Khare , A. , and Garg , A . 2002. Conceptemy: An issue in XML information retrieval . In Proceedings of the International World Wide Web Conference (WWW). Vutukuru, V., Pasupuleti, K., Khare, A., and Garg, A. 2002. Conceptemy: An issue in XML information retrieval. In Proceedings of the International World Wide Web Conference (WWW).

10.1007/BF02944801

10.1007/s10618-008-0100-7

10.1145/1364782.1364795

10.1109/TNN.2005.845141

10.1007/s10115-008-0138-2

10.1142/S0218213005002387

10.1145/1066157.1066243

10.1023/A:1012861931139

10.1137/0218082

10.1145/233269.233324

Zhao Y. and Karypis G. 2002a. Criterion functions for document clustering: Experiments and analysis. Tech. rep. 01-40 Department of Computer Science University of Minnesota. Zhao Y. and Karypis G. 2002a. Criterion functions for document clustering: Experiments and analysis. Tech. rep. 01-40 Department of Computer Science University of Minnesota.

10.1145/584792.584877