XML data clustering

ACM Computing Surveys - Tập 43 Số 4 - Trang 1-41 - 2011

Alsayed Algergawy¹, Marco Mesiti², Richi Nayak³, Gunter Saake⁴

¹Madgeburg University, Madegeburg, Germany

²Univ. of Milano, Milano, Italy#TAB#

³Queensland University of Technology, Brisbane, Australia

⁴Magdeburg University, Magdeburg, Germany

Tóm tắt

In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.

Từ khóa

Tài liệu tham khảo

10.1145/1281192.1281201

Algergawy , A. , Schallehn , E. , and Saake , G . 2008a. Combining effectiveness and efficiency for schema matching evaluation . In Proceedings of the 1st International Workshop Model-Based Software and Data Integration (MBSDI'08) . 19--30. Algergawy, A., Schallehn, E., and Saake, G. 2008a. Combining effectiveness and efficiency for schema matching evaluation. In Proceedings of the 1st International Workshop Model-Based Software and Data Integration (MBSDI'08). 19--30.

10.1145/1497308.1497337

10.1016/j.datak.2009.01.001

10.1093/bib/bbn058

10.1145/1363686.1363940

Baeza-Yates , R. and Ribeiro-Neto , B. 1999. Modern Information Retrieval . ACM Press/Addison-Wesley . Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. ACM Press/Addison-Wesley.

10.1145/27633.27634

Berkhin P. 2002. Survey of clustering data mining techniques. 10.1.1.145.895.pdf. Berkhin P. 2002. Survey of clustering data mining techniques. 10.1.1.145.895.pdf.

10.1109/4236.968835

10.1016/S0306-4379(03)00031-0

10.1007/s10844-006-0023-y

10.1016/j.tcs.2004.12.030

Bolshakova N. and Cunningham P. 2005. cluML: a markup language for clustering and cluster validity assessment of microarray data. Tech. rep. TCD-CS-2005-23 The University of Dublin. Bolshakova N. and Cunningham P. 2005. cluML: a markup language for clustering and cluster validity assessment of microarray data. Tech. rep. TCD-CS-2005-23 The University of Dublin.

10.1145/1353343.1353358

10.1145/1096601.1096629

Bourret R. 2009. XML database products. http://www.rpbourret.com/xml/XMLDatabaseProds.htm. Bourret R. 2009. XML database products. http://www.rpbourret.com/xml/XMLDatabaseProds.htm.

Buttler , D. 2004 . A short survey of document structure similarity algorithms . In Proceedings of the International Conference on Internet Computing. 3--9. Buttler, D. 2004. A short survey of document structure similarity algorithms. In Proceedings of the International Conference on Internet Computing. 3--9.

10.1145/130226.134466

10.1007/978-3-540-34963-1_36

Cerami , E. 2005. XML for Bioinformatics . Springer New York . Cerami, E. 2005. XML for Bioinformatics. Springer New York.

10.1137/S0097539702418498

10.5555/645925.671669

10.1016/j.datak.2006.02.004

Cohen , W. W. , Ravikumar , P. , and Fienberg , S. E . 2003. A comparison of string distance metrics for name-matching tasks . In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web (IIWeb). 73--78 . Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web (IIWeb). 73--78.

10.1023/A:1013625426931

10.1016/j.is.2004.11.009

Do , H. H. and Rahm , E . 2002. COMA- A system for flexible combination of schema matching approaches . In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 610--621 . Do, H. H. and Rahm, E. 2002. COMA- A system for flexible combination of schema matching approaches. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 610--621.

Doan A. Madhavan J. Domingos P. and Halevy A. 2004. Handbook on Ontologies. Springer 385--404. Doan A. Madhavan J. Domingos P. and Halevy A. 2004. Handbook on Ontologies. Springer 385--404.

10.1145/1031453.1031465

10.1073/pnas.95.25.14863

10.1109/TKDE.2005.27

Florescu , D. and Kossmann , D. 1999 . Storing and querying XML data using an RDMBS . IEEE Data Eng. Bull. 22 , 3, 27 -- 34 . Florescu, D. and Kossmann, D. 1999. Storing and querying XML data using an RDMBS. IEEE Data Eng. Bull. 22, 3, 27--34.

Giannotti , F. , Gozzi , C. , and Manco , G . 2002. Clustering transactional data . In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 175--187 . Giannotti, F., Gozzi, C., and Manco, G. 2002. Clustering transactional data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 175--187.

Giunchiglia , F. , Yatskevich , M. , and Shvaiko , P. 2007 . Semantic matching: Algorithms and implementation . J. Data Semantics 9 , 1 -- 38 . Giunchiglia, F., Yatskevich, M., and Shvaiko, P. 2007. Semantic matching: Algorithms and implementation. J. Data Semantics 9, 1--38.

10.1109/TKDE.2007.1060

Guerrini G. Mesiti M. and Sanz I. 2007. An Overview of Similarity Measures for Clustering XML Documents. Web Data Management Practices: Emerging Techniques and Technologies. IDEA GROUP. Guerrini G. Mesiti M. and Sanz I. 2007. An Overview of Similarity Measures for Clustering XML Documents. Web Data Management Practices: Emerging Techniques and Technologies. IDEA GROUP.

10.1145/276304.276312

10.1016/S0306-4379(00)00022-3

Hanisch , D. , Zimmer , R. , and Lengauer , T. 2002 . ProML-the protein markup language for specification of protein sequences, structures and families . Silico Biol. 2 , 3, 313 -- 324 . Hanisch, D., Zimmer, R., and Lengauer, T. 2002. ProML-the protein markup language for specification of protein sequences, structures and families. Silico Biol. 2, 3, 313--324.

10.1023/A:1009769707641

10.1093/bioinformatics/btg015

10.1145/331499.331504

10.1016/j.eswa.2007.01.025

10.1007/11915034_96

10.1145/1410140.1410178

10.1007/11751632_22

10.1145/362084.362140

10.1145/584792.584841

10.1007/s00778-006-0024-z

10.1145/1321440.1321483

10.1007/s10115-004-0156-7

10.1016/S0169-023X(99)00044-0

10.1109/TKDE.2004.1264824

10.1007/11511854_7

10.2197/ipsjdc.2.382

Madhavan , J. , Bernstein , P. A. , and Rahm , E . 2001. Generic schema matching with cupid . In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01) . 49--58. Madhavan, J., Bernstein, P. A., and Rahm, E. 2001. Generic schema matching with cupid. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 49--58.

Manning C. D. Raghavan P. and Schütze H. 2008. Introduction to Information Retrieval. Cambridge University Press. Manning C. D. Raghavan P. and Schütze H. 2008. Introduction to Information Retrieval. Cambridge University Press.

Melnik , S. , Garcia-Molina , H. , and Rahm , E . 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching . In Proceedings of the 18th International Conference on Data Engineering (ICDE). Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the 18th International Conference on Data Engineering (ICDE).

Melton J. and Buxton S. 2006. Querying XML: XQuery XPath and SQL/XML in Context. Morgan Kaufmann/Elsevier. Melton J. and Buxton S. 2006. Querying XML: XQuery XPath and SQL/XML in Context. Morgan Kaufmann/Elsevier.

Muller , T. , Selinski , S. , and Ickstadt , K . 2005a. Cluster analysis: A comparison of different similarity measures for SNP data . In Proceedings of the Second Joint Meeting of the Institute of Mathematical Statistics and International Society for Bayesian Analysis (IMS-ISBA). Muller, T., Selinski, S., and Ickstadt, K. 2005a. Cluster analysis: A comparison of different similarity measures for SNP data. In Proceedings of the Second Joint Meeting of the Institute of Mathematical Statistics and International Society for Bayesian Analysis (IMS-ISBA).

Muller T. Selinski S. and Ickstadt K. 2005b. How similar is it&quest; towards personalized similarity measures in ontologies. In 7. International Tagung Wirschaftinformatik. Muller T. Selinski S. and Ickstadt K. 2005b. How similar is it&quest; towards personalized similarity measures in ontologies. In 7. International Tagung Wirschaftinformatik.

10.1007/s10115-007-0080-8

10.1016/j.knosys.2006.08.006

10.1142/S0218001407005648

Nierman , A. and Jagadish , H. V . 2002. Evaluating structural similarity in XML documents . In Proceedings of the 5th International Workshop on the Web and Databases (WebDB). 61--66 . Nierman, A. and Jagadish, H. V. 2002. Evaluating structural similarity in XML documents. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB). 61--66.

10.1109/TKDE.2004.25

10.1109/TNN.2002.1031947

10.1109/DEXA.2006.62

10.1007/s007780100057

10.1016/j.is.2008.01.010

10.1145/361219.361220

10.1109/DEXA.2008.55

10.1007/11687238_88

10.1093/bioinformatics/18.suppl_1.S14

Shanmugasundaram , J. , Tufte , K. , He , G. , Zhang , C. , DeWitt , D. , and Naughton , J . 1999. Relational databases for querying XML documents: Limitations and opportunities . In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). 302--314 . Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., and Naughton, J. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB). 302--314.

Shasha D. and Zhang K. 1995. Approximate tree pattern matching. In Pattern Matching in Strings Trees and Arrays. Oxford University Press. Shasha D. and Zhang K. 1995. Approximate tree pattern matching. In Pattern Matching in Strings Trees and Arrays. Oxford University Press.

Singhal , A. 2001 . Modern information retrieval: A brief overview . IEEE Data Eng. Bull. 24 , 4, 35 -- 43 . Singhal, A. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 4, 35--43.

Somervuo , P. and Kohonen , T . 2000. Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map . In Proceedings of the 3rd International Conference on Discovery Science. 76--85 . Somervuo, P. and Kohonen, T. 2000. Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map. In Proceedings of the 3rd International Conference on Discovery Science. 76--85.

Srikant , R. and Agrawal , R . 1996. Mining sequential patterns: Generalizations and performance improvements . In Proceedings of the 5th International Conference on Extending Database Technology (EDBT). 3--17 . Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology (EDBT). 3--17.

Tagarelli , A. and Greco , S . 2006. Toward semantic XML clustering . In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199 . Tagarelli, A. and Greco, S. 2006. Toward semantic XML clustering. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199.

10.1145/322139.322143

10.1073/pnas.96.6.2907

10.1016/j.cosrev.2009.03.001

Tekli , J. , Chbeir , R. , and Ytongnon , K . 2007. Structural similarity evaluation between XML documents and DTDs . In Proceedings of the 8th International Conference on Web Information Systems Engineering (WISE). 196--211 . Tekli, J., Chbeir, R., and Ytongnon, K. 2007. Structural similarity evaluation between XML documents and DTDs. In Proceedings of the 8th International Conference on Web Information Systems Engineering (WISE). 196--211.

Tran , T. , Nayak , R. , and Bruza , P . 2008. Combining structure and content similarities for XML document clustering . In Proceedings of the 7th Australasian Data Mining Conference (AusDM). 219--226 . Tran, T., Nayak, R., and Bruza, P. 2008. Combining structure and content similarities for XML document clustering. In Proceedings of the 7th Australasian Data Mining Conference (AusDM). 219--226.

10.1007/978-3-540-30192-9_59

10.1007/978-3-540-88871-0_35

Vutukuru , V. , Pasupuleti , K. , Khare , A. , and Garg , A . 2002. Conceptemy: An issue in XML information retrieval . In Proceedings of the International World Wide Web Conference (WWW). Vutukuru, V., Pasupuleti, K., Khare, A., and Garg, A. 2002. Conceptemy: An issue in XML information retrieval. In Proceedings of the International World Wide Web Conference (WWW).

10.1007/BF02944801

10.1007/s10618-008-0100-7

10.1145/1364782.1364795

10.1109/TNN.2005.845141

10.1007/s10115-008-0138-2

10.1142/S0218213005002387

10.1145/1066157.1066243

10.1023/A:1012861931139

10.1137/0218082

10.1145/233269.233324

Zhao Y. and Karypis G. 2002a. Criterion functions for document clustering: Experiments and analysis. Tech. rep. 01-40 Department of Computer Science University of Minnesota. Zhao Y. and Karypis G. 2002a. Criterion functions for document clustering: Experiments and analysis. Tech. rep. 01-40 Department of Computer Science University of Minnesota.

10.1145/584792.584877

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA