Non-binary evaluation measures for big data integration

Tomer Sagi1, Avigdor Gal2
1University of Haifa, Haifa, Israel
2Technion–Israel Institute of Technology, Haifa, Israel

Tóm tắt

The evolution of data accumulation, management, analytics, and visualization has led to the coining of the term big data, which challenges the task of data integration. This task, common to any matching problem in computer science involves generating alignments between structured data in an automated fashion. Historically, set-based measures, based upon binary similarity matrices (match/non-match), have dominated evaluation practices of matching tasks. However, in the presence of big data, such measures no longer suffice. In this work, we propose evaluation methods for non-binary matrices as well. Non-binary evaluation is formally defined together with several new, non-binary measures using a vector space representation of matching outcome. We provide empirical analyses of the usefulness of non-binary evaluation and show its superiority over its binary counterparts in several problem domains.

Từ khóa


Tài liệu tham khảo

Algergawy, A., Nayak, R., Saake, G.: XML schema element similarity measures: a schema matching context. In: On the Move to Meaningful Internet Systems: OTM 2009, pp. 1246–1253 (2009)

Ayat, N., Afsarmanesh, H., Akbarinia, R., Valduriez, P.: Pay-as-you-go data integration using functional dependencies. In: Multidisciplinary Research and Practice for Information Systems, LNCS, vol. 7465, pp. 375–389. Springer, Berlin (2012)

Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, Berlin (2011). https://doi.org/10.1007/978-3-642-16518-4

Ben-Tal, A., Nemirovski, A.: Robust optimization-methodology and applications. Math. Program. 92(3), 453–480 (2002)

Berenzweig, A., Logan, B., Ellis, D.P., Whitman, B.: A large-scale evaluation of acoustic and subjective music-similarity measures. Comput. Music J. 28(2), 63–76 (2004)

Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: CoopIS 2001, LNCS, vol. 2172, pp. 108–122. Springer, Berlin (2001)

Bryant, V.: Metric Spaces: Iteration and Application. Cambridge University Press, Cambridge (1985)

Cardoso, J., Sheth, A.P.: Semantic Web Services, Processes and Applications. Springer, Berlin (2006)

Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD ’08: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1065–1068. ACM, New York (2008). https://doi.org/10.1145/1401890.1402020

Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. (2011). https://doi.org/10.1109/TKDE.2011.127

Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 861–874. ACM, New York, SIGMOD ’08 (2008). https://doi.org/10.1145/1376616.1376702

Do, H.H., Rahm, E.: COMA: a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, VLDB Endowment, pp. 610–621 (2002)

Doan, A.H., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. ACM SIGMOD Rec. 30, 509–520 (2001)

Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB J. 18, 469–500 (2009). https://doi.org/10.1007/s00778-008-0119-9

Duchateau, F., Bellahsene, Z., Coletta, R.: Matching and alignment: What is the cost of user post-match effort? In: On the Move to Meaningful Internet Systems: OTM 2011, LNCS, vol. 7044, pp. 421–428. Springer, Berlin (2011). https://doi.org/10.1007/978-3-642-25109-2_28

Engmann, D., Maßmann, S.: Instance matching with coma++. In: BTW Workshops, pp. 28–37 (2007)

Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proceedings of the IJCAI, pp. 348–353 (2007)

Euzenat, J., Meilicke, C., Stuckenschmidt, H., Shvaiko, P., dos Santos, C.T.: Ontology alignment evaluation initiative: six years of experience. J. Data Semant. 15, 158–192 (2011). https://doi.org/10.1007/978-3-642-22630-4_6

Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969). https://doi.org/10.2307/2286061

Friedman, E.J.: Active learning for smooth problems. In: Proceedings of the 22nd Annual Conference on Learning Theory (2009)

Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, Los Altos (2011). https://doi.org/10.2200/S00337ED1V01Y201102DTM013

Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. VLDB J. 14(1), 50–67 (2005)

Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: Proceedings of the International Conference on Very Large Databases (VLDB) (2001)

Gawinecki, M.: Abbreviation Expansion in Lexical Annotation of Schema. Camogli (Genova), Italy June 25th, 2009 Co-located with SEBD, p. 61 (2009)

Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.S.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)

Li, W., Clifton, C.: SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)

Luenberger, D.: Optimization by Vector Space Methods. Wiley-Interscience, New York (1997)

Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proceedings of the ICDE, pp. 57–68 (2005)

Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: you can only afford to pay as you go. In: Proceedings of the CIDR, pp. 342–350 (2007)

Magnani, M., Rizopoulos, N., McBrien, P., Montesi, D.: Schema integration based on uncertain semantic mappings. In: Conceptual Modeling ER 2005, pp. 31–46 (2005)

Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V. (eds.) Scalable Uncertainty Management, LNCS, vol. 4772, pp. 60–73. Springer, Berlin (2007). https://doi.org/10.1007/978-3-540-75410-7_5

Marie, A., Gal, A.: On the stable marriage of maximum weight royal couples. In: Proceedings of AAAI Workshop on Information Integration on the Web (2007)

Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE, pp. 117–128. IEEE (2002)

Mena, E., Kashyap, V., Illarramendi, A., Sheth, A.P.: Imprecise answers in distributed environments: estimation of information loss for multi-ontology based query processing. Int. J. Coop. Inf. Syst. 9(4), 403–425 (2000)

Modica, G., Gal, A., Jamil, H.: The use of machine-generated ontologies in dynamic information seeking. In: CoopIS, pp. 433–447 (2001)

Noy, N.F., Mortensen, J., Musen, M.A., Alexander, P.R.: Mechanical turk as an ontology engineer? Using microtasks as a component of an ontology-engineering workflow. In: Web Science 2013 (co-located with ECRC), WebSci ’13, Paris, pp. 262–271 (2013). https://doi.org/10.1145/2464464.2464482

Peukert, E., Eberius, J., Rahm, E.: AMC—a framework for modelling and comparing matching systems as matching processes. In: ICDE, pp. 1304–1307. IEEE (2011)

Powers, D.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)

Ratinov, L., Gudes, E.: Abbreviation expansion in schema matching and web integration. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, pp. 485–489 (2004)

Rodriguez-Gianolli, P., Mylopoulos, J.: A semantic approach to XML-based data integration. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) Conceptual Modeling–ER 2001. Lecture Notes in Computer Science, vol. 2224, pp. 117–132. Springer, Berlin (2001)

Sagi, T., Gal, A.: Non-binary evaluation for schema matching. In: Atzeni, P., Cheung, D., Ram, S. (eds.) Conceptual Modeling, Lecture Notes in Computer Science, vol. 7532, pp. 477–486. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-34002-4_37

Sagi, T., Gal, A.: Schema matching prediction with applications to data source discovery and dynamic ensembling. VLDB J. 22(5), 689–710 (2013). https://doi.org/10.1007/s00778-013-0325-y

Sagi, T., Gal, A.: In schema matching, even experts are human. towards expert sourcing in schema matching. In: 10th International Workshop on Information Integration on the Web (IIWeb ’14), co-located with ICDE 2014. IEEE, Chicago (2014)

Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pp. 269–278. ACM, New York (2002). https://doi.org/10.1145/775047.775087

Shepard, R.: Attention and the metric structure of the stimulus space. J. Math. Psychol. 1(1), 54–87 (1964)

Steel, R.G.D., Torrie, J.H.: Principles and Procedures of Statistics. McGraw-Hill, New York (1960)

Weidlich, M., Dijkman, R., Mendling, J.: The ICOP framework: identification of correspondences between process models. In: Advanced Information Systems Engineering, pp. 483–498. Springer, Berlin (2010)

Winkler, W., Yancey, W., Porter, E.: Fast record linkage of very large files in support of decennial and administrative records projects. In: Proceedings of the Section on Survey Research Methods. American Statistical Association (2010)

Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32, 18–34 (1998). https://doi.org/10.1145/281250.281256