Detecting referential inconsistencies in electronic CV datasets

Springer Science and Business Media LLC - Tập 23 - Trang 1-11 - 2017
Ivison C. Rubim1, Vanessa Braganholo2
1NCE, Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil
2Institute of Computing, Fluminense Federal University (UFF), Niterói, Brazil

Tóm tắt

One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.

Tài liệu tham khảo

Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick and Andrew McCallum (2007) Author disambiguation using error-driven machine learning with a ranking loss function. International Workshop on Information Integration on the Web (IIWeb), Vancouver Bhattacharya, I. and Getoor, L (2007) Collective entity resolution in relational data. ACM Trans Knowledge Discovery Data. 1(1):1–36 Borges EN, Becker K, Heuser CA, Galante R (2011) A classification-based approach for bibliographic metadata deduplication. WWW/Internet International Conference, Porto, pp 221–228 Borges EN, Carvalho MG, Galante R, Gonçalves MA, Laender AHF (2011) An unsupervised heuristic-based approach for bibliographic metadata deduplication. Inf Process Manage 47:706–718 Carvalho MG, Gonçalves MA, Laender AHF, Silva AS (2006) Learning to deduplicate. ACM/IEEE-CS Joint Conference on Digital libraries, Chapel Hill, pp 11–15 Cormen, TH, Leiserson, CE, Rivest, RL and Stein, C (2009) Introduction to algorithms. The MIT Press, Cambridge Cota RG, Ferreira AA, Nascimento C, Gonçalves MA, Laender AHF (2010) An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J Am Soc Inf Sci Technol 9:1853–1870 Ferreira AA, Gonçalves MA, Laender AHF (2012) A brief survey of automatic method for author name disambiguation. SIGMOD Record 41:15–26 Ferreira AA, Veloso A, Gonçalves MA, Laender AHF (2010) Effective self-training author name disambiguation in scholarly digital libraries. Annual Joint Conference on Digital Libraries (JCDL), New York, pp 39–48 Han H, Giles L, Zha H, Li C, Tsioutsiouliklis K (2004) Two supervised learning approaches for name disambiguation in author citations. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), New York, pp 296–305 Han H, Xu W, Zha H, Giles CL (2005) A hierarchical naive Bayes mixture model for name disambiguation in author citations. ACM Symposium on Applied Computing (SAC), New York, pp 1065–1069 Han H, Zha H, Giles CL (2005) Name disambiguation in author citations using a K-way spectral clustering method. ACM/IEEE-CS Joint Conference on Digital Libraries, New York, pp 334–343 Han J, Kamber M, Pei, J (2012) Data mining concepts and techniques. Morgan Kaufmann, Burlington Huang J, Ertekin S, Giles CL (2006) Efficient name disambiguation for large-scale databases. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge Discovery in Databases: PKDD 2006. Springer, Berlin Heidelberg, pp 536–544 Hunt JW, Szymanski TG (1977) A fast algorithm for computing longest common subsequences. Commun ACM 20(5):350–353 Kanani P, McCallum A, Pal C (2007) Improving author coreference by resource-bounded information gathering from the web. International Joint Conference on Artifical Intelligence, San Francisco, pp 429–434 Kang I-S, Na S-H, Lee S, Jung H, Kim P, Sung W-K, Lee J-H (2009) On co-authorship for author disambiguation. Inf Process Manage 45(1):84–97 Lee D, On B-W, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. International Workshop on Information Quality in Information Systems (IQIS), New York, pp 69–76 Ley M (2009) DBLP: some lessons learned. Proc VLDB Endow 2(2):1493–1500 Ley M, Reuther P (2006) Maintaining an online bibliographical database: the problem of data quality. Journées Extraction et Gestion des Connaissances (EGC), Lille, pp 5–10 Liu W, Islamaj Doğan R, Kim S, Comeau DC, Kim W, Yeganova L, Lu Z, Wilbur WJ (2014) Author name disambiguation for PubMed. J Assoc Inf Sci Technol 65(4):765–781 Masek WJ, Paterson MS (1980) A faster algorithm computing string edit distances. J Comput Syst Sci 20(1):18–31 Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surveys 33(1):31–88 Pereira DA, Ribeiro-Neto B, Ziviani N, Laender AHF, Goncalves MA, Ferreira AA (2009) Using web information for author name disambiguation. ACM/IEEE-CS Joint Conference on Digital libraries, Austin, pp 49–58 Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10(4):334–350 Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. ACM SIGKDD International Conference on Knowledge discovery and data mining, Edmonton, pp 269–278 Shin D, Kim T, Choi J, Kim J (2014) Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100(1):15–50 Shu L, Long B, Meng W (2009) A latent topic model for complete entity resolution. IEEE International Conference on Data Engineering (ICDE), Washington, pp 880–891 Shvaiko, P and Euzenat, J (2005) A survey of schema-based matching approaches. Journal on Data Semantics IV. S. Spaccapietra, ed. Springer Berlin Heidelberg. 146–171. Song Y, Huang J, Councill IG, Li J, Giles CL (2007) Efficient topic-based unsupervised name disambiguation. ACM/IEEE-CS Joint Conference on Digital Libraries, New York, pp 342–351 Tang J, Fong ACM, Wang B, Zhang J (2012) A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowledge Data Eng 24(6):975–987 Torvik, VI and Smalheiser, NR (2009) Author name disambiguation in MEDLINE. ACM Trans Knowl Discov Data. 3(3)11:1–11:29. Ullman JD, Aho AV, Hirschberg DS (1976) Bounds on the complexity of the longest common subsequence problem. J ACM 23(1):1–12 Veloso A, Ferreira AA, Gonçalves MA, Laender AHF, Meira W Jr (2012) Cost-effective on-demand associative author name disambiguation. Inf Process Manage 48(4):680–697 Yang K-H, Peng H-T, Jiang J-Y, Lee H-M, Ho J-M (2008) Author name disambiguation for citations using topic and web correlation. In: Christensen-Dalsgaard B, Castelli D, Ammitzbøll Jurik B, Lippincott J (eds) Research and Advanced Technology for Digital Libraries. Springer Berlin, Heidelberg, pp 185–196