A discovery system for narrative query graphs: entity-interaction-aware document retrieval

Hermann Kroll1, Jan Pirklbauer1, Jan-Christoph Kalo2, Morris Kunz1, Johannes Ruthmann1, Wolf-Tilo Balke1
1Institute for Information Systems, TU Braunschweig, Braunschweig, Germany
2Knowledge Representation and Reasoning Group, VU Amsterdam, Amsterdam, The Netherlands

Tóm tắt

Finding relevant publications in the scientific domain can be quite tedious: Accessing large-scale document collections often means to formulate an initial keyword-based query followed by many refinements to retrieve a sufficiently complete, yet manageable set of documents to satisfy one’s information need. Since keyword-based search limits researchers to formulating their information needs as a set of unconnected keywords, retrieval systems try to guess each user’s intent. In contrast, distilling short narratives of the searchers’ information needs into simple, yet precise entity-interaction graph patterns provides all information needed for a precise search. As an additional benefit, such graph patterns may also feature variable nodes to flexibly allow for different substitutions of entities taking a specified role. An evaluation over the PubMed document collection quantifies the gains in precision for our novel entity-interaction-aware search. Moreover, we perform expert interviews and a questionnaire to verify the usefulness of our system in practice. This paper extends our previous work by giving a comprehensive overview about the discovery system to realize narrative query graph retrieval.

Tài liệu tham khảo

Azad, H.K., Deepak, A.: Query expansion techniques for information retrieval: a survey. Inf. Process. Manag. 56(5), 1698–1735 (2019). https://doi.org/10.1016/j.ipm.2019.05.009 Betts, C., Power, J., Ammar, W.: GrapAL: connecting the dots in scientific literature. In: Proceedings of the 57th annual meeting of the association for computational linguistics: system demonstrations. association for computational linguistics, Florence, Italy, pp 147–152, (2019)https://doi.org/10.18653/v1/P19-3025 Chen, Q.: An object-oriented database system for efficient information retrieval applications. PhD thesis, (1992) http://hdl.handle.net/10919/27976 Croft, W., Parenty, T.J.: A comparison of a network structure and a database system used for document retrieval. Inf. Syst. 10(4), 377–390 (1985). https://doi.org/10.1016/0306-4379(85)90042-0 Croft, W.B., Wolf, R., Thompson, R.: A network organization used for document retrieval. In: proceedings of the 6th annual international acm sigir conference on research and development in information retrieval. association for computing machinery, New York, NY, USA, SIGIR ’83, p 178-188, (1983) https://doi.org/10.1145/511793.511820 Dietz, L., Kotov, A., Meij, E.: Utilizing knowledge graphs for text-centric information retrieval. In: The 41st international ACM SIGIR conference on research & development in information retrieval. Association for computing machinery, New York, NY, USA, SIGIR ’18, p 1387-1390, (2018) https://doi.org/10.1145/3209978.3210187 Dogan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inf. 47, 1–10 (2014). https://doi.org/10.1016/j.jbi.2013.12.006 Färber, M.: The microsoft academic knowledge graph: A linked data source with 8 billion triples of scholarly data. In: The Semantic Web - ISWC 2019 - 18th International Semantic Web Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part II, Lecture Notes in Computer Science, vol 11779. Springer, pp 113–129, (2019) https://doi.org/10.1007/978-3-030-30796-7_8 France, R.K.: Effective, efficient retrieval in a network of digital information objects. PhD thesis, (2001) http://hdl.handle.net/10919/29754 Herskovic, J.R., Tanaka, L.Y., Hersh, W., et al.: A day in the life of pubmed: analysis of a typical day’s query log. J. Am. Med. Inf. Assoc. 14(2), 212–220 (2007). https://doi.org/10.1197/jamia.M2191 Jaradeh, M.Y., Oelen, A., Farfar, K.E., et al. Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge. In: proceedings of the 10th international conference on knowledge capture, K-CAP 2019, Marina Del Rey, CA, USA, November 19-21, 2019. ACM, pp 243–246, (2019) https://doi.org/10.1145/3360901.3364435 Kadry, A., Dietz, L.: open relation extraction for support passage retrieval: merit and open issues. In: proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. Association for computing machinery, New York, NY, USA, SIGIR ’17, p 1149-1152, (2017) https://doi.org/10.1145/3077136.3080744 Kolluru, K., Adlakha, V., Aggarwal, S., et al. OpenIE6: iterative grid labeling and coordination analysis for open information extraction. In: Proc. of the 2020 conf. on empirical methods in natural language processing (EMNLP). ACL, pp 3748–3761, (2020) https://doi.org/10.18653/v1/2020.emnlp-main.306 Kroll, H., Kalo, J.C., Nagel, D., et al.: Context-compatible information fusion for scientific knowledge graphs. In: Digital Libraries for Open Knowledge, pp. 33–47. Springer (2020) Kroll, H., Nagel, D., Balke, W.T.: Modeling Narrative Structures in Logical Overlays on Top of Knowledge Repositories. In: Dev, T. (ed.) Conceptual Modeling, pp. 250–260. Springer (2020) Kroll, H., Nagel, D., Kunz, M., et al. Demonstrating narrative bindings: linking discourses to knowledge repositories. In: fourth workshop on narrative extraction from texts, Text2Story@ECIR2021, CEUR Workshop Proceedings, vol 2860. CEUR-WS.org, pp 57–63, (2021a) http://ceur-ws.org/Vol-2860/paper7.pdf Kroll, H., Pirklbauer, J., Balke, W.: A toolbox for the nearly-unsupervised construction of digital library knowledge graphs. In: ACM/IEEE joint conference on digital libraries, JCDL 2021, Champaign, IL, USA, September 27-30, 2021. IEEE, pp 21–30, (2021b) https://doi.org/10.1109/JCDL52503.2021.00014 Kroll, H., Pirklbauer, J., Kalo, J., et al. Narrative query graphs for entity-interaction-aware document retrieval. In: Towards open and trustworthy digital societies—23rd international conference on Asia-pacific digital libraries, ICADL 2021, Virtual Event, December 1-3, 2021, Proceedings, Lecture Notes in Computer Science, vol 13133. Springer, pp 80–95, (2021c) https://doi.org/10.1007/978-3-030-91669-5_7 Kroll, H., Pirklbauer, J., Plötzky, F., et al. A library perspective on nearly-unsupervised information extraction workflows in digital libraries. In: proceedings of the 22nd ACM/IEEE joint conference on digital libraries. Association for computing machinery, New York, NY, USA, JCDL ’22, (2022a) https://doi.org/10.1145/3529372.3530924 Kroll, H., Plötzky, F., Pirklbauer, J., et al. What a Publication Tells You-Benefits of Narrative Information Access in Digital Libraries. In: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries. Association for Computing Machinery, New York, NY, USA, JCDL ’22, (2022b) https://doi.org/10.1145/3529372.3530928 Krötzsch, M., Rudolph, S.: Is your database system a semantic web reasoner? KI-Künstliche Intelligenz 30(2), 169–176 (2016). https://doi.org/10.1007/s13218-015-0412-x Langnickel, L., Baum, R., Darms, J., et al. COVID-19 preVIEW: semantic search to explore COVID-19 research preprints. In: public health and informatics. IOS Press, Amsterdam, the Netherlands, p 78–82, (2021a) https://doi.org/10.3233/SHTI210124 Langnickel, L., Darms, J., Baum, R., et al.: preVIEW: from a fast prototype towards a sustainable semantic search system for central access to COVID-19 preprints. J. EAHIL 17(3), 8–14 (2021) Leaman, R., Lu, Z.: TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32(18), 2839–2846 (2016). https://doi.org/10.1093/bioinformatics/btw343 Manning, C.D., Surdeanu, M., Bauer, J., et al. The stanford CoreNLP natural language processing toolkit. In: proceedings of the 52nd annual meeting of the association for computational linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, system demonstrations. The association for computer linguistics, pp 55–60, (2014) https://doi.org/10.3115/v1/p14-5010 Manola, F., Miller, E., McBride, B., et al. RDF primer. W3C recommendation 10(1-107):6 (2004) Mendez, D., Gaulton, A., Bento, A.P., et al.: ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47(D1), D930–D940 (2018). https://doi.org/10.1093/nar/gky1075 Mohan, S., Fiorini, N., Kim, S., et al. A fast deep learning model for textual relevance in biomedical information retrieval. In: Proceedings of the 2018 world wide web conference. International world wide web conferences steering committee, Republic and Canton of Geneva, CHE, WWW ’18, p 77-86, (2018) https://doi.org/10.1145/3178876.3186049 Nguyen, D.B., Abujabal, A., Tran, N.K., et al.: Query-driven on-the-fly knowledge base construction. Proc. VLDB Endow 11(1), 66–79 (2017) Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. (2009). https://doi.org/10.1145/1567274.1567278 Priem, J., Piwowar, H., Orr, R.: Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. (2022) https://doi.org/10.48550/ARXIV.2205.01833 Ratner, A., Bach, S.H., Ehrenberg, H.R., et al.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow 11(3), 269–282 (2017) Raviv, H., Kurland, O., Carmel, D.: Document retrieval using entity-based language models. In: Proceedings of the 39th international acm sigir conference on research and development in information retrieval. association for computing machinery, New York, NY, USA, SIGIR ’16, p 65-74, (2016) https://doi.org/10.1145/2911451.2911508 Shin, J., Wu, S., Wang, F., et al.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow 8(11), 1310–1321 (2015) Spitz, A., Gertz, M.: Terms over LOAD: Leveraging named entities for cross-document extraction and summarization of events. In: proceedings of the 39th international acm sigir conference on research and development in information retrieval. Association for computing machinery, New York, NY, USA, SIGIR ’16, p 503-512, (2016) https://doi.org/10.1145/2911451.2911529 Vazirgiannis, M., Malliaros, F.D., Nikolentzos, G.: GraphRep: boosting text mining, NLP and information retrieval with graphs. In: proceedings of the 27th ACM international conference on information and knowledge management. Association for computing machinery, New York, NY, USA, CIKM ’18, p 2295-2296, (2018) https://doi.org/10.1145/3269206.3274273 Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014). https://doi.org/10.1145/2629489 Weaver, M.T.: Implementing an intelligent information retrieval system: the CODER system, version 1.0. Master’s thesis, (1988) http://hdl.handle.net/10919/44097 Wei, C.H., Kao, H.Y., Lu, Z.: PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 41(W1), W518–W522 (2013). https://doi.org/10.1093/nar/gkt441 Wei, C.H., Kao, H.Y., Lu, Z.: GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed. Res. Int. 918, 710 (2015a). https://doi.org/10.1155/2015/918710 Wei, C.H., Peng, Y., Leaman, R., et al. Overview of the BioCreative V chemical disease relation (CDR) task. In: proceedings of the fifth biocreative challenge evaluation workshop (2015b) Wei, C.H., Allot, A., Leaman, R., et al.: PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 47(W1), W587–W593 (2019). https://doi.org/10.1093/nar/gkz389 Xiong, C., Power, R., Callan, J.: Explicit semantic ranking for academic search via knowledge graph embedding. In: proceedings of the 26th international conference on world wide web. international world wide web conferences steering committee, Republic and Canton of Geneva, CHE, WWW ’17, p 1271-1279, (2017) https://doi.org/10.1145/3038912.3052558 Zhang, Y., Chen, Q., Yang, Z., et al.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6(1), 52 (2019). https://doi.org/10.1038/s41597-019-0055-0 Zhao, S., Su, C., Sboner, A., et al. GRAPHENE: a precise biomedical literature retrieval engine with graph augmented deep learning and external knowledge empowerment. In: proceedings of the 28th ACM international conference on information and knowledge management. Association for computing machinery, New York, NY, USA, CIKM ’19, p 149-158, (2019) https://doi.org/10.1145/3357384.3358038