DptOIE: a Portuguese open information extraction based on dependency analysis

Artificial Intelligence Review - Tập 56 - Trang 7015-7046 - 2022
Leandro Oliveira1, Daniela Barreiro Claro1, Marlo Souza1
1FORMAS Research Group, Federal University of Bahia (UFBA)–Institute of Computing (IC), Salvador, Brazil

Tóm tắt

It is estimated that more than 80% of the information on the Web is stored in textual form. As such, it has become increasingly difficult for humans to sort and extract useful information from the daily influx of data. In order to automate this process, open information extraction (OIE) methods have been proposed, which can extract facts from large textual bases. While most OIE methods were initially developed for the English language, the importance of developing methods for other languages, such as Portuguese, has been increasingly recognized in recent literature. OIE methods based on hand-crafted rules and shallow syntactic analysis have achieved good performances for the English language. Nevertheless, methods based on similar approaches in the Portuguese language have not achieved equivalent success. We believe that the shallow syntactic patterns previously explored in the literature do not cover important aspects of the Portuguese language syntax. For this reason, we propose the DptOIE method based on a new set of syntax-based rules using dependency parsers and a depth-first search (DFS) algorithm for OIE and a set of grammar-based rules to cover specific syntactic phenomena of the language. DptOIE was compared against the state-of-the-art OIE for the Portuguese language, obtaining favorable results both in our empirical evaluation and at the IberLEF evaluation track of OIE systems for the Portuguese language. Furthermore, we believe our method can be easily adapted to other Romance languages related to Portuguese.

Tài liệu tham khảo

Akbik A, Broß J (2009) Wanderlust: extracting semantic relations from natural language text using dependency grammar patterns. In: SemSearch workshop day at World Wide Web conference (WWW2009), 2009, vol 48 Akbik A, Löser A (2012) KrakeN: N-ary facts in open information extraction. In: Proceedings of the joint workshop on automatic knowledge base construction and Web-scale knowledge extraction, 2012. Association for Computational Linguistics, pp 52–56 Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the Web. IJCAI 7:2670–2676 Bassa A, Kroll M, Kern R (2018) GerIE—an open information extraction system for the German language. J Univers Comput Sci 24(1):2–24 Bast H, Haussmann E (2013) Open information extraction via contextual sentence decomposition. In: 2013 IEEE seventh international conference on semantic computing (ICSC), 2013. IEEE, pp 154–159 Bechara E (2012) Moderna gramática portuguesa. Nova Fronteira, Rio de Janeiro Bender EM (2009) Linguistically naïve!= language independent: why NLP needs linguistic typology. In: Proceedings of the EACL 2009 workshop on the interaction between linguistics and computational linguistics: virtuous, vicious or vacuous? 2009, pp 26–32 Buďa J (2017) A posição do adjetivo no sintagma nominal em português. Études romanes de Brno 38(1):219–238 Cabral B, Souza M, Claro DB (2020a) Explainable OpenIE classifier with morpho-syntactic rules. In: Proceedings of the workshop on hybrid intelligence for natural language processing tasks (HI4NLP 2020), 2020. CEUR-WS.org, pp 7–15 Cabral BS, Glauber R, Souza M, Claro DB (2020b) CrossOIE: cross-lingual classifier for open information extraction. In: International conference on computational processing of the Portuguese language, 2020. Springer, pp 368–378 Cimiano P, Wenderoth J (2005) Automatically learning Qualia structures from the Web. In: Proceedings of the ACL-SIGLEX workshop on deep lexical acquisition, 2005. Association for Computational Linguistics, pp 28–37 Claro DB, Souza M, Castellã Xavier C, Oliveira L (2019) Multilingual open information extraction: challenges and opportunities. Information 10(7):228. https://doi.org/10.3390/info10070228 Collovini S, Machado G, Vieira R (2016) Extracting and structuring open relations from Portuguese text. In: International conference on computational processing of the Portuguese language, 2016. Springer, pp 153–164 Collovini S, Neto JFS, Consoli BS, Terra J, Vieira R, Quaresma P, Souza M, Claro DB, Glauber R (2019) IberLEF 2019 Portuguese named entity recognition and relation extraction tasks. In: IberLEF@ SEPLN, 2019, pp 390–410 Cui L, Wei F, Zhou M (2018) Neural open information extraction. CoRR. arXiv:abs/1805.04270 Damiano E, Minutolo A, Esposito M (2018) Open information extraction for Italian sentences. In: 2018 32nd International conference on advanced information networking and applications workshops (WAINA), 2018, pp 668–673. https://doi.org/10.1109/WAINA.2018.00165 Del Corro L, Gemulla R (2013) ClausIE: clause-based open information extraction. In: Proceedings of the 22nd international conference on World Wide Web, 2013. ACM, pp 355–366 Dryer MS, Haspelmath M (eds) (2013) WALS online. Max Planck Institute for Evolutionary Anthropology, Leipzig. https://wals.info/ Fader A, Soderland S, Etzioni O (2011) Identifying relations for open information extraction. In: Proceedings of the conference on empirical methods in natural language processing, 2011. Association for Computational Linguistics, pp 1535–1545 Faruqui M, Kumar S (2015) Multilingual open relation extraction using cross-lingual projection, pp 1351–1356. arXiv preprint. arXiv:1503.06450, http://www.aclweb.org/anthology/N15-1151 Gamallo P, Garcia M (2015) Multilingual open information extraction. In: Portuguese conference on artificial intelligence, 2015. Springer, pp 711–722 Gamallo P, Garcia M (2017) Linguakit: uma ferramenta multilingue para a análise linguística e a extração de informação. Linguamática 9(1):19–28 Gamallo P, Garcia M, Fernández-Lanza S (2012) Dependency-based open information extraction. In: Proceedings of the joint workshop on unsupervised and semi-supervised learning in NLP, 2012. Association for Computational Linguistics, pp 10–18 Garcia M, Gamallo P (2014) Entity-centric coreference resolution of person entities for open information extraction. Proces Leng Nat 53:25–32 Glauber R, Claro DB (2018) A systematic mapping study on open information extraction. Expert Syst Appl 112:372–387 Glauber R, de Oliveira LS, Sena CFL, Claro DB, Souza M (2018) Challenges of an annotation task for open information extraction in Portuguese. In: International conference on computational processing of the Portuguese language, 2018. Springer, pp 66–76 Guarasci R, Damiano E, Minutolo A, Esposito M, Pietro GD (2020) Lexicon-grammar based open information extraction from natural language sentences in Italian. Expert Syst Appl 143:112954. https://doi.org/10.1016/j.eswa.2019.112954 Jurafsky D, Martin JH (2017) Chapter 6: vector semantics. In: Jurafsky D, Martin JH (eds) Speech and language processing, 3rd edn. Prentice Hall, pp 101–130 (draft of 23 Sep 2018). https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf Kato MA (2000) A restrição de mono-argumentalidade da ordem vs no português do brasil. Fórum Linguíst 2(1):97–127 Kilgarriff A, Grefenstette G (2001) Web as corpus. In: Proceedings of corpus linguistics 2001, Corpus Linguistics. Readings in a widening discipline, 2001, pp 342–344 Léchelle W, Gotti F, Langlais P (2018) WiRe57: a fine-grained benchmark for open information extraction. arXiv preprint. arXiv:1809.08962 Leung H, Li CY, Li J, Li K, Ljubešić N, Loginova O, Lyashevskaya O, Lynn T, Macketanz V, Makazhanov A et al (2017) Universal dependencies 2.1 Lockard C, Shiralkar P, Dong XL (2019) OpenCeres: when open information extraction meets the semi-structured Web. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies: long and short papers, 2019, vol 1. Association for Computational Linguistics, Minneapolis, pp 3047–3056. https://doi.org/10.18653/v1/N19-1309 Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: ACL (system demonstrations), 2014, pp 55–60 Nivre J, Hall J, Nilsson J (2006) MaltParser: a data-driven parser-generator for dependency parsing. Proc LREC 6:2216–2219 Oliveira L, Glauber R, Claro DB (2017) DependentIE: an open information extraction system on Portuguese by a dependence analysis. In: ENIAC—2017 XIV Encontro Nacional de Inteligência Artificial e Computacional. http://comissoes.sbc.org.br/ce-ia/pg/historico/?file=ENIAC-2017|Anais-ENIAC-2017.pdf Pereira V, Pinheiro V (2015) Report-um sistema de extração de informações aberta para língua portuguesa (report-an open information extraction system for Portuguese language). In: Proceedings of the 10th Brazilian symposium in information and human language technology, 2015, pp 191–200 Pilati E (2016) Sobre a ordem verbo-sujeito no português brasileiro: 30 anos em mirada crítica. Rev Linguí\(\int \)t 12(2):183–205. https://doi.org/10.31513/linguistica.2016.v12n2a5474 Ro Y, Lee Y, Kang P (2020) Multi\(\hat{}\) 2OIE: multilingual open information extraction based on multi-head attention with BERT. arXiv preprint. arXiv:2009.08128 Rodríguez JM, Merlino HD, Pesado P, García-Martínez R (2016) Performance evaluation of knowledge extraction methods. In: International conference on industrial engineering and other applications of applied intelligent systems, 2016. Springer, pp 16–22 Sacconi LA (2012) Gramática Para Todos os Cursos e Concursos -Teoria e Prática, 5th edn. Nova Geração Santos D, Cardoso N (2007) Reconhecimento de entidades mencionadas em português: Documentação e atas do HAREM, a primeira avaliação conjunta na área. Linguateca, Lisboa Schmitz M, Bart R, Soderland S, Etzioni O et al (2012) Open language learning for information extraction. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, 2012. Association for Computational Linguistics, pp 523–534 Sena CFL, Claro DB (2019) InferPortOIE: a Portuguese open information extraction system with inference. Nat Lang Eng 25:287–306. https://doi.org/10.1017/S135132491800044X Sena CFL, Claro DB (2020) PragmaticOIE: a pragmatic open information extraction for Portuguese language. Knowl Inf Syst 62:3811–3836 Sena CFL, Glauber R, Claro DB (2017) Inference approach to enhance a Portuguese open information extraction. In: Proceedings of the 19th international conference on enterprise information systems (ICEIS), 2017, vol 1. INSTICC, ScitePress, pp 442–451. https://doi.org/10.5220/0006338204420451 Stanovsky G, Michael J, Zettlemoyer L, Dagan I (2018) Supervised open information extraction. In: Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies: long papers, 2018, vol 1, pp 885–895 Teixeira RFA (1986) Zero Anaphora in Brazilian Portuguese subjects and objects: morphological and typological considerations (Brazil). University of California, Berkeley Virtanen A, Kanerva J, Ilo R, Luoma J, Luotolahti J, Salakoski T, Ginter F, Pyysalo S (2019) Multilingual is not enough: BERT for Finnish. arXiv preprint. arXiv:1912.07076 Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? arXiv preprint. arXiv:2005.09093 Wu F, Weld DS (2010) Open information extraction using Wikipedia. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, 2010. Association for Computational Linguistics, pp 118–127 Xavier CC, de Lima VLS, Souza M (2013) Open information extraction based on lexical–syntactic patterns. In: 2013 Brazilian conference on intelligent systems (BRACIS), 2013. IEEE, pp 189–194 Xavier CC, de Lima VLS, Souza M (2015) Open information extraction based on lexical semantics. J Braz Comput Soc 21(1):4 Zeman D, Hajič J, Popel M, Potthast M, Straka M, Ginter F, Nivre J, Petrov S (2018) CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, 2018. Association for Computational Linguistics, Brussels, pp 1–21. http://www.aclweb.org/anthology/K18-2001