Natural language processing for under-resourced languages: Developing a Welsh natural language toolkit

Computer Speech & Language - Tập 72 - Trang 101311 - 2022
Daniel Cunliffe1, Andreas Vlachidis2, Daniel Williams1, Douglas Tudhope1
1School of Computing and Mathematical Sciences, University of South Wales, Trefforest, CF37 1DL, UK
2Department of Information Studies, University College London, Gower Street, London WC1E 6BT, UK

Tài liệu tham khảo

ap Dyfrig, R. (2013). Hanes y we gymraeg. http://www.tiki-toki.com/timeline/entry/84932/Hanes-y-We-Gymraeg/Online publication. Baker, 2002, EMILLE, A 67-million word corpus of indic languages: data collection, mark-up and harmonisation, 819 Berger, K.C., Hernaiz, A.G., Baroni, P., Hicks, D., Kruse, E., Quochi, V., Russo, I., Salonen, T. Sarhimaa, A. and Soria, C. (2018). The DLDP digital language survival kit. The Digital Language Diversity Project, www.dldp.eu. Binding, 2018, A study of semantic integration across archaeological data and reports in different languages, J. Inf. Sci., 45, 364, 10.1177/0165551518789874 Bontcheva, 2013, Twitie: an open-source information extraction pipeline for microblog text, 83 Bontcheva, 2003, GATE: a Unicode-based infrastructure supporting multilingual information extraction Carter, 2013, Microblogging language identification: overcoming the limitations of short, unedited and idiomatic text, Lang. Resour. Eval, 47, 195, 10.1007/s10579-012-9195-y Cavnar, 1994, N-gram-based text categorization, 161 Ceberio, K., Gurrutxaga, A., Soria, C., Russo, I. and Quochi, V. (2018). How to use the digital language vitality scale. The Digital Language Diversity Project, www.dldp.eu. Cunningham, 2002, GATE, a General architecture for text engineering, Comput. Hum., 36, 223, 10.1023/A:1014348124664 Cunningham, 2002, GATE: a framework and graphical development environment for robust NLP tools and applications, 168 Derczynski, 2013, Microblog-genre noise and impact on semantic annotation accuracy, 21 Donnelly, K. (2018). Eurfa. http://eurfa.org.uk. Online publication. Donnelly, 2011, Using constraint grammar in the Bangor Autoglosser to disambiguatemultilingual spoken text 2018, European parliament committee on culture and education Evas, J. (2013). Y Gymraeg yn Yr Oes Ddigidol – the Welsh language in the digital age. META-NET White Paper Series. Available online at http://www.meta-net.eu/whitepapers. Ezeani, 2019, Leveraging pre-trained embeddings for Welsh taggers, 270 Hardy, 2006, The Amitiés system: data-driven techniques for automated dialogue, Speech Commun., 48, 354, 10.1016/j.specom.2005.07.006 Hepple, 2000, Independence and commitment: assumptions for rapid training and execution of rule-based POS taggers Hicks, D., Baroni, P., Berger, K.C., Hernaiz, A.G., Kruse, E., Quochi, V., Russo, I., Salonen, T., Sarhimaa, A. and Soria, C. (2018). The DLDP road map. The Digital Language Diversity Project, www.dldp.eu. Jones, 2010, Cilfachau electronig: geni'r Gymraeg ar-lein, 1989-1996, Cyfrwng, 7, 21 Jones, 2017, Porn shock for dons' (and other stories from Welsh pre-web history), 256 Jones, D.B., Robertson, P. and Taborda, A. (2015a). Corpus of Welsh language tweets. http://techiaith.org/corpora/twitter/?lang=en Online publication. Jones, D.B., Robertson, P. and Prys, G. (2015b) Welsh language Lemmatizer API service. http://techiaith.cymru/api/lemmatizer/?lang=en Online publication. Krauwer, 2003, The basic language resourse kit (BLARK) as the first milestone for the language resources roadmap Liddy, 2003, Natural language processing, 2126 Maynard, 2003, NE recognition without training data on a language you don't speak, 15, 33 Maynard, 2002, Architectural elements of language engineering robustness, Nat. Lang. Eng., 8, 257, 10.1017/S1351324902002930 McMonagle, 2018, What can hashtags tell us about minority languages on Twitter?: a comparison of #cymraeg, #frysk, and #gaeilge, J. Multiling. Multicult. Dev., 40, 32, 10.1080/01434632.2018.1465429 Moseley, 2010 Nadeau, 2007, A survey of named entity recognition and classification, Lingvisticae Investig., 30, 3, 10.1075/li.30.1.03nad Neale, 2018, Leveraging lexical resources and constraint grammar for rule-based part-of-speech tagging in Welsh, 3946 Nic Giolla Mhichíl, 2018, Twitter and the Irish language, #Gaeilge – agents and activities: exploring a data set with micro-implementers in social media, J. Multiling. Multicult. Dev., 39, 868, 10.1080/01434632.2018.1450414 Piao, 2018, Towards a Welsh semantic annotation system, 980 Pretorius, 2017, Introduction to the special issue, Lang. Resour. Eval., 51, 891, 10.1007/s10579-017-9405-8 Prys, 2006, The BLARK matrix and its relation to the language resources situation for the Celtic languages, 31 Prys, 2008, The ultimate Welsh language survival kit: an overview of ten years of language technology work at Canolfan Bedwyr, Mercat. Media Forum, 10, 4 Prys, 2016, National language technologies portals for LRLs: a case study, 10930 Prys, 2018, Gathering data for speech technology in the welsh language: a case study Rivera Pastor, 2017 Soria, 2014, The language resource strategic agenda: the FLaReNet synthesis of community recommendations, Lang. Resour. Eval., 48, 753, 10.1007/s10579-014-9279-y StatsWales (2021). Welsh speakers by local authority, gender and detailed age groups, 2011 Census. https://statswales.gov.wales/Catalogue/Welsh-Language/WelshSpeakers-by-LocalAuthority-Gender-DetailedAgeGroups-2011Census Online publication. Steinberger, 2010, Challenges and methods for multilingual text mining, 19 Thorne, 1993 Vlachidis, 2012, A pilot investigation of information extraction in the semantic annotation of archaeological reports, Int. J. Metadata Semant. Ontol., 7, 222, 10.1504/IJMSO.2012.050183 Vlachidis, 2016, A knowledge-based approach to Information Extraction for semantic interoperability in the archaeology domain, J. Assoc. Inf. Sci. Technol., 67, 1138, 10.1002/asi.23485 Welsh Government (2017). Cymraeg 2050: welsh language strategy. http://gov.wales/topics/welshlanguage/welsh-language-strategy-and-policies/cymraeg-2050-welsh-language-strategy/?lang=en. Welsh Government (2018). Welsh language technology action plan. https://gov.wales/topics/welshlanguage/welsh-language-strategy-and-policies/welsh-language-policies-upto-2017/wl-technology-and-digital-media/?lang=en. Welsh Government (2019). Welsh language results: annual population survey, 2001- 2018 https://gov.wales/sites/default/files/statistics-and-research/2019-05/welsh-language-results-annual-population-survey-2001-to-2018.pdf. Witt, 2009, Multilingual language resources and interoperability, Lang. Resour. Eval., 43, 1, 10.1007/s10579-009-9088-x