Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm

Springer Science and Business Media LLC - Tập 11 - Trang 1-16 - 2019
Martin Pérez-Pérez1,2,3, Gael Pérez-Rodríguez1,2,3, Aitor Blanco-Míguez1,2,3,4, Florentino Fdez-Riverola1,2,3, Alfonso Valencia5,6,7,8, Martin Krallinger5,6,9, Anália Lourenço1,2,3,10
1Department of Computer Science, ESEI, University of Vigo, Ourense, Spain
2The Biomedical Research Centre (CINBIO), Vigo, Spain
3SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
4Department of Microbiology and Biochemistry of Dairy Products, Instituto de Productos Lácteos de Asturias (IPLA), Consejo Superior de Investigaciones Científicas (CSIC), Villaviciosa, Spain
5Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona, Spain
6Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona, Barcelona, Spain
7Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
8Spanish Bioinformatics Institute INB-ISCIII ES-ELIXIR, Madrid, Spain
9Biological Text Mining Unit, Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
10Centre of Biological Engineering (CEB), University of Minho, Braga, Portugal

Tóm tắt

Shared tasks and community challenges represent key instruments to promote research, collaboration and determine the state of the art of biomedical and chemical text mining technologies. Traditionally, such tasks relied on the comparison of automatically generated results against a so-called Gold Standard dataset of manually labelled textual data, regardless of efficiency and robustness of the underlying implementations. Due to the rapid growth of unstructured data collections, including patent databases and particularly the scientific literature, there is a pressing need to generate, assess and expose robust big data text mining solutions to semantically enrich documents in real time. To address this pressing need, a novel track called “Technical interoperability and performance of annotation servers” was launched under the umbrella of the BioCreative text mining evaluation effort. The aim of this track was to enable the continuous assessment of technical aspects of text annotation web servers, specifically of online biomedical named entity recognition systems of interest for medicinal chemistry applications. A total of 15 out of 26 registered teams successfully implemented online annotation servers. They returned predictions during a two-month period in predefined formats and were evaluated through the BeCalm evaluation platform, specifically developed for this track. The track encompassed three levels of evaluation, i.e. data format considerations, technical metrics and functional specifications. Participating annotation servers were implemented in seven different programming languages and covered 12 general entity types. The continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity, encompassing overall 4,092,502 requests from three different document provider settings. The median response time was below 3.74 s, with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process over 100,000 requests in a 5-day period. The presented track was a novel experimental task that systematically evaluated the technical performance aspects of online entity recognition systems. It raised the interest of a significant number of participants. Future editions of the competition will address the ability to process documents in bulk as well as to annotate full-text documents.

Tài liệu tham khảo

Krallinger M, Rabal O, Lourenço A et al (2017) Information retrieval and text mining technologies for chemistry. Chem Rev 117:7673–7761. https://doi.org/10.1021/acs.chemrev.6b00851 Huang C-C, Lu Z (2016) Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 17:132–144. https://doi.org/10.1093/bib/bbv024 Arighi CN, Roberts PM, Agarwal S et al (2011) BioCreative III interactive task: an overview. BMC Bioinform 12:S4. https://doi.org/10.1186/1471-2105-12-S8-S4 Hirschman L, Fort K, Boué S et al (2016) Crowdsourcing and curation: perspectives from biology and natural language processing. Database (Oxford). https://doi.org/10.1093/database/baw115 Rebholz-Schuhmann D, Yepes AJJ, Van Mulligen EM et al (2010) CALBC silver standard corpus. J Bioinform Comput Biol 08:163–179. https://doi.org/10.1142/S0219720010004562 Rangel F, Rosso P, Montes-Y-Gómez M, et al (2018) Overview of the 6th author profiling task at PAN 2018: multimodal gender identification in Twitter CodaLab (2017). http://codalab.org/. Accessed 2 Jan 2019 Gollub T, Stein B, Burrows S, Hoppe D (2012) TIRA: configuring, executing, and disseminating information retrieval experiments. In: 2012 23rd international workshop on database and expert systems applications. IEEE, pp 151–155 Smith L, Tanabe LK, nee Ando RJ et al (2008) Overview of BioCreative II gene mention recognition. Genome Biol 9(Suppl 2):S2. https://doi.org/10.1186/gb-2008-9-s2-s2 Krallinger M, Leitner F, Rabal O et al (2015) CHEMDNER: the drugs and chemical names extraction challenge. J Cheminform 7:S1. https://doi.org/10.1186/1758-2946-7-S1-S1 Neves M (2014) An analysis on the entity annotations in biological corpora. F1000Research 3:96. https://doi.org/10.12688/f1000research.3216.1 Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A (2008) Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol 9(Suppl 2):S4. https://doi.org/10.1186/gb-2008-9-s2-s4 Katayama T, Arakawa K, Nakao M et al (2010) The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. J Biomed Semant 1:8. https://doi.org/10.1186/2041-1480-1-8 Neerincx PBT, Leunissen JAM (2005) Evolution of web services in bioinformatics. Brief Bioinform 6:178–188 Kim S, Islamaj Doğan R, Chatr-Aryamontri A et al (2016) BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID. Database (Oxford). https://doi.org/10.1093/database/baw121 Kano Y, Baumgartner WA, McCrohon L et al (2009) U-Compare: share and compare text mining tools with UIMA. Bioinformatics 25:1997–1998. https://doi.org/10.1093/bioinformatics/btp289 Krallinger M, Vazquez M, Leitner F et al (2011) The protein–protein interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform 12(Suppl 8):S3. https://doi.org/10.1186/1471-2105-12-S8-S3 Krallinger M, Morgan A, Smith L et al (2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 9(Suppl 2):S1. https://doi.org/10.1186/gb-2008-9-s2-s1 Wiegers TC, Davis AP, Mattingly CJ (2014) Web services-based text-mining demonstrates broad impacts for interoperability and process simplification. Database. https://doi.org/10.1093/database/bau050 Wei C-H, Peng Y, Leaman R et al (2016) Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database (Oxford). https://doi.org/10.1093/database/baw032 Leitner F, Krallinger M, Rodriguez-Penagos C et al (2008) Introducing meta-services for biomedical information extraction. Genome Biol 9(Suppl 2):S6. https://doi.org/10.1186/gb-2008-9-s2-s6 Leitner F, Krallinger M, Alfonso V (2013) BioCreative meta-server and text-mining interoperability standard. In: Dubitzky W, Wolkenhauer O, Cho KH, Yokota H (eds) Encyclopedia of systems biology. Springer, New York, pp 106–110 Rabal O, Pérez-Pérez M, Pérez-Rodríguez G et al (2018) Comparative assessment of named entity recognition strategies on medicinal chemistry patents for systems pharmacology. J Cheminform 2018:11–18 BeCalm. http://www.becalm.eu/. Accessed 17 Oct 2018 Iglesias M (2011) CakePHP 1.3 application development cookbook : over 60 great recipes for developing, maintaing, and deploying web applications. Packt Publishing Ltd, Birmingham Oracle–Java. https://www.oracle.com/java/. Accessed 17 Oct 2018 HTML 5.2. https://www.w3.org/TR/html5/. Accessed 17 Oct 2018 CSS3—All you ever needed to know about CSS3. http://www.css3.info/. Accessed 17 Oct 2018 jQuery. http://jquery.com/. Accessed 17 Oct 2018 Massé M (2012) REST API design rulebook. O’Reilly, Sebastopol Hibernate. http://hibernate.org/. Accessed 17 Oct 2018 Comeau DC, Islamaj Doğan R, Ciccarese P et al (2013) BioC: a minimalist approach to interoperability for biomedical text processing. Database (Oxford). https://doi.org/10.1093/database/bat064 OpenMinTeD. http://openminted.eu/. Accessed 17 Oct 2018 Rabal O, Pérez-Pérez M, Pérez-Rodríguez G et al (2019) Comparative assessment of named entity recognition strategies on medicinal chemistry patents for systems pharmacology. J Cheminform (Under revision) Torell W, Avelar V (2004) Mean time between failure: explanation and standards Lienig J, Bruemmer H (2017) Reliability analysis. In: Fundamentals of electronic systems design. Springer, Cham, pp 45–73 Wynn R, Oyeyemi SO, Johnsen J-AK, Gabarron E (2017) Tweets are not always supportive of patients with mental disorders. Int J Integr Care 17:149. https://doi.org/10.5334/ijic.3261 Kirschnick J, Thomas P, Roller R, Hennig L (2018) SIA: a scalable interoperable annotation server for biomedical named entities. J Cheminform 10:63. https://doi.org/10.1186/s13321-018-0319-2 Dai H-J, Rosa MAC dela, Zhang D et al (2017) NTTMU-SCHEMA BeCalm API in BioCreative V. 5. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 196–204 Wang C-K, Dai H-J, Chang N-W (2017) Micro-RNA recognition in patents in BioCreative V.5. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 205–210 Jonnagaddala J, Dai H-J, Wang C-K, Lai P-T (2017) Performance and interoperability assessment of Disease Extract Annotation Server (DEAS). In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 156–162 Jensen LJ (2017) Tagger: BeCalm API for rapid named entity recognition. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 122–129 Pletscher-Frankild S, Jensen LJ (2019) Design, implementation, and operation of a rapid, robust named entity recognition web service. J Cheminform 11:19. https://doi.org/10.1186/s13321-019-0344-9 Santos A, Matos S (2017) Neji : DIY web services for biomedical concept recognition. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 54–60 Matos S (2018) Configurable web-services for biomedical document annotation. J Cheminform 10:68. https://doi.org/10.1186/s13321-018-0317-4 Couto FM, Campos L, Lamurias A (2017) MER: a minimal named-entity recognition tagger and annotation server. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 130–137 Couto FM, Lamurias A (2018) MER: a shell script and annotation server for minimal named entity recognition and linking. J Cheminform 10:58. https://doi.org/10.1186/s13321-018-0312-9 Folkerts H, Neves M (2017) Olelo’s named-entity recognition web servicein the BeCalm TIPS task. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 167–174 Furrer L, Rinaldi F (2017) OGER: OntoGene’s entity recogniser in the BeCalm TIPS task. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 175–182 Furrer L, Jancso A, Colic N, Rinaldi F (2019) OGER++: hybrid multi-type entity recognition. J Cheminform 11:7. https://doi.org/10.1186/s13321-018-0326-3 Hemati W, Uslu T, Mehler A (2017) TextImager as an interface to BeCalm. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 163–166 Teng R, Verspoor K (2017) READ-Biomed-Server : a scalable annotation server using the UIMA concept mapper. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 183–190 Madrid MA, Valencia A (2017) High-throughput, interoperability and benchmarking of text-mining with BeCalm biomedical metaserver. In: Proceedings of the BioCreative V.5 challenge evaluation workshop, Barcelona, pp 146–155 Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29. https://doi.org/10.1038/75556 Bairoch A (2018) The cellosaurus, a cell-line knowledge resource. J Biomol Technol 29:25–38. https://doi.org/10.7171/jbt.18-2902-002 Griffiths-Jones S, Grocock RJ, van Dongen S et al (2006) miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34:D140–D144. https://doi.org/10.1093/nar/gkj112 Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:D267–D270. https://doi.org/10.1093/nar/gkh061 Hastings J, Owen G, Dekker A et al (2016) ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res 44:D1214–D1219. https://doi.org/10.1093/nar/gkv1031 Gaulton A, Bellis LJ, Bento AP et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107. https://doi.org/10.1093/nar/gkr777