Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

Journal of Biomedical Semantics - Tập 2 Số 5 - Trang 1-12 - 2011

Rebholz-Schuhmann, Dietrich¹, Yepes, Antonio Jimeno¹, Li, Chen¹, Kafkas, Senay¹, Lewin, Ian¹, Kang, Ning², Corbett, Peter³, Milward, David³, Buyko, Ekaterina⁴, Beisswanger, Elena⁴, Hornbostel, Kerstin⁴, Kouznetsov, Alexandre⁵, Witte, René⁶, Laurila, Jonas B⁵, Baker, Christopher JO⁵, Kuo, Cheng-Ju⁷, Clematide, Simone⁸, Rinaldi, Fabio⁸, Farkas, Richárd⁹, Móra, György⁹, Hara, Kazuo¹⁰, Furlong, Laura I¹¹, Rautschka, Michael¹¹, Neves, Mariana Lara¹², Pascual-Montano, Alberto¹², Wei, Qi¹³, Collier, Nigel¹³, Chowdhury, Md Faisal Mahbub¹⁴, Lavelli, Alberto¹⁴, Berlanga, Rafael¹⁵, Morante, Roser¹⁶, Van Asch, Vincent¹⁶, Daelemans, Walter¹⁶, Marina, José Luís¹⁷, van Mulligen, Erik², Kors, Jan², Hahn, Udo⁴

¹EMBL Outstation, European Bioinformatics Institute, Cambridge, U.K

²Dept. of Medical Informatics, Erasmus University Medical Center, Rotterdam, NL

³Linguamatics Ltd, St. John's Innovation Centre, Cambridge, U.K

⁴Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität, Jena, Germany

⁵Dept. of Computer Science & Applied Statistics, University of New Brunswick, Canada

⁶Dept. of Computer Science & Software Engineering, Concordia University, Montreal, Canada

⁷Institute of Information Science, Academia Sinica, Taipei, Taiwan

⁸University of Zürich, Switzerland

⁹Research Group on Artificial Intelligence, Hungarian Academy of Sciences, Hungary

¹⁰Nara Institute of Science and Technology, Nara, Japan

¹¹Research Programme on Biomedical Informatics (GRIB), IMIM (Hospital del Mar Research Institute), Universitat Pompeu Fabra, Barcelona, Spain

¹²National Center for Biotechnology-CSIC, Madrid, Spain

¹³National Institute of Informatics, Tokyo, Japan

¹⁴Fondazione Bruno Kessler, Trento, Italy

¹⁵Universitat Jaume I, Spain

¹⁶CLiPS, University of Antwerp, Belgium

¹⁷Complutense University of Madrid, Spain

Tóm tắt

Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I. The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.

Tài liệu tham khảo

citation_journal_title=BMC Bioinformatics; citation_title=Overview of BioCreAtIvE: Critical assessment of information extraction for biology; citation_author=L Hirschman, A Yeh, C Blaschke, A Valencia; citation_volume=6; citation_issue=Suppl 1; citation_publication_date=2005; citation_pages=S1; citation_doi=10.1186/1471-2105-6-S1-S1; citation_id=CR1 citation_journal_title=Genome Biology; citation_title=Evaluation of textmining systems for biology: Overview of the Second BioCreAtIvE Community Challenge; citation_author=M Krallinger, A Morgan, L Smith, F Leitner, L Ta-nabe, J Wilbur, L Hirschman, A Valencia; citation_volume=9; citation_issue=Suppl 2; citation_publication_date=2008; citation_pages=S1; citation_doi=10.1186/gb-2008-9-s2-s1; citation_id=CR2 citation_title=Introduction to the bio-entity recognition task at JNLPBA; citation_inbook_title=Proceedings of the JNLPBA-04; citation_publication_date=2004; citation_pages=70-75; citation_id=CR3; citation_author=JD Kim; citation_author=T Ohta; citation_author=Y Tsuruoka; citation_author=Y Tateishi; citation_author=N Collier citation_title=Overview of BioNLP’09 Shared Task on Event Extraction; citation_inbook_title=Proceedings of the Workshop on BioNLP: Shared Task; citation_publication_date=2009; citation_pages=1-9; citation_id=CR4; citation_author=JD Kim; citation_author=T Ohta; citation_author=S Pyysalo; citation_author=Y Kano; citation_author=J Tsujii LLL’05 challenge. [ http://www.cs.york.ac.uk/aig/lll/lll05/ ] citation_title=The CALBC Silver Standard Corpus for Biomedical Named Entities: A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers; citation_inbook_title=Proc. LREC 2010; citation_publication_date=2010; citation_id=CR6; citation_author=D Rebholz-Schuhmann; citation_author=A J; citation_author=EM Jimeno Yepes; citation_author=N van Mulligen; citation_author=J Kang; citation_author=D Kors; citation_author=P Milward; citation_author=E Corbett; citation_author=K Buyko; citation_author=E Beisswanger Tomanek; citation_author=U Hahn citation_journal_title=Genome Biol; citation_title=Introducing meta-services for biomedical information extraction; citation_author=F Leitner, M Krallinger, C Rodriguez-Penagos, J Hakenberg, C Plake, CJ Kuo, CN Hsu, RT Tsai, HC Hung, WW Lau, CA Johnson, R Saetre, K Yoshida, YH Chen, S Kim, SY Shin, BT Zhang, WA Baumgartner, L Hunter, B Haddow, M Matthews, X Wang, P Ruch, F Ehrler, A Ozgür, G Erkan, DR Radev, M Krauthammer, T Luong, R Hoffmann, C Sander, A Valencia; citation_volume=9; citation_issue=Suppl 2; citation_publication_date=2008; citation_pages=S6; citation_doi=10.1186/gb-2008-9-s2-s6; citation_id=CR7 Proceedings of the First CALBC Workshop. [ http://www.ebi.ac.uk/Rebholz-srv/CALBC/docs/FirstProceedings.pdf ] citation_title=IeXML: towards a framework for interoperability of text processing modules to improve annotation of semantic types in biomedical text; citation_inbook_title=Proc. of BioLINK, ISMB 2006; citation_publication_date=2006; citation_id=CR9; citation_author=D Rebholz-Schuhmann; citation_author=H Kirsch; citation_author=G Nenadic citation_journal_title=Journal of Biomedical Informatics; citation_title=Exploring semantic groups through visual approaches; citation_author=O Bodenreider, A McCray; citation_volume=36; citation_issue=6; citation_publication_date=2003; citation_pages=414-432; citation_doi=10.1016/j.jbi.2003.11.002; citation_id=CR10 citation_journal_title=Nucleic Acids Res; citation_title=The Unified Medical Language System (UMLS): integrating biomedical terminology; citation_author=O Bodenreider; citation_volume=32; citation_issue=Database issue; citation_publication_date=2004; citation_pages=D267-270; citation_doi=10.1093/nar/gkh061; citation_id=CR11 The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009, 37 (Database issue): D169-174. citation_journal_title=Nucleic Acids Res; citation_title=Entrez Gene: gene-centered information at NCBI; citation_author=D Maglott, J Ostell, KD Pruitt, T Tatusova; citation_volume=35; citation_issue=Database issue; citation_publication_date=2007; citation_pages=D26-31; citation_doi=10.1093/nar/gkl993; citation_id=CR13 citation_journal_title=J Bioinform Comput Biol 2010; citation_title="CALBC Silver Standard Corpus."; citation_author=D Rebholz-Schuhmann, A Jimeno Yepes, E Van Mulligen, N Kang, J Kors, D Milward, P Corbett, E Buyko, E Beisswanger, U Hahn; citation_volume=8; citation_issue=1; citation_publication_date=2010; citation_pages=163-79; citation_doi=10.1142/S0219720010004562; citation_id=CR14 citation_journal_title=Bioinformatics; citation_title=A dictionary to identify small molecules and drugs in free text; citation_author=KM Hettne, RH Stierum, MJ Schuemie, PJ Hendriksen, BJ Schijvenaars, EM van Mulligen, J Kleinjans, JA Kors; citation_volume=25; citation_publication_date=2009; citation_pages=2983-91; citation_doi=10.1093/bioinformatics/btp535; citation_id=CR15 citation_title=BANNER: An executable survey of advances in biomedical named entity recognition; citation_inbook_title=Proceedings of the Pacific Symposium on Biocomputing; citation_publication_date=2008; citation_pages=652-663; citation_id=CR16; citation_author=R Leaman; citation_author=G Gonzalez citation_journal_title=J Am Med Inform; citation_title=BioTagger-GM: a gene/protein name recognition system; citation_author=M Torii, Z Hu, CH Wu, H Liu; citation_volume=16; citation_publication_date=2009; citation_pages=247-255; citation_doi=10.1197/jamia.M2844; citation_id=CR17

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Công cụ kiểm tra chính tả và thể thức Viver

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA