A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluations

Comparative and Functional Genomics - Tập 6 Số 1-2 - Trang 77-85 - 2005
Shipra Dingare1, Malvina Nissim1, Jenny Rose Finkel2, Christopher D. Manning2, Claire Grover1
1Institute for Communicating and Collaborative Systems, University of Edinburgh 2 Buccleuch Place,, Edinburgh EH8 9LW, UK#TAB#
2Department of Computer Science, Stanford University, Gates Building 1A, 353 Serra Mall, Stanford CA 94305-9010, USA#TAB#

Tóm tắt

AbstractWe present a maximum entropy‐based system for identifying named entities (NEs) in biomedical abstracts and present its performance in the only two biomedical named entity recognition (NER) comparative evaluations that have been held to date, namely BioCreative and Coling BioNLP. Our system obtained an exact match F‐score of 83.2% in the BioCreative evaluation and 70.1% in the BioNLP evaluation. We discuss our system in detail, including its rich use of local features, attention to correct boundary identification, innovative use of external knowledge resources, including parsing and web searches, and rapid adaptation to new NE sets. We also discuss in depth problems with data annotation in the evaluations which caused the final performance to be lower than optimal. Copyright © 2005 John Wiley & Sons, Ltd.

Từ khóa


Tài liệu tham khảo

BlaschkeC HirschmanL YehA(eds).2004.Proceedings of the BioCreative Workshop Granada;http://www.pdg.cnb.uam.es/BioLINK/workshop_BioCreative_04/handout/.

BrantsT.2000.TnT—a statistical part‐of‐speech tagger. InProceedings of the Sixth Applied Natural Language Processing Conference ANLP‐2000 Seattle WA6:224–231.

CollierN KimJ TateisiY OhtaT TsuruokaY(eds).2004.Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications Geneva Switzerland.http://www.genesis.ch/∼natlang/JNLPBAO4.

CollierN NobataC TsujiiJ.2000.Extracting the names of genes and gene products with a hidden Markov model. InProceedings of the 18th International Conference on Computational Linguistics (Coling 2000) Saarbruecken Germany; pp.201–207.

CurranJR ClarkS.2003.Language independent NER using a maximum entropy tagger. InProceedings of the 7th Conference on Natural Language Learning (CoNLL‐03) Edmonton Canada;164–167.

DemetriouG GaizauskasR.2003. Corpus resources for development and evaluation of a biological text mining system. InProceedings of the Third Meeting of the Special Interest Group on Text Mining Brisbane Australia;http://www.pdg.cnb.uam.es/BioLink/SpecialInterestTextMining/PRESENTATIONS/rob_g.ppt.

DingareS FinkelJ NissimM ManningC AlexB.2004. Exploring the boundaries: Gene and protein identification in biomedical text. InProceedings of the BioCreative Workshop Granada;http://www.pdg.cnb.uam.es/BioLINK/workshop_BioCrea‐tive_04/handout/.

FinkelJ DingareS NguyenH NissimM ManningC.2004.From syntax to the web. InProceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications at CoLing 2004 Geneva Switzerland; pp.88–91.

FukudaK.1998.Toward information extraction: Identifying protein names from biological papers. InProceedings of the Pacific Symposium on Biocomputing705–716.

GrefenstetteG.1999.TheWWW as a resource for example‐based MT tasks. InProceedings of ASLIB'99 Translating and the Computer21 London.

HirschmanL.2003. Using biological resources to bootstrap text mining. Presentation to the Massachusetts Biotechnology Council Informatics Committee;http://www.e‐biosci.org/sept/Hirs‐chman.pdf.

KazamaJ MakinoT OhtaY TsujiiJ.2002.Biomedical name recognition: Tuning support vector machines for biomedical named entity recognition. InProceedings of the ACL 2002 Workshop on Natural Language Processing in the Biomedical Domain1–8.

10.1162/089120103322711604

10.1093/ijl/10.2.135

KleinD ManningC.2003.Accurate unlexicalized parsing.Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) Sapporo Japan pp.423–430.

KleinD SmarrJ NguyenH ManningCD.2003.Named entity recognition with character‐level models.Proceedings of the 7th Conference on Natural Language Learning (CoNLL 2003) Edmonton Canada; pp.180–183.

KoichiT CollierN.2003.Bio‐medical entity extraction using support vector machines. InProceedings of the Workshop on Natural Language Processing in Biomedicine held as part of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) Sappoco Japan July 7–12 2003.

MarkertK NissimM ModjeskaN.2003.Using the web for nominal anaphora resolution. InProceedings of the Workshop on the Computational Treatment of Anaphora held as part of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003) Budapest Hungary April 12–17 2003.DaleR van DeemterK MitkovR(eds)39–46.

MarshE PerzanowskiD.1998. MUC‐7 evaluation of IE tech‐nology: Overview of results. InMessage Understanding Conference Proceedings7;http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf.

McCallumA FreitagD PereiraF.2000.Maximum entropy markov models for information extraction and segmentation. InProceedings of the 17th International Conference on Machine Learning.

MikheevA MoensM GroverC.1999.Named entity recognition without gazetteers. InProceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL '99) Bergen Norway June 8–12 1999 pp.1–8.

NobataC CollierN TsujiiJ.1999.Automatic term identification and classification in biology texts. InProceedings of the 5th Natural Language Processing Pacific Rim Symposium (NLPRS '99) Beijing China November 5–7 1999; pp.369–374.

OhtaT TateisiY MimaH TsujiiJ.2002.GENIA corpus: an annotated research abstract corpus in molecular biology domain. InProceedings of the Human Language Technology Conference San Diego CA USA March 24–27 2002.

SangEFTK De MeulderF.2003.Introduction to the CoNLL‐2003 shared task: language‐independent named entity recognition. InProceedings of CoNLL‐2003142–147.

SchwartzA HearstM.2003.A simple algorithm for identifying abbreviation definitions in biomedical text. InPacific Symposium on Biocomputing Kauai.

ShenD ZhangJ SuGZJ CTan.2003.Effective adaptation of hidden Markov model‐based named entity recognizer for biomedical domain. InProceedings of the Workshop on Natural Language Processing in Biomedicine held as part of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) Sapporo Japan July 7–12 2003.