Mô hình quy trình tích hợp cho sự đối sánh thực thể sinh học

Springer Science and Business Media LLC - Tập 15 - Trang 1-15 - 2021
Yu Hu1, Tiezheng Nie1, Derong Shen1, Yue Kou1, Ge Yu1
1School of Computer Science and Engineering, Northeastern University, Shenyang, China

Tóm tắt

Đối sánh thực thể sinh học, bao gồm hai nhiệm vụ con: xác định thực thể và lập bản đồ thực thể-khái niệm, có giá trị nghiên cứu lớn trong khai thác văn bản sinh học, trong khi các kỹ thuật này được sử dụng rộng rãi để tiêu chuẩn hóa tên thực thể, thu thập thông tin, tiếp nhận tri thức và xây dựng ngữ nghĩa. Các công trình trước đây đã nỗ lực nhiều trong việc kỹ thuật hóa các đặc tính để áp dụng mô hình dựa trên đặc tính cho việc xác định và đối sánh thực thể. Tuy nhiên, các mô hình phụ thuộc vào lựa chọn đặc tính chủ quan có thể gặp phải hiện tượng lan truyền lỗi và không thể tận dụng thông tin tiềm ẩn. Với sự phát triển nhanh chóng trong các nghiên cứu liên quan đến sức khỏe, các nhà nghiên cứu cần một phương pháp hiệu quả để khám phá khối lượng tài liệu sinh học lớn có sẵn. Do đó, chúng tôi đề xuất một quy trình đối sánh thực thể hai giai đoạn, mô hình khám phá thực thể sinh học, để xác định các thực thể sinh học và liên kết chúng với cơ sở tri thức một cách tương tác. Mô hình này nhằm tự động lấy thông tin ngữ nghĩa để trích xuất các thực thể sinh học và khai thác các mối quan hệ ngữ nghĩa thông qua cơ sở tri thức sinh học chuẩn. Các thí nghiệm cho thấy phương pháp được đề xuất đạt hiệu suất tốt hơn trong việc đối sánh thực thể. Mô hình được đề xuất cải thiện đáng kể điểm F1 của nhiệm vụ khoảng 4,5% trong xác định thực thể và 2,5% trong lập bản đồ thực thể-khái niệm.

Từ khóa

#đối sánh thực thể sinh học; xác định thực thể; lập bản đồ thực thể-khái niệm; khai thác văn bản sinh học; mô hình sinh học

Tài liệu tham khảo

Amith M, Zhang Y, Xu H, Tao C. Knowledge-based approach for named entity recognition in biomedical literature: a use case in biomedical software identification, In: Benferhat S, Tabia K, Ali M, eds. Advances in Artificial Intelligence: From Theory to Practice. Springer, Cham, 2017 Dang T H, Le H Q, Trang M N. D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information. Bioinformatics, 2018, 34(20): 3539–3546 Dieter G, Ivan L, Kirill A V. Exploiting and assessing multi-source data for supervised biomedical named entity recognition. Bioinformatics, 2018, 34(14): 2474–2482 Lossio-Ventura J A, Bian J, Jonguet C, Roche M, Teisseire M. A novel framework for biomedical entity sense induction. Journal of Biomedical Informatics, 2018, 84: 31–41 Chris J L, Destinee T, Lynn M C. Enhanced lexsynonym acquisition for effective UMLS concept mapping. In: Proceedings of the 16th World Congress on Medical and Health Informatics. 2017, 501–505 Mollie R C, Kristina D H, Joseph P. Automated mapping of NPDS data elements to the UMLS metathesaurus. In: Proceedings of American Medical Informatics Association Annual Symposium. 2013 Paul T, John M N, Simonetta M. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics, 2011, 12: 397–426 Hans-Michael M, Kimberly V A, Li Y. Textpresso central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics, 2018, 19(1): 1–16 Song M, Han W S, Yu H. BoDBES: a boosted dictionary-based biomedical entity spotter. In: Proceeding of the 7rd International Workshop on Data and Text Mining in Bioinformatics. 2013, 21–22 Song M, Yu H, Han W S. Developing a hybrid dictionary-based bio-entity recognition technique. BMC Medical Informatics and Decision Making, 2015, 15(S1): S9 Lars J J. One tagger, many uses: illustrating the power of ontologies in dictionary-based named entity recognition. In: Proceedings of the Joint International Conference on Biological Ontology and BioCreative. 2016, 1747–1749 Yang Z, Li H, Li Y. Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Computational Biology and Chemistry, 2008, 32(4): 287–291 Martijn J S, Barend M, Marc W. Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. Journal of Biomedical Informatics, 2007, 40(3): 316–324 Zeng D, Sun C, Lin L, Liu B. Enlarging drug dictionary with semi-supervised learning for drug entity recognition. In: Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2016, 1929–1931 Laura C, Rajasekar K, Li Y, Frederick R, Shivakumar V. Domain adaptation of rule-based annotators for named-entity recognition tasks. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 2010, 1002–1012 Ryan G, Jay D, Constantine L, Marjorie F, Ralph M W. Combining rule-based and statistical mechanisms for low-resource named entity recognition. Machine Translation, 2018, 32(1–2): 31–43 Peng M, Xing X, Zhang Q, Fu J, Huang X. Distantly supervised named entity recognition using positive-unlabeled learning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2409–2419 Li Q, Wang X, Zhang Y, Ling F, Wu C H, Han J. Pattern discovery for wide-window open information extraction in biomedical literature. In: Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2018, 420–427 Hanisch D, Fundel K, Mevissen H T, Zimmer R, Fluck J. ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics, 2005, 6(S1): S14 Nigel C, Chikashi N, Junichi T. Extracting the names of genes and gene products with a hidden markov model. In: Proceedings of the 18th International Conference on Computational Linguistics. 2000, 201–207 Burr S. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 107–110 Kazuhiro S, Javed M. A hybrid approach to protein name identification in biomedical texts. Information Processing and Management, 2005, 41(4): 723–743 Liu J, Huang M, Zhu X. Recognizing biomedical named entities using skip-chain conditional random fields. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010, 10–18 Sujan K S, Sudeshna S, Pabitra M. Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics, 2009, 42(5): 905–911 Zhu Q, Li X, Ana C, Cecile P. GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics, 2018, 34(9): 1547–1554 Nathan G, Trapit B, Patrick V. Marginal likelihood training of BiLSTMCRF for biomedical named entity recognition from disjoint label sets. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2824–2829 Maryam H, Leon W, Mariana L N. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics, 2017, 33(14): i37–i48 Li H, Yang M, Chen Q, Tang B, Wang X, Yan J. Chemical-induced disease extraction via recurrent piecewise convolutional neural networks. BMC Medical Informatics and Decision Making, 2018, 18(S2): 45–51 Lucy L W, Chandra B, Mark N. Ontology alignment in the biomedical domain using entity definitions and context. In: Proceedings of the BioNLP 2018 Workshop. 2018, 47–55 Wang Y, Majid R M, Ravikumar K E, Liu H. Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts. Database, 2017, 1: 13 Naiara P, Montse C, German R. Biomedical term normalization of EHRs with UMLS. In: Proceedings of the 17th International Conference on Language Resources and Evaluation. 2018, 2045–2051 Ali H P, Paul C. Do character-level neural network language models capture knowledge of multiword expression compositionality? In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions. 2018, 185–192 Michael H, Marco B. Tabula nearly rasa: probing the linguistic knowledge of character-level neural language models trained on unsegmented text. Transactions of the Association for Computational Linguistics, 2019, 7: 467–484 Ruiz-Martinez J M, Valencia-Garcia R, Fernández-Breis J T, García-Sánchez F, Martinez-Béjar R. Ontology learning from biomedical natural language documents using UMLS. Expert Systems with Applications, 2011, 38(10): 12365–12378 He Z, Yehoshua P, Gai E, Chen Y, James G, Bian J. Auditing the assignments of top-level semantic types in the UMLS semantic network to UMLS concepts. In: Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2017, 1262–1269 EI-Rab W G, Zaïane D R, EI-Hajj M. Biomedical text disambiguation using UMLS. In: Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2013, 943–947 Lin Y F, Tsai T H, Chou W C, Wu K P, Sung T Y, Hsu W L. A maximum entropy approach to biomedical named entity recognition. In: Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics. 2004, 56–61 Zhang S, Noemie E. Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. Journal of Biomedical Informatics, 2013, 46(6): 1088–1098 Serhan T, Ilyas C. Two learning approaches for protein name extraction. Journal of Biomedical Informatics, 2009, 42(6): 1046–1055 Lyu C, Chen B, Ren Y. Long short-term memory RNN for biomedical named entity recognition. BMC Bioinformatics, 2017, 18(1): 462–473 Andrea B, Elisabeth L. Data-intensive modelling and simulation in life sciences and socio-economical and physical sciences. Data Science and Engineering, 2017, 2(3): 197–198 Kim J D, Wang Y, Nicola C, Seung H B, Kim Y H, Song M. Refactoring the genia event extraction shared task toward a general framework for IE-Driven KB development. In: Proceedings of the 4th BioNLP Shared Task Workshop. 2016, 23–31 Ju Z, Wang J, Zhu F. Named entity recognition from biomedical text using SVM. In: Proceedings of the 5th International Conference on Bioinformatics and Biomedical Engineering. 2011, 1–4 Kuo H C, Lin K. Extracting protein names from biological literature. Advances in Computer Science: an International Journal, 2017, 3(2): 58–68 Nigel C, Hyun S P, Norihiro O. The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics. 1999, 271–272 Li F, Zhang M, Fu G, Ji D. A neural joint model for entity and relation extraction from biomedical text. BMC Bioinformatics, 2017, 18(1): 1–11 Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. 2015, arXiv preprint arXiv:1508.01991