Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo

Nâng cao phân đoạn văn bản thông qua phân loại tiến bộ

Knowledge and Information Systems - Tập 15 - Trang 285-320 - 2007

Eugenio Cesario¹, Francesco Folino¹, Antonio Locane¹, Giuseppe Manco¹, Riccardo Ortale¹

¹ICAR-CNR, Rende, CS, Italy

Tóm tắt

Một phương pháp mới để hòa giải các tuple được lưu trữ dưới dạng văn bản tự do vào một sơ đồ thuộc tính hiện có đã được đề xuất. Ý tưởng cơ bản là áp dụng văn bản có sẵn vào một quy trình phân loại tiến bộ, tức là một sơ đồ phân loại đa giai đoạn, trong đó tại mỗi giai đoạn trung gian, một bộ phân loại được học để phân tích các đoạn văn bản chưa được hòa giải ở cuối các bước trước đó. Việc phân loại được thực hiện thông qua việc khai thác tạm thời các thuật toán khai thác hiệp hội truyền thống, và được hỗ trợ bởi một sơ đồ chuyển đổi dữ liệu mà tận dụng các từ điển/ontologie chuyên ngành. Một tính năng chính là khả năng làm phong phú dần dần ontology có sẵn với các kết quả từ các giai đoạn phân loại trước đó, do đó cải thiện đáng kể độ chính xác phân loại tổng thể. Một đánh giá thực nghiệm mở rộng cho thấy hiệu quả của phương pháp của chúng tôi.

Từ khóa

#phân loại tiến bộ #phân đoạn văn bản #khai thác hiệp hội #ontology #độ chính xác phân loại

Tài liệu tham khảo

Adelberg B (1998). NoDoSE: A tool for semi-automatically extracting semistructured data from text documents. In: Haas LM, Tiwary A (eds) Proceedings of 1998 ACM SIGMOD conference on management of data. ACM Press, Seattle, WA, USA, June 1998, pp 283–294 Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Kim W, Kohavi R, Gehrke J, DuMouchel W (eds) Proceedings of 2004 ACM SIGKDD conference on knowledge discovery and data mining. ACM Press, Seattle, WA, USA, August 2004, pp 20–29 Borkar VR, Deshmukh K, Sarawagi S (2001) Automatic segmentation of text into structured records. In: Aref WG (ed) Proceedings of 2001 ACM SIGMOD conference on management of Data. ACM Press, Santa Barbara, CA, USA, May 2001, pp 175–186 Brill E (1995). Transformation-based error-driven learning and natural language processing: a cased study in POS tagging. Comput Linguist 21(4): 543–565 Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: Proceedings of 16th national conference on artificial intelligence. AAAI/MIT Press, Madison, WI, USA, July 1999, pp 328–334 Cohen WW (1995) Learning to classify english text with ILP methods. In: De Raedt L (ed). Proceedings of the 5th international Workshop on inductive logic programming. Katholieke Universiteit Leuven, Haverlee, Belgium, pp 3–24 Elmagarmid AK, Panagiotis GI and Verykios VS (2007). Duplicate Record Dectection: A Survey. IEEE Trans Knowl Data Eng 19(1): 1–16 Flesca F, Manco G and Masciari E (2004). Web wrapper induction: a brief survey. AI Commun 17(2): 57–61 Freitag D (1998) Toward general-purpose learning for information extraction. In: Proceedings of 17th national conference on computational linguistics. ACL/Morgan Kaufmann Publishers, Universit de Montral, Montreal, Quebec, Canada, August 1998, pp 404–408 Gu L, Baxter RA, Vickers D et al (2003) Record linkage: current practice and future directions. Technical report. CSIRO Mathematical and Information Sciences, Australia Hernández MA and Stolfo J (1998). Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2(1): 9–37 Junker M, Sintek M, Rinck M (1999) Learning for text categorization and information Extraction with ILP. In: Cussens J, Dzeroski S (eds) Learning language in logic. Springer Heidelberg, pp 247–258 Kupiec J (1992). Robust part-of-speech tagging using a hidden Markov model. Comput Speech Lang 6(3): 225–242 Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Brodley CE, Pohoreckyj Danyluk A (eds). Proceedings of 18th international conference on machine learning. Morgan Kaufmann, Williamstown, MA, USA, June 2001, pp 282–289 Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of 4th ACM SIGKDD international conference on knowledge discovery and data mining. AAAI Press, New York City, NY, USA, August 1998, pp 80–86 Manning CD and Schultze C (1999). Foundations of statistical natural language processing. MIT Press, Cambridge Marquez L, Padro L and Rodriguez H (2000). A machine learning approach to POS tagging. Mach Learn 39(1): 59–91 McCallum A (2002) MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu McCallum A, Freitag D, Pereira F (2000) Maximum entropy Markov models for information extraction and segmentation. In: Langley P (ed) Proceedings of 17th international conference on machine learning. Morgan Kaufmann, Standford University, Standord, CA, USA, June 2000, pp 591–598 Mukherjee S, Ramakrishnan IV (2004) Taming the unstructured: creating structured content from partially labeled schematic text sequences. In: Meersman R, Tari Z (eds) Proceedings of 12th CoopIS/DOA/ ODBASE international conference. Springer, Agia Napa, Cyprus, October 2004, pp 909–926 Soderland S (1999). Learning information extraction rules for semi/structured and free text. Mach Learn 34: 233–272 Srikant R, Agrawal R (1995) Mining generalized association rules. In: Dayal U, Gray PMD, Nishio S (eds) Proceedings of 21th international conference on Very large databases. Morgan Kaufmann, Zurich, Switzerland, September 1995, pp 407–419 Winkler WE (1999) The state of record linkage and current research problems. Technical report. Statistical Research Division, U.S. Census Bureau, Wachington, DC

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]