Khám phá dựa trên đồ thị và phân tích phân cụm các không gian ngữ nghĩa

Applied Network Science - Tập 4 Số 1 - Trang 1-26 - 2019
Veremyev, Alexander1, Semenov, Alexander2, Pasiliao, Eduardo L.3, Boginski, Vladimir1
1Department of Industrial Engineering and Management Systems, University of Central Florida, Orlando, USA
2Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland
3Air Force Research Laboratory, Eglin AFB, USA

Tóm tắt

Mục tiêu của nghiên cứu này là nhằm chứng minh cách mà các công cụ và khái niệm trong khoa học mạng và lý thuyết đồ thị có thể được sử dụng hiệu quả để khám phá và so sánh các không gian ngữ nghĩa của các vector từ và cơ sở dữ liệu từ vựng. Cụ thể, chúng tôi xây dựng các mạng ngữ nghĩa dựa trên các biểu diễn word2vec của các từ, được "học" từ các tập văn bản lớn (tin tức Google, đánh giá Amazon), và các mạng từ được "xây dựng bởi con người" lấy từ các cơ sở dữ liệu từ vựng nổi tiếng: WordNet và Moby Thesaurus. Chúng tôi so sánh các đặc điểm "toàn cục" (ví dụ: bậc, khoảng cách, hệ số cụm) và "địa phương" (ví dụ: các nút trung tâm nhất và các cụm dày đặc kiểu cộng đồng) của các mạng đã được xem xét. Những quan sát của chúng tôi cho thấy rằng các mạng do con người xây dựng có các mô hình kết nối toàn cục trực quan hơn, trong khi các đặc điểm địa phương (cụ thể là các cụm dày đặc) của các mạng do máy tạo ra cung cấp nhiều thông tin phong phú hơn về cách sử dụng ngữ cảnh và ý nghĩa được nhận thức của các từ, điều này tiết lộ những khác biệt cấu trúc thú vị giữa các mạng ngữ nghĩa do con người xây dựng và các mạng do máy tạo ra. Theo kiến thức của chúng tôi, đây là nghiên cứu đầu tiên sử dụng lý thuyết đồ thị và khoa học mạng trong bối cảnh đã xem xét; do đó, chúng tôi cũng cung cấp những ví dụ thú vị và thảo luận về các hướng nghiên cứu tiềm năng có thể kích thích nghiên cứu tiếp theo về sự tổng hợp của các công cụ từ điển học và máy học, dẫn đến những hiểu biết mới trong lĩnh vực này.

Từ khóa

#khoa học mạng #lý thuyết đồ thị #mạng ngữ nghĩa #từ điển học #máy học

Tài liệu tham khảo

citation_journal_title=Psychol Rev; citation_title=Random walks on semantic networks can resemble optimal foraging; citation_author=JT Abbott, JL Austerweil, TL Griffiths; citation_volume=122; citation_issue=3; citation_publication_date=2015; citation_pages=558-569; citation_doi=10.1037/a0038693; citation_id=CR1 citation_title=On maximum clique problems in very large graphs; citation_inbook_title=External Memory Algorithms and Visualization; citation_publication_date=1999; citation_pages=119-130; citation_id=CR2; citation_author=J Abello; citation_author=PM Pardalos; citation_author=MGC Resende; citation_publisher=American Mathematical Society citation_title=Massive quasi-clique detection; citation_inbook_title=LATIN 2002: Theoretical Informatics; citation_publication_date=2002; citation_pages=598-612; citation_id=CR3; citation_author=J Abello; citation_author=MGC Resende; citation_author=S Sudarsky; citation_publisher=Springer-Verlag citation_journal_title=Applied Network Science; citation_title=From free text to clusters of content in health records: an unsupervised graph partitioning approach; citation_author=MT Altuncu, E Mayer, SN Yaliraki, M Barahona; citation_volume=4; citation_issue=1; citation_publication_date=2019; citation_pages=2; citation_doi=10.1007/s41109-018-0109-9; citation_id=CR4 Amazon Reviews dataset (2017) Unlocked Mobile Phones. https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones . Last accessed 15 Feb 2019. citation_journal_title=Proc VLDB Endowment; citation_title=Dense subgraph maintenance under streaming edge weight updates for real-time story identification; citation_author=A Angel, N Sarkas, N Koudas, D Srivastava; citation_volume=5; citation_issue=6; citation_publication_date=2012; citation_pages=574-585; citation_doi=10.14778/2168651.2168658; citation_id=CR6 citation_journal_title=BMC Bioinformatics; citation_title=An automated method for finding molecular complexes in large protein interaction networks; citation_author=GD Bader, CW Hogue; citation_volume=4; citation_issue=1; citation_publication_date=2003; citation_pages=2; citation_doi=10.1186/1471-2105-4-2; citation_id=CR7 citation_journal_title=J Biomed Inf; citation_title=Graph theoretic modeling of large-scale semantic networks; citation_author=ME Bales, SB Johnson; citation_volume=39; citation_issue=4; citation_publication_date=2006; citation_pages=451-464; citation_doi=10.1016/j.jbi.2005.10.007; citation_id=CR8 citation_journal_title=Procedia Comput Sci; citation_title=Sentiment classification of online consumer reviews using word vector representations; citation_author=B Bansal, S Srivastava; citation_volume=132; citation_publication_date=2018; citation_pages=1147-1153; citation_doi=10.1016/j.procs.2018.05.029; citation_id=CR9 citation_title=Applied Text Analysis with Python: Enabling Language-aware Data Products with Machine Learning; citation_publication_date=2018; citation_id=CR10; citation_author=B Bengfort; citation_author=R Bilbro; citation_author=T Ojeda; citation_publisher=O’Reilly Media, Inc. citation_title=Natural language processing with Python: analyzing text with the natural language toolkit; citation_publication_date=2009; citation_id=CR11; citation_author=S Bird; citation_author=E Klein; citation_author=E Loper; citation_publisher=O’Reilly Media, Inc citation_journal_title=Comput Stat Data Anal; citation_title=Statistical analysis of financial networks; citation_author=V Boginski, S Butenko, PM Pardalos; citation_volume=48; citation_issue=2; citation_publication_date=2005; citation_pages=431-443; citation_doi=10.1016/j.csda.2004.02.004; citation_id=CR12 citation_journal_title=Ann Oper Res; citation_title=A network-based data mining approach to portfolio selection via weighted clique relaxations; citation_author=V Boginski, S Butenko, O Shirokikh, S Trukhanov, JG Lafuente; citation_volume=216; citation_issue=1; citation_publication_date=2014; citation_pages=23-34; citation_doi=10.1007/s10479-013-1395-3; citation_id=CR13 citation_journal_title=Internet Math; citation_title=Axioms for centrality; citation_author=P Boldi, S Vigna; citation_volume=10; citation_issue=3-4; citation_publication_date=2014; citation_pages=222-262; citation_doi=10.1080/15427951.2013.865686; citation_id=CR14 citation_title=The Maximum Clique Problem; citation_inbook_title=Handbook of Combinatorial Optimization. vol. 4; citation_publication_date=1999; citation_pages=1-74; citation_id=CR15; citation_author=IM Bomze; citation_author=M Budinich; citation_author=PM Pardalos; citation_author=M Pelillo; citation_publisher=Kluwer Academic Publishers citation_journal_title=Soc Netw; citation_title=A graph-theoretic perspective on centrality; citation_author=SP Borgatti, MG Everett; citation_volume=28; citation_issue=4; citation_publication_date=2006; citation_pages=466-484; citation_doi=10.1016/j.socnet.2005.11.005; citation_id=CR16 citation_journal_title=Entropy; citation_title=Semantic networks: Structure and dynamics; citation_author=J Borge-Holthoefer, A Arenas; citation_volume=12; citation_issue=5; citation_publication_date=2010; citation_pages=1264-1302; citation_doi=10.3390/e12051264; citation_id=CR17 citation_journal_title=Nucleic Acids Res; citation_title=Topological structure analysis of the protein-protein interaction network in budding yeast; citation_author=D Bu, Y Zhao, L Cai, H Xue, X Zhu, H Lu; citation_volume=31; citation_issue=9; citation_publication_date=2003; citation_pages=2443-2450; citation_doi=10.1093/nar/gkg340; citation_id=CR18 Choudhury, M, Mukherjee A (2009) The structure and dynamics of linguistic networks In: Dynamics on and of Complex Networks, 145–166.. Springer. citation_journal_title=Phys Life Rev; citation_title=Approaching human language with complex networks; citation_author=J Cong, H Liu; citation_volume=11; citation_issue=4; citation_publication_date=2014; citation_pages=598-618; citation_doi=10.1016/j.plrev.2014.04.004; citation_id=CR20 citation_journal_title=Am J Polit Sci; citation_title=Social Networks and Political Processes in Urban Neighborhoods; citation_author=MA Crenson; citation_volume=22; citation_issue=3; citation_publication_date=1978; citation_pages=578-594; citation_doi=10.2307/2110462; citation_id=CR21 citation_journal_title=InterJournal Complex Syst; citation_title=The igraph software package for complex network research; citation_author=G Csardi, T Nepusz; citation_volume=1695; citation_issue=5; citation_publication_date=2006; citation_pages=1-9; citation_id=CR22 citation_journal_title=Phys A Stat Mech Appl; citation_title=Thesaurus as a complex network; citation_author=HA de Jesus, IT Pisa, O Kinouchi, AS Martinez, EES Ruiz; citation_volume=344; citation_issue=3-4; citation_publication_date=2004; citation_pages=530-536; citation_doi=10.1016/j.physa.2004.06.025; citation_id=CR23 citation_journal_title=Soc Netw; citation_title=Ego network betweenness; citation_author=M Everett, SP Borgatti; citation_volume=27; citation_issue=1; citation_publication_date=2005; citation_pages=31-38; citation_doi=10.1016/j.socnet.2004.11.007; citation_id=CR24 citation_title=WordNet: An electronic lexical database; citation_publication_date=1998; citation_id=CR25; citation_publisher=MIT press citation_journal_title=J Phys A Math Theor; citation_title=Topological structure of dictionary graphs; citation_author=M Krzemiński; citation_volume=42; citation_issue=37; citation_publication_date=2009; citation_pages=375101; citation_doi=10.1088/1751-8113/42/37/375101; citation_id=CR26 Gaillard, B, Gaume B, Navarro E (2011) Invariants and variability of synonymy networks: Self mediated agreement by confluence In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, 15–23.. Association for Computational Linguistics. citation_journal_title=Int J Inf Manag; citation_title=Beyond the hype: Big data concepts, methods, and analytics; citation_author=A Gandomi, M Haider; citation_volume=35; citation_issue=2; citation_publication_date=2015; citation_pages=137-144; citation_doi=10.1016/j.ijinfomgt.2014.10.007; citation_id=CR28 citation_journal_title=SIAM Review; citation_title=Computers and Intractability: A Guide to the Theory of NP-Completeness (Michael R. Garey and David S. Johnson); citation_author=Juris Hartmanis; citation_volume=24; citation_issue=1; citation_publication_date=1982; citation_pages=90-91; citation_doi=10.1137/1024022; citation_id=CR29 Google Open Source Project (2013) word2vec. https://code.google.com/archive/p/word2vec/ . Last accessed 15 Feb 2019. Gurobi Optimization LLC (2019) Gurobi Optimizer Reference Manual. http://www.gurobi.com . citation_title=Exploring network structure, dynamics, and function using NetworkX; citation_publication_date=2008; citation_id=CR32; citation_author=A Hagberg; citation_author=P Swart; citation_author=D S Chult; citation_publisher=Los Alamos National Lab.(LANL) citation_journal_title=Nature; citation_title=From molecular to modular cell biology; citation_author=LH Hartwell, JJ Hopfield, S Leibler, AW Murray; citation_volume=402; citation_publication_date=1999; citation_pages=C47-C52; citation_doi=10.1038/35011540; citation_id=CR33 citation_journal_title=Discret Appl Math; citation_title=The complexity of detecting fixed-density clusters; citation_author=K Holzapfel, S Kosub, H Täubig; citation_volume=154; citation_issue=11; citation_publication_date=2006; citation_pages=1547-1562; citation_doi=10.1016/j.dam.2006.01.005; citation_id=CR34 citation_journal_title=Bioinformatics; citation_title=Mining coherent dense subgraphs across massive biological networks for functional discovery; citation_author=H Hu, X Yan, Y Huang, J Han, XJ Zhou; citation_volume=21; citation_issue=suppl 1; citation_publication_date=2005; citation_pages=i213-i221; citation_doi=10.1093/bioinformatics/bti1049; citation_id=CR35 citation_journal_title=Phys A Stat Mech Appl; citation_title=A network analysis of the Chinese stock market; citation_author=WQ Huang, XT Zhuang, S Yao; citation_volume=388; citation_issue=14; citation_publication_date=2009; citation_pages=2956-2964; citation_doi=10.1016/j.physa.2009.03.028; citation_id=CR36 citation_title=Social and economic networks; citation_publication_date=2010; citation_id=CR37; citation_author=MO Jackson; citation_publisher=Princeton University Press citation_journal_title=Phys A Stat Mech Appl; citation_title=Quantitative learning strategies based on word networks; citation_author=ZY Jia, Y Tang, JJ Xiong, YC Zhang; citation_volume=491; citation_publication_date=2018; citation_pages=898-911; citation_doi=10.1016/j.physa.2017.09.063; citation_id=CR38 Kasch, N (2014) Text Analytics and Natural Language Processing in the Era of Big Data. Pivotal Data Labs. Accessed: 6 June 2019. https://content.pivotal.io/blog/text-analytics-and-natural-language-processing-in-the-era-of-big-data . citation_journal_title=Journal of Quantitative Linguistics; citation_title=Analysing language development from a network approach; citation_author=J Ke, Y Yao; citation_volume=15; citation_issue=1; citation_publication_date=2008; citation_pages=70-99; citation_doi=10.1080/09296170701794286; citation_id=CR40 citation_journal_title=J Exp Psychol; citation_title=An hypothesis concerning the generation and use of synonyms; citation_author=WM Lepley; citation_volume=40; citation_issue=4; citation_publication_date=1950; citation_pages=527; citation_doi=10.1037/h0060728; citation_id=CR41 citation_journal_title=J Abnorm Soc Psycholo; citation_title=Word usage and synonym representation in the English language; citation_author=WM Lepley, JL Kobrick; citation_volume=47; citation_issue=2S; citation_publication_date=1952; citation_pages=572; citation_doi=10.1037/h0059745; citation_id=CR42 citation_journal_title=Psychometrika; citation_title=A method of matrix analysis of group structure; citation_author=RD Luce, AD Perry; citation_volume=14; citation_issue=2; citation_publication_date=1949; citation_pages=95-116; citation_doi=10.1007/BF02289146; citation_id=CR43 Mikolov, T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. ICLR. Mikolov, T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed Representations of Words and Phrases and their Compositionality In: Advances in Neural Information Processing Systems, 3111–3119. citation_journal_title=Commun ACM; citation_title=WordNet: a lexical database for English; citation_author=GA Miller; citation_volume=38; citation_issue=11; citation_publication_date=1995; citation_pages=39-41; citation_doi=10.1145/219717.219748; citation_id=CR46 citation_title=“Blissfully Happy” or “Ready to Fight”: Varying Interpretations of Emoji; citation_inbook_title=Tenth International AAAI Conference on Web and Social Media; citation_publication_date=2016; citation_id=CR47; citation_author=HJ Miller; citation_author=J Thebault-Spieker; citation_author=S Chang; citation_author=I Johnson; citation_author=L Terveen; citation_author=B Hecht; citation_publisher=AAAI Press citation_journal_title=Phys Rev E; citation_title=Topology of the conceptual network of language; citation_author=AE Motter, AP De Moura, YC Lai, P Dasgupta; citation_volume=65; citation_issue=6; citation_publication_date=2002; citation_pages=065102; citation_doi=10.1103/PhysRevE.65.065102; citation_id=CR48 citation_journal_title=SIAM Rev; citation_title=The structure and function of complex networks; citation_author=ME Newman; citation_volume=45; citation_issue=2; citation_publication_date=2003; citation_pages=167-256; citation_doi=10.1137/S003614450342480; citation_id=CR49 citation_title=Networks; citation_publication_date=2018; citation_id=CR50; citation_author=M Newman; citation_publisher=Oxford university press citation_title=Natural language corpus data; citation_inbook_title=Beautiful Data, In Beautiful Data, edited by Toby Segaran and Jeff Hammerbacher; citation_publication_date=2009; citation_pages=219-242; citation_id=CR51; citation_author=P Norvig; citation_publisher=O’Reilly citation_journal_title=Networks; citation_title=On maximum degree-based-quasi-clique problem: Complexity and exact approaches; citation_author=G Pastukhov, A Veremyev, V Boginski, OA Prokopyev; citation_volume=71; citation_issue=2; citation_publication_date=2018; citation_pages=136-152; citation_doi=10.1002/net.21791; citation_id=CR52 citation_journal_title=Discrete Applied Mathematics; citation_title=On the maximum quasi-clique problem; citation_author=Jeffrey Pattillo, Alexander Veremyev, Sergiy Butenko, Vladimir Boginski; citation_volume=161; citation_issue=1-2; citation_publication_date=2013; citation_pages=244-257; citation_doi=10.1016/j.dam.2012.07.019; citation_id=CR53 citation_journal_title=European Journal of Operational Research; citation_title=On clique relaxation models in network analysis; citation_author=Jeffrey Pattillo, Nataly Youssef, Sergiy Butenko; citation_volume=226; citation_issue=1; citation_publication_date=2013; citation_pages=9-18; citation_doi=10.1016/j.ejor.2012.10.021; citation_id=CR54 citation_journal_title=J Mach Learn Res; citation_title=Scikit-learn: Machine Learning in Python; citation_author=F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel; citation_volume=12; citation_publication_date=2011; citation_pages=2825-2830; citation_id=CR55 citation_title=Applications and explanations of Zipf’s law; citation_inbook_title=Proceedings of the joint conferences on new methods in language processing and computational natural language learning; citation_publication_date=1998; citation_pages=151-160; citation_id=CR56; citation_author=DM Powers; citation_publisher=Association for Computational Linguistics citation_title=Software Framework for Topic Modelling with Large Corpora; citation_publication_date=2010; citation_id=CR57; citation_author=R Řehůřek; citation_author=P Sojka; citation_publisher=ELRA Schneider, C (2016) The biggest data challenges that you might not even know you have. IBM Watson. https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/ . Accessed: 6 June 2019. citation_journal_title=Front Psychol; citation_title=Community structure in the phonological network; citation_author=CS Siew; citation_volume=4; citation_publication_date=2013; citation_pages=553; citation_doi=10.3389/fpsyg.2013.00553; citation_id=CR59 citation_journal_title=Appl Netw Sci; citation_title=The orthographic similarity structure of English words: Insights from network science; citation_author=CS Siew; citation_volume=3; citation_issue=1; citation_publication_date=2018; citation_pages=13; citation_doi=10.1007/s41109-018-0068-1; citation_id=CR60 Siew, CS, Wulff DU, Beckage NM, Kenett YN (2018) Cognitive Network Science: A review of research on cognition through the lens of network representations, processes, and dynamics. PsyArXiv. https://doi.org/10.31234/osf.io/eu9tr . citation_journal_title=Proc Natl Acad Sci; citation_title=Global organization of the Wordnet lexicon; citation_author=M Sigman, GA Cecchi; citation_volume=99; citation_issue=3; citation_publication_date=2002; citation_pages=1742-1747; citation_doi=10.1073/pnas.022341799; citation_id=CR62 citation_title=Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment; citation_inbook_title=Proceedings of the Sixth International Conference on Data Mining. ICDM ’06; citation_publication_date=2006; citation_pages=1059-1063; citation_id=CR63; citation_author=K Sim; citation_author=J Li; citation_author=V Gopalkrishnan; citation_author=G Liu; citation_publisher=IEEE Computer Society citation_journal_title=Proc Natl Acad Sci; citation_title=Protein complexes and functional modules in molecular networks; citation_author=V Spirin, LA Mirny; citation_volume=100; citation_issue=21; citation_publication_date=2003; citation_pages=12123-12128; citation_doi=10.1073/pnas.2032324100; citation_id=CR64 citation_journal_title=Cogn Sci; citation_title=The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth; citation_author=M Steyvers, JB Tenenbaum; citation_volume=29; citation_issue=1; citation_publication_date=2005; citation_pages=41-78; citation_doi=10.1207/s15516709cog2901_3; citation_id=CR65 Sumathy, K, Chidambaram M (2013) Text mining: concepts, applications, tools and issues-an overview. Int J Comput Appl 80(4). Tsourakakis, C, Bonchi F, Gionis A, Gullo F, Tsiarli M (2013) Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 104–112.. ACM. https://doi.org/10.1145/2487575.2487645 . Vazirgiannis, M, Malliaros FD, Nikolentzos G (2018) GraphRep: Boosting Text Mining, NLP and Information Retrieval with Graphs In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2295–2296.. ACM. https://doi.org/10.1145/3269206.3274273 . citation_journal_title=Comput Optim Appl; citation_title=Exact MIP-based approaches for finding maximum quasi-cliques and dense subgraphs; citation_author=A Veremyev, OA Prokopyev, S Butenko, EL Pasiliao; citation_volume=64; citation_issue=1; citation_publication_date=2016; citation_pages=177-214; citation_doi=10.1007/s10589-015-9804-y; citation_id=CR69 citation_journal_title=J Speech Lang Hear Res; citation_title=What can graph theory tell us about word learning and lexical retrieval?; citation_author=MS Vitevitch; citation_volume=51; citation_issue=2; citation_publication_date=2008; citation_pages=408-422; citation_doi=10.1044/1092-4388(2008/030); citation_id=CR70 citation_journal_title=J Mem Lang; citation_title=Keywords in the mental lexicon; citation_author=MS Vitevitch, R Goldstein; citation_volume=73; citation_publication_date=2014; citation_pages=131-147; citation_doi=10.1016/j.jml.2014.03.005; citation_id=CR71 citation_journal_title=Yearbook of the Poznan Linguistic Meeting; citation_title=Using complex networks to understand the mental lexicon; citation_author=Michael S. Vitevitch, Rutherford Goldstein, Cynthia S.Q. Siew, Nichol Castro; citation_volume=1; citation_issue=1; citation_publication_date=2014; citation_pages=119-138; citation_doi=10.1515/yplm-2015-0007; citation_id=CR72 Ward, G (2002) Moby thesaurus II. Project Gutenberg Literary Archive Foundation. Available from: http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=3202 . citation_title=Social Network Analysis; citation_publication_date=1994; citation_id=CR74; citation_author=S Wasserman; citation_author=K Faust; citation_publisher=Cambridge University Press citation_journal_title=Nature; citation_title=Collective dynamics of ’small-world’networks; citation_author=DJ Watts, SH Strogatz; citation_volume=393; citation_issue=6684; citation_publication_date=1998; citation_pages=440; citation_doi=10.1038/30918; citation_id=CR75 citation_title=Complex Network Analysis in Python: Recognize-Construct-Visualize-Analyze-Interpret; citation_publication_date=2018; citation_id=CR76; citation_author=D Zinoviev; citation_publisher=Pragmatic Bookshelf