Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo
Khám phá dựa trên đồ thị và phân tích phân cụm các không gian ngữ nghĩa
Tóm tắt
Mục tiêu của nghiên cứu này là nhằm chứng minh cách mà các công cụ và khái niệm trong khoa học mạng và lý thuyết đồ thị có thể được sử dụng hiệu quả để khám phá và so sánh các không gian ngữ nghĩa của các vector từ và cơ sở dữ liệu từ vựng. Cụ thể, chúng tôi xây dựng các mạng ngữ nghĩa dựa trên các biểu diễn word2vec của các từ, được "học" từ các tập văn bản lớn (tin tức Google, đánh giá Amazon), và các mạng từ được "xây dựng bởi con người" lấy từ các cơ sở dữ liệu từ vựng nổi tiếng: WordNet và Moby Thesaurus. Chúng tôi so sánh các đặc điểm "toàn cục" (ví dụ: bậc, khoảng cách, hệ số cụm) và "địa phương" (ví dụ: các nút trung tâm nhất và các cụm dày đặc kiểu cộng đồng) của các mạng đã được xem xét. Những quan sát của chúng tôi cho thấy rằng các mạng do con người xây dựng có các mô hình kết nối toàn cục trực quan hơn, trong khi các đặc điểm địa phương (cụ thể là các cụm dày đặc) của các mạng do máy tạo ra cung cấp nhiều thông tin phong phú hơn về cách sử dụng ngữ cảnh và ý nghĩa được nhận thức của các từ, điều này tiết lộ những khác biệt cấu trúc thú vị giữa các mạng ngữ nghĩa do con người xây dựng và các mạng do máy tạo ra. Theo kiến thức của chúng tôi, đây là nghiên cứu đầu tiên sử dụng lý thuyết đồ thị và khoa học mạng trong bối cảnh đã xem xét; do đó, chúng tôi cũng cung cấp những ví dụ thú vị và thảo luận về các hướng nghiên cứu tiềm năng có thể kích thích nghiên cứu tiếp theo về sự tổng hợp của các công cụ từ điển học và máy học, dẫn đến những hiểu biết mới trong lĩnh vực này.
Từ khóa
#khoa học mạng #lý thuyết đồ thị #mạng ngữ nghĩa #từ điển học #máy họcTài liệu tham khảo
citation_journal_title=Psychol Rev; citation_title=Random walks on semantic networks can resemble optimal foraging; citation_author=JT Abbott, JL Austerweil, TL Griffiths; citation_volume=122; citation_issue=3; citation_publication_date=2015; citation_pages=558-569; citation_doi=10.1037/a0038693; citation_id=CR1
citation_title=On maximum clique problems in very large graphs; citation_inbook_title=External Memory Algorithms and Visualization; citation_publication_date=1999; citation_pages=119-130; citation_id=CR2; citation_author=J Abello; citation_author=PM Pardalos; citation_author=MGC Resende; citation_publisher=American Mathematical Society
citation_title=Massive quasi-clique detection; citation_inbook_title=LATIN 2002: Theoretical Informatics; citation_publication_date=2002; citation_pages=598-612; citation_id=CR3; citation_author=J Abello; citation_author=MGC Resende; citation_author=S Sudarsky; citation_publisher=Springer-Verlag
citation_journal_title=Applied Network Science; citation_title=From free text to clusters of content in health records: an unsupervised graph partitioning approach; citation_author=MT Altuncu, E Mayer, SN Yaliraki, M Barahona; citation_volume=4; citation_issue=1; citation_publication_date=2019; citation_pages=2; citation_doi=10.1007/s41109-018-0109-9; citation_id=CR4
Amazon Reviews dataset (2017) Unlocked Mobile Phones.
https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
. Last accessed 15 Feb 2019.
citation_journal_title=Proc VLDB Endowment; citation_title=Dense subgraph maintenance under streaming edge weight updates for real-time story identification; citation_author=A Angel, N Sarkas, N Koudas, D Srivastava; citation_volume=5; citation_issue=6; citation_publication_date=2012; citation_pages=574-585; citation_doi=10.14778/2168651.2168658; citation_id=CR6
citation_journal_title=BMC Bioinformatics; citation_title=An automated method for finding molecular complexes in large protein interaction networks; citation_author=GD Bader, CW Hogue; citation_volume=4; citation_issue=1; citation_publication_date=2003; citation_pages=2; citation_doi=10.1186/1471-2105-4-2; citation_id=CR7
citation_journal_title=J Biomed Inf; citation_title=Graph theoretic modeling of large-scale semantic networks; citation_author=ME Bales, SB Johnson; citation_volume=39; citation_issue=4; citation_publication_date=2006; citation_pages=451-464; citation_doi=10.1016/j.jbi.2005.10.007; citation_id=CR8
citation_journal_title=Procedia Comput Sci; citation_title=Sentiment classification of online consumer reviews using word vector representations; citation_author=B Bansal, S Srivastava; citation_volume=132; citation_publication_date=2018; citation_pages=1147-1153; citation_doi=10.1016/j.procs.2018.05.029; citation_id=CR9
citation_title=Applied Text Analysis with Python: Enabling Language-aware Data Products with Machine Learning; citation_publication_date=2018; citation_id=CR10; citation_author=B Bengfort; citation_author=R Bilbro; citation_author=T Ojeda; citation_publisher=O’Reilly Media, Inc.
citation_title=Natural language processing with Python: analyzing text with the natural language toolkit; citation_publication_date=2009; citation_id=CR11; citation_author=S Bird; citation_author=E Klein; citation_author=E Loper; citation_publisher=O’Reilly Media, Inc
citation_journal_title=Comput Stat Data Anal; citation_title=Statistical analysis of financial networks; citation_author=V Boginski, S Butenko, PM Pardalos; citation_volume=48; citation_issue=2; citation_publication_date=2005; citation_pages=431-443; citation_doi=10.1016/j.csda.2004.02.004; citation_id=CR12
citation_journal_title=Ann Oper Res; citation_title=A network-based data mining approach to portfolio selection via weighted clique relaxations; citation_author=V Boginski, S Butenko, O Shirokikh, S Trukhanov, JG Lafuente; citation_volume=216; citation_issue=1; citation_publication_date=2014; citation_pages=23-34; citation_doi=10.1007/s10479-013-1395-3; citation_id=CR13
citation_journal_title=Internet Math; citation_title=Axioms for centrality; citation_author=P Boldi, S Vigna; citation_volume=10; citation_issue=3-4; citation_publication_date=2014; citation_pages=222-262; citation_doi=10.1080/15427951.2013.865686; citation_id=CR14
citation_title=The Maximum Clique Problem; citation_inbook_title=Handbook of Combinatorial Optimization. vol. 4; citation_publication_date=1999; citation_pages=1-74; citation_id=CR15; citation_author=IM Bomze; citation_author=M Budinich; citation_author=PM Pardalos; citation_author=M Pelillo; citation_publisher=Kluwer Academic Publishers
citation_journal_title=Soc Netw; citation_title=A graph-theoretic perspective on centrality; citation_author=SP Borgatti, MG Everett; citation_volume=28; citation_issue=4; citation_publication_date=2006; citation_pages=466-484; citation_doi=10.1016/j.socnet.2005.11.005; citation_id=CR16
citation_journal_title=Entropy; citation_title=Semantic networks: Structure and dynamics; citation_author=J Borge-Holthoefer, A Arenas; citation_volume=12; citation_issue=5; citation_publication_date=2010; citation_pages=1264-1302; citation_doi=10.3390/e12051264; citation_id=CR17
citation_journal_title=Nucleic Acids Res; citation_title=Topological structure analysis of the protein-protein interaction network in budding yeast; citation_author=D Bu, Y Zhao, L Cai, H Xue, X Zhu, H Lu; citation_volume=31; citation_issue=9; citation_publication_date=2003; citation_pages=2443-2450; citation_doi=10.1093/nar/gkg340; citation_id=CR18
Choudhury, M, Mukherjee A (2009) The structure and dynamics of linguistic networks In: Dynamics on and of Complex Networks, 145–166.. Springer.
citation_journal_title=Phys Life Rev; citation_title=Approaching human language with complex networks; citation_author=J Cong, H Liu; citation_volume=11; citation_issue=4; citation_publication_date=2014; citation_pages=598-618; citation_doi=10.1016/j.plrev.2014.04.004; citation_id=CR20
citation_journal_title=Am J Polit Sci; citation_title=Social Networks and Political Processes in Urban Neighborhoods; citation_author=MA Crenson; citation_volume=22; citation_issue=3; citation_publication_date=1978; citation_pages=578-594; citation_doi=10.2307/2110462; citation_id=CR21
citation_journal_title=InterJournal Complex Syst; citation_title=The igraph software package for complex network research; citation_author=G Csardi, T Nepusz; citation_volume=1695; citation_issue=5; citation_publication_date=2006; citation_pages=1-9; citation_id=CR22
citation_journal_title=Phys A Stat Mech Appl; citation_title=Thesaurus as a complex network; citation_author=HA de Jesus, IT Pisa, O Kinouchi, AS Martinez, EES Ruiz; citation_volume=344; citation_issue=3-4; citation_publication_date=2004; citation_pages=530-536; citation_doi=10.1016/j.physa.2004.06.025; citation_id=CR23
citation_journal_title=Soc Netw; citation_title=Ego network betweenness; citation_author=M Everett, SP Borgatti; citation_volume=27; citation_issue=1; citation_publication_date=2005; citation_pages=31-38; citation_doi=10.1016/j.socnet.2004.11.007; citation_id=CR24
citation_title=WordNet: An electronic lexical database; citation_publication_date=1998; citation_id=CR25; citation_publisher=MIT press
citation_journal_title=J Phys A Math Theor; citation_title=Topological structure of dictionary graphs; citation_author=M Krzemiński; citation_volume=42; citation_issue=37; citation_publication_date=2009; citation_pages=375101; citation_doi=10.1088/1751-8113/42/37/375101; citation_id=CR26
Gaillard, B, Gaume B, Navarro E (2011) Invariants and variability of synonymy networks: Self mediated agreement by confluence In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, 15–23.. Association for Computational Linguistics.
citation_journal_title=Int J Inf Manag; citation_title=Beyond the hype: Big data concepts, methods, and analytics; citation_author=A Gandomi, M Haider; citation_volume=35; citation_issue=2; citation_publication_date=2015; citation_pages=137-144; citation_doi=10.1016/j.ijinfomgt.2014.10.007; citation_id=CR28
citation_journal_title=SIAM Review; citation_title=Computers and Intractability: A Guide to the Theory of NP-Completeness (Michael R. Garey and David S. Johnson); citation_author=Juris Hartmanis; citation_volume=24; citation_issue=1; citation_publication_date=1982; citation_pages=90-91; citation_doi=10.1137/1024022; citation_id=CR29
Google Open Source Project (2013) word2vec.
https://code.google.com/archive/p/word2vec/
. Last accessed 15 Feb 2019.
Gurobi Optimization LLC (2019) Gurobi Optimizer Reference Manual.
http://www.gurobi.com
.
citation_title=Exploring network structure, dynamics, and function using NetworkX; citation_publication_date=2008; citation_id=CR32; citation_author=A Hagberg; citation_author=P Swart; citation_author=D S Chult; citation_publisher=Los Alamos National Lab.(LANL)
citation_journal_title=Nature; citation_title=From molecular to modular cell biology; citation_author=LH Hartwell, JJ Hopfield, S Leibler, AW Murray; citation_volume=402; citation_publication_date=1999; citation_pages=C47-C52; citation_doi=10.1038/35011540; citation_id=CR33
citation_journal_title=Discret Appl Math; citation_title=The complexity of detecting fixed-density clusters; citation_author=K Holzapfel, S Kosub, H Täubig; citation_volume=154; citation_issue=11; citation_publication_date=2006; citation_pages=1547-1562; citation_doi=10.1016/j.dam.2006.01.005; citation_id=CR34
citation_journal_title=Bioinformatics; citation_title=Mining coherent dense subgraphs across massive biological networks for functional discovery; citation_author=H Hu, X Yan, Y Huang, J Han, XJ Zhou; citation_volume=21; citation_issue=suppl 1; citation_publication_date=2005; citation_pages=i213-i221; citation_doi=10.1093/bioinformatics/bti1049; citation_id=CR35
citation_journal_title=Phys A Stat Mech Appl; citation_title=A network analysis of the Chinese stock market; citation_author=WQ Huang, XT Zhuang, S Yao; citation_volume=388; citation_issue=14; citation_publication_date=2009; citation_pages=2956-2964; citation_doi=10.1016/j.physa.2009.03.028; citation_id=CR36
citation_title=Social and economic networks; citation_publication_date=2010; citation_id=CR37; citation_author=MO Jackson; citation_publisher=Princeton University Press
citation_journal_title=Phys A Stat Mech Appl; citation_title=Quantitative learning strategies based on word networks; citation_author=ZY Jia, Y Tang, JJ Xiong, YC Zhang; citation_volume=491; citation_publication_date=2018; citation_pages=898-911; citation_doi=10.1016/j.physa.2017.09.063; citation_id=CR38
Kasch, N (2014) Text Analytics and Natural Language Processing in the Era of Big Data. Pivotal Data Labs. Accessed: 6 June 2019.
https://content.pivotal.io/blog/text-analytics-and-natural-language-processing-in-the-era-of-big-data
.
citation_journal_title=Journal of Quantitative Linguistics; citation_title=Analysing language development from a network approach; citation_author=J Ke, Y Yao; citation_volume=15; citation_issue=1; citation_publication_date=2008; citation_pages=70-99; citation_doi=10.1080/09296170701794286; citation_id=CR40
citation_journal_title=J Exp Psychol; citation_title=An hypothesis concerning the generation and use of synonyms; citation_author=WM Lepley; citation_volume=40; citation_issue=4; citation_publication_date=1950; citation_pages=527; citation_doi=10.1037/h0060728; citation_id=CR41
citation_journal_title=J Abnorm Soc Psycholo; citation_title=Word usage and synonym representation in the English language; citation_author=WM Lepley, JL Kobrick; citation_volume=47; citation_issue=2S; citation_publication_date=1952; citation_pages=572; citation_doi=10.1037/h0059745; citation_id=CR42
citation_journal_title=Psychometrika; citation_title=A method of matrix analysis of group structure; citation_author=RD Luce, AD Perry; citation_volume=14; citation_issue=2; citation_publication_date=1949; citation_pages=95-116; citation_doi=10.1007/BF02289146; citation_id=CR43
Mikolov, T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. ICLR.
Mikolov, T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed Representations of Words and Phrases and their Compositionality In: Advances in Neural Information Processing Systems, 3111–3119.
citation_journal_title=Commun ACM; citation_title=WordNet: a lexical database for English; citation_author=GA Miller; citation_volume=38; citation_issue=11; citation_publication_date=1995; citation_pages=39-41; citation_doi=10.1145/219717.219748; citation_id=CR46
citation_title=“Blissfully Happy” or “Ready to Fight”: Varying Interpretations of Emoji; citation_inbook_title=Tenth International AAAI Conference on Web and Social Media; citation_publication_date=2016; citation_id=CR47; citation_author=HJ Miller; citation_author=J Thebault-Spieker; citation_author=S Chang; citation_author=I Johnson; citation_author=L Terveen; citation_author=B Hecht; citation_publisher=AAAI Press
citation_journal_title=Phys Rev E; citation_title=Topology of the conceptual network of language; citation_author=AE Motter, AP De Moura, YC Lai, P Dasgupta; citation_volume=65; citation_issue=6; citation_publication_date=2002; citation_pages=065102; citation_doi=10.1103/PhysRevE.65.065102; citation_id=CR48
citation_journal_title=SIAM Rev; citation_title=The structure and function of complex networks; citation_author=ME Newman; citation_volume=45; citation_issue=2; citation_publication_date=2003; citation_pages=167-256; citation_doi=10.1137/S003614450342480; citation_id=CR49
citation_title=Networks; citation_publication_date=2018; citation_id=CR50; citation_author=M Newman; citation_publisher=Oxford university press
citation_title=Natural language corpus data; citation_inbook_title=Beautiful Data, In Beautiful Data, edited by Toby Segaran and Jeff Hammerbacher; citation_publication_date=2009; citation_pages=219-242; citation_id=CR51; citation_author=P Norvig; citation_publisher=O’Reilly
citation_journal_title=Networks; citation_title=On maximum degree-based-quasi-clique problem: Complexity and exact approaches; citation_author=G Pastukhov, A Veremyev, V Boginski, OA Prokopyev; citation_volume=71; citation_issue=2; citation_publication_date=2018; citation_pages=136-152; citation_doi=10.1002/net.21791; citation_id=CR52
citation_journal_title=Discrete Applied Mathematics; citation_title=On the maximum quasi-clique problem; citation_author=Jeffrey Pattillo, Alexander Veremyev, Sergiy Butenko, Vladimir Boginski; citation_volume=161; citation_issue=1-2; citation_publication_date=2013; citation_pages=244-257; citation_doi=10.1016/j.dam.2012.07.019; citation_id=CR53
citation_journal_title=European Journal of Operational Research; citation_title=On clique relaxation models in network analysis; citation_author=Jeffrey Pattillo, Nataly Youssef, Sergiy Butenko; citation_volume=226; citation_issue=1; citation_publication_date=2013; citation_pages=9-18; citation_doi=10.1016/j.ejor.2012.10.021; citation_id=CR54
citation_journal_title=J Mach Learn Res; citation_title=Scikit-learn: Machine Learning in Python; citation_author=F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel; citation_volume=12; citation_publication_date=2011; citation_pages=2825-2830; citation_id=CR55
citation_title=Applications and explanations of Zipf’s law; citation_inbook_title=Proceedings of the joint conferences on new methods in language processing and computational natural language learning; citation_publication_date=1998; citation_pages=151-160; citation_id=CR56; citation_author=DM Powers; citation_publisher=Association for Computational Linguistics
citation_title=Software Framework for Topic Modelling with Large Corpora; citation_publication_date=2010; citation_id=CR57; citation_author=R Řehůřek; citation_author=P Sojka; citation_publisher=ELRA
Schneider, C (2016) The biggest data challenges that you might not even know you have. IBM Watson.
https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/
. Accessed: 6 June 2019.
citation_journal_title=Front Psychol; citation_title=Community structure in the phonological network; citation_author=CS Siew; citation_volume=4; citation_publication_date=2013; citation_pages=553; citation_doi=10.3389/fpsyg.2013.00553; citation_id=CR59
citation_journal_title=Appl Netw Sci; citation_title=The orthographic similarity structure of English words: Insights from network science; citation_author=CS Siew; citation_volume=3; citation_issue=1; citation_publication_date=2018; citation_pages=13; citation_doi=10.1007/s41109-018-0068-1; citation_id=CR60
Siew, CS, Wulff DU, Beckage NM, Kenett YN (2018) Cognitive Network Science: A review of research on cognition through the lens of network representations, processes, and dynamics. PsyArXiv.
https://doi.org/10.31234/osf.io/eu9tr
.
citation_journal_title=Proc Natl Acad Sci; citation_title=Global organization of the Wordnet lexicon; citation_author=M Sigman, GA Cecchi; citation_volume=99; citation_issue=3; citation_publication_date=2002; citation_pages=1742-1747; citation_doi=10.1073/pnas.022341799; citation_id=CR62
citation_title=Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment; citation_inbook_title=Proceedings of the Sixth International Conference on Data Mining. ICDM ’06; citation_publication_date=2006; citation_pages=1059-1063; citation_id=CR63; citation_author=K Sim; citation_author=J Li; citation_author=V Gopalkrishnan; citation_author=G Liu; citation_publisher=IEEE Computer Society
citation_journal_title=Proc Natl Acad Sci; citation_title=Protein complexes and functional modules in molecular networks; citation_author=V Spirin, LA Mirny; citation_volume=100; citation_issue=21; citation_publication_date=2003; citation_pages=12123-12128; citation_doi=10.1073/pnas.2032324100; citation_id=CR64
citation_journal_title=Cogn Sci; citation_title=The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth; citation_author=M Steyvers, JB Tenenbaum; citation_volume=29; citation_issue=1; citation_publication_date=2005; citation_pages=41-78; citation_doi=10.1207/s15516709cog2901_3; citation_id=CR65
Sumathy, K, Chidambaram M (2013) Text mining: concepts, applications, tools and issues-an overview. Int J Comput Appl 80(4).
Tsourakakis, C, Bonchi F, Gionis A, Gullo F, Tsiarli M (2013) Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 104–112.. ACM.
https://doi.org/10.1145/2487575.2487645
.
Vazirgiannis, M, Malliaros FD, Nikolentzos G (2018) GraphRep: Boosting Text Mining, NLP and Information Retrieval with Graphs In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2295–2296.. ACM.
https://doi.org/10.1145/3269206.3274273
.
citation_journal_title=Comput Optim Appl; citation_title=Exact MIP-based approaches for finding maximum quasi-cliques and dense subgraphs; citation_author=A Veremyev, OA Prokopyev, S Butenko, EL Pasiliao; citation_volume=64; citation_issue=1; citation_publication_date=2016; citation_pages=177-214; citation_doi=10.1007/s10589-015-9804-y; citation_id=CR69
citation_journal_title=J Speech Lang Hear Res; citation_title=What can graph theory tell us about word learning and lexical retrieval?; citation_author=MS Vitevitch; citation_volume=51; citation_issue=2; citation_publication_date=2008; citation_pages=408-422; citation_doi=10.1044/1092-4388(2008/030); citation_id=CR70
citation_journal_title=J Mem Lang; citation_title=Keywords in the mental lexicon; citation_author=MS Vitevitch, R Goldstein; citation_volume=73; citation_publication_date=2014; citation_pages=131-147; citation_doi=10.1016/j.jml.2014.03.005; citation_id=CR71
citation_journal_title=Yearbook of the Poznan Linguistic Meeting; citation_title=Using complex networks to understand the mental lexicon; citation_author=Michael S. Vitevitch, Rutherford Goldstein, Cynthia S.Q. Siew, Nichol Castro; citation_volume=1; citation_issue=1; citation_publication_date=2014; citation_pages=119-138; citation_doi=10.1515/yplm-2015-0007; citation_id=CR72
Ward, G (2002) Moby thesaurus II. Project Gutenberg Literary Archive Foundation. Available from:
http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=3202
.
citation_title=Social Network Analysis; citation_publication_date=1994; citation_id=CR74; citation_author=S Wasserman; citation_author=K Faust; citation_publisher=Cambridge University Press
citation_journal_title=Nature; citation_title=Collective dynamics of ’small-world’networks; citation_author=DJ Watts, SH Strogatz; citation_volume=393; citation_issue=6684; citation_publication_date=1998; citation_pages=440; citation_doi=10.1038/30918; citation_id=CR75
citation_title=Complex Network Analysis in Python: Recognize-Construct-Visualize-Analyze-Interpret; citation_publication_date=2018; citation_id=CR76; citation_author=D Zinoviev; citation_publisher=Pragmatic Bookshelf